:mod:`pmemobj` tutorial =============================================================================== :mod:`pmem`, :mod:`pmemblk`, and :mod:`pmemlog` will be interesting to people with specific needs that match the services provided by those libraries. The vast majority of Python programmers interested in utilizing persistent memory, however, will be interested in :mod:`pmemobj`, which provides a Python-object oriented interface to persistent memory. This chapter aims to explain how to use Python :mod:`pmemobj`, and how to write your own :class:`~pmemobj.Persistent` objects. Conceptual Overview ------------------------------------------------------------------------------- In a normal Python program we create a bunch of objects and use them to accomplish a goal. When the program ends, all of the objects are thrown away, to be rebuilt from scratch the next time the program is run. :mod:`pmemobj` provides the opportunity to change this paradigm: to be able to create objects in a program, and have their state preserved between program runs, so that they do not need to be reconstructed the next time the program is run. The guarantee made by :mod:`pmemobj` is that the state of the objects will be self-consistent no matter when the program terminates. Further, it provides a :meth:`~pmemobj.PersistentObjectPool.transaction` that can be placed around multiple persistent object modifications to guarantee that either *all* of the modifications are made, or *none* of them are made. Contrast this with a persistence paradigm such as that provided by `SQL Alchemy `_. Here we have objects whose data is mapped to relational database tables. When the program starts up, it can query the database in any of several ways in order to retrieve objects. The object state is thus persistent in the sense that an object will have the same state it had the last time that object was flushed to disk in a previous program. SQLAlchemy also provides transactions that guarantee that either all of the changes in a block are committed to the database, or none of them are. So, how do the two paradigms differ? At the higher conceptual levels, not by much. In the SQLAlchemy case objects are retrieved by running a query to find selected instances of a given object class. In :mod:`pmemobj` objects are retrieved by walking an object tree from a :attr:`~pmemobj.PersistentObjectPool.root` object defined by the program. The difference, for both better and worse, is that persistent memory is entirely an "object store", and *not* a relational database. It is thus more similar to the `ZODB `_ than to SQLAlchemy. Where it differs from the ZODB is in how objects are stored. In the ZODB Python objects are serialized using the :mod:`pickle` module and stored on disk. In :mod:`pmemobj`, objects are stored directly in persistent memory, written to and read from using the same *store* and *fetch* instructions used to access RAM memory. This means that in principle read access can be nearly as fast as RAM access, and write access can be orders of magnitude more efficient than disk writes. In practice we're at the early stages of development, and at least in the Python case we aren't anywhere near as fast as we could be. But it's fast enough to be useful. To be a bit more concrete, consider the example of a Python list. CPython stores a list in RAM via an object header that points to an area of allocated memory that holds a list of pointers to the objects in the list. In :mod:`pmemobj`, a list is stored in *persistent* memory as an object header that points to an area of allocated *persistent* memory that contains a list of *persistent* pointers to the objects in the list. An access to a list element is a normal ``addr+offset`` fetch of a pointer. Pointer resolution is another quick arithmetic operation. Updating a list element is the reverse: calculating the persistent pointer to the object and storing it at the correct offset in the persistent data structure. It is clear that this is going to be more efficient than SQLAlchemy marshalling to SQL-DB-update to disk-write to disk-flush, or ZODB-pickling to disk-write to disk-flush. There is, however, overhead involved in the integrity guarantees. ``libpmemobj`` uses a change-log to record all changes that are taking place in a transaction, and if the transaction is aborted or not marked as complete, then all of the changes that did take place during the aborted transaction are rolled back, either immediately in the case of an abort, or the next time the persistent memory is accessed by ``libmemobj`` in the case of a crash. This log overhead has a non-zero cost, but what you buy with that cost is the object and transactional integrity in the face of hard crashes. And all of the log and rollback activity takes place using direct memory *fetch* and *store* instructions, so it is still fast, relatively speaking. In this first version of :mod:`pmemobj` we have focused on proof of concept and portability rather than efficiency. That is, it is implemented entirely in Python, using `CFFI `_ to access the ``libpmemobj`` functions. In addition, most immutable persistent objects are handled by converting them back to normal Python RAM based instances when accessed, rather than accessing them directly in persistent memory. All of this adds conceptually unnecessary overhead and results in execution times that are slower than optimal. There is no conceptual barrier, however, to making it all quite efficient by moving the object access to the C level in a future version. The object algorithms are, for the most part, copied directly from the CPython codebase, with a few modifications to deal with persistent pointers and updating the rollback log. So in principle the object implementations can be almost as fast as the CPython objects they are emulating. Real and Fake Persistent Memory ------------------------------------------------------------------------------- "Real" persistent memory in the context of this library is physical non-volatile memory that is accessible via the linux kernel `DAX `_ extensions. Persistent memory thus configured appears as a mounted filesystem to Linux. An allocated area of persistent memory is labeled by a filename according to normal unix rules. Thus if your DAX memory is mounted at /mnt/persistent, your would refer to an allocated area of memory named ``myprog.pmem`` via the path: /mnt/persistent/myprog.pmem The persistent file system is a normal unix filesystem when viewed through the file system drivers. The magic of DAX, however, is that it allows a program to bypass the file system drivers and have direct, unbuffered access to the memory using normal CPU *fetch* and *store* instructions. There are, of course, concerns with respect to CPU caches and when exactly a change gets committed to the physical memory. See the :mod:`pmem` module for more details. :mod:`pmemobj` handles all of those details so your program doesn't have to. There are two sorts of "fake" persistent memory. One is discussed on the `Persistent Memory Wiki `_ referenced above: you can emulate real persistent memory using regular RAM by reserving RAM to accessed through DAX via kernel configuration. The second sort of "fake" persistent memory is to simply ``mmap`` a normal file. In this case the pmem libraries use different calls to ensure changes are flushed to disk, but the remainder of the pmem programming infrastructure can be tested. All of the pmem libraries automatically use this mode when the specified path is not a DAX-backed path. So, anywhere in the following examples where a filename is used, you can substitute a path that will access the fake or real persistent memory as you choose, and the examples should all work the same. (Except for losing the persistent data on machine reboot, if you are using RAM emulation.) Object Types and Persistence ------------------------------------------------------------------------------- For the purposes of considering persistence, we can divide Python objects up into three classes: immutable non-container objects, mutable non-container objects, and container objects. Immutable non-container objects are the easiest to handle. We can store them in whatever form we want in persistent memory, and upon access we can reconstruct the equivalent Python object and let the program use that. Because the object is immutable, it doesn't matter that the object in persistent memory and object in use aren't the same object. (Or if it does, that's a bug in your program, since Python makes no guarantees about the identity of immutable objects.) Mutable non-container objects *must* directly store, update, and retrieve their data from persistent memory, since everything that points to that mutable object will expect to see any updates. (An example of a mutable non-container object is a :class:`bytearray`. :mod:`pmemobj` does not yet support any of Python's mutable non-container types.) Container objects may contain pointers to other objects. The rule in :mod:`pmemobj` is that every object pointed to by a persistent container must itself be stored persistently. This means that all pointers inside persistent objects are persistent pointers; that is, pointers that can be resolved into a valid pointer if the program is shut down and restarted running in a different memory location. Therefore we can't map a persistent immutable container object (such as a tuple) to its Python equivalent, because the stored pointers are persistent pointers, and may not even have the same length as a normal RAM pointer. Mostly these distinctions matter only to someone implementing a new :class:`persistent` type. However, the first category, the immutable non-container objects, matter at the Python programming level. This is because there are two possibilities for such objects: :mod:`pmemobj` may support them directly, or it may support them through :class:`pickle`. If a class is supported directly, a :class:`Pesistent` container may reference them and :mod:`pmemobj` will automatically deal with storing their data persistently, and accessing it when referenced. If a class is not supported directly, then a program using :mod:`pmemobj` can still reference them, if the program nominates them for persistence via pickling. This is less efficient than direct support, but allows programs to use data types for which support has not yet been written. (Pickling is not applied automatically because there is no way for :mod:`pmemobj` to determine if a specific class is immutable or not.) Hello ------------------------------------------------------------------------------- We'll start the tutorial proper with the traditional "Hello, World" program. To make it interesting from a persistence standpoint, we'll skip past the static "Hello, world!" to the second part of the traditional example, where you make it say hello to a specified name, and we'll make it remember the name from one call to the next: .. literalinclude:: examples/hello_you.py This simple example demonstrates several things. Persistent memory is accessed through a :class:`PesistentObjectPool` object. By passing ``flag='c'`` to the constructor, we tell :mod:`pmemobj` to create the pool if it doesn't exist yet, and to open it if it does. It creates the pool with a fairly generous size, but a real application might need to increase the allocated size depending on how much data it is handling. Note that a pool's size is fixed once created. There are plans for future improvements that will either provide a way to resize a pool or, at a minimum, a way to dump the data from one pool and restore it into another. Neither of these facilities exist as of this writing. The pool object returned by the constructor has several methods and one attribute. That attribute, :attr:`~pmemobj.PersistentObjectPool.root`, names an arbitrary persistent object, and its default value is ``None``. When a pool is first created, then, ``root`` is ``None``. Our program checks if ``root`` is ``None``, and if it is, sets about getting a value (the name to use). It assigns that to the ``root`` attribute, which is enough to cause that object to be persisted. It then prints out the "Hello" greeting. If ``root`` is not ``None``, then it has a value, so we use that value to print out the greeting. If we name this script ``hello.py``, running it from the command line would look like this:: > python hello.py What is your name? David Hello, David > python hello.py Hello, David > python hello.py Hello, David Guessing Game ------------------------------------------------------------------------------- Another frequent example of a simple program is a guessing game. It might looks something like this: .. literalinclude:: examples/normal_guess.py A playing session might look like this:: Hello, what is your name? David David, I've picked a number between 1 and 50. Take a guess. > 25 Your guess is too low. Take a guess. > 40 Your guess is too low. Take a guess. > 45 Your guess is too low. Take a guess. > 48 You guessed my number in 4 tries, David. The magic of persistence is that everything is remembered between program runs. So lets rewrite this so instead of a loop, we're using commands typed at the shell prompt to play the game. First, we need a command to start the game: .. literalinclude:: examples/start_guessing This introduces several new concepts. The :func:`~pmemobj.create` function raises an error if the persistent memory file already exists. This is equivalent to specifying ``flag='x'`` in the :class:`~pmemobj.PersistentObjectPool` constructor. This command is only dealing with creating the pool, so it doesn't have an if test to see if root is ``None``, it can just go ahead and do the setup. However, we want the setup to either work or fail completely, so we use the pool's :func:`~pmemobj.PersistentObjectPool.transaction` context manager to wrap all of our initialization in a transaction. The first thing we do is create a namespace to hold our persistent program data. We use a dictionary for this, but we can't persist a normal Python dict. Instead we use the :class:`pmemobj.PersistentDict`. To create one, we use the :func:`~pmemobj.PersistentObjectPool.new` method of the pool. The ``new`` method requires a class object that supports the :class:`~pmemobj.Persistent` interface, and given one it creates an instance of the object that will store its data in the pool. We could also pass constructor arguments after the class name. :class:`~pmemobj.PersistentDict` accepts the same constructor arguments as a normal dict. Note that in addition to giving the :class:`~pmemobj.PersistentDict` a local name, we also assign it to the :attr:`~pmemobj.PersistentObjectPool.root` attribute of the pool. If we failed to do that, :mod:`pmemobj` would forget about the object once the pool was closed, since nothing would be referring to it. That is, when the pool is closed, :mod:`pmemobj` looks through all the objects in the pool, and any that cannot be reached from :attr:`~pmemobj.PersistentObjectPool.root` are garbage collected. Once we have our namespace, we store the player's name, and the number we want them to guess, and create an empty :class:`~pmemobj.PersistentList` in which to store the guesses. Then we tell the player what to do next, and we're ready to play. We've told the player to type ``guess `` at the command line, so now we need to implement the ``guess`` command: .. literalinclude:: examples/guess Here we've used :func:`~pynvm.open` to get access to the existing pool. It will throw an error if the pool does *not* exist. This is equivalent to passing ``flag='r'`` to the constructor. We do need to check :attr:`~pynvm.PersistentObjectPool.root` to see if it is ``None`` here, since it could be if the initialization did not complete. In that case we just delete the pool and tell the player to start over from the beginning. Notice how we can use a local name for the ``guess`` list, and append to it, and the list is updated persistently. This is because each :class:`~pynvm.Persistent` class knows which pool it belongs to, so it can find the persistent memory it needs to update. The other thing to notice about this example is that we haven't used an explicit transaction anywhere. We only do one data structure update, and that's the append of the new guess to the list. That append is guaranteed to be atomic, so there is no need for an explicit transaction in this case. With our persistent version of the guessing game, running the game looks like this:: > start_guessing Hello, what is your name? David David, I've picked a number between 1 and 50. Type 'guess' followed by your guess at the prompt. > guess 25 Your guess is too low. > guess 35 Your guess is too low. > guess 45 Your guess is too high. > guess 40 Your guess is too high. > guess 38 Your guess is too high. > guess 37 Your guess is too high. > guess 36 You guessed my number in 7 tries, David. Now, this code is somewhat more complicated than the non-persistent version, but it would allow you to start a game one day, and come back days later and finish the game. We could add a 'status' command that let you know now many guesses you'd made, and even replay the guesses. While this is a trivial example, I think you can see how these principles would apply to more useful programs with retained state. Persistent Objects ------------------------------------------------------------------------------- Python is an object oriented language, so we of course would like to be able to persist arbitrary objects. We can't do that in the general case, since anything that has a specific memory layout requires specific support in :mod:`pmemobj`. However, Python objects that do not subclass built-in types are, from the point of view of persistent memory, just a dictionary wrapped in some extra behavior. So :mod:`pmemobj` does support persisting arbitrary objects that do not subclass built-ins, via the :class:`~pmemobj.PersistentObject` base class. Our ``guess`` code above has to awkwardly pull the data of interest out of the dictionary that we used as a namespace. It would provide simpler code if we can instead have that data be attributes on an object. To do that, we'll need to be able to access that object from both programs, so we'll want a separate python file to hold our class definition: .. literalinclude:: examples/guess_lib.py The first thing to notice about the :class:`~pmemobj.PersistentObject` subclass is that for the most part it doesn't look any different from a normal Python class. There is an ``__init__`` that is executed when the object is first created, and most attributes are referenced and set normally. The one exception is our ``self.guesses`` attribute. We want that to be a list. Since it is not a non-container immutable, it needs to be a :class:`~pmemobj.Persistent` object itself. To accomplish this we make use of the :attr:`~pmemobj.Persistent._p_mm` attribute of our :class:`~pmemobj.PersistentObject` instance. This attribute points to the :class:`~pmemobj.MemoryManager` instance associated with the :class:`~pmemobj.PersistentObject`. We can use that reference to access the ``MemoryManager's`` :meth:`~pmemobj.MemoryManager.new` method, and use that method to create an empty :class:`~pmemobj.PersistentList` that is associated with the same ``Memorymanager`` managing our ``PersistentObject``. We can also use the :attr:`~pmemobj.Persistent._p_mm` attribute to access the ``MemoryManager's`` :meth:`~pmemobj.MemoryManager.transaction` context manager, as you can see in the ``check_guess`` method of the example. Unlike our previous example, in this code block we are making several updates to our class that should either all be done, or none of them done. By using the transaction, we ensure that either the guess is completely processed, or it is not processed at all, no matter when the program gets interrupted. With the game logic now factored out into a class, our command scripts are much simpler. ``start_guessing`` becomes: .. literalinclude:: examples/start_guessing2 To start the game, we check there's no existing game file and create it, but now initializing the data structures in the pool consists of just calling :meth:`~pmemobj.PersistentObjectPool.new` on our ``guesser`` class and assigning that to :attr:`~pmemobj.PersistemtObjectPool.root`. The ``guess`` command is now almost trivial: .. literalinclude:: examples/guess2 We use our library function to reopen the pool, which checks for the various error conditions and aborts with the appropriate message if we run into any of them. Then we grab the ``guesser`` instance from the pool's :attr:`~pmemobj.PersistemtObjectPool.root` and pass the guess the player made its ``check_guess`` method to evaluate, printing the message associated with whatever guess status it returns, removing the game file if and only if the game is over: And now we can easily implement the ``game_status`` command mentioned earlier: .. literalinclude:: examples/guess_status2 The pattern here is one I expect many persistent memory applications will share (possibly via a single program with subcommands or sub-functions, rather than the multiple program files in this example): the persistent memory is accessed through an instance of an application specific class that is assigned to the :attr:`~pmemobj.PersistemtObjectPool.root` of the object pool. When run, the application makes sure it can access the pool, then grabs the instance from :attr:`~pmemobj.PersistemtObjectPool.root` and uses the instance's methods to accomplish the application's goals.