pmemobj
tutorial¶
pmem
, pmemblk
, and pmemlog
will be interesting to people
with specific needs that match the services provided by those libraries. The
vast majority of Python programmers interested in utilizing persistent memory,
however, will be interested in pmemobj
, which provides a Python-object
oriented interface to persistent memory. This chapter aims to explain how to
use Python pmemobj
, and how to write your own
Persistent
objects.
Conceptual Overview¶
In a normal Python program we create a bunch of objects and use them to
accomplish a goal. When the program ends, all of the objects are thrown away,
to be rebuilt from scratch the next time the program is run. pmemobj
provides the opportunity to change this paradigm: to be able to create objects
in a program, and have their state preserved between program runs, so that they
do not need to be reconstructed the next time the program is run. The
guarantee made by pmemobj
is that the state of the objects will be
self-consistent no matter when the program terminates. Further, it provides a
transaction()
that can be placed around
multiple persistent object modifications to guarantee that either all of the
modifications are made, or none of them are made.
Contrast this with a persistence paradigm such as that provided by SQL Alchemy. Here we have objects whose data is mapped to relational database tables. When the program starts up, it can query the database in any of several ways in order to retrieve objects. The object state is thus persistent in the sense that an object will have the same state it had the last time that object was flushed to disk in a previous program. SQLAlchemy also provides transactions that guarantee that either all of the changes in a block are committed to the database, or none of them are.
So, how do the two paradigms differ? At the higher conceptual levels, not by
much. In the SQLAlchemy case objects are retrieved by running a query to find
selected instances of a given object class. In pmemobj
objects are
retrieved by walking an object tree from a
root
object defined by the program. The
difference, for both better and worse, is that persistent memory is entirely an
“object store”, and not a relational database. It is thus more similar to
the ZODB than to SQLAlchemy.
Where it differs from the ZODB is in how objects are stored. In the ZODB
Python objects are serialized using the pickle
module and stored on
disk. In pmemobj
, objects are stored directly in persistent memory,
written to and read from using the same store and fetch instructions used
to access RAM memory. This means that in principle read access can be nearly
as fast as RAM access, and write access can be orders of magnitude more
efficient than disk writes.
In practice we’re at the early stages of development, and at least in the Python case we aren’t anywhere near as fast as we could be. But it’s fast enough to be useful.
To be a bit more concrete, consider the example of a Python list. CPython
stores a list in RAM via an object header that points to an area of allocated
memory that holds a list of pointers to the objects in the list. In
pmemobj
, a list is stored in persistent memory as an object header
that points to an area of allocated persistent memory that contains a list of
persistent pointers to the objects in the list. An access to a list element
is a normal addr+offset
fetch of a pointer. Pointer resolution is another
quick arithmetic operation. Updating a list element is the reverse:
calculating the persistent pointer to the object and storing it at the correct
offset in the persistent data structure. It is clear that this is going
to be more efficient than SQLAlchemy marshalling to SQL-DB-update to
disk-write to disk-flush, or ZODB-pickling to disk-write to disk-flush.
There is, however, overhead involved in the integrity guarantees.
libpmemobj
uses a change-log to record all changes that are taking place in
a transaction, and if the transaction is aborted or not marked as complete,
then all of the changes that did take place during the aborted transaction are
rolled back, either immediately in the case of an abort, or the next time the
persistent memory is accessed by libmemobj
in the case of a crash. This
log overhead has a non-zero cost, but what you buy with that cost is the object
and transactional integrity in the face of hard crashes. And all of the log
and rollback activity takes place using direct memory fetch and store
instructions, so it is still fast, relatively speaking.
In this first version of pmemobj
we have focused on proof of concept and
portability rather than efficiency. That is, it is implemented entirely in
Python, using CFFI to access the
libpmemobj
functions. In addition, most immutable persistent objects are
handled by converting them back to normal Python RAM based instances when
accessed, rather than accessing them directly in persistent memory. All of
this adds conceptually unnecessary overhead and results in execution times that
are slower than optimal. There is no conceptual barrier, however, to making it
all quite efficient by moving the object access to the C level in a future
version. The object algorithms are, for the most part, copied directly from
the CPython codebase, with a few modifications to deal with persistent pointers
and updating the rollback log. So in principle the object implementations can
be almost as fast as the CPython objects they are emulating.
Real and Fake Persistent Memory¶
“Real” persistent memory in the context of this library is physical
non-volatile memory that is accessible via the linux kernel DAX extensions. Persistent memory thus
configured appears as a mounted filesystem to Linux. An allocated area of
persistent memory is labeled by a filename according to normal unix rules.
Thus if your DAX memory is mounted at /mnt/persistent, your would refer to an
allocated area of memory named myprog.pmem
via the path:
/mnt/persistent/myprog.pmem
The persistent file system is a normal unix filesystem when viewed through the
file system drivers. The magic of DAX, however, is that it allows a program to
bypass the file system drivers and have direct, unbuffered access to the memory
using normal CPU fetch and store instructions. There are, of course,
concerns with respect to CPU caches and when exactly a change gets committed to
the physical memory. See the pmem
module for more details.
pmemobj
handles all of those details so your program doesn’t have to.
There are two sorts of “fake” persistent memory. One is discussed on the Persistent Memory Wiki referenced above: you can emulate real persistent memory using regular RAM by reserving RAM to accessed through DAX via kernel configuration.
The second sort of “fake” persistent memory is to simply mmap
a normal
file. In this case the pmem libraries use different calls to ensure changes
are flushed to disk, but the remainder of the pmem programming infrastructure
can be tested. All of the pmem libraries automatically use this mode when the
specified path is not a DAX-backed path.
So, anywhere in the following examples where a filename is used, you can substitute a path that will access the fake or real persistent memory as you choose, and the examples should all work the same. (Except for losing the persistent data on machine reboot, if you are using RAM emulation.)
Object Types and Persistence¶
For the purposes of considering persistence, we can divide Python objects up into three classes: immutable non-container objects, mutable non-container objects, and container objects.
Immutable non-container objects are the easiest to handle. We can store them in whatever form we want in persistent memory, and upon access we can reconstruct the equivalent Python object and let the program use that. Because the object is immutable, it doesn’t matter that the object in persistent memory and object in use aren’t the same object. (Or if it does, that’s a bug in your program, since Python makes no guarantees about the identity of immutable objects.)
Mutable non-container objects must directly store, update, and retrieve
their data from persistent memory, since everything that points to
that mutable object will expect to see any updates. (An example of a
mutable non-container object is a bytearray
. pmemobj
does not yet support any of Python’s mutable non-container types.)
Container objects may contain pointers to other objects. The rule in
pmemobj
is that every object pointed to by a persistent container must
itself be stored persistently. This means that all pointers inside persistent
objects are persistent pointers; that is, pointers that can be resolved into a
valid pointer if the program is shut down and restarted running in a different
memory location. Therefore we can’t map a persistent immutable container
object (such as a tuple) to its Python equivalent, because the stored pointers
are persistent pointers, and may not even have the same length as a normal RAM
pointer.
Mostly these distinctions matter only to someone implementing a new
persistent
type. However, the first category, the immutable
non-container objects, matter at the Python programming level. This is because
there are two possibilities for such objects: pmemobj
may support them
directly, or it may support them through pickle
. If a class is
supported directly, a Pesistent
container may reference them and
pmemobj
will automatically deal with storing their data persistently,
and accessing it when referenced. If a class is not supported directly, then a
program using pmemobj
can still reference them, if the program nominates
them for persistence via pickling. This is less efficient than direct support,
but allows programs to use data types for which support has not yet been
written. (Pickling is not applied automatically because there is no way for
pmemobj
to determine if a specific class is immutable or not.)
Hello <your_name_here>¶
We’ll start the tutorial proper with the traditional “Hello, World” program. To make it interesting from a persistence standpoint, we’ll skip past the static “Hello, world!” to the second part of the traditional example, where you make it say hello to a specified name, and we’ll make it remember the name from one call to the next:
from nvm.pmemobj import PersistentObjectPool
with PersistentObjectPool('hello_world.pmem', flag='c') as pool:
if pool.root is None:
name = input("What is your name? ")
pool.root = name
print("Hello, {}".format(pool.root))
This simple example demonstrates several things. Persistent memory is accessed
through a PesistentObjectPool
object. By passing flag='c'
to the
constructor, we tell pmemobj
to create the pool if it doesn’t exist
yet, and to open it if it does. It creates the pool with a fairly generous
size, but a real application might need to increase the allocated size
depending on how much data it is handling.
Note that a pool’s size is fixed once created. There are plans for future improvements that will either provide a way to resize a pool or, at a minimum, a way to dump the data from one pool and restore it into another. Neither of these facilities exist as of this writing.
The pool object returned by the constructor has several methods and one
attribute. That attribute, root
, names
an arbitrary persistent object, and its default value is None
. When a pool
is first created, then, root
is None
. Our program checks if
root
is None
, and if it is, sets about getting a value (the name to
use). It assigns that to the root
attribute, which is enough to cause that
object to be persisted. It then prints out the “Hello” greeting.
If root
is not None
, then it has a value, so we use that value to print
out the greeting.
If we name this script hello.py
, running it from the command line would
look like this:
> python hello.py
What is your name? David
Hello, David
> python hello.py
Hello, David
> python hello.py
Hello, David
Guessing Game¶
Another frequent example of a simple program is a guessing game. It might looks something like this:
import random
import sys
guesses = []
max = 50
name = input("Hello, what is your name? ")
number = random.randint(1, max)
print("{}, I've picked a number between 1 and {}.".format(name, max))
while len(guesses) < 6:
print('Take a guess.')
guess = int(input('> '))
if guess in guesses:
print("You already tried that number!")
continue
if guess < number:
print('Your guess is too low.')
if guess > number:
print('Your guess is too high.')
if guess == number:
print('You guessed my number in {} tries, {}.'.format(
len(guesses)+1, name))
break
guesses.append(guess)
else:
print("Too many guesses, {}!"
" The number I was thinking of was {}".format(
name, number))
A playing session might look like this:
Hello, what is your name? David
David, I've picked a number between 1 and 50.
Take a guess.
> 25
Your guess is too low.
Take a guess.
> 40
Your guess is too low.
Take a guess.
> 45
Your guess is too low.
Take a guess.
> 48
You guessed my number in 4 tries, David.
The magic of persistence is that everything is remembered between program runs. So lets rewrite this so instead of a loop, we’re using commands typed at the shell prompt to play the game.
First, we need a command to start the game:
#!/usr/bin/env python
import os
import sys
import random
from nvm.pmemobj import create, PersistentList, PersistentDict
pool_fn = 'guessing_game.pmem'
max = 50
try:
pool = create(pool_fn)
except OSError as err:
print(err)
print("Are you already in the middle of a game?")
sys.exit()
with pool:
with pool.transaction():
root = pool.root = pool.new(PersistentDict)
root['number'] = random.randint(1, max)
root['name'] = name = input("Hello, what is your name? ")
root['guesses'] = pool.new(PersistentList)
print("{}, I've picked a number between 1 and {}.".format(name, max))
print("Type 'guess' followed by your guess at the prompt.")
This introduces several new concepts. The create()
function raises an error if the persistent memory file already exists.
This is equivalent to specifying flag='x'
in the
PersistentObjectPool
constructor. This command is
only dealing with creating the pool, so it doesn’t have an if test
to see if root is None
, it can just go ahead and do the setup.
However, we want the setup to either work or fail completely, so we use the
pool’s transaction()
context manager to
wrap all of our initialization in a transaction.
The first thing we do is create a namespace to hold our persistent program
data. We use a dictionary for this, but we can’t persist a normal Python dict.
Instead we use the pmemobj.PersistentDict
. To create one, we use the
new()
method of the pool. The new
method requires a class object that supports the Persistent
interface, and given one it creates an instance of the object that will store
its data in the pool. We could also pass constructor arguments after the class
name. PersistentDict
accepts the same constructor arguments
as a normal dict.
Note that in addition to giving the PersistentDict
a local
name, we also assign it to the root
attribute of the pool. If we failed to do that, pmemobj
would forget
about the object once the pool was closed, since nothing would be referring to
it. That is, when the pool is closed, pmemobj
looks through all the
objects in the pool, and any that cannot be reached from
root
are garbage collected.
Once we have our namespace, we store the player’s name, and the number we want
them to guess, and create an empty PersistentList
in which to
store the guesses.
Then we tell the player what to do next, and we’re ready to play.
We’ve told the player to type guess <their_guess>
at the command line,
so now we need to implement the guess
command:
#!/usr/bin/env python
import os
import sys
from nvm.pmemobj import open
pool_fn = 'guessing_game.pmem'
if len(sys.argv) != 2:
print("Please specify a single integer as your guess.")
sys.exit(1)
try:
guess = int(sys.argv[1])
except ValueError as err:
print("Please specify an integer as your guess.")
sys.exit(1)
try:
pool = open(pool_fn)
except OSError as err:
print(err)
print("Perhaps you need to run 'start_guessing' first?")
sys.exit(1)
if pool.root is None:
# The start_guessing script must have been killed before
# initialization was complete.
print("Looks like a start was aborted. Please run"
" start_guessing again.")
pool.close()
os.remove(pool_fn)
sys.exit()
with pool:
done = False
root = pool.root
guesses = root['guesses']
name = root['name']
number = root['number']
if guess in pool.root['guesses']:
print("You already tried {}".format(guess))
elif guess < number:
print("Your guess is too low.")
elif guess > number:
print("Your guess is too high.")
elif guess == number:
print('You guessed my number in {} tries, {}.'.format(
len(guesses)+1, name))
done = True
guesses.append(guess)
if not done and len(guesses) > 6:
print("Too many guesses, {}!"
" The number I was thinking of was {}".format(
name, number))
done = True
if done:
os.remove(pool_fn)
Here we’ve used open()
to get access to the existing pool. It
will throw an error if the pool does not exist. This is equivalent to
passing flag='r'
to the constructor.
We do need to check root
to see if it is
None
here, since it could be if the initialization did not complete. In
that case we just delete the pool and tell the player to start over from the
beginning.
Notice how we can use a local name for the guess
list, and append to it,
and the list is updated persistently. This is because each
Persistent
class knows which pool it belongs to, so it can find
the persistent memory it needs to update.
The other thing to notice about this example is that we haven’t used an explicit transaction anywhere. We only do one data structure update, and that’s the append of the new guess to the list. That append is guaranteed to be atomic, so there is no need for an explicit transaction in this case.
With our persistent version of the guessing game, running the game looks like this:
> start_guessing
Hello, what is your name? David
David, I've picked a number between 1 and 50.
Type 'guess' followed by your guess at the prompt.
> guess 25
Your guess is too low.
> guess 35
Your guess is too low.
> guess 45
Your guess is too high.
> guess 40
Your guess is too high.
> guess 38
Your guess is too high.
> guess 37
Your guess is too high.
> guess 36
You guessed my number in 7 tries, David.
Now, this code is somewhat more complicated than the non-persistent version, but it would allow you to start a game one day, and come back days later and finish the game. We could add a ‘status’ command that let you know now many guesses you’d made, and even replay the guesses. While this is a trivial example, I think you can see how these principles would apply to more useful programs with retained state.
Persistent Objects¶
Python is an object oriented language, so we of course would like to be able to
persist arbitrary objects. We can’t do that in the general case, since
anything that has a specific memory layout requires specific support in
pmemobj
. However, Python objects that do not subclass built-in types
are, from the point of view of persistent memory, just a dictionary wrapped in
some extra behavior. So pmemobj
does support persisting arbitrary
objects that do not subclass built-ins, via the
PersistentObject
base class.
Our guess
code above has to awkwardly pull the data of interest out of the
dictionary that we used as a namespace. It would provide simpler code if we
can instead have that data be attributes on an object. To do that, we’ll need
to be able to access that object from both programs, so we’ll want a separate
python file to hold our class definition:
import random
import os
from nvm.pmemobj import open, PersistentObject, PersistentList
pool_fn = 'guessing_game2.pmem'
class GameError(Exception):
pass
def reopen_game():
if not os.path.isfile(pool_fn):
raise GameError("No game in progress. Use 'start_guessing'"
" to start one.")
try:
pool = open(pool_fn)
except OSError as err:
exc = GameError("Could not open game file: {}".format(err))
try:
os.remove(pool_fn)
except OSError as err:
raise GameError("Can't remove game file")
raise GameError("Could not open game file, start again"
" with 'start_guessing'")
if pool.root is None:
pool.close()
os.remove(pool_fn)
raise("Looks like a game was aborted; start again with"
" 'start_guessing'")
return pool
class Guesser(PersistentObject):
def __init__(self, name, maximum=50):
self.name = name
self.maximum = maximum
self.number = random.randint(1, maximum)
self.guesses = self._p_mm.new(PersistentList)
self.lost = False
self.done = False
def _guess_to_int(self, s):
try:
guess = int(s)
except ValueError as err:
raise ValueError("Please specify an integer; {} is not"
"valid: {}".format(s, err))
if guess < 1 or guess > self.maximum:
raise ValueError("Come now, {}, a guess outside of the"
" range I told you won't get you"
" anywhere".format(self.name))
return guess
def check_guess(self, guess):
guess = self._guess_to_int(guess)
with self._p_mm.transaction():
self.current_guess = guess
if guess in self.guesses:
self.current_outcome = 'SEEN'
self.guesses.append(guess)
if guess == self.number:
self.current_outcome = 'EQUAL'
self.done = True
if len(self.guesses) > 6:
self.lost = True
self.done = True
if guess < self.number:
self.current_outcome = 'LOW'
if guess > self.number:
self.current_outcome = 'HIGH'
return self.current_outcome
def message(self, key):
return getattr(self, 'msg_' + key)()
def msg_START(self):
return "{}, I've picked a number between 1 and {}.".format(
self.name, self.maximum)
def msg_SEEN(self):
return "You already tried {}".format(self.current_guess)
def msg_EQUAL(self):
return "You guessed my number in {} tries, {}.".format(
len(self.guesses), self.name)
def msg_LOW(self):
return "Your guess is too low."
def msg_HIGH(self):
return "Your guess is too high."
def msg_LOST(self):
return ("Too many guesses, {}!"
" The number I was thinking of was {}".format(
self.name, self.number))
The first thing to notice about the PersistentObject
subclass
is that for the most part it doesn’t look any different from a normal Python
class. There is an __init__
that is executed when the object is first
created, and most attributes are referenced and set normally. The one
exception is our self.guesses
attribute. We want that to be a list. Since
it is not a non-container immutable, it needs to be a
Persistent
object itself.
To accomplish this we make use of the _p_mm
attribute of our PersistentObject
instance. This attribute
points to the MemoryManager
instance associated with the
PersistentObject
. We can use that reference to access the
MemoryManager's
new()
method, and use that
method to create an empty PersistentList
that is associated
with the same Memorymanager
managing our PersistentObject
.
We can also use the _p_mm
attribute to access the
MemoryManager's
transaction()
context manager,
as you can see in the check_guess
method of the example. Unlike our
previous example, in this code block we are making several updates to our class
that should either all be done, or none of them done. By using the
transaction, we ensure that either the guess is completely processed, or it is
not processed at all, no matter when the program gets interrupted.
With the game logic now factored out into a class, our command scripts are much simpler.
start_guessing
becomes:
#!/usr/bin/env python
import os
import sys
from nvm.pmemobj import create
from guess_lib import Guesser, pool_fn
name = input("Hello, what is your name? ")
if os.path.isfile(pool_fn):
print("There is already a game file. Use the guess_status command"
" to see details of the current game.")
sys.exit(1)
try:
pool = create(pool_fn)
except OSError as err:
print(err)
sys.exit(err.errno)
with pool:
pool.root = game = pool.new(Guesser, name)
print(game.message('START'))
print("Type 'guess' followed by your guess at the prompt.")
To start the game, we check there’s no existing game file and create
it, but now initializing the data structures in the pool consists of just
calling new()
on our guesser
class and
assigning that to root
.
The guess
command is now almost trivial:
#!/usr/bin/env python
import os
import sys
from guess_lib import reopen_game, GameError
if len(sys.argv) != 2:
print("Please specify a single integer as your guess.")
sys.exit(1)
guess = sys.argv[1]
try:
pool = reopen_game()
except (OSError, GameError) as err:
print(err)
sys.exit(1)
with pool:
guesser = pool.root
try:
disposition = guesser.check_guess(guess)
except ValueError as err:
print(err)
sys.exit(1)
print(guesser.message(disposition))
if guesser.lost:
print(guesser.message('LOST'))
if guesser.done:
os.remove(pool_fn)
We use our library function to reopen the pool, which checks for the various
error conditions and aborts with the appropriate message if we run into any of
them. Then we grab the guesser
instance from the pool’s
root
and pass the guess the player made
its check_guess
method to evaluate, printing the message associated with
whatever guess status it returns, removing the game file if and only if the
game is over:
And now we can easily implement the game_status
command mentioned earlier:
#!/usr/bin/env python
import os
import sys
from guess_lib import reopen_game, GameError
try:
pool = reopen_game()
except (OSError, GameError) as err:
print(err)
sys.exit(1)
with pool:
guesser = pool.root
if not guesser.guesses:
print("No guesses yet, use 'guess <integer>' to make a guess")
sys.exit(0)
print("guesses so far:")
for guess in guesser.guesses:
print(" {}".guess)
print("my response to your last guess:")
print(" {}".guesser.message(guesser.current_outcome))
The pattern here is one I expect many persistent memory applications will share
(possibly via a single program with subcommands or sub-functions, rather than
the multiple program files in this example): the persistent memory is accessed
through an instance of an application specific class that is assigned to the
root
of the object pool. When run, the
application makes sure it can access the pool, then grabs the instance from
root
and uses the instance’s methods to
accomplish the application’s goals.