In the course of preparing a talk on Rust-Python interoperation, I have come to learn terrible, horrible things about Python's memory model.
But first, what is the Python memory model? Even if you're a Python developer, you may not be aware of any such thing: indeed, that was my experience in a job interview. Asked to describe the Rust memory model, I managed just fine. Asked to describe Python's, and I floundered.
There is good reason for this. Rule #1 of the Python memory model: the user should never have to think about the Python memory model.
Python is a Thick Wrapper for C
Quick refresher on how Python works:
Any Python you write is first turned into Python bytecode, an intermediary language. For example, print("Hello, world!")
(in a jupyter notebook) is evaluated to:
81 0 RESUME 0 86 2 LOAD_GLOBAL 1 (NULL + compile) 14 LOAD_FAST 1 (source) 16 LOAD_FAST 2 (filename) 18 LOAD_FAST 3 (symbol) 20 LOAD_FAST 0 (self) 22 LOAD_ATTR 1 (flags) 32 LOAD_GLOBAL 4 (PyCF_ONLY_AST) 44 BINARY_OP 7 (|) 48 LOAD_CONST 1 (1) 50 PRECALL 5 --> 54 CALL 5 64 RETURN_VALUE
Some operations, like control flow, are handled entirely in bytecode. But anything that would create, modify, or drop an object--in other words, any useful work the program does--ultimately bottoms out in a call to the CPython API, such as the PyCF_ONLY_AST call above.
This is because Python is basically C, except it's wearing ten layers of trench coats. Python objects are represented under the hood by C structs, fundamental Python methods call C functions, and so on. They just happen to do this through numerous API layers, sacrificing performance (both runtime and memory) for safety.
So, question: what does an integer look like in Python?
What Have They Done to my Boy!?
x = 257
By default, any integer object created in Python is a 32-bit signed integer, (i32 to Rustaceans, int32_t in C, 'long' to you C++ heathens). So it takes up 32 bits (4 bytes) of memory, right?
x = 257 x.__sizeof__()
> 28
Nope. 28 bytes.
How the hell does this happen?
Well, first off, let's think back to the Python memory model. How do you design a programming memory model that ensures its users basically never have to think about it?
- Everything is reference-counted.
- Everything is heap-allocated.
- For good measure, everything is dynamically typed and sized.
That means that, in addition to the contents of the number itself, we have some tagalongs:
struct _longobject { long ob_refcnt; PyTypeObject *ob_type; size_t ob_size; long ob_digit; };
First, the refcount: on 64-bit machines, this is a signed 64-bit integer.
Yes, you heard me right. Signed. Not unsigned. Despite the fact that if this number is ever negative, it means something has gone catastrophically wrong. Anyway, that's 8 bytes.
Second, we have the type pointer: if this object changes type, we redirect the pointer to refer to the new type. That's another 8 bytes.
Third, the size counter. What, you think a pylong is always a long? Think again. This is how Python protects against over/underflow. If the underlying number would be modified such that it could no longer be contained within the 4 bytes initially allocated, we just... allocate more space! But to keep track of how much space we've allocated, we need a size counter. And wouldn't you know it, this is also 8 bytes.
So we've got 24 bytes of metadata tagging along with 4 bytes of payload. Efficiency incarnate!
Except, that's not quite all. This is just what the sizeof dunder method tells us about, but chances are we're using more. All of this is heap-allocated (because the integer itself needs to be growable, and the metadata comes along for the ride), but you're not just going to allocate 28 bytes: you'll want another 4 bytes of padding on the end in order to ensure that the next object is aligned. So our actual memory cost is rising to 32 bytes.
And note that, since this is heap-allocated, we need a pointer on the stack to refer to it: that's another 8 bytes, bringing our total memory use across stack and heap to 40 bytes... for a 4 byte integer.
Hooray.
This has several consequences: for one, you can only fit two of these damn things in a 64-byte cache line, and in all likelihood, any two numbers you're using almost certainly won't be placed next to each other anyway. I hope you like cache misses.
This is why numpy
is mandatory for remotely performant numeric operations: Python abstracts away all the tools you could use to make make performant software, and you need to buy them back with other packages.
Internment
But of course, I left something out earlier. See, Python knows that this is an extremely memory-heavy setup it has going, so it tries to optimize things.
For one, it makes use of peephole optimizations. In the process of bytecode creation, any mutiple instances of the same number in the same compilation unit will be aliased to the same pointer.
x = 257 y = 257 print(f"ID of x: {id(x)}") print(f"ID of y: {id(y)}") print(f"Are they the same object? {x is y}")
ID of x: 4367987024 ID of y: 4367987024 Are they the same object? True
This will occur, for example, in a .py script, but not in a REPL or jupyter notebook, even in the same cell.
This way, it avoids allocating new heap memory... until one of these objects is mutated.
But my choice of number is not arbitrary. If I use a smaller integer, like 256:
x = 256 y = 256 print(f"ID of x: {id(x)}") print(f"ID of y: {id(y)}") print(f"Are they the same object? {x is y}")
ID of x: 4380885464 ID of y: 4380885464 Are they the same object? True
We get this result even if they are not in the same compilation unit.
This is because of another Python optimization: integer interning.
Basically, when the Python interpreter starts up, it allocates the numbers -5 through 256 in its private heap, and anytime one of these numbers would be used, it just passes out a pointer to its internal, permanent allocation of 256.
This is very convenient... but it also means that all instances of these small numbers are actually the same number, which means they're never mutated. They're basically constants.
So when you do x-=1 in Python, what happens?
Well, Python sends the operation down into C, it reads the internal numerical representation, computes the result, and it sees... hey, this is one of our interned numbers! No need to allocate new memory or mutate anything, let's just change the pointer!
Madness.
Absolute madness.