The pickle module in Python is used for serializing and deserializing Python objects. Notice the stress is on Python objects. The goal of this tutorial is to help you understand the concepts related to pickle module in python, and how to use it. I will show some examples, and also give you some problems to practice in the jupyter notebook below. So, let’s get started.
- How to use the Pickle module in Python
- Serialization and Deserialization of Python objects
- Storing and retrieving data using Pickle
- Best practices for working with Pickle
Table of Contents
Basics Of Serialization
Before we understand how pickle module works, we need to understand what it is designed for. In programming or computer science, we often need to save complex data structures or objects for later use or to transfer them between different parts of a program or even between different programs. This process of converting a complex data structure or object into a format that can be stored or transmitted is called serialization. The reverse process, where we reconstruct the object from the serialized format, is called deserialization.
Think of serialization as a way of “flattening” a complex, multi-dimensional object into a simple, one-dimensional stream of data. Here’s a conceptual diagram to illustrate this:
Complex Object Serialized Data
┌─────────────────┐ ┌───────────────────────┐
│ ┌───┐ │ │ │
│ A │ B │ C │ ───► │ A B C D E F G H I J K │
│ └─┬─┘ │ │ │
│ └───┐ │ └───────────────────────┘
│ D E │ F │
│ ┌─────┘ │
│ G │ H I J K │
└─────────────────┘
Code language: JavaScript (javascript)
Why Do We Need Serialization?
You might wonder why we can’t just use normal data storage methods. The issue lies in the complexity and structure of the data. Consider a nested dictionary with various data types:
data = {
'user': {
'name': 'Alice',
'age': 30,
'hobbies': ['reading', 'cycling'],
'address': {
'street': '123 Main St',
'city': 'Anytown'
}
},
'transactions': [
{'date': '2023-01-01', 'amount': 100.50},
{'date': '2023-01-02', 'amount': 200.75}
]
}
Code language: Python (python)
Storing this structure in a text file would require a custom format and parser, while serialization handles it automatically. This means you won’t have to write too many parsing conditions. It does a lot of work for you. Serialization preserves type information, which is crucial for accurate reconstruction, Simple storage methods might convert everything to strings, losing this type information. Additionally, serialization handles complex object relationships. For example, if there is circular references, serialization handles that for you.
Now let’s talk about the use cases and how it is helpful. Imagine you’re developing a game where players can save their progress. The game state might include the player’s position, inventory, health, and completed quests. This could be represented as a complex object:
class GameState:
def __init__(self):
self.player_position = (0, 0)
self.inventory = ['sword', 'shield']
self.health = 100
self.completed_quests = ['tutorial']
# During gameplay
current_game = GameState()
current_game.player_position = (10, 20)
current_game.inventory.append('potion')
current_game.health = 80
current_game.completed_quests.append('rescue_villagers')
Code language: Python (python)
Without serialization, you’d need to manually write each attribute to a file and parse it when loading, which would be error-prone and inefficient for complex objects.
# Saving game (serialization)
import pickle
with open('save_game.pkl', 'wb') as file:
pickle.dump(current_game, file)
# Later, loading game (deserialization)
with open('save_game.pkl', 'rb') as file:
loaded_game = pickle.load(file)
print(loaded_game.player_position) # (10, 20)
print(loaded_game.inventory) # ['sword', 'shield', 'potion']
Code language: Python (python)
You don’t have to be intimidated by what’s going on here. Just know that dump and load does the conversion for us. This was just one example, but there are several such use cases:
- When different parts of a system need to communicate, they often need to exchange complex data structures. Serialization allows this data to be transmitted over simple protocols like HTTP or through message queues.
- Caching involves storing the results of expensive operations for future use. Serialization allows these results to be stored in a format that can be quickly retrieved and deserialized.
- Creating a deep copy of an object (a copy that includes copies of nested objects) can be achieved through serialization
One of the most imprtant use cases for pickle module is you can save your machine learning models for later use:
import pickle
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
# Train a simple model
iris = load_iris()
clf = DecisionTreeClassifier()
clf.fit(iris.data, iris.target)
# Save the model
with open('model.pkl', 'wb') as file:
pickle.dump(clf, file)
# Later, load the model
with open('model.pkl', 'rb') as file:
loaded_model = pickle.load(file)
# Use the loaded model
print(loaded_model.predict(iris.data[:1]))
Code language: Python (python)
Several serialization techniques are widely used in the industry. Here are some of them:
- JSON (JavaScript Object Notation) is a lightweight format used to store and exchange data. It’s easy for both humans and machines to read and write. JSON represents data in a way that’s compatible with many programming languages, supporting basic data types like strings, numbers, arrays, and objects. However, JSON has limitations when it comes to handling more complex data types.
- XML (eXtensible Markup Language) is a format that focuses on structuring data using tags. It can represent complex data structures and is self-descriptive, meaning the data includes information about its structure. While XML is very flexible, it tends to be more verbose and can be slower to process compared to JSON.
- YAML (YAML Ain’t Markup Language) is a format that combines human readability with the ability to represent complex data structures. It’s more powerful than JSON and is often used for configuration files and data that needs to be easily understood by people. Despite its readability, YAML can be more complex to parse and might have issues with formatting if not handled carefully.
Pickle Module In Python: A Unique Approach
Python’s Pickle module takes a unique approach to serialization. Unlike JSON or XML, which are designed to be language-independent, Pickle is Python-specific. This allows it to serialize almost any Python object, including functions and classes. Pickle traverses the object structure, including nested objects and circular references. It generates a series of bytecode instructions that, when executed, will reconstruct the object. Pickle keeps track of object references to correctly handle shared and circular references.
pickle
module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, or “flattening”.Pickle module in Python uses protocols to define the format of the serialized data. Higher protocol versions are more efficient but may not be backward compatible.
- Protocol 0: The original ASCII protocol
- Protocol 1-4: Increasingly efficient binary protocols
- Protocol 5 (default in Python 3.8+): Adds support for out-of-band data and more efficient serialization of certain object types
Pickle module supports specific data types only: None, True, and False Integers, floating-point numbers, complex numbers Strings, bytes, bytearrays. Tuples, lists, sets, and dictionaries containing only picklable objects. Functions defined at the top level of a module (using def, not lambda). Built-in functions defined at the top level of a module. Classes that are defined at the top level of a module Instances of such classes whose dict or the result of calling getstate() is picklable.
Now what do I mean my top level of the module? It refers to objects (functions, classes, etc.) that are defined directly in a module, not nested inside other structures.
# This function is at the top level of the module
def top_level_function():
pass
class TopLevelClass:
# This method is not at the top level of the module
def method(self):
pass
# Pickle can serialize top_level_function and TopLevelClass, but not TopLevelClass.method independently.
Code language: Python (python)
When Pickle serializes a function or class, it doesn’t store the actual code. Let’s explore this with examples:
import pickle
def greet(name):
return f"Hello, {name}!"
# Pickle the function
pickled_func = pickle.dumps(greet)
# In another Python session or script:
unpickled_func = pickle.loads(pickled_func)
print(unpickled_func("Alice")) # "Hello, Alice!"
Code language: Python (python)
Here, pickle doesn’t store the function’s code. It stores: The function name (greet
). The module name where it’s defined. Any default argument values (none in this case). When unpickling, Python looks for a function named greet
in the same module. This allows you to modify the function’s implementation without breaking existing pickled data:
# Modified function (after pickling)
def greet(name):
return f"Greetings, {name}! How are you?"
# Unpickle using the old data
unpickled_func = pickle.loads(pickled_func)
print(unpickled_func("Bob")) # "Greetings, Bob! How are you?"
Code language: Python (python)
The pickled function now uses the new implementation. Now same can be said for the class:
import pickle
class Person:
def __init__(self, name, age):
self.name = name
self.age = age
def greet(self):
return f"Hello, I'm {self.name}"
# Create and pickle an instance
alice = Person("Alice", 30)
pickled_alice = pickle.dumps(alice)
# In another Python session or script:
unpickled_alice = pickle.loads(pickled_alice)
print(unpickled_alice.greet()) # "Hello, I'm Alice"
# Modify the class (after pickling)
class Person:
def __init__(self, name, age):
self.name = name
self.age = age
def greet(self):
return f"Greetings! My name is {self.name} and I'm {self.age} years old"
# Unpickle using the old data
unpickled_alice = pickle.loads(pickled_alice)
print(unpickled_alice.greet()) # "Greetings! My name is Alice and I'm 30 years old"
Code language: Python (python)
Here, Pickle stored:
- The class name (‘Person’)
- The module where it’s defined
- The instance attributes (‘name’ and ‘age’)
When unpickling, it uses the current definition of the Person
class, allowing you to add or modify methods without breaking compatibility with existing pickled instances. This approach allows for bug fixes and feature additions in long-lived systems where objects may be pickled and unpickled across different versions of the software. However, it also means you need to be careful about maintaining backwards compatibility in your code, especially if pickled objects will be used across different versions of your software.
Despite it’s benefit, there are some limitations that you need to keep in mind. It’s not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source. It’s specific to Python and isn’t easily readable by other programming languages. Not all objects can be pickled. For example, file objects, connection objects, and other objects tied to external resources typically can’t be pickled.
You can visit the offcial Python documentation for more details. JSON is another way to do the same thing but there are some differences. Taken from Official Python Documentation:
Areas | JSON | Pickle Module |
---|---|---|
Serialization Format | Text-based (outputs Unicode text, often UTF-8 encoded) | Binary |
Human Readability | Human-readable | Not human-readable |
Interoperability | Interoperable and widely used outside Python | Python-specific |
Supported Types | Limited to a subset of Python built-in types; no custom classes | Supports a wide range of Python types, including custom classes |
Security | Safe for deserializing untrusted data (does not execute arbitrary code) | Can pose security risks (untrusted data may lead to arbitrary code execution) |
Methods In Pickle Module
Now let’s see all the methods that we commonly use:
Method Name | Description | Arguments |
---|---|---|
dump() |
Serializes an object and writes it to a file object. | obj , file , protocol |
dumps() |
Serializes an object and returns it as a byte stream. | obj , protocol |
load() |
Deserializes an object from a file object. | file |
loads() |
Deserializes an object from a byte stream. | byte_stream |
HIGHEST_PROTOCOL |
Constant representing the highest protocol version available for pickling. | – |
DEFAULT_PROTOCOL |
Constant representing the default protocol version used by pickle. | – |
The dump()
method is used to serialize an object hierarchy and write it to a file-like object.
import pickle
data = {'name': 'Alice', 'age': 30, 'city': 'New York'}
with open('data.pkl', 'wb') as file:
pickle.dump(data, file)
Code language: Python (python)
In this example, we’re serializing a dictionary and writing it to a file named ‘data.pkl’. The ‘wb’ mode opens the file for writing in binary mode, which is necessary for pickle.
The load()
method deserializes an object from a file-like object.
import pickle
with open('data.pkl', 'rb') as file:
loaded_data = pickle.load(file)
print(loaded_data) # {'name': 'Alice', 'age': 30, 'city': 'New York'}
Code language: Python (python)
This method reads the serialized data from the file ‘data.pkl’ and reconstructs the original object.
The dumps()
method serializes an object hierarchy and returns it as a bytes object, without writing it to a file.
import pickle
data = ['apple', 'banana', 'cherry']
serialized_data = pickle.dumps(data)
print(serialized_data) # b'\x80\x04\x95\x1f\x00\x00\x00\x00\x00\x00\x00]\x94(
Code language: Python (python)
The loads()
method deserializes an object from a bytes object.
import pickle
serialized_data = b'\x80\x04\x95\x1f\x00\x00\x00\x00\x00\x00\x00]\x94(
deserialized_data = pickle.loads(serialized_data)
print(deserialized_data) # output ['apple', 'banana', 'cherry']
Code language: Python (python)
This method is the counterpart to dumps()
and is useful when you’ve received serialized data from a network transmission or database and want to reconstruct the original object.
These are the four important methods you need to know.
As mentioned previously, not all objects support serializatoion.Python objects, like file handles, database connections, or objects involving external resources, are inherently non-serializable. These objects often contain references to external systems or states that cannot be easily converted into a byte stream and restored later. To handle objects that cannot be pickled directly, you might need to implement custom methods for serialization.
- Determine which attributes of the object can be serialized. Ensure that any resources or non-picklable attributes are either excluded or transformed into a serializable form.
- Implement
__getstate__
method to return the serializable state of your object. It should return a dictionary with the attributes you want to serialize. - Implement
___setstate__
method to restore the object’s state from the dictionary provided by__getstate__
Reinitialize any non-picklable resources as needed. - Test the serialization and deserialization processes thoroughly to ensure that the object is restored to its original state correctly.
Imagine you have a class that manages a connection to an external resource, such as a file handle. You want to serialize instances of this class using pickle module, but the file handle cannot be pickled directly. We’ll implement custom methods to handle this situation.
class ResourceManager:
def __init__(self, data, filename):
self.data = data # This is serializable
self._resource = open(filename, 'w') # This is non-picklable
def write_data(self):
self._resource.write(self.data)
Code language: Python (python)
Here, self.data
is a simple string that can be easily serialized, while self._resource
is a file handle that can’t be pickled directly.
we implement the __getstate__
method. This method is called when pickle
serializes the object. It should return a dictionary containing only the serializable state of the object.
class ResourceManager:
def __init__(self, data, filename):
self.data = data
self._resource = open(filename, 'w')
def write_data(self):
self._resource.write(self.data)
def __getstate__(self):
# Create a copy of the object's state
state = self.__dict__.copy()
# Remove the non-picklable _resource attribute
del state['_resource']
return state
Code language: Python (python)
In __getstate__
, we create a copy of the object’s state (self.__dict__
). Then, we remove the _resource
attribute because it cannot be pickled. The __getstate__
method ensures that the non-picklable _resource
attribute is excluded from the serialized data.
Now, we implement the __setstate__
method. This method is called when pickle
deserializes the object. It receives the state dictionary and should restore the object’s attributes, including reinitializing the _resource
.
class ResourceManager:
def __init__(self, data, filename):
self.data = data
self._resource = open(filename, 'w')
def write_data(self):
self._resource.write(self.data)
def __getstate__(self):
state = self.__dict__.copy()
del state['_resource']
return state
def __setstate__(self, state):
# Restore the object's state
self.__dict__.update(state)
# Reinitialize the non-picklable _resource attribute
self._resource = open('restored_file.txt', 'w')
Code language: Python (python)
In __setstate__
, we update the object’s __dict__
with the state dictionary. Then, we reinitialize the _resource
attribute, which could not be serialized. In simple words, the __setstate__
method restores the object’s state and reinitializes the _resource
attribute to point to a new file (restored_file.txt
).
import pickle
# Create an instance of ResourceManager
manager = ResourceManager("Hello, World!", "original_file.txt")
manager.write_data() # Writes data to original_file.txt
# Serialize the manager object to a file
with open('manager.pkl', 'wb') as f:
pickle.dump(manager, f)
# Deserialize the manager object from the file
with open('manager.pkl', 'rb') as f:
restored_manager = pickle.load(f)
# Check the restored object's state
print(restored_manager.data) # Output: Hello, World!
restored_manager.write_data() # Writes data to restored_file.txt
Code language: Python (python)
Always test the custom serialization and deserialization process to ensure that your objects are correctly restored to their original state. Now, a few things you need to keep in mind. When dealing with large datasets or complex objects, the size of the pickled object can be large, which may lead to memory issues or slow performance, using the bz2
, gzip
, or lzma
modules to compress the serialized data to save space.
Only unpickle data you trust, or consider using safer serialization formats like JSON for data exchange between untrusted sources. The pickle
module is generally not thread-safe. If you’re using it in a multi-threaded environment, make sure to protect pickling and unpickling operations with locks to avoid data corruption or unexpected behavior.
Practice Problems
That’s all for this tutorial. Feel free to check out the Python documentation for more details but the best way to learn this module is to practice the questions below: