Pickle Module In Python: A Comprehensive Guide For Beginners

The pickle module in Python is used for serializing and deserializing Python objects. Notice the stress is on Python objects. The goal of this tutorial is to help you understand the concepts related to pickle module in python, and how to use it. I will show some examples, and also give you some problems to practice in the jupyter notebook below. So, let’s get started.

Free

Beginner

2 Days to Learn

Pickle Module

Prerequisites

Basic Python Knowledge

Understanding of File I/O

What You Will Learn

How to use the Pickle module in Python
Serialization and Deserialization of Python objects
Storing and retrieving data using Pickle
Best practices for working with Pickle

Tutorial Outcomes

By the end of this tutorial on the Pickle module, you’ll have a good understanding of how to serialize and deserialize Python objects. You’ll be able to store and retrieve complex data structures, ensuring data persistence across sessions. This knowledge will enable you to handle Python objects efficiently in applications that require data storage, making your programs more versatile and robust.

This Tutorial Includes

Concepts & Explanation

Code Examples

Code Exercise Notebook

Basics Of Serialization

Before we understand how pickle module works, we need to understand what it is designed for. In programming or computer science, we often need to save complex data structures or objects for later use or to transfer them between different parts of a program or even between different programs. This process of converting a complex data structure or object into a format that can be stored or transmitted is called serialization. The reverse process, where we reconstruct the object from the serialized format, is called deserialization.

Think of serialization as a way of “flattening” a complex, multi-dimensional object into a simple, one-dimensional stream of data. Here’s a conceptual diagram to illustrate this:

Complex Object                Serialized Data
  ┌─────────────────┐           ┌───────────────────────┐
  │    ┌───┐        │           │                       │
  │ A  │ B │   C    │   ───►    │ A B C D E F G H I J K │
  │    └─┬─┘        │           │                       │
  │      └───┐      │           └───────────────────────┘
  │      D E │ F    │
  │    ┌─────┘      │
  │  G │ H I J K    │
  └─────────────────┘
Code language: JavaScript (javascript)

Why Do We Need Serialization?

You might wonder why we can’t just use normal data storage methods. The issue lies in the complexity and structure of the data. Consider a nested dictionary with various data types:

data = {
    'user': {
        'name': 'Alice',
        'age': 30,
        'hobbies': ['reading', 'cycling'],
        'address': {
            'street': '123 Main St',
            'city': 'Anytown'
        }
    },
    'transactions': [
        {'date': '2023-01-01', 'amount': 100.50},
        {'date': '2023-01-02', 'amount': 200.75}
    ]
}
Code language: Python (python)

Storing this structure in a text file would require a custom format and parser, while serialization handles it automatically. This means you won’t have to write too many parsing conditions. It does a lot of work for you. Serialization preserves type information, which is crucial for accurate reconstruction, Simple storage methods might convert everything to strings, losing this type information. Additionally, serialization handles complex object relationships. For example, if there is circular references, serialization handles that for you.

Now let’s talk about the use cases and how it is helpful. Imagine you’re developing a game where players can save their progress. The game state might include the player’s position, inventory, health, and completed quests. This could be represented as a complex object:

class GameState:
    def __init__(self):
        self.player_position = (0, 0)
        self.inventory = ['sword', 'shield']
        self.health = 100
        self.completed_quests = ['tutorial']

# During gameplay
current_game = GameState()
current_game.player_position = (10, 20)
current_game.inventory.append('potion')
current_game.health = 80
current_game.completed_quests.append('rescue_villagers')
Code language: Python (python)

Without serialization, you’d need to manually write each attribute to a file and parse it when loading, which would be error-prone and inefficient for complex objects.

# Saving game (serialization)
import pickle
with open('save_game.pkl', 'wb') as file:
    pickle.dump(current_game, file)

# Later, loading game (deserialization)
with open('save_game.pkl', 'rb') as file:
    loaded_game = pickle.load(file)

print(loaded_game.player_position)  # (10, 20)
print(loaded_game.inventory)  # ['sword', 'shield', 'potion']
Code language: Python (python)

You don’t have to be intimidated by what’s going on here. Just know that dump and load does the conversion for us. This was just one example, but there are several such use cases:

When different parts of a system need to communicate, they often need to exchange complex data structures. Serialization allows this data to be transmitted over simple protocols like HTTP or through message queues.
Caching involves storing the results of expensive operations for future use. Serialization allows these results to be stored in a format that can be quickly retrieved and deserialized.
Creating a deep copy of an object (a copy that includes copies of nested objects) can be achieved through serialization

One of the most imprtant use cases for pickle module is you can save your machine learning models for later use:

import pickle
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

# Train a simple model
iris = load_iris()
clf = DecisionTreeClassifier()
clf.fit(iris.data, iris.target)

# Save the model
with open('model.pkl', 'wb') as file:
    pickle.dump(clf, file)

# Later, load the model
with open('model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

# Use the loaded model
print(loaded_model.predict(iris.data[:1]))
Code language: Python (python)

Several serialization techniques are widely used in the industry. Here are some of them:

JSON (JavaScript Object Notation) is a lightweight format used to store and exchange data. It’s easy for both humans and machines to read and write. JSON represents data in a way that’s compatible with many programming languages, supporting basic data types like strings, numbers, arrays, and objects. However, JSON has limitations when it comes to handling more complex data types.
XML (eXtensible Markup Language) is a format that focuses on structuring data using tags. It can represent complex data structures and is self-descriptive, meaning the data includes information about its structure. While XML is very flexible, it tends to be more verbose and can be slower to process compared to JSON.
YAML (YAML Ain’t Markup Language) is a format that combines human readability with the ability to represent complex data structures. It’s more powerful than JSON and is often used for configuration files and data that needs to be easily understood by people. Despite its readability, YAML can be more complex to parse and might have issues with formatting if not handled carefully.

Pickle Module In Python: A Unique Approach

Python’s Pickle module takes a unique approach to serialization. Unlike JSON or XML, which are designed to be language-independent, Pickle is Python-specific. This allows it to serialize almost any Python object, including functions and classes. Pickle traverses the object structure, including nested objects and circular references. It generates a series of bytecode instructions that, when executed, will reconstruct the object. Pickle keeps track of object references to correctly handle shared and circular references.

From The Python Doc: The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, or “flattening”.

Pickle module in Python uses protocols to define the format of the serialized data. Higher protocol versions are more efficient but may not be backward compatible.

Protocol 0: The original ASCII protocol
Protocol 1-4: Increasingly efficient binary protocols
Protocol 5 (default in Python 3.8+): Adds support for out-of-band data and more efficient serialization of certain object types

Pickle module supports specific data types only: None, True, and False Integers, floating-point numbers, complex numbers Strings, bytes, bytearrays. Tuples, lists, sets, and dictionaries containing only picklable objects. Functions defined at the top level of a module (using def, not lambda). Built-in functions defined at the top level of a module. Classes that are defined at the top level of a module Instances of such classes whose dict or the result of calling getstate() is picklable.

Now what do I mean my top level of the module? It refers to objects (functions, classes, etc.) that are defined directly in a module, not nested inside other structures.

# This function is at the top level of the module
def top_level_function():
    pass

class TopLevelClass:
    # This method is not at the top level of the module
    def method(self):
        pass

# Pickle can serialize top_level_function and TopLevelClass, but not TopLevelClass.method independently.
Code language: Python (python)

When Pickle serializes a function or class, it doesn’t store the actual code. Let’s explore this with examples:

import pickle

def greet(name):
    return f"Hello, {name}!"

# Pickle the function
pickled_func = pickle.dumps(greet)

# In another Python session or script:
unpickled_func = pickle.loads(pickled_func)
print(unpickled_func("Alice"))  # "Hello, Alice!"
Code language: Python (python)

Here, pickle doesn’t store the function’s code. It stores: The function name (greet). The module name where it’s defined. Any default argument values (none in this case). When unpickling, Python looks for a function named greet in the same module. This allows you to modify the function’s implementation without breaking existing pickled data:

# Modified function (after pickling)
def greet(name):
    return f"Greetings, {name}! How are you?"

# Unpickle using the old data
unpickled_func = pickle.loads(pickled_func)
print(unpickled_func("Bob"))  # "Greetings, Bob! How are you?"
Code language: Python (python)

The pickled function now uses the new implementation. Now same can be said for the class:

import pickle

class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
    
    def greet(self):
        return f"Hello, I'm {self.name}"

# Create and pickle an instance
alice = Person("Alice", 30)
pickled_alice = pickle.dumps(alice)

# In another Python session or script:
unpickled_alice = pickle.loads(pickled_alice)
print(unpickled_alice.greet())  # "Hello, I'm Alice"

# Modify the class (after pickling)
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
    
    def greet(self):
        return f"Greetings! My name is {self.name} and I'm {self.age} years old"

# Unpickle using the old data
unpickled_alice = pickle.loads(pickled_alice)
print(unpickled_alice.greet())  # "Greetings! My name is Alice and I'm 30 years old"
Code language: Python (python)

Here, Pickle stored:

The class name (‘Person’)
The module where it’s defined
The instance attributes (‘name’ and ‘age’)

When unpickling, it uses the current definition of the Person class, allowing you to add or modify methods without breaking compatibility with existing pickled instances. This approach allows for bug fixes and feature additions in long-lived systems where objects may be pickled and unpickled across different versions of the software. However, it also means you need to be careful about maintaining backwards compatibility in your code, especially if pickled objects will be used across different versions of your software.

Despite it’s benefit, there are some limitations that you need to keep in mind. It’s not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source. It’s specific to Python and isn’t easily readable by other programming languages. Not all objects can be pickled. For example, file objects, connection objects, and other objects tied to external resources typically can’t be pickled.

Warning: The pickle module is not secure. Only unpickle data you trust. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with. Consider signing data with `hmac` if you need to ensure that it has not been tampered with. Safer serialization formats such as json may be more appropriate if you are processing untrusted data.

You can visit the offcial Python documentation for more details. JSON is another way to do the same thing but there are some differences. Taken from Official Python Documentation:

Areas	JSON	Pickle Module
Serialization Format	Text-based (outputs Unicode text, often UTF-8 encoded)	Binary
Human Readability	Human-readable	Not human-readable
Interoperability	Interoperable and widely used outside Python	Python-specific
Supported Types	Limited to a subset of Python built-in types; no custom classes	Supports a wide range of Python types, including custom classes
Security	Safe for deserializing untrusted data (does not execute arbitrary code)	Can pose security risks (untrusted data may lead to arbitrary code execution)

Methods In Pickle Module

Now let’s see all the methods that we commonly use:

Method Name	Description	Arguments
`dump()`	Serializes an object and writes it to a file object.	`obj`, `file`, `protocol`
`dumps()`	Serializes an object and returns it as a byte stream.	`obj`, `protocol`
`load()`	Deserializes an object from a file object.	`file`
`loads()`	Deserializes an object from a byte stream.	`byte_stream`
`HIGHEST_PROTOCOL`	Constant representing the highest protocol version available for pickling.	–
`DEFAULT_PROTOCOL`	Constant representing the default protocol version used by pickle.	–

The dump() method is used to serialize an object hierarchy and write it to a file-like object.

import pickle

data = {'name': 'Alice', 'age': 30, 'city': 'New York'}

with open('data.pkl', 'wb') as file:
    pickle.dump(data, file)
Code language: Python (python)

In this example, we’re serializing a dictionary and writing it to a file named ‘data.pkl’. The ‘wb’ mode opens the file for writing in binary mode, which is necessary for pickle.

The load() method deserializes an object from a file-like object.

import pickle

with open('data.pkl', 'rb') as file:
    loaded_data = pickle.load(file)

print(loaded_data)  # {'name': 'Alice', 'age': 30, 'city': 'New York'}
Code language: Python (python)

This method reads the serialized data from the file ‘data.pkl’ and reconstructs the original object.

The dumps() method serializes an object hierarchy and returns it as a bytes object, without writing it to a file.

import pickle

data = ['apple', 'banana', 'cherry']
serialized_data = pickle.dumps(data)

print(serialized_data)  # b'\x80\x04\x95\x1f\x00\x00\x00\x00\x00\x00\x00]\x94(
Code language: Python (python)

The loads() method deserializes an object from a bytes object.

import pickle

serialized_data = b'\x80\x04\x95\x1f\x00\x00\x00\x00\x00\x00\x00]\x94(
deserialized_data = pickle.loads(serialized_data)

print(deserialized_data)  # output ['apple', 'banana', 'cherry']
Code language: Python (python)

This method is the counterpart to dumps() and is useful when you’ve received serialized data from a network transmission or database and want to reconstruct the original object.

These are the four important methods you need to know.

As mentioned previously, not all objects support serializatoion.Python objects, like file handles, database connections, or objects involving external resources, are inherently non-serializable. These objects often contain references to external systems or states that cannot be easily converted into a byte stream and restored later. To handle objects that cannot be pickled directly, you might need to implement custom methods for serialization.

Determine which attributes of the object can be serialized. Ensure that any resources or non-picklable attributes are either excluded or transformed into a serializable form.
Implement __getstate__method to return the serializable state of your object. It should return a dictionary with the attributes you want to serialize.
Implement ___setstate__method to restore the object’s state from the dictionary provided by__getstate__ Reinitialize any non-picklable resources as needed.
Test the serialization and deserialization processes thoroughly to ensure that the object is restored to its original state correctly.

Imagine you have a class that manages a connection to an external resource, such as a file handle. You want to serialize instances of this class using pickle module, but the file handle cannot be pickled directly. We’ll implement custom methods to handle this situation.

class ResourceManager:
    def __init__(self, data, filename):
        self.data = data  # This is serializable
        self._resource = open(filename, 'w')  # This is non-picklable

    def write_data(self):
        self._resource.write(self.data)
Code language: Python (python)

Here, self.data is a simple string that can be easily serialized, while self._resource is a file handle that can’t be pickled directly.

we implement the __getstate__ method. This method is called when pickle serializes the object. It should return a dictionary containing only the serializable state of the object.

class ResourceManager:
    def __init__(self, data, filename):
        self.data = data
        self._resource = open(filename, 'w')

    def write_data(self):
        self._resource.write(self.data)

    def __getstate__(self):
        # Create a copy of the object's state
        state = self.__dict__.copy()
        # Remove the non-picklable _resource attribute
        del state['_resource']
        return state
Code language: Python (python)

In __getstate__, we create a copy of the object’s state (self.__dict__). Then, we remove the _resource attribute because it cannot be pickled. The __getstate__ method ensures that the non-picklable _resource attribute is excluded from the serialized data.

Now, we implement the __setstate__ method. This method is called when pickle deserializes the object. It receives the state dictionary and should restore the object’s attributes, including reinitializing the _resource.

class ResourceManager:
    def __init__(self, data, filename):
        self.data = data
        self._resource = open(filename, 'w')

    def write_data(self):
        self._resource.write(self.data)

    def __getstate__(self):
        state = self.__dict__.copy()
        del state['_resource']
        return state

    def __setstate__(self, state):
        # Restore the object's state
        self.__dict__.update(state)
        # Reinitialize the non-picklable _resource attribute
        self._resource = open('restored_file.txt', 'w')
Code language: Python (python)

In __setstate__, we update the object’s __dict__ with the state dictionary. Then, we reinitialize the _resource attribute, which could not be serialized. In simple words, the __setstate__ method restores the object’s state and reinitializes the _resource attribute to point to a new file (restored_file.txt).

import pickle

# Create an instance of ResourceManager
manager = ResourceManager("Hello, World!", "original_file.txt")
manager.write_data()  # Writes data to original_file.txt

# Serialize the manager object to a file
with open('manager.pkl', 'wb') as f:
    pickle.dump(manager, f)

# Deserialize the manager object from the file
with open('manager.pkl', 'rb') as f:
    restored_manager = pickle.load(f)

# Check the restored object's state
print(restored_manager.data)  # Output: Hello, World!
restored_manager.write_data()  # Writes data to restored_file.txt

Code language: Python (python)

Always test the custom serialization and deserialization process to ensure that your objects are correctly restored to their original state. Now, a few things you need to keep in mind. When dealing with large datasets or complex objects, the size of the pickled object can be large, which may lead to memory issues or slow performance, using the bz2, gzip, or lzma modules to compress the serialized data to save space.

Only unpickle data you trust, or consider using safer serialization formats like JSON for data exchange between untrusted sources. The pickle module is generally not thread-safe. If you’re using it in a multi-threaded environment, make sure to protect pickling and unpickling operations with locks to avoid data corruption or unexpected behavior.

Practice Problems

That’s all for this tutorial. Feel free to check out the Python documentation for more details but the best way to learn this module is to practice the questions below:

Click Here To Go to Google Colab Notebook

Tags: Python Projects Python Tutorial

Pickle Module In Python: A Comprehensive Guide for Beginners

Python OS Module: A Comprehensive Guide for Beginners

MIT Researchers Develop AI Model for Human-Like Vocal Imitations

Amritesh Kumar

Related Posts

Python OS Module: A Comprehensive Guide for Beginners

Install Miniconda on Windows (The EASIEST Way! )

TensorFlow Simplified Guide For Beginners

Create A To-Do-List App In Python: Complete Tutorial

Notes App In Python: Complete Tutorial

E-commerce App In Python – User Facing

MIT Researchers Develop AI Model for Human-Like Vocal Imitations

Welcome Back!

Create New Account!

Retrieve your password

Are you sure want to unlock this post?

Are you sure want to cancel subscription?

Pickle Module In Python: A Comprehensive Guide for Beginners

Table of Contents

Basics Of Serialization

Why Do We Need Serialization?

Pickle Module In Python: A Unique Approach

Methods In Pickle Module

Practice Problems

Python OS Module: A Comprehensive Guide for Beginners

MIT Researchers Develop AI Model for Human-Like Vocal Imitations

Related Posts

Welcome Back!

Create New Account!

Retrieve your password

Are you sure want to unlock this post?

Are you sure want to cancel subscription?