Python cheatsheet for data analysts

Practice Python for data interviews
200+ pandas, numpy, and data-wrangling problems with explanations.
Join the waitlist

Why analysts need a Python cheatsheet

You don't memorize Python by reading a tutorial end to end. You memorize it by writing the same fifteen patterns every day for six months: read a CSV, filter, group, aggregate, plot, export. This cheatsheet is the lookup layer — the page you alt-tab to when you forget whether it is list.append or list.add.

The audience: an analyst who already knows SQL and is now under pressure to do things SQL cannot do gracefully — bootstrapping a confidence interval, calling an internal API to enrich a cohort, gluing three CSVs into one weekly report. Python is not the goal. The report on Monday morning is the goal.

One framing that helps: most analyst Python is functions of a dataframe plus a few primitives around it. Once your fingers know lists, dicts, slicing, and a handful of pandas verbs, the rest is googling APIs.

Syntax, types, and operators

Python's surface syntax is small. Variables are untyped at the binding site, indentation defines blocks, and print is your debugger until you graduate. The four scalar types you touch every day are int, float, str, bool, plus None as the absence-of-value sentinel.

# Variables
x = 10
name = 'Anna'
is_active = True

# Comments
# Single line
"""
Multi-line docstring or comment block
"""

# Print and f-strings
print('Hello')
print(f'Name: {name}, x: {x}')
print(f'{1.234:.2f}')   # '1.23'
# Types
x = 10          # int
y = 3.14        # float
s = 'hello'     # str
b = True        # bool
n = None        # NoneType

# Inspect
type(x)               # <class 'int'>
isinstance(x, int)    # True
isinstance(n, type(None))  # True

Operators behave the way you expect from any C-family language, with two exceptions: / is always float division (7 / 2 returns 3.5), // is integer division (7 // 2 returns 3). Forget that and your retention curves come out off by half a point.

# Arithmetic
+ - * /
//      # floor division: 7 // 2 = 3
%       # modulo: 7 % 2 = 1
**      # power: 2 ** 3 = 8

# Comparison
== != < > <= >=

# Boolean
and or not

# Membership
'a' in 'abc'         # True
5 in [1, 2, 5]       # True

Load-bearing trick: chained comparisons work like math, not like C. if 0 < x < 100: is valid Python and reads exactly the way you would write it on paper.

Strings and slicing

String slicing is where SQL refugees lose two hours their first week. Indexing starts at zero, negative indices count from the end, slices are half-open intervals — start included, stop excluded. The same rule applies to lists, tuples, NumPy arrays, and pandas Series, so paying the cost once compounds across the stack.

s = 'Hello World'
len(s)                    # 11
s.upper()                 # 'HELLO WORLD'
s.lower()                 # 'hello world'
s.replace('Hello', 'Hi')  # 'Hi World'
s.split(' ')              # ['Hello', 'World']
'-'.join(['a', 'b', 'c']) # 'a-b-c'
s.strip()                 # trim whitespace

# Slicing
s[0]      # 'H'
s[0:5]    # 'Hello'
s[:5]     # 'Hello'
s[6:]     # 'World'
s[::-1]   # 'dlroW olleH' — reverse

F-strings are the only formatting style you need in 2026. Use = inside the braces for quick debug prints: f'{x=}' expands to 'x=10' — the cheapest debugger known to analyst-kind.

Lists, dicts, sets, tuples

These four collection types cover most analyst code. When to reach for each:

Type Mutable Ordered Use for
list yes yes sequences you will sort, filter, iterate
dict yes yes* lookups by key, JSON-shaped records
set yes no dedup, membership tests, set algebra
tuple no yes fixed records, multiple return values, dict keys

Dict insertion order has been guaranteed since Python 3.7. If your environment is older than that, you have other problems.

# Lists
l = [1, 2, 3]
l = list(range(5))   # [0, 1, 2, 3, 4]
l[0], l[-1], l[1:3]  # 1, 3, [2, 3]

l.append(4)
l.extend([5, 6])
l.insert(0, 0)
l.remove(2)          # removes first occurrence
l.pop()              # removes and returns last
l.sort()             # in-place
sorted(l)            # new list

len(l), sum(l), min(l), max(l)
# Dicts
d = {'name': 'Anna', 'age': 25}
d['name']                  # 'Anna'
d.get('city', 'Berlin')    # default if missing

d['city'] = 'Berlin'
del d['age']
d.pop('age', None)         # safe pop

for key, value in d.items():
    print(key, value)

'name' in d                # True
# Sets
s = {1, 2, 3}
s = set([1, 2, 3, 3])      # {1, 2, 3}
s.add(4)
s.remove(2)

a = {1, 2, 3}; b = {2, 3, 4}
a | b   # union        {1, 2, 3, 4}
a & b   # intersection {2, 3}
a - b   # difference   {1}
a ^ b   # symmetric    {1, 4}
# Tuples — immutable
t = (1, 2, 3)
t[0]                # 1
# t[0] = 5          # TypeError

# Unpacking
x, y, z = t
a, *rest = [1, 2, 3, 4]   # a=1, rest=[2, 3, 4]

Sanity check: if you reach for a list to test membership in a loop of 10,000+ items, stop and use a set. List membership is O(n), set membership is O(1) — this single swap turns a 30-second script into 30 milliseconds.

Control flow and comprehensions

Conditionals and loops look the way you expect. Two patterns worth memorizing: the ternary for one-line conditional assignment, and the list comprehension for transforming or filtering sequences without a four-line loop.

if x > 0:
    print('positive')
elif x == 0:
    print('zero')
else:
    print('negative')

# Ternary
label = 'positive' if x > 0 else 'non-positive'
# For
for i in range(5):              # 0, 1, 2, 3, 4
    print(i)

for i, x in enumerate(['a', 'b', 'c']):
    print(i, x)

for k, v in d.items():
    print(k, v)

# While + break / continue
for x in l:
    if x == 0:
        break
    if x < 0:
        continue
    print(x)
# List comprehensions
squares = [x**2 for x in range(10)]
positives = [x for x in l if x > 0]
clipped = [x if x > 0 else 0 for x in l]

# Dict / set comprehensions
lookup = {row['id']: row['name'] for row in rows}
uniq = {x.lower() for x in names}

Comprehensions are not always clearer than a loop. If you find yourself writing a triple-nested comprehension, unroll it into a named function. Readability beats cleverness on every code review.

Practice Python for data interviews
200+ pandas, numpy, and data-wrangling problems with explanations.
Join the waitlist

Functions, files, errors, classes

Functions take positional and keyword arguments, support defaults, and can accept arbitrary *args and **kwargs. The most important pattern for analysts: never put a mutable default argument like def f(x, items=[]) — that list is shared across every call. Use items=None and assign inside the body.

def greet(name, greeting='Hello'):
    return f'{greeting}, {name}!'

greet('Anna')                  # 'Hello, Anna!'
greet('Anna', greeting='Hi')   # 'Hi, Anna!'

# Lambda
square = lambda x: x ** 2

# *args, **kwargs
def f(*args, **kwargs):
    print(args, kwargs)

File I/O uses the context manager pattern so handles close even when an exception fires. For CSVs, prefer pandas.read_csv over the stdlib csv module unless you are streaming a file too big to fit in memory.

# Reading
with open('file.txt', 'r') as f:
    content = f.read()
    # or: lines = f.readlines()

# Writing
with open('file.txt', 'w') as f:
    f.write('hello')

# CSV via stdlib
import csv
with open('file.csv') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row)

Error handling uses try / except / else / finally. Catch the specific exception — except Exception: swallows everything and turns your script into a haunted house.

try:
    x = int('abc')
except ValueError as e:
    print(f'Error: {e}')
else:
    print('success — only runs if no exception')
finally:
    print('cleanup — always runs')
# Minimal class
class User:
    def __init__(self, name, age):
        self.name = name
        self.age = age

    def greet(self):
        return f'Hello, {self.name}'

u = User('Anna', 25)
u.greet()

# Inheritance
class Admin(User):
    def __init__(self, name, age, role):
        super().__init__(name, age)
        self.role = role

You write classes maybe once a month as a pure analyst — usually to wrap a metric calculator, a config loader, or a custom dataframe accessor. Functions plus dicts carry you the other 95% of the time.

The analyst module stack

The stdlib plus six third-party packages cover roughly every analyst workflow before senior level. Memorize the import line for each — autocomplete fills in the rest.

Module Use case Killer feature
pandas tabular data, ETL, joins, group-by read_csv, groupby.agg, merge
numpy numeric arrays, vectorized math broadcasting, np.where
matplotlib quick plots plt.plot, plt.savefig
seaborn statistical plots histplot, boxplot, heatmap
scipy stats tests, optimization stats.ttest_ind, stats.norm
requests HTTP calls to internal/external APIs requests.get(...).json()
# Pandas
import pandas as pd
df = pd.read_csv('file.csv')

# NumPy
import numpy as np
arr = np.array([1, 2, 3])

# Matplotlib
import matplotlib.pyplot as plt
plt.plot([1, 2, 3])
plt.savefig('out.png')

# Seaborn
import seaborn as sns
sns.histplot(df['x'])

# Scipy stats
from scipy import stats
stats.ttest_ind(a, b)

# HTTP
import requests
r = requests.get('https://api.example.com')
data = r.json()

# JSON
import json
d = json.loads('{"a": 1}')
s = json.dumps({'a': 1})

# Datetime
from datetime import datetime, timedelta
now = datetime.now()
week_ago = now - timedelta(days=7)

Built-ins to keep on the front burner: len, sum, min, max, sorted, reversed, zip, map, filter, any, all, abs, round, enumerate. Knowing these by name keeps you out of import-spaghetti.

# Virtualenv — every project, no exceptions
python -m venv venv
source venv/bin/activate    # macOS / Linux
venv\Scripts\activate       # Windows

pip install pandas numpy
pip freeze > requirements.txt

If you want a structured way to drill these patterns until they are muscle memory, NAILDD is launching with hundreds of Python and SQL interview problems organized along these cheatsheet lines.

Common pitfalls

The first pitfall is mutable default arguments. Writing def append_log(entry, log=[]) looks innocent, but the empty list is constructed once at function-definition time, not once per call — so every call shares the same list and your log accumulates rows from every previous invocation. The fix is to default to None and assign log = log if log is not None else [] inside the function body.

The second pitfall is modifying a list while iterating over it. Code like for x in items: if bad(x): items.remove(x) skips elements because the index advances while the list shrinks. The fix is to build a new list with a comprehension: items = [x for x in items if not bad(x)].

The third pitfall is shallow vs. deep copy confusion. Assigning b = a for a list or dict creates two names pointing at the same object — mutating b mutates a. Use a.copy() for a shallow copy and copy.deepcopy(a) for nested structures. Analysts hit this when they copy a config dict, mutate it for one experiment, and find baseline numbers shifted too.

The fourth pitfall is silent type coercion in comparisons. Comparing '10' < '9' returns True because string comparison is lexicographic — '1' comes before '9' alphabetically. This happens constantly when a CSV column comes in as strings instead of ints. Always confirm df.dtypes after read_csv and cast explicitly with astype(int) or pd.to_numeric when in doubt.

FAQ

Where should I start if I know SQL but not Python?

Start with the collection types — list, dict, set, tuple — and the slicing rules. Most of your work moves data between these structures and a pandas DataFrame, so the faster you internalize how [start:stop:step] works, the faster the rest of the language unlocks. Skip web frameworks, async, and metaclasses until you actually need them, which as a pure analyst is approximately never.

Which editor or IDE should I pick?

Jupyter (or JupyterLab) for exploratory analysis where you want plots inline and cell-by-cell iteration. VS Code with the Python and Jupyter extensions for anything that becomes a script or a scheduled job. PyCharm is excellent but heavyweight. Cursor or any other AI-first editor is a reasonable default in 2026 if you are starting fresh.

Python 2 or Python 3?

Python 3 only. Python 2 reached end-of-life in 2020 and has not received security fixes since. If a tutorial still uses print as a statement or xrange, the material is older than your data. Target Python 3.10 or higher for structural pattern matching and improved error messages.

Do I need advanced Python — decorators, generators, metaclasses?

At junior level, no. At mid-level, generators help when you process files that do not fit in memory, and decorators show up when you write reusable utilities (timing, caching, retries). Metaclasses are a senior-engineer concern. Optimize for fluency in the basics — you get more leverage from a fast pandas.groupby than from a clever metaclass.

What about type hints?

Type hints are optional but increasingly standard. Annotating def f(x: int, y: list[str]) -> dict[str, int]: documents intent and lets your editor catch bugs early. Solo notebooks — skip them. The moment anyone else reads your code, add them.