Chapter 16: File Handling, CSV & JSON
Every piece of data a pipeline processes started somewhere. A CSV dropped into a folder by an accounting system. A JSON response from an API call. A log file written by a server at midnight. Before any transformation happens, your code has to open that file, read what is inside, and do something with it.
Python handles files simply and cleanly. This chapter covers three things that go together in practice: reading and writing plain files, working with CSV data, and working with JSON. By the end, you can take raw data from disk, process it, and write results back out — the core loop of most real data scripts.
---o
16.1 Opening and Reading Files
The open() Function
open() opens a file and returns a file object you can read from or write to. It takes two arguments: the file path and the mode.
file = open("report.txt", "r")
File Modes
| Mode | Meaning | Creates file? | Overwrites? |
|---|---|---|---|
"r" | Read | No — error if missing | No |
"w" | Write | Yes | Yes — wipes existing content |
"a" | Append | Yes | No — adds to end |
"x" | Create | Yes — error if exists | No |
"r+" | Read and write | No | No |
"r" is the default — open("report.txt") and open("report.txt", "r") are identical.
The with Statement
Opening a file creates a connection between Python and the filesystem. That connection must be closed when you are done — otherwise data can be lost or the file can be locked. The with statement handles this automatically:
with open("report.txt", "r") as file:
content = file.read()
# File is closed here automatically, even if an error occurred
print(content)
with Statement Anatomy
with open("report.txt", "r") as file:
─────────────────────────────────────
│ │ │
keyword path alias
(the file object)
│
└── file closes automatically when the block ends
Always use with. It is not optional in production code.
Reading Methods
Three ways to read a file:
# Read entire file as one string
with open("notes.txt", "r") as f:
content = f.read()
print(content)
Server restarted at 02:14
Backup completed successfully
3 failed login attempts flagged
# Read one line at a time
with open("notes.txt", "r") as f:
first_line = f.readline()
print(first_line)
Server restarted at 02:14
# Read all lines into a list
with open("notes.txt", "r") as f:
lines = f.readlines()
print(lines)
['Server restarted at 02:14\n', 'Backup completed successfully\n', '3 failed login attempts flagged\n']
Each line includes a \n newline character at the end. Use .strip() to remove it when processing:
with open("notes.txt", "r") as f:
for line in f:
print(line.strip())
Server restarted at 02:14
Backup completed successfully
3 failed login attempts flagged
Iterating directly over the file object (for line in f) is memory-efficient for large files — it reads one line at a time instead of loading everything at once.
⚠ Common Mistake — Opening Without
withfile = open("report.txt", "r")content = file.read()# forgot file.close() — file stays lockedIf your script crashes between
open()andclose(), the file never closes properly.withcloses the file even when an error occurs. Use it every time.
Try It 16.1 — Create a plain text file with four lines of your choice. Open it with
with, read it line by line, strip whitespace, and print each line numbered (1. line one, 2. line two...).
16.2 Writing Files
Writing New Content
with open("summary.txt", "w") as f:
f.write("Pipeline run: complete\n")
f.write("Records processed: 9500\n")
f.write("Errors: 0\n")
Running this creates summary.txt if it does not exist. If it does exist, "w" mode wipes it completely before writing.
with open("summary.txt", "r") as f:
print(f.read())
Pipeline run: complete
Records processed: 9500
Errors: 0
Appending to an Existing File
Use "a" mode to add to a file without erasing what is already there:
with open("summary.txt", "a") as f:
f.write("Next run scheduled: 02:00 UTC\n")
with open("summary.txt", "r") as f:
print(f.read())
Pipeline run: complete
Records processed: 9500
Errors: 0
Next run scheduled: 02:00 UTC
Writing Multiple Lines at Once
writelines() takes a list of strings and writes them all. Note: it does not add newline characters automatically — you have to include them.
lines = [
"Region: Asia Pacific\n",
"Status: Active\n",
"Clients: 14\n"
]
with open("region.txt", "w") as f:
f.writelines(lines)
⚠ Common Mistake —
"w"Overwrites Silentlywith open("important_log.txt", "w") as f:f.write("new entry")If
important_log.txthad six months of log history, it is now gone. Before writing, check whether you need"w"or"a". In pipelines, log files almost always use"a".
Try It 16.2 — Write a script that creates a file called
run_log.txtand writes three pipeline status lines to it. Then append a fourth line using a separatewithblock. Read the final file and print all four lines.
16.3 Working with CSV Files
What Is a CSV?
A CSV (comma-separated values) file stores tabular data as plain text. Each line is a row. Each value within a row is separated by a comma. The first row is usually a header.
CSV structure:
┌──────────────────────────────────────────────────┐
│ id,name,department,status ← header row │
│ 1,Alice,Engineering,active ← data row │
│ 2,Bob,Finance,active ← data row │
│ 3,Carol,Engineering,inactive ← data row │
└──────────────────────────────────────────────────┘
Python's csv module handles CSV files correctly — including quoted fields, embedded commas, and different delimiters.
import csv
Reading with csv.DictReader
DictReader reads each row as a dictionary, using the header row as keys. This is the most practical way to read CSV files — you access columns by name, not by position.
Given a file employees.csv:
id,name,department,status
1,Alice,Engineering,active
2,Bob,Finance,active
3,Carol,Engineering,inactive
import csv
with open("employees.csv", "r") as f:
reader = csv.DictReader(f)
for row in reader:
print(row["name"], "—", row["department"])
Alice — Engineering
Bob — Finance
Carol — Engineering
Each row is a dictionary:
with open("employees.csv", "r") as f:
reader = csv.DictReader(f)
for row in reader:
print(dict(row))
break # just the first row
{'id': '1', 'name': 'Alice', 'department': 'Engineering', 'status': 'active'}
Note that every value is a string — including id. Convert types explicitly when needed.
Reading with csv.reader
csv.reader reads each row as a list. Use it when the file has no header, or when you need positional access:
with open("employees.csv", "r") as f:
reader = csv.reader(f)
next(reader) # skip header row
for row in reader:
print(row[1], row[2]) # name, department
Alice Engineering
Bob Finance
Carol Engineering
Writing with csv.DictWriter
import csv
records = [
{"id": 1, "name": "Alice", "department": "Engineering", "status": "active"},
{"id": 2, "name": "Bob", "department": "Finance", "status": "active"},
]
with open("output.csv", "w", newline="") as f:
fieldnames = ["id", "name", "department", "status"]
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(records)
Result in output.csv:
id,name,department,status
1,Alice,Engineering,active
2,Bob,Finance,active
⚠ Common Mistake — Missing
newline=""On Windows, opening a CSV file withoutnewline=""adds a blank line between every row. Always passnewline=""when opening a file forcsv.writerorcsv.DictWriter.
Try It 16.3 — Create a CSV file manually with five rows of product data (id, name, price, category). Use
DictReaderto read it, filter rows wherecategoryequals a chosen value, and print only the matching names and prices.
16.4 Working with JSON
What Is JSON?
JSON (JavaScript Object Notation) is the data format of APIs and configuration files. It looks almost identical to a Python dictionary — key-value pairs, arrays, nested objects.
JSON vs Python
| JSON | Python equivalent |
|---|---|
{ "key": "value" } | dict |
[1, 2, 3] | list |
"text" | str |
42 | int |
3.14 | float |
true / false | True / False |
null | None |
Python's json module converts between the two:
import json
String ↔ Python: json.loads() and json.dumps()
json.loads() parses a JSON string into a Python object:
import json
response = '{"status": "ok", "records": 4200, "errors": null}'
data = json.loads(response)
print(type(data))
print(data["status"])
print(data["records"])
print(data["errors"])
<class 'dict'>
ok
4200
None
json.dumps() converts a Python object into a JSON string:
result = {
"pipeline": "daily_load",
"rows_loaded": 9500,
"success": True
}
json_string = json.dumps(result)
print(json_string)
print(type(json_string))
{"pipeline": "daily_load", "rows_loaded": 9500, "success": true}
<class 'str'>
Notice Python's True becomes JSON's true. None becomes null. The module handles all type conversions automatically.
Pretty Printing
indent makes JSON human-readable:
print(json.dumps(result, indent=4))
{
"pipeline": "daily_load",
"rows_loaded": 9500,
"success": true
}
Always use indent=4 when writing JSON files that humans will read.
File ↔ Python: json.load() and json.dump()
json.load() reads directly from a file object:
with open("config.json", "r") as f:
config = json.load(f)
print(config["host"])
print(config["port"])
json.dump() writes directly to a file:
pipeline_result = {
"run_date": "2024-11-15",
"rows_processed": 8741,
"failed": 0,
"status": "success"
}
with open("last_run.json", "w") as f:
json.dump(pipeline_result, f, indent=4)
Result in last_run.json:
{
"run_date": "2024-11-15",
"rows_processed": 8741,
"failed": 0,
"status": "success"
}
The Four JSON Functions at a Glance
String → Python: json.loads(string) # 's' = string
Python → String: json.dumps(python_obj) # 's' = string
File → Python: json.load(file_obj) # no 's'
Python → File: json.dump(python_obj, file_obj) # no 's'
The s in loads/dumps stands for string. Without the s, the function works with a file.
⚠ Common Mistake —
loadvsloadsjson_string = '{"status": "ok"}'json.load(json_string) # TypeError — expects a file objectjson.loads(json_string) # Correct — parses a stringPass strings to
loads. Pass open file objects toload.
Try It 16.4 — Create a Python dictionary representing a server configuration (host, port, database name, ssl enabled). Write it to a JSON file with
indent=4. Read it back withjson.load()and print each key-value pair using an f-string.
16.5 Putting It Together
This script reads a CSV file of client records, filters for active clients, and writes the result to a JSON file — a pattern that appears in real data pipelines constantly.
Input file — clients.csv:
id,name,region,status,contract_value
1,Danial,Asia Pacific,active,95000
2,Alice,Europe,inactive,42000
3,Bob,Asia Pacific,active,78000
4,Carol,North America,active,120000
5,Eve,Europe,inactive,33000
Script:
import csv
import json
input_file = "clients.csv"
output_file = "active_clients.json"
target_region = "Asia Pacific"
active_clients = []
with open(input_file, "r") as f:
reader = csv.DictReader(f)
for row in reader:
if row["status"] == "active" and row["region"] == target_region:
active_clients.append({
"id": int(row["id"]),
"name": row["name"],
"region": row["region"],
"contract_value": int(row["contract_value"])
})
with open(output_file, "w") as f:
json.dump(active_clients, f, indent=4)
print(f"Exported {len(active_clients)} active clients to {output_file}")
Exported 2 active clients to active_clients.json
Contents of active_clients.json:
[
{
"id": 1,
"name": "Danial",
"region": "Asia Pacific",
"contract_value": 95000
},
{
"id": 3,
"name": "Bob",
"region": "Asia Pacific",
"contract_value": 78000
}
]
Every concept from this chapter is here: open() with with, csv.DictReader, type conversion, json.dump(), and f-strings for the summary line.
Summary
open() connects Python to a file on disk. The mode controls whether you read, write, or append — "r", "w", and "a" are the three you will use most. Always use with to ensure the file closes correctly, even when errors occur. For CSV files, csv.DictReader reads rows as dictionaries keyed by header names; csv.DictWriter writes them back. For JSON, json.loads() parses a JSON string into Python, json.dumps() converts Python back to a string, and json.load()/json.dump() do the same directly with file objects. The s in loads/dumps is the signal — string. Without it, expect a file.
Exercises
16.1 — Write a script that creates a file called servers.txt with five server hostnames, one per line. Read it back, strip whitespace from each line, and print only hostnames that start with "db-".
16.2 — You have a CSV file with columns date, event, and severity. Write code using csv.DictReader that reads the file and prints only rows where severity equals "HIGH". Then count and print how many high-severity events there were.
16.3 — The following code has a bug. Find it and explain exactly what goes wrong at runtime:
import json
data = {"host": "localhost", "port": 5432}
with open("config.json", "w") as f:
json.dumps(data, f, indent=4)
16.4 — Write a script that reads a JSON file containing a list of pipeline run records (each with run_date, rows, status). Filter for records where status == "failed". Write the filtered results to a new JSON file called failed_runs.json.
16.5 — A CSV file arrives with values in inconsistent casing — "Active", "ACTIVE", "active" all mean the same thing. Write a DictReader loop that normalizes the status column to lowercase before processing each row. Print the cleaned values.
16.6 — Think About It: Your pipeline uses open("data.csv", "w") to write results at the end of each run. On Tuesday the script crashes halfway through writing. On Wednesday the script runs again and opens the file in "w" mode. What does the output file contain on Wednesday? What mode would you use instead, and what changes would that require in how you structure the write logic?