Skip to main content

Chapter 16: File Handling, CSV & JSON

Every piece of data a pipeline processes started somewhere. A CSV dropped into a folder by an accounting system. A JSON response from an API call. A log file written by a server at midnight. Before any transformation happens, your code has to open that file, read what is inside, and do something with it.

Python handles files simply and cleanly. This chapter covers three things that go together in practice: reading and writing plain files, working with CSV data, and working with JSON. By the end, you can take raw data from disk, process it, and write results back out — the core loop of most real data scripts.

---o

16.1 Opening and Reading Files

The open() Function

open() opens a file and returns a file object you can read from or write to. It takes two arguments: the file path and the mode.

file = open("report.txt", "r")

File Modes

ModeMeaningCreates file?Overwrites?
"r"ReadNo — error if missingNo
"w"WriteYesYes — wipes existing content
"a"AppendYesNo — adds to end
"x"CreateYes — error if existsNo
"r+"Read and writeNoNo

"r" is the default — open("report.txt") and open("report.txt", "r") are identical.

The with Statement

Opening a file creates a connection between Python and the filesystem. That connection must be closed when you are done — otherwise data can be lost or the file can be locked. The with statement handles this automatically:

with open("report.txt", "r") as file:
content = file.read()

# File is closed here automatically, even if an error occurred
print(content)

with Statement Anatomy

with open("report.txt", "r") as file:
─────────────────────────────────────
│ │ │
keyword path alias
(the file object)

└── file closes automatically when the block ends

Always use with. It is not optional in production code.

Reading Methods

Three ways to read a file:

# Read entire file as one string
with open("notes.txt", "r") as f:
content = f.read()
print(content)
Server restarted at 02:14
Backup completed successfully
3 failed login attempts flagged
# Read one line at a time
with open("notes.txt", "r") as f:
first_line = f.readline()
print(first_line)
Server restarted at 02:14
# Read all lines into a list
with open("notes.txt", "r") as f:
lines = f.readlines()
print(lines)
['Server restarted at 02:14\n', 'Backup completed successfully\n', '3 failed login attempts flagged\n']

Each line includes a \n newline character at the end. Use .strip() to remove it when processing:

with open("notes.txt", "r") as f:
for line in f:
print(line.strip())
Server restarted at 02:14
Backup completed successfully
3 failed login attempts flagged

Iterating directly over the file object (for line in f) is memory-efficient for large files — it reads one line at a time instead of loading everything at once.

⚠ Common Mistake — Opening Without with

file = open("report.txt", "r")
content = file.read()
# forgot file.close() — file stays locked

If your script crashes between open() and close(), the file never closes properly. with closes the file even when an error occurs. Use it every time.

Try It 16.1 — Create a plain text file with four lines of your choice. Open it with with, read it line by line, strip whitespace, and print each line numbered (1. line one, 2. line two...).


16.2 Writing Files

Writing New Content

with open("summary.txt", "w") as f:
f.write("Pipeline run: complete\n")
f.write("Records processed: 9500\n")
f.write("Errors: 0\n")

Running this creates summary.txt if it does not exist. If it does exist, "w" mode wipes it completely before writing.

with open("summary.txt", "r") as f:
print(f.read())
Pipeline run: complete
Records processed: 9500
Errors: 0

Appending to an Existing File

Use "a" mode to add to a file without erasing what is already there:

with open("summary.txt", "a") as f:
f.write("Next run scheduled: 02:00 UTC\n")
with open("summary.txt", "r") as f:
print(f.read())
Pipeline run: complete
Records processed: 9500
Errors: 0
Next run scheduled: 02:00 UTC

Writing Multiple Lines at Once

writelines() takes a list of strings and writes them all. Note: it does not add newline characters automatically — you have to include them.

lines = [
"Region: Asia Pacific\n",
"Status: Active\n",
"Clients: 14\n"
]

with open("region.txt", "w") as f:
f.writelines(lines)

⚠ Common Mistake — "w" Overwrites Silently

with open("important_log.txt", "w") as f:
f.write("new entry")

If important_log.txt had six months of log history, it is now gone. Before writing, check whether you need "w" or "a". In pipelines, log files almost always use "a".

Try It 16.2 — Write a script that creates a file called run_log.txt and writes three pipeline status lines to it. Then append a fourth line using a separate with block. Read the final file and print all four lines.


16.3 Working with CSV Files

What Is a CSV?

A CSV (comma-separated values) file stores tabular data as plain text. Each line is a row. Each value within a row is separated by a comma. The first row is usually a header.

CSV structure:

┌──────────────────────────────────────────────────┐
│ id,name,department,status ← header row │
│ 1,Alice,Engineering,active ← data row │
│ 2,Bob,Finance,active ← data row │
│ 3,Carol,Engineering,inactive ← data row │
└──────────────────────────────────────────────────┘

Python's csv module handles CSV files correctly — including quoted fields, embedded commas, and different delimiters.

import csv

Reading with csv.DictReader

DictReader reads each row as a dictionary, using the header row as keys. This is the most practical way to read CSV files — you access columns by name, not by position.

Given a file employees.csv:

id,name,department,status
1,Alice,Engineering,active
2,Bob,Finance,active
3,Carol,Engineering,inactive
import csv

with open("employees.csv", "r") as f:
reader = csv.DictReader(f)
for row in reader:
print(row["name"], "—", row["department"])
Alice — Engineering
Bob — Finance
Carol — Engineering

Each row is a dictionary:

with open("employees.csv", "r") as f:
reader = csv.DictReader(f)
for row in reader:
print(dict(row))
break # just the first row
{'id': '1', 'name': 'Alice', 'department': 'Engineering', 'status': 'active'}

Note that every value is a string — including id. Convert types explicitly when needed.

Reading with csv.reader

csv.reader reads each row as a list. Use it when the file has no header, or when you need positional access:

with open("employees.csv", "r") as f:
reader = csv.reader(f)
next(reader) # skip header row
for row in reader:
print(row[1], row[2]) # name, department
Alice Engineering
Bob Finance
Carol Engineering

Writing with csv.DictWriter

import csv

records = [
{"id": 1, "name": "Alice", "department": "Engineering", "status": "active"},
{"id": 2, "name": "Bob", "department": "Finance", "status": "active"},
]

with open("output.csv", "w", newline="") as f:
fieldnames = ["id", "name", "department", "status"]
writer = csv.DictWriter(f, fieldnames=fieldnames)

writer.writeheader()
writer.writerows(records)

Result in output.csv:

id,name,department,status
1,Alice,Engineering,active
2,Bob,Finance,active

⚠ Common Mistake — Missing newline="" On Windows, opening a CSV file without newline="" adds a blank line between every row. Always pass newline="" when opening a file for csv.writer or csv.DictWriter.

Try It 16.3 — Create a CSV file manually with five rows of product data (id, name, price, category). Use DictReader to read it, filter rows where category equals a chosen value, and print only the matching names and prices.


16.4 Working with JSON

What Is JSON?

JSON (JavaScript Object Notation) is the data format of APIs and configuration files. It looks almost identical to a Python dictionary — key-value pairs, arrays, nested objects.

JSON vs Python

JSONPython equivalent
{ "key": "value" }dict
[1, 2, 3]list
"text"str
42int
3.14float
true / falseTrue / False
nullNone

Python's json module converts between the two:

import json

String ↔ Python: json.loads() and json.dumps()

json.loads() parses a JSON string into a Python object:

import json

response = '{"status": "ok", "records": 4200, "errors": null}'
data = json.loads(response)

print(type(data))
print(data["status"])
print(data["records"])
print(data["errors"])
<class 'dict'>
ok
4200
None

json.dumps() converts a Python object into a JSON string:

result = {
"pipeline": "daily_load",
"rows_loaded": 9500,
"success": True
}

json_string = json.dumps(result)
print(json_string)
print(type(json_string))
{"pipeline": "daily_load", "rows_loaded": 9500, "success": true}
<class 'str'>

Notice Python's True becomes JSON's true. None becomes null. The module handles all type conversions automatically.

Pretty Printing

indent makes JSON human-readable:

print(json.dumps(result, indent=4))
{
"pipeline": "daily_load",
"rows_loaded": 9500,
"success": true
}

Always use indent=4 when writing JSON files that humans will read.

File ↔ Python: json.load() and json.dump()

json.load() reads directly from a file object:

with open("config.json", "r") as f:
config = json.load(f)

print(config["host"])
print(config["port"])

json.dump() writes directly to a file:

pipeline_result = {
"run_date": "2024-11-15",
"rows_processed": 8741,
"failed": 0,
"status": "success"
}

with open("last_run.json", "w") as f:
json.dump(pipeline_result, f, indent=4)

Result in last_run.json:

{
"run_date": "2024-11-15",
"rows_processed": 8741,
"failed": 0,
"status": "success"
}

The Four JSON Functions at a Glance

String → Python: json.loads(string) # 's' = string
Python → String: json.dumps(python_obj) # 's' = string

File → Python: json.load(file_obj) # no 's'
Python → File: json.dump(python_obj, file_obj) # no 's'

The s in loads/dumps stands for string. Without the s, the function works with a file.

⚠ Common Mistake — load vs loads

json_string = '{"status": "ok"}'

json.load(json_string) # TypeError — expects a file object
json.loads(json_string) # Correct — parses a string

Pass strings to loads. Pass open file objects to load.

Try It 16.4 — Create a Python dictionary representing a server configuration (host, port, database name, ssl enabled). Write it to a JSON file with indent=4. Read it back with json.load() and print each key-value pair using an f-string.


16.5 Putting It Together

This script reads a CSV file of client records, filters for active clients, and writes the result to a JSON file — a pattern that appears in real data pipelines constantly.

Input file — clients.csv:

id,name,region,status,contract_value
1,Danial,Asia Pacific,active,95000
2,Alice,Europe,inactive,42000
3,Bob,Asia Pacific,active,78000
4,Carol,North America,active,120000
5,Eve,Europe,inactive,33000

Script:

import csv
import json

input_file = "clients.csv"
output_file = "active_clients.json"
target_region = "Asia Pacific"

active_clients = []

with open(input_file, "r") as f:
reader = csv.DictReader(f)
for row in reader:
if row["status"] == "active" and row["region"] == target_region:
active_clients.append({
"id": int(row["id"]),
"name": row["name"],
"region": row["region"],
"contract_value": int(row["contract_value"])
})

with open(output_file, "w") as f:
json.dump(active_clients, f, indent=4)

print(f"Exported {len(active_clients)} active clients to {output_file}")
Exported 2 active clients to active_clients.json

Contents of active_clients.json:

[
{
"id": 1,
"name": "Danial",
"region": "Asia Pacific",
"contract_value": 95000
},
{
"id": 3,
"name": "Bob",
"region": "Asia Pacific",
"contract_value": 78000
}
]

Every concept from this chapter is here: open() with with, csv.DictReader, type conversion, json.dump(), and f-strings for the summary line.


Summary

open() connects Python to a file on disk. The mode controls whether you read, write, or append — "r", "w", and "a" are the three you will use most. Always use with to ensure the file closes correctly, even when errors occur. For CSV files, csv.DictReader reads rows as dictionaries keyed by header names; csv.DictWriter writes them back. For JSON, json.loads() parses a JSON string into Python, json.dumps() converts Python back to a string, and json.load()/json.dump() do the same directly with file objects. The s in loads/dumps is the signal — string. Without it, expect a file.


Exercises

16.1 — Write a script that creates a file called servers.txt with five server hostnames, one per line. Read it back, strip whitespace from each line, and print only hostnames that start with "db-".

16.2 — You have a CSV file with columns date, event, and severity. Write code using csv.DictReader that reads the file and prints only rows where severity equals "HIGH". Then count and print how many high-severity events there were.

16.3 — The following code has a bug. Find it and explain exactly what goes wrong at runtime:

import json
data = {"host": "localhost", "port": 5432}
with open("config.json", "w") as f:
json.dumps(data, f, indent=4)

16.4 — Write a script that reads a JSON file containing a list of pipeline run records (each with run_date, rows, status). Filter for records where status == "failed". Write the filtered results to a new JSON file called failed_runs.json.

16.5 — A CSV file arrives with values in inconsistent casing — "Active", "ACTIVE", "active" all mean the same thing. Write a DictReader loop that normalizes the status column to lowercase before processing each row. Print the cleaned values.

16.6 — Think About It: Your pipeline uses open("data.csv", "w") to write results at the end of each run. On Tuesday the script crashes halfway through writing. On Wednesday the script runs again and opens the file in "w" mode. What does the output file contain on Wednesday? What mode would you use instead, and what changes would that require in how you structure the write logic?