Skip to main content

Chapter 19: Pandas

NumPy is fast at operating on numbers. But real-world data is not just numbers — it is tables with column names, mixed types, missing values, and rows that need to be filtered, grouped, and joined. Pandas was built to handle exactly that.

At its core, Pandas gives Python a data structure that behaves like a spreadsheet but operates like a database. You can load a CSV in one line, inspect a million rows in seconds, filter by condition, group by category, merge two tables, and export the result — all with readable, expressive code. It is the standard tool for data manipulation in Python, and nearly every data engineering pipeline touches it at some point.

import pandas as pd
import numpy as np

19.1 Series — The Building Block

A Series is a one-dimensional labeled array. Think of it as a single column from a spreadsheet — values on one side, labels (the index) on the other.

revenue = pd.Series([4200, 3800, 5100, 4700, 4100])
print(revenue)
0 4200
1 3800
2 5100
3 4700
4 4100
dtype: int64

By default the index is 0, 1, 2... You can assign meaningful labels:

revenue = pd.Series(
[4200, 3800, 5100, 4700, 4100],
index=["Jan", "Feb", "Mar", "Apr", "May"]
)
print(revenue)
print(revenue["Mar"])
print(revenue[revenue > 4000])
Jan 4200
Feb 3800
Mar 5100
Apr 4700
May 4100
dtype: int64

5100

Jan 4200
Mar 5100
Apr 4700
May 4100
dtype: int64

A Series supports arithmetic, statistical functions, and boolean indexing — the same patterns as NumPy, plus the labeled index on top.


19.2 DataFrame — The Core Structure

A DataFrame is a two-dimensional table with labeled rows and labeled columns. It is the primary structure you will work with.

Creating from a Dictionary

data = {
"name": ["Alice", "Bob", "Carol", "David", "Eve"],
"department": ["Engineering", "Finance", "Engineering", "HR", "Finance"],
"salary": [85000, 72000, 91000, 65000, 78000],
"active": [True, True, False, True, True]
}

df = pd.DataFrame(data)
print(df)
name department salary active
0 Alice Engineering 85000 True
1 Bob Finance 72000 True
2 Carol Engineering 91000 False
3 David HR 65000 True
4 Eve Finance 78000 True

DataFrame Anatomy

name department salary active
┌──────────────────────────────────┐
0 ──▶│ Alice Engineering 85000 True │
1 ──▶│ Bob Finance 72000 True │◀── rows
2 ──▶│ Carol Engineering 91000 False │
3 ──▶│ David HR 65000 True │
4 ──▶│ Eve Finance 78000 True │
└──────────────────────────────────┘
▲ ▲
index columns

Reading from a CSV

In practice you will create DataFrames from files, not dictionaries.

df = pd.read_csv("employees.csv")

Common options:

df = pd.read_csv("data.csv", sep=";") # semicolon delimiter
df = pd.read_csv("data.csv", index_col="id") # set a column as index
df = pd.read_csv("data.csv", usecols=["name","salary"]) # load specific columns

Try It 19.1 — Create a DataFrame from a dictionary with at least four columns and five rows. The columns should include a string, a number, and a boolean. Print the full DataFrame.


19.3 Exploring a DataFrame

Before touching data, understand its shape and contents. These are the first five lines of every data engineering script.

print(df.shape) # (rows, columns)
print(df.dtypes) # data type of each column
print(df.columns) # column names
df.head() # first 5 rows
df.tail(3) # last 3 rows
df.info() # column names, non-null counts, dtypes
df.describe() # statistics for numeric columns

Exploration Methods at a Glance

MethodReturns
df.shape(rows, columns) as a tuple
df.dtypesData type of each column
df.columnsList of column names
df.head(n)First n rows (default 5)
df.tail(n)Last n rows (default 5)
df.info()Column names, non-null counts, memory usage
df.describe()Count, mean, std, min, max, percentiles for numeric columns
df.value_counts()Frequency of each unique value in a Series
df.nunique()Number of unique values per column
print(df.shape)
print(df.dtypes)
print(df["department"].value_counts())
(5, 4)

name object
department object
salary int64
active bool
dtype: object

Engineering 2
Finance 2
HR 1
Name: department, dtype: int64

19.4 Selecting Data

Selecting Columns

print(df["name"]) # one column — returns a Series
print(df[["name", "salary"]]) # multiple columns — returns a DataFrame
0 Alice
1 Bob
2 Carol
3 David
4 Eve
Name: name, dtype: object

name salary
0 Alice 85000
1 Bob 72000
2 Carol 91000
3 David 65000
4 Eve 78000

⚠ Common Mistake — [] vs [[]]

df["name"] # returns a Series (one column)
df[["name"]] # returns a DataFrame (one column, still a table)

When you need to pass columns to another function that expects a DataFrame, use double brackets. When you just need values, single brackets are fine.

Selecting Rows — .loc[] and .iloc[]

.loc[].iloc[]
Based onLabel (index value or column name)Integer position (0, 1, 2...)
Use whenYou have meaningful index labelsYou want rows by position
# .loc — by label
print(df.loc[2]) # row with index label 2
print(df.loc[1:3, "name"]) # rows 1 to 3, name column
name Carol
department Engineering
salary 91000
active False
Name: 2, dtype: object

1 Bob
2 Carol
3 David
Name: name, dtype: object
# .iloc — by position
print(df.iloc[0]) # first row
print(df.iloc[0:2, 0:2]) # rows 0-1, columns 0-1
name Alice
department Engineering
salary 85000
active True
Name: 0, dtype: object

name department
0 Alice Engineering
1 Bob Finance

Filtering Rows with Conditions

# Active employees only
active = df[df["active"] == True]
print(active)

# Engineering department with salary above 80,000
senior_eng = df[(df["department"] == "Engineering") & (df["salary"] > 80000)]
print(senior_eng)
name department salary active
0 Alice Engineering 85000 True
1 Bob Finance 72000 True
3 David HR 65000 True
4 Eve Finance 78000 True

name department salary active
0 Alice Engineering 85000 True
2 Carol Engineering 91000 False

Try It 19.2 — Using the DataFrame from Try It 19.1, select two columns by name. Filter rows where the numeric column is above a threshold you choose. Print the shape of the filtered result.


19.5 Handling Missing Data

Missing data in Pandas is represented as NaN (Not a Number). Every data cleaning step starts with finding and handling it.

data = {
"name": ["Alice", "Bob", "Carol", "David"],
"salary": [85000, None, 91000, 65000],
"city": ["Karachi", None, "Lahore", None]
}
df = pd.DataFrame(data)
print(df)
name salary city
0 Alice 85000.0 Karachi
1 Bob NaN None
2 Carol 91000.0 Lahore
3 David 65000.0 None

Finding Missing Values

print(df.isnull()) # True where value is missing
print(df.isnull().sum()) # count of missing per column
name salary city
0 False False False
1 False True True
2 False False False
3 False False True

name 0
salary 1
city 2
dtype: int64

Handling Strategies

SituationMethod
Drop rows with any missing valuedf.dropna()
Drop rows where specific column is missingdf.dropna(subset=["salary"])
Fill missing with a fixed valuedf.fillna(0)
Fill missing with the column meandf["salary"].fillna(df["salary"].mean())
Forward-fill (use previous value)df.fillna(method="ffill")
# Fill salary with mean, drop rows missing city
df["salary"] = df["salary"].fillna(df["salary"].mean())
df_clean = df.dropna(subset=["city"])
print(df_clean)
name salary city
0 Alice 85000.0 Karachi
2 Carol 91000.0 Lahore

⚠ Common Mistake — fillna Does Not Modify In Place by Default

df["salary"].fillna(0) # does nothing to df
df["salary"] = df["salary"].fillna(0) # correct

Most Pandas methods return a new object. Assign the result back or pass inplace=True.

Try It 19.3 — Create a DataFrame with at least two columns containing some None values. Use isnull().sum() to count missing values per column. Fill numeric missing values with the column mean. Drop rows where the string column is missing. Print the cleaned result.


19.6 Data Cleaning

Renaming Columns

df = df.rename(columns={
"name": "full_name",
"salary": "annual_salary"
})
print(df.columns)
Index(['full_name', 'annual_salary', 'city'], dtype='object')

Changing Data Types

df["annual_salary"] = df["annual_salary"].astype(int)
print(df.dtypes)
full_name object
annual_salary int64
city object
dtype: object

String Cleaning with .str

Pandas exposes string methods through .str — the same methods as Python strings, applied to every value in a column at once.

data = {
"name": [" alice ", "BOB", "carol smith"],
"email": ["Alice@Example.COM", "bob@example.com", "CAROL@EXAMPLE.COM"]
}
df = pd.DataFrame(data)

df["name"] = df["name"].str.strip().str.title()
df["email"] = df["email"].str.lower()

print(df)
name email
0 Alice alice@example.com
1 Bob bob@example.com
2 Carol Smith carol@example.com

The .str accessor is one of the most useful patterns in data cleaning — it applies any string method across an entire column without writing a loop.

Try It 19.4 — Create a DataFrame with a column of messy city names (inconsistent casing, extra spaces). Clean the column using .str.strip() and .str.title(). Add a second column showing the length of each cleaned name using .str.len().


19.7 Adding and Transforming Columns

Adding a New Column

Assign to a new column name:

df["monthly_salary"] = df["salary"] / 12
df["senior"] = df["salary"] > 80000
print(df[["name", "salary", "monthly_salary", "senior"]])
name salary monthly_salary senior
0 Alice 85000 7083.333333 True
1 Bob 72000 6000.000000 False
2 Carol 91000 7583.333333 True
3 David 65000 5416.666667 False
4 Eve 78000 6500.000000 False

apply() — Custom Transformations

apply() runs a function over each row or column:

def salary_band(salary):
if salary >= 90000:
return "Senior"
elif salary >= 75000:
return "Mid"
else:
return "Junior"

df["band"] = df["salary"].apply(salary_band)
print(df[["name", "salary", "band"]])
name salary band
0 Alice 85000 Mid
1 Bob 72000 Junior
2 Carol 91000 Senior
3 David 65000 Junior
4 Eve 78000 Mid

np.where() for Simple Conditionals

For binary (two-option) conditions, np.where is faster than apply:

df["status"] = np.where(df["active"], "Active", "Inactive")

19.8 Sorting and Grouping

sort_values()

# Sort by salary descending
print(df.sort_values("salary", ascending=False)[["name", "salary"]])
name salary
2 Carol 91000
0 Alice 85000
4 Eve 78000
1 Bob 72000
3 David 65000

groupby() — The Most Powerful Pandas Operation

groupby() splits a DataFrame into groups, applies a function to each group, and combines the results.

groupby() flow:

DataFrame

├── SPLIT by "department"
│ Engineering: [Alice, Carol]
│ Finance: [Bob, Eve]
│ HR: [David]

├── APPLY function (mean, sum, count...)

└── COMBINE into result
summary = df.groupby("department")["salary"].mean()
print(summary)
department
Engineering 88000.0
Finance 75000.0
HR 65000.0
Name: salary, dtype: float64

Multiple aggregations at once with .agg():

summary = df.groupby("department")["salary"].agg(
count="count",
mean="mean",
max="max"
)
print(summary)
count mean max
department
Engineering 2 88000.0 91000
Finance 2 75000.0 78000
HR 1 65000.0 65000

Try It 19.5 — Using a DataFrame of your choice with at least one categorical and one numeric column, group by the categorical column and compute the sum, mean, and count of the numeric column in a single .agg() call.


19.9 Merging DataFrames

Pandas merge() combines two DataFrames on a shared column — exactly like a SQL JOIN.

employees = pd.DataFrame({
"id": [1, 2, 3, 4],
"name": ["Alice", "Bob", "Carol", "David"],
"dept_id": [10, 20, 10, 30]
})

departments = pd.DataFrame({
"dept_id": [10, 20, 40],
"dept_name": ["Engineering", "Finance", "Marketing"]
})

Merge Types

how=Returns
"inner"Only rows with a match in both tables
"left"All rows from left, matched from right
"right"All rows from right, matched from left
"outer"All rows from both, NaN where no match
# Inner join — only matched rows
inner = pd.merge(employees, departments, on="dept_id", how="inner")
print(inner)
id name dept_id dept_name
0 1 Alice 10 Engineering
1 3 Carol 10 Engineering
2 2 Bob 20 Finance
# Left join — all employees, NaN if no department match
left = pd.merge(employees, departments, on="dept_id", how="left")
print(left)
id name dept_id dept_name
0 1 Alice 10 Engineering
1 2 Bob 20 Finance
2 3 Carol 10 Engineering
3 4 David 30 NaN

David has no matching department — his dept_name is NaN.


19.10 Exporting Data

# Save to CSV
df.to_csv("output.csv", index=False) # index=False prevents writing row numbers

# Save to JSON
df.to_json("output.json", orient="records", indent=4)

orient="records" exports as a list of dictionaries — the format most APIs and downstream tools expect.


19.11 Putting It Together

Ayan receives a raw sales CSV every Monday. Before loading it to the warehouse, he runs a cleaning and profiling script.

import pandas as pd
import numpy as np

# --- Load ---
df = pd.read_csv("weekly_sales.csv")
print(f"Loaded: {df.shape[0]} rows, {df.shape[1]} columns\n")

# --- Inspect ---
print("Missing values:")
print(df.isnull().sum(), "\n")

# --- Clean ---
df["product"] = df["product"].str.strip().str.title()
df["region"] = df["region"].str.strip().str.upper()
df["revenue"] = df["revenue"].fillna(df["revenue"].median())
df = df.dropna(subset=["product"])

# --- Transform ---
df["tax"] = (df["revenue"] * 0.17).round(2)
df["total"] = df["revenue"] + df["tax"]
df["tier"] = np.where(df["revenue"] >= 10000, "High", "Standard")

# --- Aggregate ---
summary = df.groupby("region")["total"].agg(
orders="count",
total_revenue="sum",
avg_order="mean"
).round(2)

print("Regional Summary:")
print(summary)

# --- Export ---
df.to_csv("weekly_sales_clean.csv", index=False)
print(f"\nExported {len(df)} cleaned rows.")
Loaded: 847 rows, 5 columns

Missing values:
product 3
region 0
revenue 12
date 0
dtype: int64

Regional Summary:
orders total_revenue avg_order
region
EAST 213 2847492.5 13369.4
NORTH 189 2341087.2 12386.7
SOUTH 224 3102843.6 13851.1
WEST 218 2761930.8 12669.4

Exported 844 cleaned rows.

Load → inspect → clean → transform → aggregate → export. This is the shape of almost every data cleaning script in production.


19.12 Real-World Analysis: The Titanic Dataset

The Titanic dataset is one of the most studied datasets in data science. It contains information about 891 passengers on the RMS Titanic — who they were, their ticket class, their age, how much they paid, and whether they survived. It is an ideal dataset for practising every Pandas skill from this chapter on real, messy data.

Download the dataset from Kaggle: https://www.kaggle.com/competitions/titanic/data

Download train.csv and save it as titanic.csv in your working directory.

The Columns

ColumnDescription
PassengerIdUnique ID
Survived0 = No, 1 = Yes
PclassTicket class — 1 (First), 2 (Second), 3 (Third)
NamePassenger name
Sexmale / female
AgeAge in years
SibSpNumber of siblings/spouses aboard
ParchNumber of parents/children aboard
TicketTicket number
FarePassenger fare
CabinCabin number
EmbarkedPort — C (Cherbourg), Q (Queenstown), S (Southampton)

Step 1 — Load and First Look

import pandas as pd
import numpy as np

df = pd.read_csv("titanic.csv")

print(df.shape)
print(df.head())
(891, 12)

PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0

Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

891 passengers. Already you can see Cabin and Age have NaN values.


Step 2 — Understand the Data

print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
print(df.describe())
PassengerId Survived Pclass Age SibSp \
count 891.000000 891.000000 891.000000 714.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008
std 257.353842 0.486592 0.836071 14.526497 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000

The mean of Survived is 0.38 — meaning 38% of passengers survived.


Step 3 — Missing Values

missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(1)

print(pd.DataFrame({"missing": missing, "percent": missing_pct}))
missing percent
PassengerId 0 0.0
Survived 0 0.0
Pclass 0 0.0
Name 0 0.0
Sex 0 0.0
Age 177 19.9
SibSp 0 0.0
Parch 0 0.0
Ticket 0 0.0
Fare 0 0.0
Cabin 687 77.1
Embarked 2 0.2

Three columns have missing values:

  • Age — 20% missing. Fill with median (preferred over mean for skewed data).
  • Cabin — 77% missing. Too many gaps to be useful. Drop the column.
  • Embarked — only 2 missing. Fill with the most common port.

Step 4 — Clean the Data

A Pandas Series stores its values as a NumPy array underneath. Calling .values exposes that array directly — which means every NumPy function works on it.

# The Age column is backed by a NumPy array
age_array = df["Age"].values
print(type(age_array))
print(age_array[:8])
<class 'numpy.ndarray'>
[22. 38. 26. 35. 35. nan 54. 2. ]

Use np.nanmedian() to compute the median while ignoring NaN values — more explicit than Pandas' .median() and directly from NumPy:

age_median = np.nanmedian(df["Age"].values)
print(f"Age median (NumPy): {age_median}")
Age median (NumPy): 28.0

Now clean all three columns:

# Fill Age with NumPy-computed median
df["Age"] = df["Age"].fillna(age_median)

# Drop Cabin — 77% missing, not salvageable
df = df.drop(columns=["Cabin"])

# Fill the two missing Embarked values with the most common port
df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode()[0])

# Verify
print(df.isnull().sum())
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Embarked 0
dtype: int64

Clean. Now the analysis can begin.


Step 5 — Overall Survival Rate

total = len(df)
survived = df["Survived"].sum()
not_surv = total - survived
surv_rate = (survived / total * 100).round(1)

print(f"Total passengers : {total}")
print(f"Survived : {survived} ({surv_rate}%)")
print(f"Did not survive : {not_surv} ({100 - surv_rate}%)")
Total passengers : 891
Survived : 342 (38.4%)
Did not survive : 549 (61.6%)

Only 38 out of every 100 passengers survived.


Step 6 — Survival by Gender

One of the most striking patterns in the dataset.

gender_surv = df.groupby("Sex")["Survived"].agg(
total="count",
survived="sum"
)
gender_surv["survival_rate"] = (
gender_surv["survived"] / gender_surv["total"] * 100
).round(1)

print(gender_surv)
total survived survival_rate
Sex
female 314 233 74.2
male 577 109 18.9

74% of women survived. Only 19% of men did. The "women and children first" protocol was real and visible in the data.


Step 7 — Survival by Passenger Class

class_surv = df.groupby("Pclass")["Survived"].agg(
total="count",
survived="sum"
)
class_surv["survival_rate"] = (
class_surv["survived"] / class_surv["total"] * 100
).round(1)
class_surv.index = ["1st Class", "2nd Class", "3rd Class"]

print(class_surv)
total survived survival_rate
1st Class 216 136 63.0
2nd Class 184 87 47.3
3rd Class 491 119 24.2

First-class passengers had more than twice the survival rate of third-class. Access to lifeboats was unequal.


Step 8 — Survival by Gender and Class Combined

combined = df.pivot_table(
values="Survived",
index="Pclass",
columns="Sex",
aggfunc="mean"
).round(3) * 100

combined.index = ["1st Class", "2nd Class", "3rd Class"]
print(combined)
Sex female male
1st Class 96.8 36.9
2nd Class 92.1 15.7
3rd Class 50.0 13.5

A first-class woman had a 96.8% chance of survival. A third-class man had a 13.5% chance. The intersection of gender and class tells a much sharper story than either variable alone.


Step 9 — Age Analysis

surv_age = df[df["Survived"] == 1]["Age"]
nosurv_age = df[df["Survived"] == 0]["Age"]

print("Age — Survivors:")
print(f" Mean: {surv_age.mean():.1f}")
print(f" Median: {surv_age.median():.1f}")

print("\nAge — Did Not Survive:")
print(f" Mean: {nosurv_age.mean():.1f}")
print(f" Median: {nosurv_age.median():.1f}")

# Children under 10
children = df[df["Age"] < 10]
child_surv_rate = (children["Survived"].sum() / len(children) * 100).round(1)
print(f"\nChildren under 10 survival rate: {child_surv_rate}%")
Age — Survivors:
Mean: 28.3
Median: 28.0

Age — Did Not Survive:
Mean: 30.6
Median: 28.0

Children under 10 survival rate: 59.4%

Younger passengers had a slight survival advantage. Children under 10 survived at nearly 60%.

Use NumPy to understand the full age distribution through percentiles:

age_arr = df["Age"].values # extract the underlying NumPy array

percentiles = np.percentile(age_arr, [10, 25, 50, 75, 90])
labels = ["10th", "25th", "50th", "75th", "90th"]

print("Age percentile distribution:")
for label, val in zip(labels, percentiles):
print(f" {label} percentile : {val:.1f} years")

print(f"\n Std deviation : {np.std(age_arr):.1f} years")
print(f" Variance : {np.var(age_arr):.1f}")
Age percentile distribution:
10th percentile : 14.0 years
25th percentile : 22.0 years
50th percentile : 28.0 years
75th percentile : 36.0 years
90th percentile : 50.0 years

Std deviation : 13.0 years
Variance : 169.0

10% of passengers were under 14. Half were under 28. Now use np.where() to label passengers into age groups — no loop needed:

df["age_group"] = np.where(
df["Age"] < 18, "Child",
np.where(df["Age"] < 40, "Adult", "Senior")
)

group_surv = df.groupby("age_group")["Survived"].agg(
total="count",
survived="sum"
)
group_surv["rate"] = (
group_surv["survived"] / group_surv["total"] * 100
).round(1)

print(group_surv)
total survived rate
age_group
Adult 551 215 39.0
Child 139 70 50.4
Senior 201 57 28.4

Children had the highest survival rate at 50.4%. Seniors the lowest at 28.4%. The nested np.where() built the entire column in one line — the same pattern works on NumPy arrays and Pandas Series identically.


Step 10 — Fare Analysis

print(f"Mean fare : £{df['Fare'].mean():.2f}")
print(f"Median fare: £{df['Fare'].median():.2f}")
print(f"Max fare : £{df['Fare'].max():.2f}")
print()

fare_by_class = df.groupby("Pclass")["Fare"].mean().round(2)
fare_by_class.index = ["1st Class", "2nd Class", "3rd Class"]
print("Average fare by class:")
print(fare_by_class)
Mean fare : £32.20
Median fare: £14.45
Max fare : £512.33

Average fare by class:
1st Class 84.15
2nd Class 20.66
3rd Class 13.68
dtype: float64

The mean (£32) is far above the median (£14) — the distribution is right-skewed. A small number of very expensive first-class tickets pull the mean up.

Use NumPy to see the full distribution through percentiles and confirm the skew:

fare_arr = df["Fare"].values

print("Fare percentile breakdown:")
for p in [25, 50, 75, 90, 95, 99]:
val = np.percentile(fare_arr, p)
print(f" {p:>3}th percentile : £{val:.2f}")

print(f"\n Std deviation : £{np.std(fare_arr):.2f}")
Fare percentile breakdown:
25th percentile : £7.91
50th percentile : £14.45
75th percentile : £31.39
90th percentile : £77.96
95th percentile : £112.08
99th percentile : £249.06

The top 1% paid over £249. The bottom 25% paid under £8. This extreme spread is what makes the mean misleading here.

Apply a log transformation using np.log1p() to compress the skew. log1p means log(1 + x) — the +1 handles zero values safely:

df["fare_log"] = np.log1p(df["Fare"].values)

print(f"Original fare — mean: £{df['Fare'].mean():.2f}, std: £{np.std(df['Fare'].values):.2f}")
print(f"Log fare — mean: {df['fare_log'].mean():.3f}, std: {np.std(df['fare_log'].values):.3f}")
Original fare — mean: £32.20, std: £49.69
Log fare — mean: 2.964, std: 1.020

The log-transformed fare has a much tighter spread (std of 1.0 vs 49.7). This is a standard preprocessing step before feeding data into machine learning models.

Now use np.corrcoef() to check how strongly each numeric feature correlates with survival:

numeric_cols = ["Survived", "Pclass", "Age", "SibSp", "Parch", "Fare"]
data_matrix = df[numeric_cols].values.T # transpose: each row = one variable

corr_matrix = np.corrcoef(data_matrix)
surv_corr = corr_matrix[0] # row 0 = correlations with Survived

print("Correlation with Survival:")
for col, corr in zip(numeric_cols, surv_corr):
bar = "█" * int(abs(corr) * 20)
direction = "+" if corr >= 0 else "-"
print(f" {col:<12} {direction}{abs(corr):.3f} {bar}")
Correlation with Survival:
Survived +1.000 ████████████████████
Pclass -0.338 ██████
Age -0.077 █
SibSp -0.035
Parch +0.082 █
Fare +0.257 █████

Pclass has the strongest negative correlation with survival — higher class number (3rd) means lower survival. Fare has the strongest positive correlation — paying more meant a better chance. Age is weakly negative — older passengers were slightly less likely to survive.


Step 11 — Export the Cleaned Dataset

df.to_csv("titanic_clean.csv", index=False)
print(f"Exported {len(df)} rows to titanic_clean.csv")
Exported 891 rows to titanic_clean.csv

What This Analysis Used

Every major concept from Chapters 18 and 19 appeared in this walkthrough:

ConceptWhere used
pd.read_csv()Step 1
head(), info(), describe(), shapeSteps 1–2
isnull().sum()Step 3
fillna(), drop(), mode()Step 4
.values — accessing the NumPy arrayStep 4
np.nanmedian()Step 4
Boolean filteringSteps 5, 9
groupby().agg()Steps 6, 7, 10
pivot_table()Step 8
np.percentile()Steps 9, 10
np.std(), np.var()Step 9
np.where() — vectorized age groupingStep 9
np.log1p() — log transformationStep 10
np.corrcoef() — correlation matrixStep 10
to_csv()Step 11

The NumPy operations did not replace Pandas — they extended it. Pandas organized and filtered the data. NumPy computed the statistics, transformed distributions, and built columns without a single loop.


Summary

Pandas provides two core structures. A Series is a 1D labeled array — one column of data with an index. A DataFrame is a 2D labeled table — the primary structure for all data work. Load data with pd.read_csv() and explore it with shape, dtypes, info(), and describe(). Select columns with df["col"] or df[["col1","col2"]]; select rows with .loc[] (by label) or .iloc[] (by position); filter rows with boolean conditions. Handle missing data with isnull(), fillna(), and dropna() — remember most methods return a new DataFrame, not a modified one. Clean string columns with the .str accessor. Add columns by direct assignment or apply() for custom logic. Sort with sort_values(); aggregate with groupby().agg(). Merge two DataFrames like a SQL JOIN using merge() with how="inner", "left", "right", or "outer". Export with to_csv() or to_json().


Exercises

19.1 — Load a CSV file (or create a DataFrame from a dictionary with 8+ rows). Print its shape, dtypes, and the count of unique values in each column using nunique(). Then print the five rows with the highest numeric value in one column.

19.2 — Create a DataFrame with a salary column containing two None values. Use isnull().sum() to confirm. Fill missing values with the column mean. Confirm no missing values remain. Print the before and after.

19.3 — The following code is intended to clean a column but does not work as expected. Explain why and fix it:

df["city"].fillna("Unknown")
df["city"] = df["city"].str.title()
print(df["city"])

19.4 — From a DataFrame of five employees with name, department, salary, and years_experience columns: filter to employees with more than 3 years experience, add a bonus column equal to 10% of salary, and sort by bonus descending. Print the result.

19.5 — Create two DataFrames: one of orders with order_id, customer_id, and amount; one of customers with customer_id and name. Perform a left merge. Print how many rows have a NaN customer name (orders with no matching customer).

19.6 — Group a sales DataFrame by both region and product_category simultaneously using groupby(["region","product_category"]). Compute total revenue and order count for each combination. Which combination has the highest average order value?

19.7 — Write a complete data cleaning function clean_dataframe(df) that: strips and title-cases all string columns, fills numeric missing values with the column median, drops rows where more than half the columns are missing, and returns the cleaned DataFrame. Test it on a deliberately messy DataFrame.

19.8 — Think About It: groupby() in Pandas uses the Split-Apply-Combine pattern. SQL's GROUP BY does the same thing. What does Pandas give you that SQL does not, and what does SQL give you that Pandas does not? When would you clean and group data in Pandas before loading to a database, and when would you let the database do the grouping instead?