Chapter 19: Pandas

NumPy is fast at operating on numbers. But real-world data is not just numbers — it is tables with column names, mixed types, missing values, and rows that need to be filtered, grouped, and joined. Pandas was built to handle exactly that.

At its core, Pandas gives Python a data structure that behaves like a spreadsheet but operates like a database. You can load a CSV in one line, inspect a million rows in seconds, filter by condition, group by category, merge two tables, and export the result — all with readable, expressive code. It is the standard tool for data manipulation in Python, and nearly every data engineering pipeline touches it at some point.

import pandas as pd
import numpy as np

19.1 Series — The Building Block

A Series is a one-dimensional labeled array. Think of it as a single column from a spreadsheet — values on one side, labels (the index) on the other.

revenue = pd.Series([4200, 3800, 5100, 4700, 4100])
print(revenue)

  4200
  3800
  5100
  4700
  4100
dtype: int64

By default the index is 0, 1, 2... You can assign meaningful labels:

revenue = pd.Series(
    [4200, 3800, 5100, 4700, 4100],
    index=["Jan", "Feb", "Mar", "Apr", "May"]
)
print(revenue)
print(revenue["Mar"])
print(revenue[revenue > 4000])

Jan    4200
Feb    3800
Mar    5100
Apr    4700
May    4100
dtype: int64

5100

Jan    4200
Mar    5100
Apr    4700
May    4100
dtype: int64

A Series supports arithmetic, statistical functions, and boolean indexing — the same patterns as NumPy, plus the labeled index on top.

19.2 DataFrame — The Core Structure

A DataFrame is a two-dimensional table with labeled rows and labeled columns. It is the primary structure you will work with.

Creating from a Dictionary

data = {
    "name":       ["Alice", "Bob", "Carol", "David", "Eve"],
    "department": ["Engineering", "Finance", "Engineering", "HR", "Finance"],
    "salary":     [85000, 72000, 91000, 65000, 78000],
    "active":     [True, True, False, True, True]
}

df = pd.DataFrame(data)
print(df)

    name   department  salary  active
Alice  Engineering   85000    True
  Bob      Finance   72000    True
Carol  Engineering   91000   False
David           HR   65000    True
  Eve      Finance   78000    True

DataFrame Anatomy

            name   department  salary  active
          ┌──────────────────────────────────┐
    0  ──▶│ Alice  Engineering  85000   True  │
    1  ──▶│  Bob    Finance     72000   True  │◀── rows
    2  ──▶│ Carol  Engineering  91000  False  │
    3  ──▶│ David      HR       65000   True  │
    4  ──▶│  Eve    Finance     78000   True  │
          └──────────────────────────────────┘
    ▲              ▲
  index          columns

Reading from a CSV

In practice you will create DataFrames from files, not dictionaries.

df = pd.read_csv("employees.csv")

Common options:

df = pd.read_csv("data.csv", sep=";")              # semicolon delimiter
df = pd.read_csv("data.csv", index_col="id")       # set a column as index
df = pd.read_csv("data.csv", usecols=["name","salary"])  # load specific columns

Try It 19.1 — Create a DataFrame from a dictionary with at least four columns and five rows. The columns should include a string, a number, and a boolean. Print the full DataFrame.

19.3 Exploring a DataFrame

Before touching data, understand its shape and contents. These are the first five lines of every data engineering script.

print(df.shape)          # (rows, columns)
print(df.dtypes)         # data type of each column
print(df.columns)        # column names
df.head()                # first 5 rows
df.tail(3)               # last 3 rows
df.info()                # column names, non-null counts, dtypes
df.describe()            # statistics for numeric columns

Exploration Methods at a Glance

Method	Returns
`df.shape`	`(rows, columns)` as a tuple
`df.dtypes`	Data type of each column
`df.columns`	List of column names
`df.head(n)`	First n rows (default 5)
`df.tail(n)`	Last n rows (default 5)
`df.info()`	Column names, non-null counts, memory usage
`df.describe()`	Count, mean, std, min, max, percentiles for numeric columns
`df.value_counts()`	Frequency of each unique value in a Series
`df.nunique()`	Number of unique values per column

print(df.shape)
print(df.dtypes)
print(df["department"].value_counts())

(5, 4)

name          object
department    object
salary         int64
active          bool
dtype: object

Engineering    2
Finance        2
HR             1
Name: department, dtype: int64

19.4 Selecting Data

Selecting Columns

print(df["name"])                        # one column — returns a Series
print(df[["name", "salary"]])            # multiple columns — returns a DataFrame

  Alice
    Bob
  Carol
  David
    Eve
Name: name, dtype: object

    name  salary
Alice   85000
  Bob   72000
Carol   91000
David   65000
  Eve   78000

⚠ Common Mistake — [] vs [[]]
df["name"]       # returns a Series (one column)
df[["name"]]     # returns a DataFrame (one column, still a table)
When you need to pass columns to another function that expects a DataFrame, use double brackets. When you just need values, single brackets are fine.

Selecting Rows — `.loc[]` and `.iloc[]`

	`.loc[]`	`.iloc[]`
Based on	Label (index value or column name)	Integer position (0, 1, 2...)
Use when	You have meaningful index labels	You want rows by position

# .loc — by label
print(df.loc[2])              # row with index label 2
print(df.loc[1:3, "name"])    # rows 1 to 3, name column

name          Carol
department    Engineering
salary              91000
active              False
Name: 2, dtype: object

1      Bob
2    Carol
3    David
Name: name, dtype: object

# .iloc — by position
print(df.iloc[0])              # first row
print(df.iloc[0:2, 0:2])       # rows 0-1, columns 0-1

name           Alice
department    Engineering
salary              85000
active               True
Name: 0, dtype: object

    name   department
0  Alice  Engineering
1    Bob      Finance

Filtering Rows with Conditions

# Active employees only
active = df[df["active"] == True]
print(active)

# Engineering department with salary above 80,000
senior_eng = df[(df["department"] == "Engineering") & (df["salary"] > 80000)]
print(senior_eng)

    name   department  salary  active
Alice  Engineering   85000    True
  Bob      Finance   72000    True
David           HR   65000    True
  Eve      Finance   78000    True

    name   department  salary  active
Alice  Engineering   85000    True
Carol  Engineering   91000   False

Try It 19.2 — Using the DataFrame from Try It 19.1, select two columns by name. Filter rows where the numeric column is above a threshold you choose. Print the shape of the filtered result.

19.5 Handling Missing Data

Missing data in Pandas is represented as NaN (Not a Number). Every data cleaning step starts with finding and handling it.

data = {
    "name":   ["Alice", "Bob", "Carol", "David"],
    "salary": [85000, None, 91000, 65000],
    "city":   ["Karachi", None, "Lahore", None]
}
df = pd.DataFrame(data)
print(df)

    name   salary     city
Alice  85000.0  Karachi
  Bob      NaN     None
Carol  91000.0   Lahore
David  65000.0     None

Finding Missing Values

print(df.isnull())          # True where value is missing
print(df.isnull().sum())    # count of missing per column

    name  salary   city
0  False   False  False
1  False    True   True
2  False   False  False
3  False   False   True

name      0
salary    1
city      2
dtype: int64

Handling Strategies

Situation	Method
Drop rows with any missing value	`df.dropna()`
Drop rows where specific column is missing	`df.dropna(subset=["salary"])`
Fill missing with a fixed value	`df.fillna(0)`
Fill missing with the column mean	`df["salary"].fillna(df["salary"].mean())`
Forward-fill (use previous value)	`df.fillna(method="ffill")`

# Fill salary with mean, drop rows missing city
df["salary"] = df["salary"].fillna(df["salary"].mean())
df_clean = df.dropna(subset=["city"])
print(df_clean)

    name    salary     city
0  Alice  85000.0  Karachi
2  Carol  91000.0   Lahore

⚠ Common Mistake — fillna Does Not Modify In Place by Default
df["salary"].fillna(0)       # does nothing to df
df["salary"] = df["salary"].fillna(0)   # correct
Most Pandas methods return a new object. Assign the result back or pass inplace=True.

Try It 19.3 — Create a DataFrame with at least two columns containing some None values. Use isnull().sum() to count missing values per column. Fill numeric missing values with the column mean. Drop rows where the string column is missing. Print the cleaned result.

19.6 Data Cleaning

Renaming Columns

df = df.rename(columns={
    "name":   "full_name",
    "salary": "annual_salary"
})
print(df.columns)

Index(['full_name', 'annual_salary', 'city'], dtype='object')

Changing Data Types

df["annual_salary"] = df["annual_salary"].astype(int)
print(df.dtypes)

full_name        object
annual_salary     int64
city             object
dtype: object

String Cleaning with `.str`

Pandas exposes string methods through .str — the same methods as Python strings, applied to every value in a column at once.

data = {
    "name":   ["  alice  ", "BOB", "carol smith"],
    "email":  ["Alice@Example.COM", "bob@example.com", "CAROL@EXAMPLE.COM"]
}
df = pd.DataFrame(data)

df["name"]  = df["name"].str.strip().str.title()
df["email"] = df["email"].str.lower()

print(df)

          name              email
      Alice  alice@example.com
        Bob    bob@example.com
Carol Smith  carol@example.com

The .str accessor is one of the most useful patterns in data cleaning — it applies any string method across an entire column without writing a loop.

Try It 19.4 — Create a DataFrame with a column of messy city names (inconsistent casing, extra spaces). Clean the column using .str.strip() and .str.title(). Add a second column showing the length of each cleaned name using .str.len().

19.7 Adding and Transforming Columns

Adding a New Column

Assign to a new column name:

df["monthly_salary"] = df["salary"] / 12
df["senior"] = df["salary"] > 80000
print(df[["name", "salary", "monthly_salary", "senior"]])

    name  salary  monthly_salary  senior
Alice   85000     7083.333333    True
  Bob   72000     6000.000000   False
Carol   91000     7583.333333    True
David   65000     5416.666667   False
  Eve   78000     6500.000000   False

`apply()` — Custom Transformations

apply() runs a function over each row or column:

def salary_band(salary):
    if salary >= 90000:
        return "Senior"
    elif salary >= 75000:
        return "Mid"
    else:
        return "Junior"

df["band"] = df["salary"].apply(salary_band)
print(df[["name", "salary", "band"]])

    name  salary    band
Alice   85000     Mid
  Bob   72000  Junior
Carol   91000  Senior
David   65000  Junior
  Eve   78000     Mid

`np.where()` for Simple Conditionals

For binary (two-option) conditions, np.where is faster than apply:

df["status"] = np.where(df["active"], "Active", "Inactive")

19.8 Sorting and Grouping

`sort_values()`

# Sort by salary descending
print(df.sort_values("salary", ascending=False)[["name", "salary"]])

    name  salary
Carol   91000
Alice   85000
  Eve   78000
  Bob   72000
David   65000

`groupby()` — The Most Powerful Pandas Operation

groupby() splits a DataFrame into groups, applies a function to each group, and combines the results.

groupby() flow:

  DataFrame
      │
      ├── SPLIT by "department"
      │       Engineering: [Alice, Carol]
      │       Finance:     [Bob, Eve]
      │       HR:          [David]
      │
      ├── APPLY function (mean, sum, count...)
      │
      └── COMBINE into result

summary = df.groupby("department")["salary"].mean()
print(summary)

department
Engineering    88000.0
Finance        75000.0
HR             65000.0
Name: salary, dtype: float64

Multiple aggregations at once with .agg():

summary = df.groupby("department")["salary"].agg(
    count="count",
    mean="mean",
    max="max"
)
print(summary)

             count     mean    max
department
Engineering      2  88000.0  91000
Finance          2  75000.0  78000
HR               1  65000.0  65000

Try It 19.5 — Using a DataFrame of your choice with at least one categorical and one numeric column, group by the categorical column and compute the sum, mean, and count of the numeric column in a single .agg() call.

19.9 Merging DataFrames

Pandas merge() combines two DataFrames on a shared column — exactly like a SQL JOIN.

employees = pd.DataFrame({
    "id":         [1, 2, 3, 4],
    "name":       ["Alice", "Bob", "Carol", "David"],
    "dept_id":    [10, 20, 10, 30]
})

departments = pd.DataFrame({
    "dept_id":   [10, 20, 40],
    "dept_name": ["Engineering", "Finance", "Marketing"]
})

Merge Types

`how=`	Returns
`"inner"`	Only rows with a match in both tables
`"left"`	All rows from left, matched from right
`"right"`	All rows from right, matched from left
`"outer"`	All rows from both, NaN where no match

# Inner join — only matched rows
inner = pd.merge(employees, departments, on="dept_id", how="inner")
print(inner)

   id   name  dept_id    dept_name
 1  Alice       10  Engineering
 3  Carol       10  Engineering
 2    Bob       20      Finance

# Left join — all employees, NaN if no department match
left = pd.merge(employees, departments, on="dept_id", how="left")
print(left)

   id   name  dept_id    dept_name
 1  Alice       10  Engineering
 2    Bob       20      Finance
 3  Carol       10  Engineering
 4  David       30          NaN

David has no matching department — his dept_name is NaN.

19.10 Exporting Data

# Save to CSV
df.to_csv("output.csv", index=False)     # index=False prevents writing row numbers

# Save to JSON
df.to_json("output.json", orient="records", indent=4)

orient="records" exports as a list of dictionaries — the format most APIs and downstream tools expect.

19.11 Putting It Together

Ayan receives a raw sales CSV every Monday. Before loading it to the warehouse, he runs a cleaning and profiling script.

import pandas as pd
import numpy as np

# --- Load ---
df = pd.read_csv("weekly_sales.csv")
print(f"Loaded: {df.shape[0]} rows, {df.shape[1]} columns\n")

# --- Inspect ---
print("Missing values:")
print(df.isnull().sum(), "\n")

# --- Clean ---
df["product"]  = df["product"].str.strip().str.title()
df["region"]   = df["region"].str.strip().str.upper()
df["revenue"]  = df["revenue"].fillna(df["revenue"].median())
df = df.dropna(subset=["product"])

# --- Transform ---
df["tax"]       = (df["revenue"] * 0.17).round(2)
df["total"]     = df["revenue"] + df["tax"]
df["tier"]      = np.where(df["revenue"] >= 10000, "High", "Standard")

# --- Aggregate ---
summary = df.groupby("region")["total"].agg(
    orders="count",
    total_revenue="sum",
    avg_order="mean"
).round(2)

print("Regional Summary:")
print(summary)

# --- Export ---
df.to_csv("weekly_sales_clean.csv", index=False)
print(f"\nExported {len(df)} cleaned rows.")

Loaded: 847 rows, 5 columns

Missing values:
product      3
region       0
revenue     12
date         0
dtype: int64

Regional Summary:
        orders  total_revenue  avg_order
region
EAST       213      2847492.5    13369.4
NORTH      189      2341087.2    12386.7
SOUTH      224      3102843.6    13851.1
WEST       218      2761930.8    12669.4

Exported 844 cleaned rows.

Load → inspect → clean → transform → aggregate → export. This is the shape of almost every data cleaning script in production.

19.12 Real-World Analysis: The Titanic Dataset

The Titanic dataset is one of the most studied datasets in data science. It contains information about 891 passengers on the RMS Titanic — who they were, their ticket class, their age, how much they paid, and whether they survived. It is an ideal dataset for practising every Pandas skill from this chapter on real, messy data.

Download the dataset from Kaggle: https://www.kaggle.com/competitions/titanic/data

Download train.csv and save it as titanic.csv in your working directory.

The Columns

Column	Description
`PassengerId`	Unique ID
`Survived`	0 = No, 1 = Yes
`Pclass`	Ticket class — 1 (First), 2 (Second), 3 (Third)
`Name`	Passenger name
`Sex`	male / female
`Age`	Age in years
`SibSp`	Number of siblings/spouses aboard
`Parch`	Number of parents/children aboard
`Ticket`	Ticket number
`Fare`	Passenger fare
`Cabin`	Cabin number
`Embarked`	Port — C (Cherbourg), Q (Queenstown), S (Southampton)

Step 1 — Load and First Look

import pandas as pd
import numpy as np

df = pd.read_csv("titanic.csv")

print(df.shape)
print(df.head())

(891, 12)

   PassengerId  Survived  Pclass  \
          1         0       3
          2         1       1
          3         1       3
          4         1       1
          5         0       3

                                                Name     Sex   Age  SibSp  \
                          Braund, Mr. Owen Harris    male  22.0      1
Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1
                           Heikkinen, Miss. Laina  female  26.0      0
     Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1
                         Allen, Mr. William Henry    male  35.0      0

   Parch            Ticket     Fare Cabin Embarked
    0         A/5 21171   7.2500   NaN        S
    0          PC 17599  71.2833   C85        C
    0  STON/O2. 3101282   7.9250   NaN        S
    0            113803  53.1000  C123        S
    0            373450   8.0500   NaN        S

891 passengers. Already you can see Cabin and Age have NaN values.

Step 2 — Understand the Data

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  891 non-null    int64
 1   Survived     891 non-null    int64
 2   Pclass       891 non-null    int64
 3   Name         891 non-null    object
 4   Sex          891 non-null    object
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64
 7   Parch        891 non-null    int64
 8   Ticket       891 non-null    object
 9   Fare         891 non-null    float64
10   Cabin        204 non-null    object
11   Embarked     889 non-null    object
dtypes: float64(2), int64(5), object(5)

print(df.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000
mean    446.000000    0.383838    2.308642   29.699118    0.523008
std     257.353842    0.486592    0.836071   14.526497    1.102743
min       1.000000    0.000000    1.000000    0.420000    0.000000
25%     223.500000    0.000000    2.000000   20.125000    0.000000
50%     446.000000    0.000000    3.000000   28.000000    0.000000
75%     668.500000    1.000000    3.000000   38.000000    1.000000
max     891.000000    1.000000    3.000000   80.000000    8.000000

The mean of Survived is 0.38 — meaning 38% of passengers survived.

Step 3 — Missing Values

missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(1)

print(pd.DataFrame({"missing": missing, "percent": missing_pct}))

             missing  percent
PassengerId        0      0.0
Survived           0      0.0
Pclass             0      0.0
Name               0      0.0
Sex                0      0.0
Age              177     19.9
SibSp              0      0.0
Parch              0      0.0
Ticket             0      0.0
Fare               0      0.0
Cabin            687     77.1
Embarked           2      0.2

Three columns have missing values:

Age — 20% missing. Fill with median (preferred over mean for skewed data).
Cabin — 77% missing. Too many gaps to be useful. Drop the column.
Embarked — only 2 missing. Fill with the most common port.

Step 4 — Clean the Data

A Pandas Series stores its values as a NumPy array underneath. Calling .values exposes that array directly — which means every NumPy function works on it.

# The Age column is backed by a NumPy array
age_array = df["Age"].values
print(type(age_array))
print(age_array[:8])

<class 'numpy.ndarray'>
[22.  38.  26.  35.  35.   nan  54.   2. ]

Use np.nanmedian() to compute the median while ignoring NaN values — more explicit than Pandas' .median() and directly from NumPy:

age_median = np.nanmedian(df["Age"].values)
print(f"Age median (NumPy): {age_median}")

Age median (NumPy): 28.0

Now clean all three columns:

# Fill Age with NumPy-computed median
df["Age"] = df["Age"].fillna(age_median)

# Drop Cabin — 77% missing, not salvageable
df = df.drop(columns=["Cabin"])

# Fill the two missing Embarked values with the most common port
df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode()[0])

# Verify
print(df.isnull().sum())

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

Clean. Now the analysis can begin.

Step 5 — Overall Survival Rate

total      = len(df)
survived   = df["Survived"].sum()
not_surv   = total - survived
surv_rate  = (survived / total * 100).round(1)

print(f"Total passengers : {total}")
print(f"Survived         : {survived} ({surv_rate}%)")
print(f"Did not survive  : {not_surv} ({100 - surv_rate}%)")

Total passengers : 891
Survived         : 342 (38.4%)
Did not survive  : 549 (61.6%)

Only 38 out of every 100 passengers survived.

Step 6 — Survival by Gender

One of the most striking patterns in the dataset.

gender_surv = df.groupby("Sex")["Survived"].agg(
    total="count",
    survived="sum"
)
gender_surv["survival_rate"] = (
    gender_surv["survived"] / gender_surv["total"] * 100
).round(1)

print(gender_surv)

        total  survived  survival_rate
Sex
female    314       233           74.2
male      577       109           18.9

74% of women survived. Only 19% of men did. The "women and children first" protocol was real and visible in the data.

Step 7 — Survival by Passenger Class

class_surv = df.groupby("Pclass")["Survived"].agg(
    total="count",
    survived="sum"
)
class_surv["survival_rate"] = (
    class_surv["survived"] / class_surv["total"] * 100
).round(1)
class_surv.index = ["1st Class", "2nd Class", "3rd Class"]

print(class_surv)

           total  survived  survival_rate
1st Class    216       136           63.0
2nd Class    184        87           47.3
3rd Class    491       119           24.2

First-class passengers had more than twice the survival rate of third-class. Access to lifeboats was unequal.

Step 8 — Survival by Gender and Class Combined

combined = df.pivot_table(
    values="Survived",
    index="Pclass",
    columns="Sex",
    aggfunc="mean"
).round(3) * 100

combined.index = ["1st Class", "2nd Class", "3rd Class"]
print(combined)

Sex         female   male
1st Class     96.8   36.9
2nd Class     92.1   15.7
3rd Class     50.0   13.5

A first-class woman had a 96.8% chance of survival. A third-class man had a 13.5% chance. The intersection of gender and class tells a much sharper story than either variable alone.

Step 9 — Age Analysis

surv_age   = df[df["Survived"] == 1]["Age"]
nosurv_age = df[df["Survived"] == 0]["Age"]

print("Age — Survivors:")
print(f"  Mean:   {surv_age.mean():.1f}")
print(f"  Median: {surv_age.median():.1f}")

print("\nAge — Did Not Survive:")
print(f"  Mean:   {nosurv_age.mean():.1f}")
print(f"  Median: {nosurv_age.median():.1f}")

# Children under 10
children = df[df["Age"] < 10]
child_surv_rate = (children["Survived"].sum() / len(children) * 100).round(1)
print(f"\nChildren under 10 survival rate: {child_surv_rate}%")

Age — Survivors:
  Mean:   28.3
  Median: 28.0

Age — Did Not Survive:
  Mean:   30.6
  Median: 28.0

Children under 10 survival rate: 59.4%

Younger passengers had a slight survival advantage. Children under 10 survived at nearly 60%.

Use NumPy to understand the full age distribution through percentiles:

age_arr = df["Age"].values   # extract the underlying NumPy array

percentiles = np.percentile(age_arr, [10, 25, 50, 75, 90])
labels      = ["10th", "25th", "50th", "75th", "90th"]

print("Age percentile distribution:")
for label, val in zip(labels, percentiles):
    print(f"  {label} percentile : {val:.1f} years")

print(f"\n  Std deviation : {np.std(age_arr):.1f} years")
print(f"  Variance      : {np.var(age_arr):.1f}")

Age percentile distribution:
  10th percentile : 14.0 years
  25th percentile : 22.0 years
  50th percentile : 28.0 years
  75th percentile : 36.0 years
  90th percentile : 50.0 years

  Std deviation : 13.0 years
  Variance      : 169.0

10% of passengers were under 14. Half were under 28. Now use np.where() to label passengers into age groups — no loop needed:

df["age_group"] = np.where(
    df["Age"] < 18, "Child",
    np.where(df["Age"] < 40, "Adult", "Senior")
)

group_surv = df.groupby("age_group")["Survived"].agg(
    total="count",
    survived="sum"
)
group_surv["rate"] = (
    group_surv["survived"] / group_surv["total"] * 100
).round(1)

print(group_surv)

            total  survived  rate
age_group
Adult         551       215  39.0
Child         139        70  50.4
Senior        201        57  28.4

Children had the highest survival rate at 50.4%. Seniors the lowest at 28.4%. The nested np.where() built the entire column in one line — the same pattern works on NumPy arrays and Pandas Series identically.

Step 10 — Fare Analysis

print(f"Mean fare  : £{df['Fare'].mean():.2f}")
print(f"Median fare: £{df['Fare'].median():.2f}")
print(f"Max fare   : £{df['Fare'].max():.2f}")
print()

fare_by_class = df.groupby("Pclass")["Fare"].mean().round(2)
fare_by_class.index = ["1st Class", "2nd Class", "3rd Class"]
print("Average fare by class:")
print(fare_by_class)

Mean fare  : £32.20
Median fare: £14.45
Max fare   : £512.33

Average fare by class:
1st Class    84.15
2nd Class    20.66
3rd Class    13.68
dtype: float64

The mean (£32) is far above the median (£14) — the distribution is right-skewed. A small number of very expensive first-class tickets pull the mean up.

Use NumPy to see the full distribution through percentiles and confirm the skew:

fare_arr = df["Fare"].values

print("Fare percentile breakdown:")
for p in [25, 50, 75, 90, 95, 99]:
    val = np.percentile(fare_arr, p)
    print(f"  {p:>3}th percentile : £{val:.2f}")

print(f"\n  Std deviation : £{np.std(fare_arr):.2f}")

Fare percentile breakdown:
  25th percentile : £7.91
  50th percentile : £14.45
  75th percentile : £31.39
  90th percentile : £77.96
  95th percentile : £112.08
  99th percentile : £249.06

The top 1% paid over £249. The bottom 25% paid under £8. This extreme spread is what makes the mean misleading here.

Apply a log transformation using np.log1p() to compress the skew. log1p means log(1 + x) — the +1 handles zero values safely:

df["fare_log"] = np.log1p(df["Fare"].values)

print(f"Original fare — mean: £{df['Fare'].mean():.2f},  std: £{np.std(df['Fare'].values):.2f}")
print(f"Log fare      — mean: {df['fare_log'].mean():.3f}, std: {np.std(df['fare_log'].values):.3f}")

Original fare — mean: £32.20,  std: £49.69
Log fare      — mean: 2.964,   std: 1.020

The log-transformed fare has a much tighter spread (std of 1.0 vs 49.7). This is a standard preprocessing step before feeding data into machine learning models.

Now use np.corrcoef() to check how strongly each numeric feature correlates with survival:

numeric_cols = ["Survived", "Pclass", "Age", "SibSp", "Parch", "Fare"]
data_matrix  = df[numeric_cols].values.T    # transpose: each row = one variable

corr_matrix  = np.corrcoef(data_matrix)
surv_corr    = corr_matrix[0]               # row 0 = correlations with Survived

print("Correlation with Survival:")
for col, corr in zip(numeric_cols, surv_corr):
    bar = "█" * int(abs(corr) * 20)
    direction = "+" if corr >= 0 else "-"
    print(f"  {col:<12} {direction}{abs(corr):.3f}  {bar}")

Correlation with Survival:
  Survived      +1.000  ████████████████████
  Pclass        -0.338  ██████
  Age           -0.077  █
  SibSp         -0.035  
  Parch         +0.082  █
  Fare          +0.257  █████

Pclass has the strongest negative correlation with survival — higher class number (3rd) means lower survival. Fare has the strongest positive correlation — paying more meant a better chance. Age is weakly negative — older passengers were slightly less likely to survive.

Step 11 — Export the Cleaned Dataset

df.to_csv("titanic_clean.csv", index=False)
print(f"Exported {len(df)} rows to titanic_clean.csv")

Exported 891 rows to titanic_clean.csv

What This Analysis Used

Every major concept from Chapters 18 and 19 appeared in this walkthrough:

Concept	Where used
`pd.read_csv()`	Step 1
`head()`, `info()`, `describe()`, `shape`	Steps 1–2
`isnull().sum()`	Step 3
`fillna()`, `drop()`, `mode()`	Step 4
`.values` — accessing the NumPy array	Step 4
`np.nanmedian()`	Step 4
Boolean filtering	Steps 5, 9
`groupby().agg()`	Steps 6, 7, 10
`pivot_table()`	Step 8
`np.percentile()`	Steps 9, 10
`np.std()`, `np.var()`	Step 9
`np.where()` — vectorized age grouping	Step 9
`np.log1p()` — log transformation	Step 10
`np.corrcoef()` — correlation matrix	Step 10
`to_csv()`	Step 11

The NumPy operations did not replace Pandas — they extended it. Pandas organized and filtered the data. NumPy computed the statistics, transformed distributions, and built columns without a single loop.

Summary

Pandas provides two core structures. A Series is a 1D labeled array — one column of data with an index. A DataFrame is a 2D labeled table — the primary structure for all data work. Load data with pd.read_csv() and explore it with shape, dtypes, info(), and describe(). Select columns with df["col"] or df[["col1","col2"]]; select rows with .loc[] (by label) or .iloc[] (by position); filter rows with boolean conditions. Handle missing data with isnull(), fillna(), and dropna() — remember most methods return a new DataFrame, not a modified one. Clean string columns with the .str accessor. Add columns by direct assignment or apply() for custom logic. Sort with sort_values(); aggregate with groupby().agg(). Merge two DataFrames like a SQL JOIN using merge() with how="inner", "left", "right", or "outer". Export with to_csv() or to_json().

Exercises

19.1 — Load a CSV file (or create a DataFrame from a dictionary with 8+ rows). Print its shape, dtypes, and the count of unique values in each column using nunique(). Then print the five rows with the highest numeric value in one column.

19.2 — Create a DataFrame with a salary column containing two None values. Use isnull().sum() to confirm. Fill missing values with the column mean. Confirm no missing values remain. Print the before and after.

19.3 — The following code is intended to clean a column but does not work as expected. Explain why and fix it:

df["city"].fillna("Unknown")
df["city"] = df["city"].str.title()
print(df["city"])

19.4 — From a DataFrame of five employees with name, department, salary, and years_experience columns: filter to employees with more than 3 years experience, add a bonus column equal to 10% of salary, and sort by bonus descending. Print the result.

19.5 — Create two DataFrames: one of orders with order_id, customer_id, and amount; one of customers with customer_id and name. Perform a left merge. Print how many rows have a NaN customer name (orders with no matching customer).

19.6 — Group a sales DataFrame by both region and product_category simultaneously using groupby(["region","product_category"]). Compute total revenue and order count for each combination. Which combination has the highest average order value?

19.7 — Write a complete data cleaning function clean_dataframe(df) that: strips and title-cases all string columns, fills numeric missing values with the column median, drops rows where more than half the columns are missing, and returns the cleaned DataFrame. Test it on a deliberately messy DataFrame.

19.8 — Think About It: groupby() in Pandas uses the Split-Apply-Combine pattern. SQL's GROUP BY does the same thing. What does Pandas give you that SQL does not, and what does SQL give you that Pandas does not? When would you clean and group data in Pandas before loading to a database, and when would you let the database do the grouping instead?

19.1 Series — The Building Block​

19.2 DataFrame — The Core Structure​

Creating from a Dictionary​

DataFrame Anatomy​

Reading from a CSV​

19.3 Exploring a DataFrame​

Exploration Methods at a Glance​

19.4 Selecting Data​

Selecting Columns​

Selecting Rows — .loc[] and .iloc[]​

Filtering Rows with Conditions​

19.5 Handling Missing Data​

Finding Missing Values​

Handling Strategies​

19.6 Data Cleaning​

Renaming Columns​

Changing Data Types​

String Cleaning with .str​

19.7 Adding and Transforming Columns​

Adding a New Column​

apply() — Custom Transformations​

np.where() for Simple Conditionals​

19.8 Sorting and Grouping​

sort_values()​

groupby() — The Most Powerful Pandas Operation​

19.9 Merging DataFrames​

Merge Types​

19.10 Exporting Data​

19.11 Putting It Together​

19.12 Real-World Analysis: The Titanic Dataset​

The Columns​

Step 1 — Load and First Look​

Step 2 — Understand the Data​

Step 3 — Missing Values​

Step 4 — Clean the Data​

Step 5 — Overall Survival Rate​

Step 6 — Survival by Gender​

Step 7 — Survival by Passenger Class​

Step 8 — Survival by Gender and Class Combined​

Step 9 — Age Analysis​

Step 10 — Fare Analysis​

Step 11 — Export the Cleaned Dataset​

What This Analysis Used​

Summary​

Exercises​