Skip to main content

Chapter 18: NumPy

Python lists are flexible — they can hold mixed types, grow dynamically, and do most things you need. But flexible has a cost. When you need to multiply a million numbers by two, Python loops through each one, checks its type, does the operation, moves to the next. On large datasets that is slow.

NumPy solves this with a single idea: store all values of the same type in one contiguous block of memory, and let compiled C code operate on the whole block at once. What takes a Python loop seconds takes NumPy milliseconds. Pandas, the data engineering workhorse in the next chapter, is built entirely on NumPy arrays underneath.

import numpy as np

18.1 What Is an ndarray?

NumPy's core data structure is the ndarray — an N-dimensional array. Think of it as a grid of values, all the same type, with a fixed size.

Python List vs NumPy Array

Python ListNumPy ndarray
TypesMixed ([1, "a", True])All the same (int64, float64)
MemoryScattered — each element stored separatelyContiguous — one block
ArithmeticLoop requiredOperates on whole array at once
Speed on large dataSlowFast (C under the hood)
SizeDynamic — grows freelyFixed at creation
import numpy as np

# Python list
py_list = [1, 2, 3, 4, 5]

# NumPy array
np_array = np.array([1, 2, 3, 4, 5])

print(py_list)
print(np_array)
print(type(np_array))
[1, 2, 3, 4, 5]
[1 2 3 4 5]
<class 'numpy.ndarray'>

Note the difference in how they print — Python lists use commas between elements, NumPy arrays do not.


18.2 Creating Arrays

From a Python List

scores = np.array([88, 92, 74, 95, 61])
prices = np.array([29.99, 14.50, 99.00, 4.75])
print(scores)
print(prices)
[88 92 74 95 61]
[29.99 14.5 99. 4.75]

Built-in Creation Functions

NumPy provides functions to create common array patterns without typing out every value.

print(np.zeros(5)) # five zeros
print(np.ones(4)) # four ones
print(np.arange(0, 10, 2)) # 0 to 9, step 2
print(np.linspace(0, 1, 5)) # 5 evenly spaced values from 0 to 1
[0. 0. 0. 0. 0.]
[1. 1. 1. 1.]
[0 2 4 6 8]
[0. 0.25 0.5 0.75 1. ]

Creation Functions Reference

FunctionWhat it createsExample
np.array(list)Array from existing datanp.array([1, 2, 3])
np.zeros(n)n zerosnp.zeros(4)
np.ones(n)n onesnp.ones(3)
np.arange(start, stop, step)Range of valuesnp.arange(0, 10, 2)
np.linspace(start, stop, n)n evenly spaced valuesnp.linspace(0, 1, 5)
np.random.randint(low, high, n)n random integersnp.random.randint(1, 100, 5)
np.random.rand(n)n random floats 0–1np.random.rand(4)
np.full(n, value)n copies of a valuenp.full(3, 7)
print(np.random.randint(1, 100, 5)) # 5 random integers between 1 and 99
print(np.full(4, 99)) # array of four 99s
[43 71 18 62 29]
[99 99 99 99]

Try It 18.1 — Create four arrays: one from a list of six temperatures, one of ten zeros, one using arange from 5 to 50 in steps of 5, and one of five random integers between 0 and 100. Print all four.


18.3 Array Attributes

Every array carries metadata about itself. Three attributes you will check constantly:

data = np.array([
[10, 20, 30, 40],
[50, 60, 70, 80],
[90, 100, 110, 120]
])

print(data.shape) # (rows, columns)
print(data.dtype) # data type of elements
print(data.ndim) # number of dimensions
print(data.size) # total number of elements
(3, 4)
int64
2
12

Array Anatomy

data = np.array([[10, 20, 30, 40],
[50, 60, 70, 80],
[90,100,110,120]])

data.shape → (3, 4) 3 rows, 4 columns
data.dtype → int64 all values are 64-bit integers
data.ndim → 2 two dimensions (a matrix)
data.size → 12 3 × 4 = 12 total elements

dtype — Why It Matters

NumPy assigns a dtype automatically based on your data. You can also set it explicitly:

a = np.array([1, 2, 3])
b = np.array([1.0, 2.0, 3.0])
c = np.array([1, 2, 3], dtype=np.float32)

print(a.dtype)
print(b.dtype)
print(c.dtype)
int64
float64
float32

float32 uses half the memory of float64. For large datasets this matters — halving the dtype can halve your memory footprint.

Try It 18.2 — Create a 2D array with 3 rows and 5 columns of your choice. Print its shape, dtype, ndim, and size. Then create the same array with dtype=float32 and confirm the dtype changed.


18.4 Indexing and Slicing

1D Arrays

Works like Python list slicing — same [start:stop:step] notation.

temps = np.array([22, 25, 19, 30, 28, 24, 21])

print(temps[0]) # first element
print(temps[-1]) # last element
print(temps[2:5]) # elements at index 2, 3, 4
print(temps[::2]) # every other element
22
21
[19 30 28]
[22 19 28 21]

2D Arrays

Use [row, column] notation. Think row first, column second.

grid = np.array([
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]
])
col 0 col 1 col 2 col 3
row 0 [ 1 2 3 4 ]
row 1 [ 5 6 7 8 ]
row 2 [ 9 10 11 12 ]
print(grid[0, 0]) # row 0, col 0 → 1
print(grid[1, 2]) # row 1, col 2 → 7
print(grid[2, -1]) # row 2, last col → 12
print(grid[0, :]) # entire first row
print(grid[:, 1]) # entire second column
print(grid[0:2, 1:3]) # subgrid — rows 0-1, cols 1-2
1
7
12
[1 2 3 4]
[ 2 6 10]
[[2 3]
[6 7]]

⚠ Common Mistake — Slices Are Views, Not Copies When you slice a NumPy array, the result shares memory with the original. Changing the slice changes the original:

a = np.array([1, 2, 3, 4, 5])
b = a[1:4]
b[0] = 99
print(a)
[ 1 99 3 4 5]

Use .copy() when you need an independent copy:

b = a[1:4].copy()

Python list slicing makes copies automatically. NumPy slicing does not.

Try It 18.3 — Create a 4×4 array using np.arange and reshape. Print the element at row 2, column 3. Print the entire second row. Print the entire third column. Print a 2×2 subgrid from the centre.


18.5 Array Operations

NumPy operations apply to every element at once — no loop needed. This is called vectorization.

Arithmetic on Arrays

prices = np.array([100.0, 250.0, 75.0, 430.0, 90.0])

print(prices * 1.17) # apply 17% tax to every price
print(prices - 10) # discount of 10 from every price
print(prices / 100) # convert to hundreds
[117. 292.5 87.75 503.1 105.3 ]
[ 90. 240. 65. 420. 80. ]
[1. 2.5 0.75 4.3 0.9 ]

Operations Between Two Arrays

revenue = np.array([5000, 8000, 6500, 9200])
costs = np.array([3200, 4100, 3900, 5500])

profit = revenue - costs
margin = (profit / revenue) * 100

print(profit)
print(np.round(margin, 1))
[1800 3900 2600 3700]
[36. 48.8 40. 40.2]

Broadcasting

Broadcasting lets NumPy operate on arrays of different shapes, following a set of rules. The most common case: a scalar applied to an array.

scores = [88, 92, 74, 95, 61]
bonus = 5

scores + bonus → [93, 97, 79, 100, 66]

NumPy "broadcasts" 5 across every element
as if it were [5, 5, 5, 5, 5]
scores = np.array([88, 92, 74, 95, 61])
curved = scores + 5
print(curved)
[93 97 79 100 66]

⚠ Common Mistake — Shape Mismatch Operations between two arrays require compatible shapes. Mismatched shapes raise an error:

a = np.array([1, 2, 3])
b = np.array([1, 2])
print(a + b)
ValueError: operands could not be broadcast together with shapes (3,) (2,)

Check .shape on both arrays before operating on them together.

Try It 18.4 — Create an array of ten product prices. Apply a 20% discount to all of them. Then create a second array of ten tax rates and calculate the final price for each product after discount and tax. Print the result rounded to two decimal places.


18.6 Statistical Functions

NumPy has a full set of statistical operations that work across an entire array or along a specific axis.

Core Statistical Functions

data = np.array([42, 67, 19, 85, 33, 71, 55, 48, 90, 26])

print(np.sum(data))
print(np.mean(data))
print(np.median(data))
print(np.min(data))
print(np.max(data))
print(np.std(data))
print(np.var(data))
536
53.6
51.5
19
90
22.41
502.24

Function Reference

FunctionReturns
np.sum(a)Total of all elements
np.mean(a)Average
np.median(a)Middle value
np.min(a)Smallest value
np.max(a)Largest value
np.std(a)Standard deviation
np.var(a)Variance
np.argmin(a)Index of the smallest value
np.argmax(a)Index of the largest value
np.sort(a)Sorted copy of array
np.unique(a)Unique values
np.cumsum(a)Cumulative sum
print(np.sort(data))
print(np.argmax(data)) # index of largest value
print(np.cumsum(data[:4])) # running total of first 4 values
[19 26 33 42 48 55 67 71 85 90]
8
[ 42 109 128 213]

Axis — Rows vs Columns

On 2D arrays, pass axis=0 to operate down columns, axis=1 to operate across rows.

monthly = np.array([
[4200, 3800, 5100], # Q1
[4700, 4100, 5600], # Q2
[3900, 4300, 4800], # Q3
])
# Product A B C

print(np.sum(monthly, axis=0)) # total per product (down columns)
print(np.sum(monthly, axis=1)) # total per quarter (across rows)
print(np.mean(monthly, axis=1)) # average per quarter
[12800 12200 15500]
[13100 14400 13000]
[4366.67 4800. 4333.33]

Try It 18.5 — Create a 2D array with 4 rows and 3 columns of random integers between 10 and 100. Print the mean, max, and min of the entire array. Then print the sum of each column and the mean of each row.


18.7 Reshaping Arrays

reshape()

Reshaping changes the structure of an array without changing its data. The total number of elements must stay the same.

flat = np.arange(1, 13) # [1, 2, 3, ... 12]
print(flat)
print(flat.shape)

matrix = flat.reshape(3, 4) # 3 rows, 4 columns
print(matrix)
print(matrix.shape)
[ 1 2 3 4 5 6 7 8 9 10 11 12]
(12,)
[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]]
(3, 4)
reshape visual:

flat: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

reshape(3, 4):
┌── 1 2 3 4 ──┐
│ 5 6 7 8 │ → same data, new structure
└── 9 10 11 12 ──┘

flatten()

Converts any multi-dimensional array back to 1D:

print(matrix.flatten())
[ 1 2 3 4 5 6 7 8 9 10 11 12]

transpose()

Flips rows and columns — rows become columns, columns become rows:

print(matrix.T)
print(matrix.T.shape)
[[ 1 5 9]
[ 2 6 10]
[ 3 7 11]
[ 4 8 12]]
(4, 3)

18.8 Boolean Indexing

Boolean indexing filters an array based on a condition — without a loop. This is one of NumPy's most practically useful features.

scores = np.array([88, 45, 92, 31, 74, 96, 58, 83])

# The condition produces a boolean array
print(scores > 75)
[ True False True False False True False True]

Use that boolean array as an index to filter:

passing = scores[scores > 75]
print(passing)
[88 92 96 83]

Combine conditions with & (and) and | (or):

# Scores between 70 and 90
mid_range = scores[(scores >= 70) & (scores <= 90)]
print(mid_range)
[88 74 83]

np.where()

np.where(condition, value_if_true, value_if_false) applies a conditional transformation across an entire array:

scores = np.array([88, 45, 92, 31, 74, 96, 58, 83])
labels = np.where(scores >= 75, "Pass", "Fail")
print(labels)
['Pass' 'Fail' 'Pass' 'Fail' 'Pass' 'Pass' 'Fail' 'Pass']
# Replace negative values with 0 (clipping)
raw = np.array([4.2, -1.1, 7.8, -0.3, 5.5])
clipped = np.where(raw > 0, raw, 0)
print(clipped)
[4.2 0. 7.8 0. 5.5]

18.9 Putting It Together

Qasim Hassan runs a quick profiling check on a dataset of monthly server response times before deciding whether to escalate a performance issue.

import numpy as np

# Monthly response times in milliseconds — 12 months of data
response_times = np.array([
142, 158, 201, 189, 134, 167,
223, 198, 145, 177, 210, 163
])

print("=== SERVER RESPONSE TIME ANALYSIS ===\n")

# Basic statistics
print(f"Mean: {np.mean(response_times):.1f} ms")
print(f"Median: {np.median(response_times):.1f} ms")
print(f"Min: {np.min(response_times)} ms")
print(f"Max: {np.max(response_times)} ms")
print(f"Std Dev: {np.std(response_times):.1f} ms")

# Flag months that exceeded the SLA threshold
sla_threshold = 180
breached = response_times[response_times > sla_threshold]
breach_count = len(breached)

print(f"\nSLA threshold: {sla_threshold} ms")
print(f"Breached months: {breach_count} of {len(response_times)}")
print(f"Breach values: {breached}")

# Label each month
labels = np.where(response_times > sla_threshold, "BREACH", "OK")
months = np.array(["Jan","Feb","Mar","Apr","May","Jun",
"Jul","Aug","Sep","Oct","Nov","Dec"])

print("\nMonthly status:")
for month, time, label in zip(months, response_times, labels):
print(f" {month}: {time:>4} ms [{label}]")
=== SERVER RESPONSE TIME ANALYSIS ===

Mean: 175.6 ms
Median: 172.0 ms
Min: 134 ms
Max: 223 ms
Std Dev: 27.5 ms

SLA threshold: 180 ms
Breached months: 4 of 12
Breach values: [201 189 223 198]

Monthly status:
Jan: 142 ms [OK]
Feb: 158 ms [OK]
Mar: 201 ms [BREACH]
Apr: 189 ms [BREACH]
May: 134 ms [OK]
Jun: 167 ms [OK]
Jul: 223 ms [BREACH]
Aug: 198 ms [BREACH]
Sep: 145 ms [OK]
Oct: 177 ms [OK]
Nov: 210 ms [BREACH]
Dec: 163 ms [OK]

Wait — the output shows 5 breaches, let me check. 201, 189, 223, 198, 210 — that is five values above 180. The code is correct; the analysis labels them accurately.

Every concept from this chapter is here: array creation, statistical functions, boolean indexing, np.where(), and working with two arrays simultaneously.


Summary

NumPy stores values in an ndarray — a fixed-size, same-type, contiguous block of memory that operates at C speed. Arrays are created with np.array(), or generated with np.zeros(), np.ones(), np.arange(), and np.linspace(). Every array has a shape, dtype, ndim, and size. Indexing follows [row, column] for 2D arrays; slices return views, not copies — use .copy() to get an independent array. Vectorized operations apply arithmetic across entire arrays without loops; broadcasting extends scalars to match array shape. Statistical functions — np.mean(), np.sum(), np.std() and others — work on whole arrays or along an axis. Reshaping changes structure without changing data; flatten() collapses dimensions; .T transposes. Boolean indexing filters an array by condition in one expression; np.where() applies conditional transformations element-wise.


Exercises

18.1 — Create an array of 20 random integers between 50 and 150. Print its shape, dtype, min, max, mean, and standard deviation. Then sort it and print the sorted version.

18.2 — Create two 1D arrays of length 6 representing monthly revenue and monthly costs. Calculate profit for each month, total annual profit, and the month with the highest profit using np.argmax(). Print all results.

18.3 — Create a 4×5 array using np.arange and reshape. Print the element at row 3, column 2. Print the entire second row. Print the entire fourth column. Print a 2×3 subgrid from rows 1–2 and columns 2–4.

18.4 — The following code has a bug. Identify it and explain what happens at runtime:

a = np.array([10, 20, 30])
b = np.array([1, 2])
result = a + b
print(result)

18.5 — Create an array of 15 exam scores between 40 and 100. Use boolean indexing to extract all scores above 70. Use np.where() to label each score "Distinction" (≥85), "Pass" (≥50), or "Fail" (below 50) — using nested np.where() calls.

18.6 — Create a 3×4 array of sales figures. Compute: the total sales for each row (use axis=1), the average for each column (use axis=0), and the single highest value across the entire array with its row and column position using np.argmax() and np.unravel_index().

18.7 — Think About It: NumPy slices return views, not copies. This is a deliberate design decision — not a bug. What is the benefit of views over copies when working with arrays that have millions of elements? In what situation does it become a problem, and how do you fix it?