#Cleaning & Transforming Real Datasets¶

Real-world data is messy. Data cleaning typically takes 60–80% of a data scientist's time. Here's how to tackle it efficiently with Pandas.

Loading Data¶

python

7 lines

1import pandas as pd
2
3df = pd.read_csv("students.csv")
4print(df.shape)          # (1200, 8) → 1200 rows, 8 columns
5print(df.dtypes)         # Column data types
6print(df.describe())     # Statistical summary
7print(df.head())         # First 5 rows

Handling Missing Values¶

python

11 lines

1# Check for nulls
2print(df.isnull().sum())
3
4# Drop rows with any null
5df_clean = df.dropna()
6
7# Fill nulls with a value
8df["score"] = df["score"].fillna(df["score"].median())
9
10# Forward-fill (useful for time series)
11df["price"] = df["price"].ffill()

Filtering & Selecting¶

python

11 lines

1# Boolean filter
2high_scorers = df[df["score"] > 80]
3
4# Multiple conditions
5top = df[(df["score"] > 80) & (df["enrolled"] == True)]
6
7# Select columns
8subset = df`[["name", "score", "grade"]]`
9
10# Query method (readable!)
11rich_query = df.query("score > 80 and age < 30")

Applying Transformations¶

python

9 lines

1# Apply a function to a column
2df["score_normalized"] = df["score"].apply(lambda x: (x - df["score"].mean()) / df["score"].std())
3
4# Map categories to numbers
5grade_map = {"A": 4, "B": 3, "C": 2, "D": 1, "F": 0}
6df["gpa_points"] = df["grade"].map(grade_map)
7
8# String operations
9df["email_domain"] = df["email"].str.split("@").str[1]

GroupBy Aggregations¶

python

8 lines

1# Average score by grade
2df.groupby("grade")["score"].mean()
3
4# Multiple aggregations
5summary = df.groupby("city").agg({
6    "score": ["mean", "std", "count"],
7    "age":   "median",
8})

Rule of thumb: Always inspect your data with head(), info(), and describe() before doing anything. Assumptions about data are almost always wrong.

1import pandas as pd 2 3df = pd.read_csv("students.csv") 4print(df.shape) # (1200, 8) → 1200 rows, 8 columns 5print(df.dtypes) # Column data types 6print(df.describe()) # Statistical summary 7print(df.head()) # First 5 rows

1# Check for nulls 2print(df.isnull().sum()) 3 4# Drop rows with any null 5df_clean = df.dropna() 6 7# Fill nulls with a value 8df["score"] = df["score"].fillna(df["score"].median()) 9 10# Forward-fill (useful for time series) 11df["price"] = df["price"].ffill()

1# Boolean filter 2high_scorers = df[df["score"] > 80] 3 4# Multiple conditions 5top = df[(df["score"] > 80) & (df["enrolled"] == True)] 6 7# Select columns 8subset = df`[["name", "score", "grade"]]` 9 10# Query method (readable!) 11rich_query = df.query("score > 80 and age < 30")

Applying Transformations¶

python

9 lines

1# Apply a function to a column
2df["score_normalized"] = df["score"].apply(lambda x: (x - df["score"].mean()) / df["score"].std())
3
4# Map categories to numbers
5grade_map = {"A": 4, "B": 3, "C": 2, "D": 1, "F": 0}
6df["gpa_points"] = df["grade"].map(grade_map)
7
8# String operations
9df["email_domain"] = df["email"].str.split("@").str[1]

GroupBy Aggregations¶

python

8 lines

1# Average score by grade
2df.groupby("grade")["score"].mean()
3
4# Multiple aggregations
5summary = df.groupby("city").agg({
6    "score": ["mean", "std", "count"],
7    "age":   "median",
8})

Rule of thumb: Always inspect your data with head(), info(), and describe() before doing anything. Assumptions about data are almost always wrong.