Real-world data is messy. Data cleaning typically takes 60–80% of a data scientist's time. Here's how to tackle it efficiently with Pandas.
1import pandas as pd
2
3df = pd.read_csv("students.csv")
4print(df.shape) # (1200, 8) → 1200 rows, 8 columns
5print(df.dtypes) # Column data types
6print(df.describe()) # Statistical summary
7print(df.head()) # First 5 rows1# Check for nulls
2print(df.isnull().sum())
3
4# Drop rows with any null
5df_clean = df.dropna()
6
7# Fill nulls with a value
8df["score"] = df["score"].fillna(df["score"].median())
9
10# Forward-fill (useful for time series)
11df["price"] = df["price"].ffill()1# Boolean filter
2high_scorers = df[df["score"] > 80]
3
4# Multiple conditions
5top = df[(df["score"] > 80) & (df["enrolled"] == True)]
6
7# Select columns
8subset = df`[["name", "score", "grade"]]`
9
10# Query method (readable!)
11rich_query = df.query("score > 80 and age < 30")1# Apply a function to a column
2df["score_normalized"] = df["score"].apply(lambda x: (x - df["score"].mean()) / df["score"].std())
3
4# Map categories to numbers
5grade_map = {"A": 4, "B": 3, "C": 2, "D": 1, "F": 0}
6df["gpa_points"] = df["grade"].map(grade_map)
7
8# String operations
9df["email_domain"] = df["email"].str.split("@").str[1]1# Average score by grade
2df.groupby("grade")["score"].mean()
3
4# Multiple aggregations
5summary = df.groupby("city").agg({
6 "score": ["mean", "std", "count"],
7 "age": "median",
8})Rule of thumb: Always inspect your data with
head(),info(), anddescribe()before doing anything. Assumptions about data are almost always wrong.