How to Prepare Excel Files for Data Analysis with Pandas

Excel is the world’s most popular data tool, but for advanced analysis and machine learning, we often need to move our data into Python. The Pandas library makes this incredibly easy, but raw Excel files are often “messy.”

In this guide, we’ll walk through the essential steps to load, clean, and prepare an Excel file for analysis.

1. Installing Required Libraries

To read Excel files in Pandas, you usually need the openpyxl engine. You can install it via pip:

pip install pandas openpyxl

2. Loading the Data

Loading a file is a single line of code. You can even specify which sheet you want to load.

import pandas as pd

# Load the Excel file
# If you have multiple sheets, use sheet_name="Sheet1"
df = pd.read_excel("my_data.xlsx")

# View the first 5 rows
print(df.head())

3. Common Data Cleaning Steps

Renaming Columns

Excel headers often contain spaces or special characters that are hard to type in Python. It’s a good practice to rename them to lowercase with underscores.

df = df.rename(columns={
    'First Name': 'first_name',
    'Date of Birth': 'dob',
    'Total Revenue ($)': 'revenue'
})

Handling Missing Values (NaN)

Real-world data often has empty cells. You can either remove these rows or fill them with a default value.

# Option A: Drop any row that has a missing value
df_clean = df.dropna()

# Option B: Fill missing values with 0
df['revenue'] = df['revenue'].fillna(0)

# Option C: Fill with the average (mean)
df['age'] = df['age'].fillna(df['age'].mean())

Removing Unnecessary Columns

Sometimes Excel files contain “notes” or “metadata” columns that aren’t useful for your analysis.

# Drop columns by name
df = df.drop(columns=['Notes', 'Created By'])

4. Changing Data Types

Pandas tries to guess the data type, but sometimes it gets it wrong (e.g., treating a Zip Code as a number when it should be a string).

# Convert a column to string
df['zip_code'] = df['zip_code'].astype(str)

# Convert a column to datetime
df['dob'] = pd.to_datetime(df['dob'])

Summary Checklist

Before you start your analysis, make sure you have: 1. Loaded the correct sheet. 2. Cleaned up column names. 3. Handled missing (NaN) values. 4. Verified that numbers and dates are in the correct format.

Cleaning your data might take 80% of your time, but it’s the most important step for accurate results!

Written by

Abdur-Rahmaan Janhangeer

Chef

Python author of 7+ years having worked for Python companies around the world

1. Installing Required Libraries

2. Loading the Data

3. Common Data Cleaning Steps

Renaming Columns

Handling Missing Values (NaN)

Removing Unnecessary Columns

4. Changing Data Types

Summary Checklist

Related Posts:

Abdur-Rahmaan Janhangeer

Suggested Posts

Efficient Column Removal in Pandas: Dropping by Index or Range

Data Scaling Techniques in Machine Learning

Measures in Statistics for Data Science