How to Prepare Excel Files for Data Analysis with Pandas


data science

Excel is the world’s most popular data tool, but for advanced analysis and machine learning, we often need to move our data into Python. The Pandas library makes this incredibly easy, but raw Excel files are often “messy.”

In this guide, we’ll walk through the essential steps to load, clean, and prepare an Excel file for analysis.


1. Installing Required Libraries

To read Excel files in Pandas, you usually need the openpyxl engine. You can install it via pip:

pip install pandas openpyxl

2. Loading the Data

Loading a file is a single line of code. You can even specify which sheet you want to load.

import pandas as pd

# Load the Excel file
# If you have multiple sheets, use sheet_name="Sheet1"
df = pd.read_excel("my_data.xlsx")

# View the first 5 rows
print(df.head())

3. Common Data Cleaning Steps

Renaming Columns

Excel headers often contain spaces or special characters that are hard to type in Python. It’s a good practice to rename them to lowercase with underscores.

df = df.rename(columns={
    'First Name': 'first_name',
    'Date of Birth': 'dob',
    'Total Revenue ($)': 'revenue'
})

Handling Missing Values (NaN)

Real-world data often has empty cells. You can either remove these rows or fill them with a default value.

# Option A: Drop any row that has a missing value
df_clean = df.dropna()

# Option B: Fill missing values with 0
df['revenue'] = df['revenue'].fillna(0)

# Option C: Fill with the average (mean)
df['age'] = df['age'].fillna(df['age'].mean())

Removing Unnecessary Columns

Sometimes Excel files contain “notes” or “metadata” columns that aren’t useful for your analysis.

# Drop columns by name
df = df.drop(columns=['Notes', 'Created By'])

4. Changing Data Types

Pandas tries to guess the data type, but sometimes it gets it wrong (e.g., treating a Zip Code as a number when it should be a string).

# Convert a column to string
df['zip_code'] = df['zip_code'].astype(str)

# Convert a column to datetime
df['dob'] = pd.to_datetime(df['dob'])

Summary Checklist

Before you start your analysis, make sure you have: 1. Loaded the correct sheet. 2. Cleaned up column names. 3. Handled missing (NaN) values. 4. Verified that numbers and dates are in the correct format.

Cleaning your data might take 80% of your time, but it’s the most important step for accurate results!

Related Posts:

Written by

Abdur-Rahmaan Janhangeer

Chef

Python author of 7+ years having worked for Python companies around the world

Suggested Posts

Efficient Column Removal in Pandas: Dropping by Index or Range

When cleaning data in Pandas, we often need to remove columns. While dropping by name (df.drop(colum...

Read article

Data Scaling Techniques in Machine Learning

Data and its quality affect machine learning models and their accuracy, and the quality of the data ...

Read article

Measures in Statistics for Data Science

Statistics is a critical component of data science and machine learning algorithms. Almost all the m...

Read article
Free Flask Course