Excel is the world’s most popular data tool, but for advanced analysis and machine learning, we often need to move our data into Python. The Pandas library makes this incredibly easy, but raw Excel files are often “messy.”
In this guide, we’ll walk through the essential steps to load, clean, and prepare an Excel file for analysis.
1. Installing Required Libraries
To read Excel files in Pandas, you usually need the openpyxl engine. You can install it via pip:
pip install pandas openpyxl
2. Loading the Data
Loading a file is a single line of code. You can even specify which sheet you want to load.
import pandas as pd
# Load the Excel file
# If you have multiple sheets, use sheet_name="Sheet1"
df = pd.read_excel("my_data.xlsx")
# View the first 5 rows
print(df.head())
3. Common Data Cleaning Steps
Renaming Columns
Excel headers often contain spaces or special characters that are hard to type in Python. It’s a good practice to rename them to lowercase with underscores.
df = df.rename(columns={
'First Name': 'first_name',
'Date of Birth': 'dob',
'Total Revenue ($)': 'revenue'
})
Handling Missing Values (NaN)
Real-world data often has empty cells. You can either remove these rows or fill them with a default value.
# Option A: Drop any row that has a missing value
df_clean = df.dropna()
# Option B: Fill missing values with 0
df['revenue'] = df['revenue'].fillna(0)
# Option C: Fill with the average (mean)
df['age'] = df['age'].fillna(df['age'].mean())
Removing Unnecessary Columns
Sometimes Excel files contain “notes” or “metadata” columns that aren’t useful for your analysis.
# Drop columns by name
df = df.drop(columns=['Notes', 'Created By'])
4. Changing Data Types
Pandas tries to guess the data type, but sometimes it gets it wrong (e.g., treating a Zip Code as a number when it should be a string).
# Convert a column to string
df['zip_code'] = df['zip_code'].astype(str)
# Convert a column to datetime
df['dob'] = pd.to_datetime(df['dob'])
Summary Checklist
Before you start your analysis, make sure you have: 1. Loaded the correct sheet. 2. Cleaned up column names. 3. Handled missing (NaN) values. 4. Verified that numbers and dates are in the correct format.
Cleaning your data might take 80% of your time, but it’s the most important step for accurate results!
Related Posts:
Written by
Abdur-Rahmaan Janhangeer
Chef
Python author of 7+ years having worked for Python companies around the world
Suggested Posts
Efficient Column Removal in Pandas: Dropping by Index or Range
When cleaning data in Pandas, we often need to remove columns. While dropping by name (df.drop(colum...
Data Scaling Techniques in Machine Learning
Data and its quality affect machine learning models and their accuracy, and the quality of the data ...
Measures in Statistics for Data Science
Statistics is a critical component of data science and machine learning algorithms. Almost all the m...