Data Cleaning and Exploratory Data Analysis (EDA) with Python

Data cleaning and exploratory data analysis (EDA) are critical steps in any data-driven project. They ensure that the data is accurate, consistent, and ready for analysis. In this blog post, we will explore the processes of data cleaning and EDA using Python, leveraging libraries like pandas and matplotlib. We’ll also delve into key statistical concepts such as mean, median, mode, quartile deviations, histograms, and boxplots, including handling outliers.

Data Cleaning

What is Data Cleaning?

Data cleaning is the process of detecting, correcting, or removing errors, inconsistencies, and inaccuracies in data. It prepares raw data for analysis, ensuring its quality and usability.

Steps in Data Cleaning

Loading the Data Use pandas to load data from CSV, Excel, or databases.
```
import pandas as pd

df = pd.read_csv('data.csv')
```

Inspecting the Data

Preview the data:
```
print(df.head())
print(df.info())
```
Check for missing values:
```
print(df.isnull().sum())
```

Handling Missing Values

Drop missing values:
```
df = df.dropna()
```

Fill missing values:

df['column_name'] = df['column_name'].fillna(df['column_name'].mean())

Removing Duplicates
```
df = df.drop_duplicates()
```

Fixing Data Types

df['date_column'] = pd.to_datetime(df['date_column'])
df['numeric_column'] = pd.to_numeric(df['numeric_column'])

Standardizing Data

Convert text to lowercase:

df['text_column'] = df['text_column'].str.lower()

Outlier Detection and Removal Use statistical methods or visualization tools (discussed in the EDA section) to detect and handle outliers.

Exploratory Data Analysis (EDA)

EDA involves analyzing and summarizing data sets to understand their main characteristics. It’s often the first step in any data analysis project.

Key Concepts in EDA

Mean
- The average of a dataset.
- Formula: [ $\text{Mean} = \frac{\text{Sum of all values}}{\text{Number of values}} $]
- In Python:
```
mean_value = df['column_name'].mean()
```
Median
- The middle value in a sorted dataset.
- In Python:
```
median_value = df['column_name'].median()
```
Mode
- The most frequent value in a dataset.
- In Python:
```
mode_value = df['column_name'].mode()[0]
```
Quartiles and Quartile Deviation
- Quartiles: Divide data into four equal parts.
- Interquartile Range (IQR): The range between Q3 (75th percentile) and Q1 (25th percentile).
```
Q1 = df['column_name'].quantile(0.25)
Q3 = df['column_name'].quantile(0.75)
IQR = Q3 - Q1
```
Outliers
- Values outside the range: [ $\text{Outlier Range} = [Q1 – 1.5 \times IQR, Q3 + 1.5 \times IQR] $]
- Detecting outliers:
```
outliers = df[(df['column_name'] < (Q1 - 1.5 * IQR)) | (df['column_name'] > (Q3 + 1.5 * IQR))]
```
Describe Method
- Summarizes the central tendency, dispersion, and shape of a dataset’s distribution.
- Example:
```
print(df.describe())
```
- Output includes statistics such as count, mean, standard deviation, minimum, maximum, and quartiles for numerical columns.

Visualizations in EDA

Histogram

Shows the frequency distribution of a dataset.

Example:

import matplotlib.pyplot as plt

df['column_name'].plot(kind='hist', bins=20)
plt.title('Histogram of Column')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

Boxplot

Displays data distribution and highlights outliers.

Example:

df.boxplot(column='column_name')
plt.title('Boxplot of Column')
plt.show()

Practical Example

Data Cleaning and EDA Workflow

# Importing libraries
import pandas as pd
import matplotlib.pyplot as plt

# Loading data
df = pd.read_csv('data.csv')

# Data Cleaning
print(df.isnull().sum())
df['Age'] = df['Age'].fillna(df['Age'].mean())
df = df.drop_duplicates()

# Outlier Detection
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Salary'] < (Q1 - 1.5 * IQR)) | (df['Salary'] > (Q3 + 1.5 * IQR))]
print(outliers)

# Summary Statistics
print(df.describe())

# Visualizations
plt.figure(figsize=(10, 5))
df['Salary'].plot(kind='hist', bins=20, color='blue', alpha=0.7)
plt.title('Salary Distribution')
plt.xlabel('Salary')
plt.ylabel('Frequency')
plt.show()

df.boxplot(column='Salary')
plt.title('Boxplot of Salary')
plt.show()

Conclusion

Data cleaning and EDA are essential to ensure that your data is ready for further analysis or modeling. Using Python libraries like pandas and matplotlib, you can efficiently clean your data and gain meaningful insights. By understanding statistical concepts and visualizations, you’re better equipped to make data-driven decisions.

Discover more from DigitalSplendid.xyz

Subscribe to get the latest posts sent to your email.

Data Cleaning and Exploratory Data Analysis (EDA) with Python

Data Cleaning

What is Data Cleaning?

Steps in Data Cleaning

Exploratory Data Analysis (EDA)

Key Concepts in EDA

Visualizations in EDA

Practical Example

Data Cleaning and EDA Workflow

Conclusion

Like this:

Related

Discover more from DigitalSplendid.xyz

Data Cleaning

What is Data Cleaning?

Steps in Data Cleaning

Exploratory Data Analysis (EDA)

Key Concepts in EDA

Visualizations in EDA

Practical Example

Data Cleaning and EDA Workflow

Conclusion

Share this:

Like this:

Related

Discover more from DigitalSplendid.xyz

Reader Interactions

Leave a ReplyCancel reply

Discover more from DigitalSplendid.xyz