Python - Machine Learning Part 2: Data Preprocessing and Cleaning
Data preprocessing involves transforming raw data into a clean and usable format. This step includes handling missing values, normalizing data, and encoding categorical variables. Python libraries like Pandas and NumPy are essential for these tasks.
Examples:
Handling Missing Data
import pandas as pd
import numpy as np
# Sample data
data = {'Name': ['Alice', 'Bob', np.nan], 'Age': [25, np.nan, 22]}
df = pd.DataFrame(data)
# Fill missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)
print(df)
Explanation: Missing ages are filled with the mean value, ensuring no loss of information during analysis.
Encoding Categorical Data
from sklearn.preprocessing import LabelEncoder
data = ['cat', 'dog', 'mouse', 'dog']
encoder = LabelEncoder()
encoded = encoder.fit_transform(data)
print(encoded)
Explanation: This converts categorical labels into numerical values for machine learning models.
Scaling Features
from sklearn.preprocessing import StandardScaler
import numpy as np
X = np.array([[1, 200], [2, 300], [3, 400]])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled)
Explanation: Feature scaling standardizes data, which is critical for algorithms like SVMs that rely on distance metrics.