Practical 14: Real-World Project - Part 1 (Data & Preprocessing)

Objective

Work on a complete real-world machine learning project from data collection to deployment.

Part 1 focuses on data gathering, cleaning, and preprocessing.

Duration

5-6 hours

Prerequisites


Project Options

Choose one of these real-world problems:

Option 1: Customer Churn Prediction

Option 2: House Price Prediction

Option 3: Disease Diagnosis

Option 4: Sentiment Analysis


📋 Tasks for Part 1

1. Data Collection & Understanding

# Load data
df = pd.read_csv('your_dataset.csv')

# Explore
print(df.head())
print(df.info())
print(df.describe())
print(df.isnull().sum())

2. Data Cleaning

# Handle missing values
df.dropna(subset=['critical_column'], inplace=True)
df.fillna(df.median(), inplace=True)

# Remove duplicates
df.drop_duplicates(inplace=True)

# Handle outliers
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
# ... outlier removal code

3. Feature Engineering

# Create new features
df['new_feature'] = df['col1'] * df['col2']

# Encode categorical variables
df = pd.get_dummies(df, columns=['category_col'])

# Normalize features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

4. Data Visualization & Analysis

# Visualizations
df.hist(figsize=(15, 10))
plt.show()

# Correlation matrix
import seaborn as sns
sns.heatmap(df.corr(), annot=True)
plt.show()

📊 Deliverables for Part 1


📊 Learning Outcomes


Next: Practical 15 → ← Back to Practicals