Liner Aggression Classification

The Titanic dataset is one of the most famous beginner-friendly datasets in data science. It contains information about passengers aboard the Titanic — including age, gender, class, and survival status.

In this project, I built a Logistic Regression model to predict whether a passenger survived based on their characteristics. Through data cleaning, visualization, feature engineering, and model building, we can uncover the key factors that influenced survival.

. Dataset & Tools

  • Dataset: Titanic Dataset (Kaggle)

  • Tools: Python, Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn

  • Skills Demonstrated:

    • Data Cleaning & Preprocessing

    • Exploratory Data Analysis (EDA)

    • Feature Engineering

    • Logistic Regression Modeling

    • Model Evaluation

2. Load and Explore the Dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Check for missing data
titanic.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
# Count the passengers, who survived and who didn't
sns.countplot(x='Survived',data=titanic,hue='Sex')

<AxesSubplot:xlabel='Survived', ylabel='count'>

sns.countplot(x='Survived',data=titanic,hue='Pclass')
<AxesSubplot:xlabel='Survived', ylabel='count'>
# Age ...
sns.boxplot(x='Pclass',y='Age',data=titanic)
<AxesSubplot:xlabel='Pclass', ylabel='Age'>
# Classwise passengers median age
titanic.groupby('Pclass')['Age'].median()
Pclass
1    37.0
2    29.0
3    24.0
Name: Age, dtype: float64
# Function to impute age
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
   
    if pd.isnull(Age):
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
    else:
        return Age


# Apply the function to the dataframe

titanic['Age'] = titanic[['Age','Pclass']].apply(impute_age,axis=1)


titanic.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
#drop cabin
titanic.drop('Cabin',axis=1,inplace=True)

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.30)
print(len(X_train))
print(len(X_test))
from sklearn.linear_model import LogisticRegression
# Initiaze the model
Passenger_Survival = LogisticRegression()
# Train the Model
Passenger_Survival.fit(X_train,y_train)
# Y-hat ;
predictions = Passenger_Survival.predict(X_test)

# To check the peformance of Classification model

# Classificaton report  & confusion matrix

from sklearn.metrics import classification_report,confusion_matrix
              precision    recall  f1-score   support

           0       0.81      0.82      0.82       170
           1       0.68      0.67      0.68        98

    accuracy                           0.76       268
   macro avg       0.75      0.75      0.75       268
weighted avg       0.76      0.76      0.76       268

# how many right,wrong ... ;  True / False positive
confusion_matrix(y_test,predictions)
array([[139,  31],
       [ 32,  66]])


 






Comments

Popular posts from this blog

Coffee Sales Dashboard with Power BI: Daily Trends, Top Flavors, and Peak Hours Analysis

Data science blog