This project involves performing data cleaning and exploratory data analysis (EDA) on the Titanic dataset from Kaggle. The objective is to explore relationships between variables and identify patterns and trends in the data.
#Problem Statement
The goal is to analyze the Titanic dataset to:
Explore the distribution of variables like age, gender, and class. Understand survival rates based on various factors such as gender, passenger class, and family size. Visualize relationships between variables to identify significant patterns.
Age Distribution: Visualized the distribution of ages among passengers. Survivors vs Non-Survivors: Compared survival outcomes based on multiple factors. Survival Rate by Gender: Analyzed how survival rates differed between male and female passengers. Survival Rate by Passenger Class: Explored the survival rates across different passenger classes (1st, 2nd, 3rd). Survival by Family Size: Investigated the relationship between family size and survival chances. Correlation Heatmap: Created a heatmap to examine the correlation between numerical variables in the dataset. Fare vs Survival: Analyzed whether higher ticket fares led to higher survival rates.
pandas: For data manipulation and cleaning. matplotlib: For creating static visualizations. seaborn: For advanced visualizations and plots. numpy: For numerical operations.
bash Copy code pip install pandas matplotlib seaborn numpy
bash Copy code python analysis.py Files Included analysis.py: The script containing the EDA and visualizations. titanic.csv: The dataset used for analysis (optional if dataset not included). output/: Directory with images of generated visualizations.
Through EDA, we uncovered interesting trends such as the higher survival rates of women and first-class passengers, and we visualized important relationships between key variables in the Titanic dataset.