Case Study 2: Berlin Marathon Performance Analysis
Repository: View on GitHub
Project Overview
Investigated age-related differences in finish times for female participants in the Berlin Marathon. This analysis covers six distinct age groups, utilizing a combination of exploratory data analysis, advanced visualization, and statistical hypothesis testing.
Data Science Approach
- Data Processing: Imported and cleaned real-world marathon data using Python and Jupyter Notebook. Grouped participants into six defined age brackets for comparative analysis.
- Exploratory Data Analysis (EDA): Visualized finish time distributions per age group using boxplots, histograms, and density plots. Identified trends and outliers in race performance.
- Statistical Analysis: Applied ANOVA to test for overall differences in finish times across age groups. Conducted post-hoc pairwise comparisons (Tukey’s HSD) to pinpoint significant group differences. Calculated descriptive statistics (mean, median, IQR) for each age bracket.
- Visualization: Created clear, publication-quality plots directly within Jupyter Notebooks (matplotlib, seaborn). Exported summary tables and visuals for reporting.
Sample Visualizations
Preview all figures and analysis in the full report PDF:
Tools & Technologies
- Python (pandas, matplotlib, seaborn, scipy, statsmodels): Data wrangling, visualization, and statistical testing.
- Jupyter Notebook: Reproducible analysis and reporting.
- LaTeX/PDF Export: For high-quality final report (optional if you exported notebook to PDF).
Key Insights
- Statistically significant differences in average finish times across female age groups.
- Older and younger groups exhibited distinct performance patterns, with clear trends highlighted in visualizations.
- Provided evidence-based recommendations for athletes and race organizers.
See the full analysis and visualizations: