Case Study 2: Berlin Marathon Performance Analysis

Project Overview

Investigated age-related differences in finish times for female participants in the Berlin Marathon. This analysis covers six distinct age groups, utilizing a combination of exploratory data analysis, advanced visualization, and statistical hypothesis testing.

Data Science Approach

Data Processing: Imported and cleaned real-world marathon data using Python and Jupyter Notebook. Grouped participants into six defined age brackets for comparative analysis.
Exploratory Data Analysis (EDA): Visualized finish time distributions per age group using boxplots, histograms, and density plots. Identified trends and outliers in race performance.
Statistical Analysis: Applied ANOVA to test for overall differences in finish times across age groups. Conducted post-hoc pairwise comparisons (Tukey’s HSD) to pinpoint significant group differences. Calculated descriptive statistics (mean, median, IQR) for each age bracket.
Visualization: Created clear, publication-quality plots directly within Jupyter Notebooks (matplotlib, seaborn). Exported summary tables and visuals for reporting.

Sample Visualizations

Preview all figures and analysis in the full report PDF:

📄 Download the Full Report (PDF)🔗 Preview PDF Online

Tools & Technologies

Python (pandas, matplotlib, seaborn, scipy, statsmodels): Data wrangling, visualization, and statistical testing.
Jupyter Notebook: Reproducible analysis and reporting.
LaTeX/PDF Export: For high-quality final report (optional if you exported notebook to PDF).

Key Insights

Statistically significant differences in average finish times across female age groups.
Older and younger groups exhibited distinct performance patterns, with clear trends highlighted in visualizations.
Provided evidence-based recommendations for athletes and race organizers.

See the full analysis and visualizations:

📄 View the PDF Report GitHub Repository & Source Code