Case Study · Data Analytics · Portfolio Project

Apple Sales
Analytics
Workspace

Comprehensive data analysis, machine learning & decision-support dashboard

From static spreadsheets to an interactive end-to-end analytics workflow — statistics, clustering, regression, constrained optimisation, and reproducible exports in one cohesive Streamlit application.

iPhone
Unit sales
iPad
Unit sales
Mac
Unit sales
Wearables
Unit sales
Services
Revenue
11
App Modules
5
ML Models
3
Clustering Algos
1
Source of Truth
Scroll
01 — Problem Statement

From Static Spreadsheets
to Interactive Intelligence

Regional sales data for hardware categories and Services revenue is typically locked in static spreadsheets. Stakeholders needed comparative statistics, segmentation, predictive models, and repeatable exports — without standing up a full BI stack.

Objective How It's Addressed
Single source of truth One file: apple_sales_2024.csv, loaded and column-normalised in src/data/loader.py
Reusable analytics Logic lives in src/ (statistics, models, charts, insights) — not embedded in button callbacks
Interactive exploration Streamlit sidebar filters (regions, min iPhone volume) across 11 focused tab views
Trust & reproducibility Validation tab with cross-validation; Export tab writes joblib + JSON + CSV under src/models/saved/
Guided presentation "Launch Guided Demo" in app/components/demo.py locks navigation and walks through the workflow
02 — Data Pipeline

Ingest · Engineer · Filter

A deterministic, reproducible pipeline from raw CSV to enriched dataframe — cached at every layer to avoid redundant I/O across Streamlit reruns.

CSV
apple_sales_2024.csv · raw data
loader.py
Strip headers · normalise columns · @st.cache_data
builder.py
Feature engineering · derived ratios
Sidebar
Region filter · iPhone threshold
Tab Renderers
Shared filtered dataframe across all pages
// Engineered Feature Overview — Derived from builder.py
Total Hardware
iPhone + iPad + Mac + Wearables
Sum col.
Rev / Unit
Services ÷ Total Hardware
Ratio
iPhone Share
iPhone ÷ Total Hardware
Market %

Data Quality Layer

Reliability Index
Heuristic score from completeness, z-score outliers (>3σ), negative values, and zero-variance columns. Drives the weighted quality gauge.
Real computation
IQR Outlier Detection
Box plots and 1.5×IQR bounds with suggested mitigation narratives — prune, winsorize, or median impute — shown as UI guidance strings.
Real computation
Drift Simulation
Resampled, jittered Services_Revenue slice against the mean — demonstrates how a drift alarm would behave. Labeled as synthetic in the implementation.
Pedagogical demo
Schema Audit
Record count, missing values, duplicate rows, dtypes, and memory usage — returned by get_data_quality() alongside the main load.
Real computation
03 — Application Modules

11 Focused Views

Each module shares the same filtered dataframe. Switching filters in the sidebar propagates consistently across every tab — no stale state, no divergent views.

01
Live Mission Control
The command centre for real-time monitoring and high-level system status. Provides an immediate pulse on global sales performance integrating KPI tiles, synthetic sparklines (deterministic RNG, fixed seed), and anomaly-style callouts. Not a live telemetry backend — values are deterministic and reproducible.
KPI tiles Synthetic sparklines Anomaly callouts IF/THEN recommendations
83% GROWTH $2.4B REVENUE 94% MARGIN 12 REGIONS SERVICES REVENUE TREND System healthy · 0 anomalies detected EMEA ↑ 12% · APAC ↑ 8% · Americas ↑ 4% Recommendation: Increase iPad allocation in APAC
02
Executive Summary
A distilled, high-impact view designed for C-suite decision-makers. Highlights the most critical KPIs, profitability metrics, and strategic growth trends — allowing rapid assessment of organisational health and market share analysis without wading through raw data.
Critical KPIs Market share Growth trends Profitability metrics
Executive Dashboard $4.2B TOTAL iPhone 45% iPad 20% Mac 13% Wear 7% Services 15%
03
Overview
A comprehensive look at the entire data landscape, bridging the gap between high-level summaries and granular analytics. Features comparative regional analysis and detailed product performance distributions across major global markets.
Regional comparison Product distributions Global markets
REGIONAL PERFORMANCE AMERICAS EMEA APAC LATAM MEA
04
Data Quality
The foundational layer ensuring analytical integrity. Performs automated audits of missing values, schema consistency, and data coverage, providing a weighted quality index. Outlier detection via IQR (1.5×) with suggested mitigation narratives. Drift detection is demonstrative.
Reliability index IQR outlier detection Schema audit Drift demo
92% QUALITY INDEX Completeness No nulls 2 outliers
05
Geospatial
Interactive mapping and spatial analysis to identify regional hot-spots and logistical optimisation opportunities. Utilises heatmaps and localised deep-dives to visualise sales density and market penetration across global territories.
Choropleth maps Sales heatmaps Market penetration Regional deep-dive
Americas · EMEA · APAC · LATAM
06
Analytics
Advanced statistical analysis of sales drivers, correlations, and feature importance. Explores dependencies between product lines via Welch t-tests, one-way ANOVA, and Pearson correlation — appropriate for exploratory cross-sectional analysis, not causal claims without domain design.
Welch t-test One-way ANOVA Pearson correlation Feature importance
CORRELATION MATRIX iPhone iPad Mac Wear. Svc.
07
Advanced Segmentation
ML-driven cluster analysis to uncover hidden patterns and targetable cohorts. K-Means, Agglomerative Clustering, and DBSCAN with optional silhouette-driven k-selection. Cluster narratives generated by comparing centroids to global means.
K-Means Agglomerative DBSCAN Silhouette k 3D scatter
iPhone Sales Services Rev.
08
Predictions
Forward-looking forecasting with regression and time-series. Linear, polynomial, Random Forest, and Gradient Boosting pipelines. Auto-selects features via correlation + quick RandomForest importance. SLSQP constrained optimisation maximises predicted Services revenue under hardware-mix constraints. ARIMA demo on resampled series.
Linear & Polynomial Random Forest Gradient Boosting SLSQP optimisation ARIMA demo What-if sliders
ACTUAL vs PREDICTED R² = 0.94
09
Validation & Backtesting
Rigorous testing of predictive models against historical data. K-fold cross_validate on a Random Forest pipeline, on-demand 5-fold CV across linear/RF/GB, and TimeSeriesSplit forward-chaining. Band-style visualisation comparing predictions to held-out points with approximate ±1.96×RMSE band.
K-fold CV TimeSeriesSplit ±1.96×RMSE band Sensitivity analysis
5-FOLD CROSS VALIDATION F1 F2 F3 F4 F5 Mean CV R² = 0.91 ± 0.03
10
Export & Reproducibility
Trains a Random Forest pipeline on active features, saves joblib model, emits a JSON snapshot (timestamp, feature list, hyperparameters from fitted estimator), and attaches prediction CSV outputs. Supports "train once, download artifacts" workflows for notebooks or batch jobs.
joblib model JSON config Predictions CSV Metadata lineage
EXPORT ARTIFACTS model.joblib RF pipeline { } config.json hyperparams preds.csv outputs src/models/saved/ — versioned artifacts
11
C-Suite AI Copilot
Chat UI that routes natural-language-style prompts by keyword to the same deterministic metrics and recommendation functions used across the app. No external LLM API — responses pull region-level stats or generate_business_recommendations(). Honest about its implementation: keyword routing, not true NLP.
Keyword routing Region-level stats IF/THEN recommendations No external LLM
What drove EMEA dip last Q? EMEA analysis: Services revenue −8% vs prior. iPhone attach rate below threshold. Generate strategy report Recommendation: Prioritise Services bundling in EMEA...
04 — Analytics Stack

Six Analytical Capabilities

A full statistical and machine learning pipeline implemented in a clean src/ layer — decoupled from the UI, independently testable, and ready for extension.

Statistical Inference
Welch t-test, one-way ANOVA, and Pearson correlation between regional and metric comparisons. Appropriate for exploratory cross-sectional analysis — results are flagged as exploratory, not causal.
ttest_ind f_oneway pearsonr equal_var=False
Clustering
StandardScaler normalisation → K-Means, Agglomerative, or DBSCAN. Optional silhouette-driven k selection. Cluster narratives map centroids vs global means to strategy labels — surfaced in the Segmentation tab.
KMeans AgglomerativeClustering DBSCAN silhouette_score
Regression Pipelines
Linear, Polynomial (StandardScaler → PolynomialFeatures → LinearRegression), Random Forest, and Gradient Boosting inside sklearn Pipelines. Actual vs predicted, residuals, and feature importance plots.
LinearRegression PolynomialFeatures RandomForestRegressor GradientBoostingRegressor
Constrained Optimisation
SLSQP (scipy.optimize.minimize) with a linear sum constraint on hardware units, maximising predicted Services revenue under the currently selected fitted model. What-if sliders surface the result interactively.
scipy.optimize.minimize SLSQP Linear sum constraint
Model Validation
K-fold cross_validate on a Random Forest pipeline, 5-fold CV across model families, and TimeSeriesSplit forward-chaining on row-ordered data. Band-style visualisation with ±1.96×RMSE confidence intervals.
cross_validate TimeSeriesSplit GridSearchCV
Insight Engine
IF-THEN style recommendations from linear regression coefficients between hardware columns and Services revenue, regional efficiency aggregates, and Wearables attach-rate heuristics. Powers Mission Control, Executive Summary, and the Copilot.
insights.py Coefficient analysis Attach-rate heuristics
Featured — Constrained Optimisation
Hardware Mix → Maximum Services Revenue
maximise: f_services(iPhone, iPad, Mac, Wearables) subject to: iPhone + iPad + Mac + Wearables = total_units iPhone ≥ 0, iPad ≥ 0, Mac ≥ 0, Wearables ≥ 0 solver: scipy.optimize.minimize(method='SLSQP')
The solver identifies the hardware allocation that maximises predicted Services revenue under the constraint that total unit count remains fixed — giving actionable product-mix guidance directly from the active fitted regression model.
// ML Segmentation — Cluster Profiles (K-Means, k=4)
C1
High-Value Markets
Above-average iPhone and Services revenue. High efficiency ratio. Primary growth targets.
KMeans · centroid above global μ on iPhone + Services
C2
iPad-Led Regions
Strong iPad attach, moderate Mac. Lower Wearables share. Education or enterprise-skewed markets.
KMeans · iPad share > 1σ above mean
C3
Emerging Markets
Below-average volumes across all hardware lines. High growth potential, currently under-penetrated.
KMeans · all features below global μ
C4
Wearables Champions
High Wearables attach rate relative to iPhone base. Services revenue strong. Premium lifestyle segment.
KMeans · Wearables > 1.5σ above mean
05 — Architecture

Layered Design

A strict separation between presentation (app/) and analytics (src/) keeps the codebase maintainable, independently testable, and ready for extension.

app/ — Presentation Layer
Streamlit UI
  • main.py — Entry: load → features → sidebar → KPIs → tab router
  • sidebar.py — Filters, nav, demo launcher
  • kpi.py — KPI card components
  • styles.py — Global dark CSS system
  • demo.py — Guided walkthrough overlay
  • tabs/* — One module per page (11 total)
src/ — Analytics Layer
Pure Python Logic
  • data/loader.py — CSV load + quality dict + @cache
  • features/builder.py — Derived feature engineering
  • analysis/statistics.py — t-test, ANOVA, Pearson
  • analysis/insights.py — IF/THEN recommendation engine
  • models/regression.py — Linear, Poly, RF, GB, CV
  • models/clustering.py — K-Means, Agg., DBSCAN
  • visualization/charts.py — Plotly dark theme system
. ├── app/ │ ├── main.py # Entry point: load → features → sidebar → tab router │ └── components/ │ ├── sidebar.py # Region filter, nav, guided demo launcher │ ├── styles.py # Global Streamlit CSS (dark card layout) │ ├── kpi.py demo.py # KPI cards & presentation overlay │ └── tabs/ # One module per page (11 tab files) ├── src/ │ ├── data/ loader.py # CSV ingest, normalise, @st.cache_data │ ├── features/ builder.py # Total sales, rev/unit, iPhone share │ ├── analysis/ statistics.py insights.py │ ├── models/ regression.py clustering.py │ │ └── saved/ # joblib + JSON + CSV (created by Export tab) │ └── visualization/ charts.py # Plotly helpers + apply_minimal_styling() ├── data/raw/ │ └── apple_sales_2024.csv # Single source of truth ├── requirements.txt README.md CASE_STUDY.md APP_GUIDE.md └── Comprehensive Data Analysis ... .ipynb
06 — Technology

The Tech Stack

Seven packages, each with a single responsibility. No bloat, no redundancy — every dependency earns its place.

Streamlit
App shell, caching, widgets, sidebar filters
pandas
Data handling, column normalisation, filtering
Plotly
Interactive charts with shared dark template
scikit-learn
Regression, clustering, CV, pipelines, scalers
scipy
Statistical tests + SLSQP constrained optimisation
statsmodels
ARIMA demonstration block in Predictions tab
NumPy
Numerics, array operations, synthetic RNG
07 — End-to-End Workflow

Ingest → Insight → Export

A linear workflow through six stages — each building on the last, with reproducible artifacts at every step.

01
Ingest & Normalise
load_data() reads the CSV, strips headers, normalises column names to stable internal identifiers. @st.cache_data prevents redundant I/O across reruns. get_data_quality() audits the frame simultaneously.
loader.py
02
Engineer Features
builder.py derives Total_Product_Sales, Revenue_Per_Unit, and iPhone_Market_Share — the three compound features that power ratios, rankings, and model inputs downstream.
builder.py
03
Filter via Sidebar
Region multiselect and minimum iPhone sales threshold propagate a single shared filtered_df to all 11 tabs simultaneously. No stale views, no manual refresh.
sidebar.py
04
Explore & Analyse
Statistical inference (t-tests, ANOVA, correlations), geospatial mapping, and advanced segmentation (K-Means / Agglomerative / DBSCAN) surface hidden patterns across the filtered dataset.
statistics.py clustering.py
05
Model & Validate
Auto-selected features feed regression pipelines (Linear, Poly, RF, GB). K-fold CV and TimeSeriesSplit stress-test generalisability. SLSQP optimises the hardware mix for maximum Services revenue.
regression.py validation.py
06
Export Artifacts
Train-once workflow writes joblib model, JSON config snapshot (timestamp + feature list + hyperparameters), and predictions CSV under src/models/saved/ — ready for notebooks or downstream batch jobs.
export.py joblib + JSON + CSV
08 — Limitations & Honest Scope

What This Is — and Isn't

Intellectual honesty is part of the design. These boundaries are explicit in the code documentation, not buried in fine print.

Data
Toy / Stylised Dataset
apple_sales_2024.csv is a stylised dataset for analysis exercises. It does not represent Apple's official reporting or actual retail data.
Metrics
In-Sample vs. Generalised
Many training metrics are in-sample unless the Validation tab is used. Cross-validated and TimeSeriesSplit metrics are the correct generalisation indicators.
Recommendations
Structured Storytelling
Business text and dollar figures in insights are heuristic (linear shortcuts + scaling). Treat them as discussion prompts, not compliance-checked financial advice.
Copilot
Keyword Routing, Not NLP
The C-Suite AI Copilot routes on keywords (risk, report, strategy) to deterministic functions. It is not connected to any external LLM API and does not understand semantics.
11
App Modules
5
Regression Models
3
Clustering Algos
3
Derived Features
7
Python Packages
1
Source of Truth