Apple Sales Analytics | Shahid Ul Islam

01 — Problem Statement

From Static Spreadsheets
to Interactive Intelligence

Regional sales data for hardware categories and Services revenue is typically locked in static spreadsheets. Stakeholders needed comparative statistics, segmentation, predictive models, and repeatable exports — without standing up a full BI stack.

Objective	How It's Addressed
Single source of truth	One file: `apple_sales_2024.csv`, loaded and column-normalised in `src/data/loader.py`
Reusable analytics	Logic lives in `src/` (statistics, models, charts, insights) — not embedded in button callbacks
Interactive exploration	Streamlit sidebar filters (regions, min iPhone volume) across 11 focused tab views
Trust & reproducibility	Validation tab with cross-validation; Export tab writes `joblib` + JSON + CSV under `src/models/saved/`
Guided presentation	"Launch Guided Demo" in `app/components/demo.py` locks navigation and walks through the workflow

02 — Data Pipeline

Ingest · Engineer · Filter

A deterministic, reproducible pipeline from raw CSV to enriched dataframe — cached at every layer to avoid redundant I/O across Streamlit reruns.

CSV

apple_sales_2024.csv · raw data

loader.py

Strip headers · normalise columns · @st.cache_data

builder.py

Feature engineering · derived ratios

Sidebar

Region filter · iPhone threshold

Tab Renderers

Shared filtered dataframe across all pages

// Engineered Feature Overview — Derived from builder.py

Total Hardware

iPhone + iPad + Mac + Wearables

Sum col.

Rev / Unit

Services ÷ Total Hardware

Ratio

iPhone Share

iPhone ÷ Total Hardware

Market %

Data Quality Layer

Reliability Index

Heuristic score from completeness, z-score outliers (>3σ), negative values, and zero-variance columns. Drives the weighted quality gauge.

Real computation

IQR Outlier Detection

Box plots and 1.5×IQR bounds with suggested mitigation narratives — prune, winsorize, or median impute — shown as UI guidance strings.

Real computation

Drift Simulation

Resampled, jittered Services_Revenue slice against the mean — demonstrates how a drift alarm would behave. Labeled as synthetic in the implementation.

Pedagogical demo

Schema Audit

Record count, missing values, duplicate rows, dtypes, and memory usage — returned by get_data_quality() alongside the main load.

Real computation

03 — Application Modules

11 Focused Views

Each module shares the same filtered dataframe. Switching filters in the sidebar propagates consistently across every tab — no stale state, no divergent views.

Live Mission Control

The command centre for real-time monitoring and high-level system status. Provides an immediate pulse on global sales performance integrating KPI tiles, synthetic sparklines (deterministic RNG, fixed seed), and anomaly-style callouts. Not a live telemetry backend — values are deterministic and reproducible.

KPI tiles Synthetic sparklines Anomaly callouts IF/THEN recommendations

Executive Summary

A distilled, high-impact view designed for C-suite decision-makers. Highlights the most critical KPIs, profitability metrics, and strategic growth trends — allowing rapid assessment of organisational health and market share analysis without wading through raw data.

Critical KPIs Market share Growth trends Profitability metrics

Overview

A comprehensive look at the entire data landscape, bridging the gap between high-level summaries and granular analytics. Features comparative regional analysis and detailed product performance distributions across major global markets.

Regional comparison Product distributions Global markets

Data Quality

The foundational layer ensuring analytical integrity. Performs automated audits of missing values, schema consistency, and data coverage, providing a weighted quality index. Outlier detection via IQR (1.5×) with suggested mitigation narratives. Drift detection is demonstrative.

Reliability index IQR outlier detection Schema audit Drift demo

Geospatial

Interactive mapping and spatial analysis to identify regional hot-spots and logistical optimisation opportunities. Utilises heatmaps and localised deep-dives to visualise sales density and market penetration across global territories.

Choropleth maps Sales heatmaps Market penetration Regional deep-dive

Analytics

Advanced statistical analysis of sales drivers, correlations, and feature importance. Explores dependencies between product lines via Welch t-tests, one-way ANOVA, and Pearson correlation — appropriate for exploratory cross-sectional analysis, not causal claims without domain design.

Welch t-test One-way ANOVA Pearson correlation Feature importance

Advanced Segmentation

ML-driven cluster analysis to uncover hidden patterns and targetable cohorts. K-Means, Agglomerative Clustering, and DBSCAN with optional silhouette-driven k-selection. Cluster narratives generated by comparing centroids to global means.

K-Means Agglomerative DBSCAN Silhouette k 3D scatter

Predictions

Forward-looking forecasting with regression and time-series. Linear, polynomial, Random Forest, and Gradient Boosting pipelines. Auto-selects features via correlation + quick RandomForest importance. SLSQP constrained optimisation maximises predicted Services revenue under hardware-mix constraints. ARIMA demo on resampled series.

Linear & Polynomial Random Forest Gradient Boosting SLSQP optimisation ARIMA demo What-if sliders

Validation & Backtesting

Rigorous testing of predictive models against historical data. K-fold cross_validate on a Random Forest pipeline, on-demand 5-fold CV across linear/RF/GB, and TimeSeriesSplit forward-chaining. Band-style visualisation comparing predictions to held-out points with approximate ±1.96×RMSE band.

K-fold CV TimeSeriesSplit ±1.96×RMSE band Sensitivity analysis

Export & Reproducibility

Trains a Random Forest pipeline on active features, saves joblib model, emits a JSON snapshot (timestamp, feature list, hyperparameters from fitted estimator), and attaches prediction CSV outputs. Supports "train once, download artifacts" workflows for notebooks or batch jobs.

joblib model JSON config Predictions CSV Metadata lineage

C-Suite AI Copilot

Chat UI that routes natural-language-style prompts by keyword to the same deterministic metrics and recommendation functions used across the app. No external LLM API — responses pull region-level stats or generate_business_recommendations(). Honest about its implementation: keyword routing, not true NLP.

Keyword routing Region-level stats IF/THEN recommendations No external LLM

04 — Analytics Stack

Six Analytical Capabilities

A full statistical and machine learning pipeline implemented in a clean src/ layer — decoupled from the UI, independently testable, and ready for extension.

Statistical Inference

Welch t-test, one-way ANOVA, and Pearson correlation between regional and metric comparisons. Appropriate for exploratory cross-sectional analysis — results are flagged as exploratory, not causal.

ttest_ind f_oneway pearsonr equal_var=False

Clustering

StandardScaler normalisation → K-Means, Agglomerative, or DBSCAN. Optional silhouette-driven k selection. Cluster narratives map centroids vs global means to strategy labels — surfaced in the Segmentation tab.

KMeans AgglomerativeClustering DBSCAN silhouette_score

Regression Pipelines

Linear, Polynomial (StandardScaler → PolynomialFeatures → LinearRegression), Random Forest, and Gradient Boosting inside sklearn Pipelines. Actual vs predicted, residuals, and feature importance plots.

LinearRegression PolynomialFeatures RandomForestRegressor GradientBoostingRegressor

Constrained Optimisation

SLSQP (scipy.optimize.minimize) with a linear sum constraint on hardware units, maximising predicted Services revenue under the currently selected fitted model. What-if sliders surface the result interactively.

scipy.optimize.minimize SLSQP Linear sum constraint

Model Validation

K-fold cross_validate on a Random Forest pipeline, 5-fold CV across model families, and TimeSeriesSplit forward-chaining on row-ordered data. Band-style visualisation with ±1.96×RMSE confidence intervals.

cross_validate TimeSeriesSplit GridSearchCV

Insight Engine

IF-THEN style recommendations from linear regression coefficients between hardware columns and Services revenue, regional efficiency aggregates, and Wearables attach-rate heuristics. Powers Mission Control, Executive Summary, and the Copilot.

insights.py Coefficient analysis Attach-rate heuristics

Featured — Constrained Optimisation

Hardware Mix → Maximum Services Revenue

maximise: f_services(iPhone, iPad, Mac, Wearables) subject to: iPhone + iPad + Mac + Wearables = total_units iPhone ≥ 0, iPad ≥ 0, Mac ≥ 0, Wearables ≥ 0 solver: scipy.optimize.minimize(method='SLSQP')

The solver identifies the hardware allocation that maximises predicted Services revenue under the constraint that total unit count remains fixed — giving actionable product-mix guidance directly from the active fitted regression model.

// ML Segmentation — Cluster Profiles (K-Means, k=4)

High-Value Markets

Above-average iPhone and Services revenue. High efficiency ratio. Primary growth targets.

KMeans · centroid above global μ on iPhone + Services

iPad-Led Regions

Strong iPad attach, moderate Mac. Lower Wearables share. Education or enterprise-skewed markets.

KMeans · iPad share > 1σ above mean

Emerging Markets

Below-average volumes across all hardware lines. High growth potential, currently under-penetrated.

KMeans · all features below global μ

Wearables Champions

High Wearables attach rate relative to iPhone base. Services revenue strong. Premium lifestyle segment.

KMeans · Wearables > 1.5σ above mean

05 — Architecture

Layered Design

A strict separation between presentation (app/) and analytics (src/) keeps the codebase maintainable, independently testable, and ready for extension.

app/ — Presentation Layer

Streamlit UI

main.py — Entry: load → features → sidebar → KPIs → tab router
sidebar.py — Filters, nav, demo launcher
kpi.py — KPI card components
styles.py — Global dark CSS system
demo.py — Guided walkthrough overlay
tabs/* — One module per page (11 total)

src/ — Analytics Layer

Pure Python Logic

data/loader.py — CSV load + quality dict + @cache
features/builder.py — Derived feature engineering
analysis/statistics.py — t-test, ANOVA, Pearson
analysis/insights.py — IF/THEN recommendation engine
models/regression.py — Linear, Poly, RF, GB, CV
models/clustering.py — K-Means, Agg., DBSCAN
visualization/charts.py — Plotly dark theme system

. ├── app/ │ ├── main.py # Entry point: load → features → sidebar → tab router │ └── components/ │ ├── sidebar.py # Region filter, nav, guided demo launcher │ ├── styles.py # Global Streamlit CSS (dark card layout) │ ├── kpi.py demo.py # KPI cards & presentation overlay │ └── tabs/ # One module per page (11 tab files) ├── src/ │ ├── data/ loader.py # CSV ingest, normalise, @st.cache_data │ ├── features/ builder.py # Total sales, rev/unit, iPhone share │ ├── analysis/ statistics.py insights.py │ ├── models/ regression.py clustering.py │ │ └── saved/ # joblib + JSON + CSV (created by Export tab) │ └── visualization/ charts.py # Plotly helpers + apply_minimal_styling() ├── data/raw/ │ └── apple_sales_2024.csv # Single source of truth ├── requirements.txt README.md CASE_STUDY.md APP_GUIDE.md └── Comprehensive Data Analysis ... .ipynb

06 — Technology

The Tech Stack

Seven packages, each with a single responsibility. No bloat, no redundancy — every dependency earns its place.

Streamlit

App shell, caching, widgets, sidebar filters

pandas

Data handling, column normalisation, filtering

Plotly

Interactive charts with shared dark template

scikit-learn

Regression, clustering, CV, pipelines, scalers

scipy

Statistical tests + SLSQP constrained optimisation

statsmodels

ARIMA demonstration block in Predictions tab

NumPy

Numerics, array operations, synthetic RNG

07 — End-to-End Workflow

Ingest → Insight → Export

A linear workflow through six stages — each building on the last, with reproducible artifacts at every step.

Ingest & Normalise

load_data() reads the CSV, strips headers, normalises column names to stable internal identifiers. @st.cache_data prevents redundant I/O across reruns. get_data_quality() audits the frame simultaneously.

loader.py

Engineer Features

builder.py derives Total_Product_Sales, Revenue_Per_Unit, and iPhone_Market_Share — the three compound features that power ratios, rankings, and model inputs downstream.

builder.py

Filter via Sidebar

Region multiselect and minimum iPhone sales threshold propagate a single shared filtered_df to all 11 tabs simultaneously. No stale views, no manual refresh.

sidebar.py

Explore & Analyse

Statistical inference (t-tests, ANOVA, correlations), geospatial mapping, and advanced segmentation (K-Means / Agglomerative / DBSCAN) surface hidden patterns across the filtered dataset.

statistics.py clustering.py

Model & Validate

Auto-selected features feed regression pipelines (Linear, Poly, RF, GB). K-fold CV and TimeSeriesSplit stress-test generalisability. SLSQP optimises the hardware mix for maximum Services revenue.

regression.py validation.py

Export Artifacts

Train-once workflow writes joblib model, JSON config snapshot (timestamp + feature list + hyperparameters), and predictions CSV under src/models/saved/ — ready for notebooks or downstream batch jobs.

export.py joblib + JSON + CSV

08 — Limitations & Honest Scope

What This Is — and Isn't

Intellectual honesty is part of the design. These boundaries are explicit in the code documentation, not buried in fine print.

Data
Toy / Stylised Dataset
apple_sales_2024.csv is a stylised dataset for analysis exercises. It does not represent
            Apple's official reporting or actual retail data.
Metrics
In-Sample vs. Generalised
Many training metrics are in-sample unless the Validation tab is used. Cross-validated
            and TimeSeriesSplit metrics are the correct generalisation indicators.
Recommendations
Structured Storytelling
Business text and dollar figures in insights are heuristic (linear shortcuts + scaling).
            Treat them as discussion prompts, not compliance-checked financial advice.
Copilot
Keyword Routing, Not NLP
The C-Suite AI Copilot routes on keywords (risk, report, strategy) to deterministic
            functions. It is not connected to any external LLM API and does not understand semantics.

App Modules

Regression Models

Clustering Algos

Derived Features

Python Packages

Source of Truth

Apple SalesAnalyticsWorkspace

From Static Spreadsheetsto Interactive Intelligence

Ingest · Engineer · Filter

Data Quality Layer

11 Focused Views

Six Analytical Capabilities

Layered Design

The Tech Stack

Ingest → Insight → Export

What This Is — and Isn't

Apple Sales
Analytics
Workspace

From Static Spreadsheets
to Interactive Intelligence