Orchestrating 12 ML Models Daily at Retail Scale

There is a particular kind of complexity that emerges when you combine machine learning with retail execution at scale. Individual ML models are well-understood. Running a demand forecasting model or a customer segmentation algorithm in isolation is a solved problem. But orchestrating twelve interdependent models that must run daily against 40+ data sources, produce actionable recommendations for 10,000+ retail outlets across five major US retail chains, and have the results ready before field teams start their morning — that is an engineering problem as much as a data science one.

This post details how we built and operate the AI-driven sales execution platform for a global FMCG leader. The platform ingests data from over 40 sources, processes it through a medallion architecture with 25+ bronze tables, 30+ silver tables, and a curated gold layer, then runs 12+ ML models daily to produce prioritized action plans for field sales teams.

The Scale of the Problem

To appreciate the engineering challenges, consider the numbers. The platform covers 5 major US retail chains, each with their own data formats, delivery schedules, and API quirks. Across these chains, there are over 10,000 individual retail outlets (stores). The product catalog spans 100,000+ SKUs across multiple categories. Each day, the platform must ingest the latest point-of-sale data, inventory levels, promotional calendars, pricing feeds, distribution data, and competitive intelligence, then run the full model suite and deliver recommendations before 6 AM Eastern.

The data sources include retailer POS feeds (typically delivered as flat files via SFTP between midnight and 3 AM), syndicated data from third-party providers, internal sales planning systems, promotional calendars, master data from SAP, competitive pricing data from web scraping services, and weather data (yes, weather affects retail execution more than you might expect).

Each data source has its own delivery schedule, format, and reliability profile. Some arrive reliably at the same time each night. Others are sporadic. Some fail silently — the file arrives but is empty or truncated. The platform must handle all of these failure modes while still producing results by the morning deadline.

Medallion Architecture at This Scale

We use the medallion architecture (bronze, silver, gold) on Databricks, but at this scale, each layer has specific design considerations that go beyond the typical tutorial examples.

The bronze layer consists of 25+ tables representing raw data from every source. These tables are append-only Delta tables that preserve the exact data as received, including any malformation. Bronze tables are partitioned by ingestion date and source, enabling efficient time-travel queries for debugging.

The critical design decision at the bronze layer is the separation of ingestion and validation. We ingest first, validate second. This means that even if a data source delivers corrupt data, we have a record of what was received. The validation step flags issues and routes them to our data quality monitoring, but does not prevent the data from being persisted. This has saved us countless hours of debugging when upstream systems change their output format without notice.

The silver layer contains 30+ tables with cleaned, conformed, and enriched data. This is where the heavy transformation work happens: deduplication, schema normalization, slowly changing dimension handling, and cross-source entity resolution. The most complex silver transformations are the entity resolution tasks — matching products across retailers (the same SKU might have different identifiers at each chain) and matching store locations to our internal outlet master.

Entity resolution deserves its own discussion. Retailers do not use consistent identifiers for the same product. A 12-pack of a beverage might have one UPC at one chain and a different one at another, or the same UPC but different internal item numbers. We maintain a product master table with mappings, but these mappings require constant maintenance as new products launch and retailers reorganize their catalogs. The silver layer transformations handle this matching using a combination of exact UPC matches, fuzzy string matching on product descriptions, and manual override tables maintained by the product management team.

The gold layer is the model-ready data. These tables are structured specifically for ML model consumption — feature tables, training datasets, and prediction output tables. Gold tables are typically wider (more columns) and more denormalized than silver tables, because ML models consume features as flat arrays, not normalized relational structures.

The OmegaConf Configuration System

With 12+ ML models, each with its own set of hyperparameters, feature configurations, training schedules, and deployment settings, we needed a configuration system that could handle complexity without becoming a maintenance burden. We chose OmegaConf, a hierarchical configuration library for Python, and structured it to manage the entire model suite.

The configuration hierarchy has three levels. At the top is a base configuration that defines defaults shared across all models: the Databricks workspace settings, storage paths, logging configuration, and default training parameters. Below that are model-specific configurations that override the base defaults for each model type. At the bottom are environment-specific overrides (development, staging, production) that adjust parameters like compute cluster size and output destinations.

OmegaConf’s merge semantics handle this hierarchy cleanly. A model configuration inherits all base settings and only specifies the values that differ. An environment overlay then adjusts only the settings that differ between environments. The result is that each model’s effective configuration is the merger of all three levels, with more specific levels taking precedence.

A model configuration specifies the model’s name and version, input features (which gold tables and columns to use), the training configuration (algorithm, hyperparameters, cross-validation strategy), the inference configuration (batch size, output schema, prediction threshold), the schedule (daily, weekly, or triggered by data availability), dependencies (which other models or data tables must be current before this model runs), and monitoring thresholds (acceptable drift ranges for input features and model performance metrics).

This approach has a practical benefit that became apparent only after several months of operation: when a data scientist wants to experiment with a new hyperparameter setting or feature set, they create a new configuration variant without touching any pipeline code. The experiment runs alongside the production model using the same infrastructure, and the results are directly comparable because the only variable is the configuration.

The Model Suite

The twelve models fall into several categories, each serving a different aspect of retail execution.

Demand Forecasting Models (3 models): These predict unit sales at the store-SKU-week level for different planning horizons. The short-term model (1-2 weeks out) uses recent POS trends, promotional calendars, and seasonal patterns. The medium-term model (4-8 weeks) incorporates distribution pipeline data and new product launch schedules. The long-term model (13-26 weeks) is used for supply chain planning and relies more heavily on historical patterns and macro trends.

Customer Segmentation (1 model): Segments retail outlets by sales velocity, product mix, growth trajectory, and competitive landscape. The segmentation determines which outlets receive priority attention from field teams and influences the assortment recommendations.

Compliance Detection (2 models): These models identify stores where execution is likely non-compliant — incorrect pricing, missing displays, out-of-stock conditions. One model uses POS data anomalies to flag probable issues. The other uses image recognition data from field team photos to detect display compliance, though the image processing itself happens in a separate pipeline.

Pricing Optimization (2 models): One model optimizes promotional pricing based on price elasticity estimates. The other monitors competitive pricing and flags opportunities where price adjustments could improve market share without sacrificing margin.

Assortment Optimization (1 model): Recommends the optimal product assortment for each outlet based on local demand patterns, store demographics, and shelf space constraints.

KPI Prioritization (1 model): This is perhaps the most important model in the suite, because it sits downstream of all the others. It takes the outputs from every other model — demand forecasts, compliance flags, pricing recommendations, assortment suggestions — and produces a single prioritized action list for each outlet. The field sales representative visiting a store sees a ranked list of five to seven actions ordered by expected impact on revenue.

The KPI prioritization model uses a multi-objective optimization approach that balances revenue impact, execution feasibility (some actions are harder to implement than others), and strategic priority (the company may be emphasizing certain categories or brands in a given quarter). The weights for these objectives are configured through OmegaConf, allowing the commercial team to adjust strategic priorities without modifying the model.

Anomaly Detection (2 models): Cross-cutting models that monitor for unusual patterns in POS data and distribution data. These are less about generating actions and more about alerting the analytics team when something unexpected is happening — a sudden drop in sales at a cluster of stores, an unusual spike in returns, or a distribution disruption.

Orchestration: The Daily Run

The daily orchestration is managed through Databricks Workflows, with each stage represented as a task in a directed acyclic graph (DAG).

The DAG has four main phases. Phase one is data ingestion, which runs from midnight to 3 AM as data sources become available. Each source has its own task that polls for data availability, validates the delivery, and loads it into bronze tables. These tasks are configured with retry logic — if a source is not available at its expected time, the task retries with exponential backoff up to a configurable deadline.

Phase two is transformation, running from approximately 3 AM to 4:30 AM. Silver layer transformations depend on the bronze tables being current. Gold layer transformations depend on silver tables. The DAG expresses these dependencies explicitly, and Databricks Workflows handles the scheduling.

Phase three is model execution, running from 4:30 AM to 5:30 AM. The models have their own dependency graph. Demand forecasting can start as soon as the relevant gold tables are ready. Compliance detection depends on demand forecasts (it uses expected vs. actual sales as an input signal). KPI prioritization runs last because it depends on outputs from all other models.

Phase four is delivery, running from 5:30 AM to 6 AM. Model outputs are written to the delivery gold tables, which feed the mobile application used by field sales teams. A final validation step checks that all outlets have received recommendations and that the recommendation counts fall within expected ranges.

Each task in the DAG is configured with alerts for failure, timeout, and SLA breach. If any critical-path task fails after exhausting its retries, an alert fires to the on-call engineer’s phone. If the overall pipeline misses its 6 AM SLA, a broader alert goes to the engineering and business teams.

DAB Deployment Strategy

We deploy the platform using Databricks Asset Bundles (DAB), which package notebooks, libraries, workflows, and cluster configurations into versioned, deployable artifacts.

Each model has its own DAB bundle that contains the model’s training notebook, inference notebook, OmegaConf configuration files, Python library dependencies (specified as a requirements file), and the workflow definition for that model’s orchestration tasks.

The orchestration DAG itself is defined in a top-level DAB bundle that references the individual model bundles. This two-level structure means that a data scientist can update a model’s training logic and configuration without affecting the orchestration layer, while the platform team can modify the DAG structure without touching model code.

DAB deployments go through the standard CI/CD pipeline: a pull request triggers automated tests (unit tests for transformation logic, integration tests against a development Databricks workspace), code review, and then deployment to staging followed by production. The staging environment runs the full daily pipeline against a subset of data to validate end-to-end correctness before production deployment.

Data Freshness and SLA Management

The tightest constraint on this platform is time. Every minute of delay in the pipeline is a minute less that field teams have to review their daily action plans before store visits begin.

We manage this through a combination of aggressive monitoring and graceful degradation. The monitoring system tracks the freshness of every table in the platform — the time elapsed since its last successful update. Each table has a freshness SLA, and a dashboard shows the current freshness status across the entire data estate in real time.

Graceful degradation means that the platform can produce useful outputs even when some data sources are late or missing. If a retailer’s POS feed is delayed, the demand forecasting model falls back to using the most recent available data, with a flag indicating reduced confidence. The KPI prioritization model adjusts its weights accordingly, downranking recommendations that depend on stale data.

This degradation logic is encoded in the OmegaConf configuration as data freshness thresholds. Each model specifies, for each of its input features, the maximum acceptable staleness and the behavior when that threshold is exceeded (use stale data with reduced weight, skip the feature entirely, or abort the model run).

Scale Challenges

At 10,000+ outlets and 100,000+ SKUs, the combinatorial space for predictions is enormous. The demand forecasting models must produce predictions for every active store-SKU combination, which can exceed 100 million rows.

We address this through several strategies. Partition pruning is the most impactful — all gold tables are partitioned by retailer and category, and model inference queries filter aggressively on these partitions so that each model run only processes the relevant subset of the data.

Cluster auto-scaling is essential for the model execution phase, where compute demand spikes dramatically for 60-90 minutes and then drops to near zero. We use Databricks job clusters with aggressive auto-scaling policies — clusters scale up quickly and scale down with a short idle timeout.

Delta table optimization (Z-ordering, compaction, and liquid clustering) keeps query performance consistent as tables grow. We run optimization jobs during off-peak hours (midday) to avoid competing with the nightly pipeline.

Caching at the gold layer reduces redundant computation. Many models share the same base features, so we materialize common feature sets as gold tables rather than recomputing them for each model.

Results

After twelve months of daily operation, the platform delivers:

Daily execution across 10,000+ outlets with results available by 6 AM Eastern, meeting the SLA 97% of the time.
Measurable improvement in retail execution effectiveness, as field teams now prioritize actions based on data-driven recommendations rather than intuition.
12 models running in coordinated daily cadence, with new models onboarded in weeks rather than months thanks to the OmegaConf configuration system and DAB deployment pattern.
40+ data sources integrated with automated freshness monitoring and graceful degradation for late or missing feeds.

The most unexpected outcome was the impact of the KPI prioritization model on field team behavior. Before the platform, representatives visited stores with a generic checklist. After, they arrived with a short list of specific, high-impact actions tailored to that outlet on that day. The behavioral change was more impactful than any individual model improvement.

Lessons Learned

OmegaConf is the right level of abstraction for ML configuration. It is more structured than raw YAML dictionaries, supports type checking and validation, handles hierarchical merging, and integrates cleanly with Python. We tried MLflow’s parameter tracking as a configuration mechanism early on and found it was designed for experiment tracking, not operational configuration management.

The KPI prioritization model is the glue. Individual model outputs are not directly actionable by field teams. A demand forecast or a compliance score is interesting but not prescriptive. The prioritization model that synthesizes all outputs into a ranked action list is what makes the platform useful in practice. We should have invested in this model earlier.

Data freshness SLAs need to be a first-class concern, not an afterthought. We initially treated monitoring as a phase-two feature and paid for it with unreliable early-morning runs where stale data produced misleading recommendations.

Graceful degradation is not optional at this scale. With 40+ data sources, something fails every night. The platform must produce useful outputs even when running on incomplete data. Designing for degradation from the start — rather than bolting it on after the first incident — is worth the upfront investment.

Orchestrating 12 ML Models Daily for Retail Execution at Scale