Part 2 is now live: Zero‑Shot Forecasting — Chronos vs Toto. Read it here: Zero Shot Forecasting: Chronos vs Toto
Zero-Shot Forecasting: Our Search for a Time-Series Foundation Model
The infrastructure monitoring space is at an inflection point. Classical forecasting models—ARIMA, SARIMA, Prophet—have served observability teams well, but they carry a steep operational cost: every new data stream needs its own tuned model. At Parseable, we manage continuous observability streams across infrastructure, and the overhead of maintaining hundreds of per-stream models was becoming unsustainable.
So we ran a benchmark. We evaluated four time-series foundation models—Amazon Chronos, Google TimesFM, IBM Tiny Time-Mixers (TTM), and Datadog Toto—against classical baselines on real Kubernetes pod metrics. The goal was to understand which, if any, zero-shot forecasting models were production-ready for observability use cases.
This is what we found.
Quick Summary
- We evaluated zero-shot forecasting models on real observability data from Kubernetes pod metrics.
- The benchmark compared Chronos, TimesFM, IBM Tiny Time-Mixers, Datadog Toto, Lag-LLaMA, and classical baselines.
- Datadog Toto performed strongest on the multivariate observability workload, achieving a MAPE of 0.006 at one-minute granularity.
- Chronos and IBM TTM showed useful trade-offs around accuracy, generalization, and compute efficiency.
- Classical models remained competitive on steady-state workloads, especially when latency and simplicity mattered.
Why Foundation Models for Observability?
Traditional statistical forecasting requires a separate, hand-tuned model for each data stream. For a small number of stable metrics, that is manageable. At the scale of a modern observability platform, it is not.
Foundation models trained on large, diverse time-series corpora offer a different path: one model that generalizes across many streams without retraining. The same principle behind large language models—train broadly, apply narrowly—is now being applied to temporal data.
The practical question for observability teams is not whether this works in principle. It is whether it works well enough on real infrastructure telemetry to justify replacing or supplementing the classical stack.
Why Zero-Shot Forecasting Matters for Observability
Observability data is structurally difficult for classical forecasting models:
- Streams appear and disappear as services are deployed and torn down.
- Infrastructure metrics are noisy, high-cardinality, and often correlated across dimensions.
- Data distributions shift with every deployment, incident, or traffic pattern change.
- Manual retraining does not scale when you have thousands of active streams.
Zero-shot forecasting—predicting over new series without task-specific fine-tuning—directly addresses this. A model trained on diverse time-series data can transfer structural patterns across domains: seasonal cycles, trend shapes, variance profiles. For teams managing dynamic infrastructure, that reusability reduces both engineering overhead and time-to-insight when onboarding new services or metrics.
Where Classical Forecasting Still Works
Foundation models are not universally better. ARIMA and Prophet remain strong choices for stable, narrow workloads with predictable seasonality and low noise. They are cheaper to run, easier to explain, and deliver sub-second inference without warm-up overhead.
The trade-off is clear: if your streams are stable and few, classical models are hard to beat on pure cost-efficiency. If your streams are varied, noisy, multivariate, or constantly changing, zero-shot time-series forecasting models offer a meaningful operational advantage—particularly when you need coverage across dozens or hundreds of services simultaneously.
Models Explored
We evaluated four foundation models and included Lag-LLaMA as an additional lightweight baseline. Here is a brief profile of each.
Amazon Chronos
Chronos is a transformer-based forecasting model from AWS, trained on a large and diverse corpus of open time-series datasets. It frames forecasting as a language modeling problem: it tokenizes time series values and generates future tokens autoregressively. This architecture gives it strong generalization across data distributions, making it one of the most versatile general-purpose options in the foundation model ecosystem.
We used Chronos Bolt Base (205M parameters) for this evaluation. It supports both univariate and multivariate forecasting and is available in multiple size tiers.
Note: This evaluation used the model version available at the time of the benchmark. Newer Chronos releases—including Chronos-2, which adds covariate support and improved multivariate handling—may produce different results and should be evaluated separately.
Google TimesFM
TimesFM is a large-scale time-series foundation model from Google Research, pretrained on over 100 billion real-world time points. It uses a decoder-only transformer architecture designed with zero-shot generalization as a primary objective. Google's research showed strong performance across diverse public benchmarks without fine-tuning.
We used timesfm-2.0-500m-pytorch (500M parameters). TimesFM is primarily univariate-focused, which shaped its performance profile on our multivariate observability task—strong at hourly granularity, but less applicable to joint multi-stream forecasting.
IBM Tiny Time-Mixers (TTM)
IBM Tiny Time-Mixers—part of the IBM Granite Time Series family—take a different architectural approach. Rather than large transformers, TTM uses a lightweight mixer-based design with under one million parameters in its smallest variants. IBM describes these models as optimized for resource-constrained environments: GPU-free inference, fast adaptation with minimal data, and favorable accuracy-to-efficiency ratios.
We used granite-timeseries-ttm-r2 (805K parameters). Its size makes it one of the most practical options for teams with constrained inference budgets or edge deployment requirements.
Datadog Toto
Toto is a time-series foundation model built specifically for observability forecasting. Where general-purpose models are trained on diverse corpora, Toto was pretrained on observability time series—infrastructure metrics, application telemetry, and system-level signals. The result is a model that understands structural patterns common to production systems: correlated pod metrics, traffic-driven latency spikes, and rolling deployment effects.
We used Toto-Open-Base-1.0 (151M parameters). It is a multivariate model, which made it a natural fit for joint forecasting over CPU, memory, and request latency.
Model Overview
| Model | Publisher | Signal Type | Approx. Size | License | Best Fit |
|---|---|---|---|---|---|
| Chronos Bolt | Amazon | Univariate / Multivariate | 205M | Apache 2.0 | General-purpose forecasting |
| TimesFM | Primarily univariate | 500M | Apache 2.0 | Zero-shot univariate forecasting | |
| TTM | IBM | Univariate / lightweight | 805K | Apache 2.0 | Efficient, edge-friendly forecasting |
| Toto | Datadog | Multivariate | 151M | Apache 2.0 | Observability metrics |
| Lag-LLaMA | Community | Univariate | 2.45M | Apache 2.0 | Lightweight univariate baseline |
Ready to try Parseable in action with your own data? Get started for free
Evaluation Metric: MAPE
Why We Used MAPE
We selected Mean Absolute Percentage Error (MAPE) as our primary metric for three reasons:
- Interpretability: A MAPE of 5% is immediately meaningful to engineers and stakeholders. It does not require statistical background to interpret.
- Scale-invariance: It allows fair comparison across metrics with very different value ranges—CPU percentages, memory in bytes, request latency in milliseconds.
- Practical alignment: It reflects the accuracy question a production team actually asks: "How far off is the forecast, as a percentage of the actual value?"
Formula: MAPE = (1/n) × Σ |(actual − forecast) / actual| × 100
MAPE Limitations
MAPE is useful, but it has known failure modes worth understanding before applying it to production evaluation:
- It breaks down when actual values are near or at zero, producing very large or undefined errors.
- It over-penalizes errors on low actual values, which can skew comparisons on metrics with near-zero floors (e.g., idle CPU during off-peak hours).
- It does not capture directionality—a forecast that misses a spike in the wrong direction scores the same as one that misses in the right direction.
- It does not reflect anomaly usefulness: a forecast that correctly anticipates the shape of a spike but lands 15% off the peak may still be operationally valuable.
We addressed the zero-value issue through filtering. MAE was tracked alongside MAPE throughout, and sMAPE was considered for cases where symmetry mattered. In production use, we recommend pairing MAPE with operational thresholds and residual analysis rather than relying on it as a standalone signal.
Dataset Used
Production Kubernetes Pod Metrics
Our evaluation used real production telemetry from a retail checkout application running on Kubernetes. The dataset included:
- CPU usage per pod
- Memory consumption per pod
- Request latency (p50/p99) per pod
- Sampling rate: 1-second intervals, downsampled to 1-minute averages for analysis
- Coverage: Multiple pods across checkout, cart, and payment services
The data reflects real-world workload behavior: sustained load periods, traffic spikes during peak hours, brief gaps from pod restarts, and occasional outliers from deployment events. This is a more challenging evaluation surface than standard synthetic benchmark datasets, and more representative of what production observability forecasting actually faces.
Pre-Processing Steps
- Resampling: Downsampled 1Hz telemetry to 1-minute averages; further aggregated to 1-hour and 1-day granularities for multi-resolution evaluation.
- Missing value handling: Forward-fill imputation for short gaps (under five minutes); larger missing intervals were masked and excluded from scoring.
- Normalization: Z-score normalization applied per series to support zero-shot generalization without leaking absolute scale information to models.
- Sliding window split: 70% training / 15% validation / 15% test, using a sliding window to simulate real deployment conditions rather than a single static split.
- Multivariate structuring: CPU, memory, and latency records joined per pod to form joint multivariate inputs for models that support them.
Ensuring Fair Evaluation
Each model received identical input windows and was scored against the same held-out test segments. No model-specific fine-tuning was applied—this was a zero-shot evaluation by design. The goal was not to optimize each model for this specific dataset, but to evaluate how each model performed under a shared, consistent protocol. Classical baselines were trained on the training split and evaluated on the same test segments.
Results and Observations
Benchmark Results
Results are split by granularity. All MAPE and MAE values reflect test-split performance without fine-tuning.
1-Minute Granularity
| Model | Type | Input Length | Output Length | MAPE | MAE |
|---|---|---|---|---|---|
| Datadog Toto | Multivariate | 512 | 64 | 0.006 | 0.00646 |
| Amazon Chronos Bolt | Multivariate | 512 | 64 | 0.046 | 0.04395 |
| Google TimesFM | Univariate | 512 | 64 | 0.108 | 0.09553 |
| Lag-LLaMA | Univariate | 512 | 64 | 0.537 | 0.47321 |
| IBM TTM | Multivariate | 512 | 96 | 1.121 | 1.00742 |
1-Hour Granularity
| Model | Type | Input Length | Output Length | MAPE | MAE |
|---|---|---|---|---|---|
| Google TimesFM | Univariate | 128 | 24 | 0.534 | 0.51253 |
| Amazon Chronos Bolt | Multivariate | 220 | 64 | 1.790 | 1.72385 |
| IBM TTM | Multivariate | 180 | 60 | 2.592 | 2.54402 |
| Datadog Toto | Multivariate | 220 | 24 | 3.866 | 3.69394 |
| Lag-LLaMA | Univariate | 220 | 24 | 9.983 | 9.54780 |
1-Day Granularity
| Model | Type | Input Length | Output Length | MAPE | MAE |
|---|---|---|---|---|---|
| Datadog Toto | Multivariate | 10 | 3 | 0.541 | 0.52186 |
| Amazon Chronos Bolt | Multivariate | 10 | 3 | 2.697 | 2.65760 |
| Google TimesFM | Univariate | — | — | — | — |
| IBM TTM | Multivariate | — | — | — | — |
TimesFM and IBM TTM did not produce results at 1-day granularity under the evaluated configuration.
What Performed Best?
Results split clearly across granularity and signal type:
- At 1-minute resolution, Toto was the clear leader with a MAPE of 0.006—roughly 7x better than the next-best model (Chronos at 0.046). This granularity is the most relevant for real-time observability alerting and anomaly detection.
- At 1-hour resolution, TimesFM led with a MAPE of 0.534, followed by Chronos at 1.79. Toto's 1-hour MAPE was notably higher, indicating its strength is concentrated in short-horizon, high-frequency forecasting over correlated observability streams.
- At 1-day resolution, Toto again outperformed Chronos (0.541 vs 2.697), though both models showed elevated MAPE given the limited available input length of 10 data points.
- IBM TTM held competitive accuracy at 1-minute and 1-hour granularity despite its 805K parameter footprint—a strong argument for resource-constrained deployments where efficiency matters as much as raw performance.
- Classical Vector-ARIMA remained competitive for steady-state workloads where series were stationary and individual per-stream retraining was feasible.
Multivariate Pod Metrics: Foundation Models in the Trenches
The multivariate evaluation—jointly forecasting CPU, memory, and latency per pod—was the task most representative of production forecasting for Kubernetes metrics. Toto was built for exactly this: correlated, high-dimensional observability metrics with structural relationships that general-purpose models are not specifically trained to capture.
Its MAPE of 0.006 at 1-minute resolution reflects how effectively it learned cross-metric patterns from its observability-focused pretraining. Chronos also performed well here, benefiting from its transformer architecture's capacity to model cross-series dependencies. TTM, despite its much smaller footprint, was a competitive multivariate performer—which speaks to the practical efficiency of mixer-based architectures for structured time-series tasks.
Classical Vector-ARIMA matched or beat foundation models on individual pods with stable, predictable load profiles. When workloads were steady and data was clean, classical approaches had no generalization gap to compensate for.
Robustness and Real-World Behavior
Foundation models and classical models diverged most clearly under three real-world conditions:
- Sudden spikes: Foundation models produced smoother, more stable forecasts during periods of input volatility. Classical models occasionally over-reacted to recent noise, producing erratic short-term predictions.
- Missing data: The forward-fill pre-processing helped across the board, but foundation models generally handled imputed sequences more gracefully than ARIMA-family models on longer gaps.
- Regime changes: No model handled true first-of-its-kind events—a new traffic pattern, a major configuration change—well in zero-shot mode. Foundation models recovered faster as new patterns emerged, particularly with even minimal fine-tuning on a few hundred recent observations.
Qualitative Patterns
A few patterns emerged that do not surface directly in MAPE scores:
- Inference latency: Toto and Chronos required brief warm-up periods before settling into stable prediction speeds. Classical models delivered sub-second responses from the start—relevant for time-sensitive alerting pipelines.
- Outlier resistance: Foundation models were less likely to produce physically implausible forecasts when input sequences were noisy or contained anomalous values.
- Cross-pod consistency: Toto showed the most consistent performance across different pod types and service contexts—practically valuable for multi-service monitoring where forecasting quality should not vary widely by team or service.
When Do Foundation Models Win?
Based on this benchmark, foundation models provide a clear advantage when:
- You are forecasting multivariate, correlated metrics (CPU + memory + latency together, or multi-service telemetry).
- Your streams are noisy, non-stationary, or subject to frequent workload shifts.
- You need to onboard new streams without retraining—zero-shot coverage across new services or deployments.
- You manage enough stream diversity that per-stream model maintenance has become a real operational cost.
Classical models remain the right choice when:
- Streams are stable, stationary, and have predictable seasonality.
- You need sub-second inference with no warm-up cost.
- Compute or memory budget is tightly constrained.
- Your team values model explainability and simple auditing over raw generalization capability.
What This Means for Predictive Observability
Zero-shot forecasting is a genuine step forward for predictive observability, but it is not a drop-in solution. A few grounding points for teams considering foundation models in production:
- They reduce retraining overhead, which is the primary operational win. One model covering many streams is significantly easier to manage than per-stream ARIMA pipelines at scale—especially when new OpenTelemetry-instrumented services are being onboarded regularly.
- Model choice depends on workload: Toto is the strongest pick for high-frequency multivariate observability data. Chronos is the more versatile option across mixed workload types. TTM is the right choice when compute efficiency is a hard constraint.
- Production deployment still requires work: backtesting against historical telemetry, drift monitoring as workloads evolve, fallback baselines for edge cases, and careful threshold calibration for alerting pipelines.
- Foundation models forecast expected behavior—they are not anomaly detectors: anomaly detection still requires a separate layer, whether residual-based, statistical, or learned. Forecasts create the baseline; detection logic acts on deviations from it.
If you are exploring predictive observability and want to build forecasting workflows over real production telemetry, Parseable provides the storage, query, and data access layer needed to run these pipelines over live infrastructure data.
Conclusion
Foundation models have earned a place in the time-series forecasting toolbox for observability teams. This benchmark showed that zero-shot forecasting is viable for production-grade use cases—but with clear conditions and well-understood trade-offs.
Datadog Toto performed strongest on our multivariate observability workload, particularly at high-frequency one-minute resolution where it is most relevant for real-time alerting. Amazon Chronos delivered solid, general-purpose performance across granularities and is the most versatile option for teams that need coverage across varied signal types. IBM TTM demonstrated that efficiency and accuracy are not mutually exclusive, especially for teams working within tighter compute budgets. Classical baselines remained competitive for stable, narrow workloads—and should not be abandoned without a clear reason to move.
The practical takeaway is not that foundation models replace classical approaches. It is that zero-shot time-series forecasting gives observability teams a more scalable default: one that generalizes across streams without constant retraining, handles data variety more gracefully, and reduces the maintenance overhead that comes with managing large per-stream model fleets at scale.
Predictive observability is becoming a core capability, not an optional extension. Teams that integrate forecasting into their telemetry pipelines now will be better positioned as infrastructure complexity continues to grow.
Have questions about this benchmark or results from your own evaluation? We'd like to hear from you.


