Enterprise ML costs are dominated by inference, not training — production inference typically accounts for 60-90% of total ML spend. The highest-impact cost optimizations target the serving layer: model compression (quantization, pruning, distillation) can reduce inference costs by 2-8x, efficient serving frameworks (vLLM, Triton, TGI) improve GPU utilization from 20-30% to 70-80%, and strategic hardware selection (matching workloads to the right GPU class) prevents overspending by 3-5x. Teams that treat ML cost optimization as a continuous engineering discipline — not a one-time exercise — achieve 50-80% cost reductions while maintaining or improving latency and quality targets.
The ML Cost Crisis in 2026
The economics of production machine learning have reached an inflection point. As organizations move from ML experiments to production deployments serving millions of predictions daily, AI infrastructure costs have become a top-line budget concern. According to industry surveys, enterprise ML infrastructure spending grew 47% year-over-year in 2025, with many organizations reporting that AI compute costs now exceed their entire pre-AI cloud spend.
The problem is not that ML is inherently expensive — it is that most teams optimize for model accuracy during development and discover cost problems only after deployment. A model that costs $0.03 per prediction seems reasonable in a notebook. At 10 million daily predictions, that is $300,000 per month.
"The biggest surprise for most ML teams is that training is a one-time cost, but inference is forever. We see organizations spending 5-10x more on serving models than on training them, and that ratio only increases as adoption grows."
— Andreessen Horowitz, "The Cost of AI Infrastructure" Report, 2025
This guide covers the engineering strategies that reduce ML costs across the entire lifecycle — from training and fine-tuning through production inference — without sacrificing the performance characteristics your users depend on.
Anatomy of ML Costs: Training, Inference, Data & Infrastructure
Before optimizing, you need to understand where your ML dollars actually go. Most teams underestimate the dominance of inference costs and overestimate training costs, leading to misallocated optimization efforts.
| Cost Category | % of Total ML Spend | Growth Pattern | Key Drivers |
|---|---|---|---|
| Inference / Serving | 60-90% | Scales with traffic | Request volume, model size, latency SLAs |
| Training / Fine-Tuning | 5-20% | Periodic spikes | Model size, dataset size, experiment frequency |
| Data Pipeline & Storage | 5-15% | Grows with data volume | Feature stores, data processing, storage tiers |
| MLOps Infrastructure | 3-10% | Relatively fixed | Monitoring, CI/CD, experiment tracking, registries |
Training Costs
Training costs are bursty and predictable. A fine-tuning run on a 7B parameter model might cost $500-$5,000 depending on dataset size and GPU selection. Full pre-training of large models costs $1M+, but most enterprise teams are fine-tuning or running transfer learning rather than training from scratch. The key optimization levers for training are: using spot/preemptible instances (50-70% savings), mixed-precision training (2x throughput), gradient checkpointing (train larger models on smaller GPUs), and efficient data loading pipelines that keep GPUs saturated.
Inference Costs
Inference costs are continuous and scale directly with user adoption — exactly the dynamic you want for a successful product, but dangerous without cost controls. Each prediction requires GPU compute (or CPU, for smaller models), memory for model weights, and network bandwidth. The cost per prediction depends on model size, hardware choice, batch size, and how efficiently your serving infrastructure utilizes the underlying compute. Most teams achieve only 20-30% GPU utilization in production inference, meaning they are paying for 3-5x more compute than they actually use.
Data Pipeline Costs
Often overlooked, data pipeline costs include feature computation, data validation, storage for training datasets and feature stores, and the ETL infrastructure that keeps everything current. These costs tend to be stable but can spike during dataset expansions or when feature engineering becomes more complex.
MLOps Infrastructure Costs
The operational overhead of running ML in production — MLOps pipelines, model monitoring, experiment tracking, model registries, and CI/CD for models — typically represents a smaller but essential fixed cost. Teams that underinvest here end up spending more overall due to failed deployments, undetected model degradation, and inefficient experimentation.
Model Compression Techniques
Model compression is the highest-ROI optimization for most teams because it reduces the compute required for every single prediction. A compressed model serves the same function with fewer resources — smaller memory footprint, faster inference, and lower per-prediction cost.
Quantization
Quantization reduces the numerical precision of model weights from 32-bit floating point (FP32) to lower precision formats — FP16, INT8, or INT4. This directly reduces memory usage and increases throughput because modern GPUs have specialized hardware for lower-precision arithmetic.
| Precision | Memory Reduction | Throughput Improvement | Typical Quality Impact | Best For |
|---|---|---|---|---|
| FP16 / BF16 | 2x | 1.5-2x | <0.5% degradation | Default for all GPU inference |
| INT8 (PTQ) | 4x | 2-3x | 0.5-2% degradation | Classification, embedding models |
| INT8 (QAT) | 4x | 2-3x | <0.5% degradation | Quality-sensitive workloads |
| INT4 (GPTQ/AWQ) | 8x | 3-4x | 1-5% degradation | LLM inference at scale |
Post-Training Quantization (PTQ) applies quantization after training is complete and requires no retraining. It is the fastest path to cost reduction — you can quantize an existing model and deploy it within hours. Quantization-Aware Training (QAT) incorporates quantization into the training loop, producing models that are more robust to precision reduction but requiring a retraining cycle.
For LLMs specifically, GPTQ and AWQ are the dominant INT4 quantization methods in 2026. Both use calibration data to determine optimal quantization parameters per weight group, achieving 4-bit precision with minimal quality loss. A 70B parameter LLM that requires 140GB of GPU memory at FP16 fits into a single 24GB GPU at INT4 — a dramatic cost reduction.
Pruning
Pruning removes redundant weights or entire structures (neurons, attention heads, layers) from a model. Unstructured pruning zeroes out individual weights, achieving high sparsity ratios (80-95% of weights removed) but requiring sparse computation support for speedups. Structured pruning removes entire channels or layers, providing immediate speedups on standard hardware without specialized sparse kernels.
In practice, structured pruning combined with fine-tuning delivers the most predictable results: prune 30-50% of the model's parameters, fine-tune for a few epochs to recover quality, and achieve a proportional inference speedup with standard serving infrastructure.
Knowledge Distillation
Knowledge distillation trains a smaller "student" model to replicate the behavior of a larger "teacher" model on your specific task. Unlike quantization and pruning (which compress an existing model), distillation creates an entirely new, smaller model. The student learns not just the teacher's predictions but also the probability distribution across outputs, capturing the teacher's "dark knowledge" about which outputs are similar.
Distillation is the most powerful compression technique when you have a well-defined, high-volume use case. A 1B parameter distilled model can match a 70B parameter teacher model's quality on a specific task while being 70x cheaper to serve. The trade-off is engineering effort: distillation requires generating a training dataset from the teacher, training the student, and validating that quality is maintained across your evaluation suite. For teams already fine-tuning foundation models, adding distillation to the pipeline is a natural extension.
ONNX Runtime Optimization
Converting models to ONNX (Open Neural Network Exchange) format enables hardware-specific graph optimizations — operator fusion, constant folding, memory planning — that reduce inference latency by 20-50% on the same hardware. ONNX Runtime is particularly effective for non-LLM models (computer vision, tabular ML, traditional NLP) deployed on CPUs or mixed CPU/GPU environments. It supports quantization natively, meaning you can combine format conversion with precision reduction in a single optimization pass.
Inference Optimization: Batching, Caching & Serving Frameworks
Even with a compressed model, how you serve it determines whether you extract full value from your hardware investment. The serving layer is where most production ML systems leave the most money on the table.
Dynamic Batching
GPUs are throughput-oriented devices — they achieve peak efficiency when processing many inputs simultaneously. Single-request inference (batch size 1) wastes the vast majority of available compute. Dynamic batching collects incoming requests into batches before sending them to the GPU, dramatically improving throughput at the cost of slightly increased latency (the time spent waiting for a batch to fill).
The optimal batch size depends on model size, GPU memory, and your latency SLA. A well-tuned dynamic batching configuration typically improves GPU utilization from 15-25% to 60-80%, effectively reducing your per-prediction cost by 3-4x without any model changes.
Inference Result Caching
For many ML workloads, a significant fraction of requests are identical or near-identical. Caching prediction results at the application layer — keyed by input features — avoids redundant GPU computation entirely. Embedding-based semantic caching extends this to "similar enough" inputs by comparing input embeddings against cached results within a similarity threshold.
Caching is most effective for classification, recommendation, and embedding generation workloads where inputs repeat frequently. For generative workloads (text generation, image generation), cache at the prompt-level with KV-cache reuse for shared prefixes. Production systems with well-implemented caching typically achieve 30-60% cache hit rates, directly reducing GPU compute requirements proportionally.
KV-Cache Optimization for LLMs
For transformer-based LLMs, the key-value cache consumes significant GPU memory and grows linearly with sequence length. Techniques like PagedAttention (used by vLLM), which manages KV-cache memory in non-contiguous pages similar to OS virtual memory, improve memory utilization from 20-40% to 90%+. This directly translates to higher throughput — more concurrent requests per GPU — and lower cost per token.
Model Serving Frameworks Compared
Choosing the right serving framework has a material impact on inference cost and operational complexity. Here is how the major options compare for production deployments in 2026:
| Framework | Best For | Throughput | LLM Support | Operational Complexity | Key Advantage |
|---|---|---|---|---|---|
| vLLM | LLM inference | Very High | Excellent | Low-Medium | PagedAttention, continuous batching, highest throughput for LLMs |
| NVIDIA Triton | Multi-model, multi-framework | High | Good | Medium-High | Framework-agnostic, model ensembles, production-grade |
| TGI (Text Generation Inference) | LLM inference (HuggingFace) | High | Excellent | Low | Easy HuggingFace integration, built-in quantization support |
| TorchServe | PyTorch models | Medium-High | Moderate | Medium | Native PyTorch support, custom handlers, AWS integration |
| ONNX Runtime Server | Cross-platform, CPU-optimized | Medium-High | Limited | Low | Best CPU performance, cross-platform deployment |
| Ray Serve | Complex pipelines | High | Good | Medium | Multi-model composition, auto-scaling, Python-native |
For LLM-heavy workloads, vLLM has emerged as the performance leader. Its PagedAttention mechanism and continuous batching achieve 2-4x higher throughput than naive serving, translating directly to 2-4x lower cost per token. For organizations running diverse model types (vision, NLP, tabular, LLMs), Triton Inference Server provides a unified serving layer with model ensembling capabilities. For teams prioritizing simplicity and already invested in the HuggingFace ecosystem, TGI offers the fastest path to optimized LLM serving.
Speculative Decoding
Speculative decoding uses a small, fast "draft" model to generate candidate tokens, then verifies them in parallel using the larger target model. Because verification is parallelizable (unlike autoregressive generation), this technique achieves 2-3x faster generation with mathematically identical output quality. The cost is additional memory for the draft model and slightly higher compute per accepted token — but the net throughput improvement reduces cost per generated token significantly.
GPU Optimization & Hardware Selection
Choosing the right GPU (or non-GPU accelerator) for each workload is one of the most impactful cost decisions. Many teams default to the most powerful available GPU when a less expensive option would deliver identical production performance.
| Hardware | VRAM | FP16 TFLOPS | On-Demand $/hr (AWS) | Best For | Cost Efficiency |
|---|---|---|---|---|---|
| NVIDIA H100 (SXM) | 80 GB | 989 | ~$32.77 (p5.xlarge) | Large LLM training & inference, multi-model serving | Best throughput/$ for large models |
| NVIDIA A100 (80GB) | 80 GB | 312 | ~$15.00 (p4d.24xlarge, per GPU) | Training, large model inference, batch workloads | Strong price/performance for training |
| NVIDIA L4 | 24 GB | 121 (with sparsity) | ~$0.81 (g6.xlarge) | Small-medium model inference, video processing | Best $/inference for sub-13B models |
| NVIDIA T4 | 16 GB | 65 | ~$0.53 (g4dn.xlarge) | Light inference, embedding generation, batch scoring | Cheapest NVIDIA GPU option |
| AWS Inferentia2 | 32 GB per core | ~190 (BF16) | ~$1.97 (inf2.xlarge) | High-volume inference, transformer models | Up to 50% cheaper than GPU for supported models |
| AWS Graviton3 (CPU) | N/A (system RAM) | N/A | ~$0.14 (c7g.xlarge) | Small models, ONNX-optimized, low-latency classification | 10-20x cheaper than GPU when viable |
Matching Workloads to Hardware
The cardinal rule of GPU cost optimization: never use a GPU when a CPU will do, and never use a large GPU when a small one will do.
- Embedding generation & small classifiers (<500M params): Start with CPU (Graviton3 with ONNX Runtime). If latency requirements demand GPU, use T4 or L4. These workloads do not benefit from H100-class hardware.
- Medium models (500M-7B params): L4 GPUs deliver the best cost per inference. INT8 quantized models in this range fit comfortably in 24GB VRAM with room for batch processing.
- Large LLMs (7B-70B params): A100 80GB for cost-effective serving, H100 when throughput demands justify the premium. INT4 quantization is essential to keep memory requirements manageable.
- Very large LLMs (70B+ params): Multi-GPU serving on H100 clusters. Tensor parallelism distributes the model across GPUs, with NVLink providing the inter-GPU bandwidth needed for efficient parallel inference.
AWS Inferentia2 deserves special consideration for high-volume inference workloads. For models that compile successfully to the Neuron SDK (most transformer architectures), Inferentia2 delivers comparable throughput to A100 GPUs at roughly half the cost. The trade-off is a narrower set of supported operations and longer compilation times. Teams running on AWS's AI/ML ecosystem should evaluate Inferentia2 for any high-volume, latency-tolerant inference workload.
GPU Utilization Monitoring
You cannot optimize what you do not measure. Production GPU utilization should be monitored across four dimensions: SM (Streaming Multiprocessor) utilization, memory utilization, memory bandwidth utilization, and interconnect utilization (for multi-GPU setups). NVIDIA's DCGM (Data Center GPU Manager) provides these metrics. If SM utilization is consistently below 50%, you are either under-batching, using too large a GPU, or your serving framework is not efficiently scheduling work.
Cloud Cost Strategies
Infrastructure-level cost strategies compound with model-level optimizations. A quantized model on spot instances with auto-scaling can be 10-20x cheaper than an unoptimized model on on-demand instances with static provisioning.
Spot & Preemptible Instances
Spot instances offer 50-70% discounts over on-demand pricing for GPU workloads. For training (which can be checkpointed and resumed) and batch inference (which is inherently resumable), spot instances should be the default. For real-time inference, spot instances are viable as part of a mixed fleet: maintain a baseline of on-demand or reserved instances for guaranteed capacity, and add spot instances for burst capacity with graceful degradation when spot capacity is reclaimed.
Reserved Capacity & Savings Plans
For steady-state inference workloads with predictable traffic, 1-year or 3-year reserved instances or compute savings plans provide 30-60% discounts. The key is accurately forecasting your baseline GPU demand. Under-reserving wastes the discount opportunity; over-reserving locks you into capacity you do not use. A common pattern: reserve capacity for your p50 (median) traffic level and use spot or on-demand for everything above that.
Auto-Scaling for Inference
ML inference workloads often have sharp traffic patterns — peak during business hours, minimal overnight. Auto-scaling that tracks GPU utilization, request queue depth, or custom latency metrics ensures you pay for capacity only when needed. The challenge is GPU spin-up time: launching a new GPU instance and loading a model takes 2-10 minutes, which is too slow for sudden spikes. Mitigation strategies include maintaining warm pools of pre-loaded instances, using model caching on instance storage for faster loading, and setting scale-up thresholds conservatively to trigger before capacity is actually exhausted.
Multi-Region Deployment
GPU availability and pricing vary significantly across cloud regions. Organizations with flexible latency requirements can route inference requests to the region with the best current pricing or availability. For latency-sensitive workloads, deploy models in regions closest to your users and use multi-region routing to balance cost with latency. Scaling AI infrastructure across regions also provides resilience — a GPU capacity shortage in one region does not impact availability.
Build vs. Buy Inference Platforms
The build-vs-buy decision for inference infrastructure depends on your scale, team expertise, and how central ML is to your business.
| Approach | Monthly Cost Range | Engineering Effort | Best For | Key Trade-off |
|---|---|---|---|---|
| Managed API (OpenAI, Anthropic, Google) | $1K-$500K+ | Minimal | Prototypes, low-medium volume, fast iteration | Highest per-prediction cost, lowest operational burden |
| Managed Inference (SageMaker, Vertex AI, Bedrock) | $2K-$200K+ | Low-Medium | Teams without GPU expertise, AWS/GCP-native orgs | Moderate cost premium, provider lock-in |
| Self-Hosted (vLLM/Triton on cloud GPUs) | $5K-$100K+ | Medium-High | High volume, cost-sensitive, custom requirements | Lowest per-prediction cost, highest ops burden |
| On-Premise / Colo GPUs | $10K-$500K+ (amortized) | High | Massive scale, data sovereignty, predictable workloads | Lowest long-term cost, highest upfront investment |
The typical progression mirrors the enterprise ML maturity model: start with managed APIs to validate use cases, move to managed inference platforms as volume grows, and invest in self-hosted infrastructure only when monthly spend justifies the engineering investment. The crossover point where self-hosting becomes cheaper than managed services is typically $20K-$50K per month in inference costs, assuming you have (or can hire) the MLOps expertise to manage the infrastructure reliably.
"Self-hosting inference saves 40-70% on per-prediction costs at scale, but teams underestimate the hidden costs: on-call rotations, model deployment pipelines, GPU driver updates, capacity planning, and incident response. Budget at least 1-2 full-time engineers for a self-hosted inference platform before calculating ROI."
— Chip Huyen, "Designing Machine Learning Systems"
Cost Monitoring & Allocation
Effective ML cost management requires granular visibility into where every dollar goes — by model, by feature, by team, and by customer. Without this visibility, optimization is guesswork.
Cost Attribution Dimensions
- Per-model: Which models cost the most to serve? This identifies candidates for compression or replacement.
- Per-feature: Which product features drive the most inference cost? This informs pricing and product decisions.
- Per-customer / per-tenant: In multi-tenant systems, which customers consume disproportionate ML resources? This enables usage-based pricing.
- Per-team: Which engineering teams are responsible for which ML costs? This enables accountability and budgeting.
Tooling & Implementation
Cloud-native cost tools (AWS Cost Explorer, GCP Billing) provide infrastructure-level visibility but lack ML-specific granularity. For ML cost attribution, you need to correlate infrastructure metrics with application-level request data. A common architecture: tag all GPU instances with the model they serve, log every inference request with model ID, feature ID, and customer ID, and join these datasets in your data warehouse to compute cost-per-prediction by any attribution dimension.
Purpose-built ML cost monitoring platforms (Vantage, Kubecost with GPU support, custom Prometheus/Grafana dashboards) can automate this correlation. The key metric to track is cost per prediction by model and use case, trended over time. This metric captures the combined effect of all your optimization efforts and makes cost regression immediately visible.
Integrating cost monitoring into your ML observability stack ensures that model quality and model cost are tracked together — preventing optimizations that reduce cost at the expense of undetected quality degradation.
Cost-per-Prediction Benchmarking
Cost per prediction (CPP) is the fundamental unit economics metric for production ML. It enables apples-to-apples comparison across models, serving configurations, and hardware choices.
How to Calculate CPP
CPP = (Total infrastructure cost for a model) / (Total predictions served by that model) over a given period. Include all costs: GPU compute, memory, networking, storage, and the proportional share of shared infrastructure (load balancers, monitoring, logging). For LLMs, normalize by tokens rather than requests since request costs vary dramatically with input/output length.
Benchmark Reference Points (2026)
| Workload Type | Model Size | Hardware | CPP (Optimized) | CPP (Unoptimized) |
|---|---|---|---|---|
| Text Classification | 100M-500M params | L4 / CPU | $0.00001-$0.0001 | $0.0005-$0.002 |
| Embedding Generation | 100M-1B params | L4 / T4 | $0.00005-$0.0003 | $0.001-$0.005 |
| Image Classification | 50M-300M params | L4 / T4 | $0.00002-$0.0002 | $0.001-$0.003 |
| LLM Inference (7B) | 7B params | L4 (INT4) | $0.0005-$0.002 per 1K tokens | $0.003-$0.01 per 1K tokens |
| LLM Inference (70B) | 70B params | A100/H100 (INT4) | $0.003-$0.01 per 1K tokens | $0.02-$0.08 per 1K tokens |
| Image Generation | 1B-3B params | A100 / L4 | $0.002-$0.01 per image | $0.02-$0.10 per image |
The gap between optimized and unoptimized CPP is typically 5-10x. This gap represents the cost savings available through the techniques covered in this guide: model compression, efficient serving, hardware matching, and infrastructure optimization. Tracking your CPP against these benchmarks tells you how much headroom remains.
Total Cost of Ownership Framework
TCO for production ML extends well beyond compute costs. Teams that optimize only for GPU spend often miss the larger picture of what it actually costs to run ML in production.
TCO Components
- Compute (40-60% of TCO): GPU/CPU instances for training and inference, including spot, reserved, and on-demand mix.
- Engineering labor (20-35% of TCO): MLOps engineers, ML engineers, and data engineers maintaining the infrastructure and models. Often the largest real cost, especially at smaller scale.
- Data infrastructure (5-15% of TCO): Storage, feature stores, data pipelines, labeling and annotation.
- Platform tooling (5-10% of TCO): Experiment tracking, model registries, monitoring, CI/CD — whether built in-house or purchased as SaaS.
- Opportunity cost: What could your ML engineers build if they were not managing infrastructure? This invisible cost often dominates at smaller organizations.
TCO Decision Framework
When evaluating ML infrastructure decisions, calculate the 12-month TCO including all components above. A common mistake: choosing self-hosted inference because the compute cost is lower, without accounting for the 1-2 engineering headcount required to manage it. At a fully-loaded cost of $200K-$350K per ML engineer per year, the engineering overhead can easily exceed the compute savings at moderate scale.
Use this framework when making decisions about model selection (larger models have higher serving costs but may require less fine-tuning), infrastructure choices (cloud vs. self-hosted), and build vs. buy for ML platform components. The true cost of AI software development includes all of these dimensions, and teams that measure ROI holistically make better investment decisions.
Putting It All Together: An Optimization Playbook
If you are starting from an unoptimized baseline, attack cost reduction in this order — each step builds on the previous:
- Measure: Instrument cost-per-prediction by model and feature. You need a baseline before optimizing.
- Right-size hardware: Match each workload to the cheapest viable GPU class. Move CPU-viable workloads off GPUs entirely.
- Quantize: Apply FP16/BF16 universally, INT8 for classification and embedding models, INT4 (GPTQ/AWQ) for LLMs. This is the highest-ROI single optimization.
- Optimize serving: Switch to an efficient serving framework (vLLM for LLMs, Triton for mixed workloads). Enable dynamic batching.
- Implement caching: Cache inference results for repeated and semantically similar inputs.
- Optimize cloud spend: Move training to spot instances, reserve baseline inference capacity, enable auto-scaling.
- Distill high-volume models: For workloads exceeding 100K daily predictions, evaluate knowledge distillation to smaller, cheaper models.
- Continuously monitor: Track CPP trends and GPU utilization. Cost creeps back up as models change and traffic patterns evolve.
Teams that execute this full playbook typically achieve 50-80% total cost reduction. The first three steps alone (measure, right-size, quantize) usually deliver 40-60% savings with relatively low engineering effort. Steps 4-8 provide incremental but compounding improvements as scale grows.
Frequently Asked Questions
What is the single highest-impact ML cost optimization?
Model quantization — specifically, moving from FP32 to FP16/BF16 for all GPU inference and to INT4 (GPTQ or AWQ) for LLMs. Quantization reduces memory usage by 2-8x and increases throughput proportionally, often with less than 1-2% quality degradation. It requires minimal engineering effort (hours to days) and applies to every inference request, making it the optimization with the best cost-reduction-to-effort ratio. For most teams, quantization alone reduces inference costs by 50-75%.
When should we self-host inference instead of using managed APIs?
Self-hosting typically becomes cost-effective when monthly inference API spend exceeds $20K-$50K and your workloads are well-understood. Below this threshold, the engineering cost of managing GPU infrastructure (1-2 FTEs at $200K-$350K fully loaded) exceeds the savings. Calculate your break-even: compare current API costs to projected GPU infrastructure costs plus engineering labor, amortized monthly. Also consider non-cost factors: data privacy requirements, latency needs, and model customization requirements that managed APIs may not support.
How do I choose between vLLM, Triton, and TGI for LLM serving?
Choose vLLM for maximum LLM throughput — its PagedAttention and continuous batching achieve the highest tokens-per-second on standard GPU hardware. Choose Triton when you need to serve a mix of model types (LLMs, vision models, tabular models) through a unified platform with model ensembling. Choose TGI when you are already invested in the HuggingFace ecosystem and want the simplest deployment path with built-in quantization support. For most teams starting with LLM serving, vLLM offers the best performance with reasonable operational complexity.
What GPU should I use for production LLM inference?
For models up to 13B parameters (quantized to INT4), the NVIDIA L4 at ~$0.81/hr offers the best cost per token. For 13B-70B parameter models, the A100 80GB provides the memory capacity needed at a reasonable cost. For 70B+ models or workloads requiring maximum throughput, the H100 delivers the best throughput per dollar despite its higher hourly cost. AWS Inferentia2 is worth evaluating for any high-volume transformer inference — it can deliver comparable throughput to A100 at roughly half the cost for supported model architectures.
How do I calculate the ROI of ML cost optimization efforts?
Measure cost-per-prediction (CPP) by model and use case before and after optimization. Multiply the per-prediction savings by your monthly prediction volume to get monthly savings. Compare this to the engineering time invested in the optimization (at fully-loaded engineer cost). Most optimizations pay for themselves within 1-3 months. For a comprehensive view, use the TCO framework that includes compute, engineering labor, data infrastructure, and platform tooling costs. Track CPP as a continuous metric alongside model quality metrics to ensure optimizations do not degrade production performance.