Fine-Tuning Foundation Models for Enterprise: A Practical Engineering Guide for 2026

Q: How many training examples do I need to fine-tune a foundation model?

For most enterprise tasks, 1,000-5,000 high-quality examples deliver strong results with LoRA fine-tuning. Simple tasks may work with 500-1,000 examples, while complex domain reasoning benefits from 5,000-20,000. Quality matters more than quantity — start with 1,000 and scale up based on evaluation.

Q: Should I fine-tune a small model or a large model?

Start with the smallest model that meets your quality bar. A fine-tuned 8B model often matches or exceeds a base 70B model for domain-specific tasks. Fine-tune an 8B model first, evaluate rigorously, and only move to a larger model if quality is insufficient. Larger models cost 5-10x more to train and serve.

Q: Can I fine-tune a model and use RAG at the same time?

Yes, and this combination is often the best approach. Fine-tune for domain-specific reasoning and output formatting, use RAG for current factual knowledge. The fine-tuned model becomes better at interpreting retrieved context because it understands your domain's language and patterns.

Q: How do I prevent a fine-tuned model from losing general capabilities?

Manage catastrophic forgetting by mixing 10-20% general instruction data into training, using LoRA instead of full fine-tuning, and monitoring general benchmarks alongside domain benchmarks. If general capability drops more than 5%, reduce training epochs or learning rate.

Q: What is the typical timeline for an enterprise fine-tuning project?

Expect 5-10 weeks for first production deployment: 2-4 weeks for data curation, 1-2 weeks for training experiments, 1-2 weeks for evaluation, and 1-2 weeks for deployment. Subsequent retraining cycles are faster at 2-3 weeks since infrastructure is in place.

Fine-tuning a foundation model is not always the right answer — but when it is, it unlocks domain-specific performance that prompt engineering and RAG alone cannot match. The decision hinges on three factors: whether your task requires specialized output formatting, domain-specific reasoning patterns, or knowledge that cannot be efficiently injected through context. In 2026, parameter-efficient fine-tuning methods like LoRA and QLoRA have made enterprise fine-tuning accessible on a single node of 4-8 GPUs, reducing costs from hundreds of thousands to single-digit thousands of dollars per training run. The practical workflow: curate 1,000-10,000 high-quality domain examples, choose a base model matching your deployment constraints, apply LoRA fine-tuning with rigorous evaluation against your domain benchmarks, and deploy behind a serving layer that supports A/B testing against the base model.

Foundation Model Landscape in 2026

The foundation model ecosystem has matured significantly. Enterprise teams now have a clear spectrum of models available as fine-tuning base candidates, each with different licensing, capability, and cost profiles.

Three categories of models dominate the fine-tuning landscape:

Open-weight models (Llama 3.1/4, Mistral Large, Qwen 2.5, DeepSeek-V3, Gemma 2): Fully downloadable weights with permissive or semi-permissive licenses. These are the primary targets for enterprise fine-tuning because you control the entire pipeline end-to-end.
Provider fine-tuning APIs (OpenAI, Anthropic, Google): Upload your data and fine-tune through managed APIs. Simpler operationally but less control over the training process, hyperparameters, and deployment.
Specialized base models (Code Llama, BioMistral, FinGPT): Models pre-trained on domain-specific corpora that provide a stronger starting point for vertical fine-tuning in code, biomedical, financial, and legal domains.

"The gap between open-weight and proprietary models has narrowed dramatically. For most enterprise fine-tuning use cases, the best open-weight models match or exceed proprietary model performance after domain adaptation." — Stanford HAI, AI Index Report 2026

For a broader view of how these models compare for general use, see our AI model selection guide. When evaluating fine-tuning candidates, pay attention to the base model architecture, license terms (especially around derivative model distribution), and community ecosystem (availability of adapters, tooling, and benchmarks).

Model Family	Parameters	License	Fine-Tuning Suitability	Key Strength
Llama 4	8B-405B	Llama Community	Excellent	Broad ecosystem, extensive tooling
Mistral Large	7B-123B	Apache 2.0 (small) / Commercial (large)	Excellent	Strong reasoning, efficient architecture
Qwen 2.5	0.5B-72B	Apache 2.0 / Qwen License	Very Good	Multilingual, strong coding capability
Gemma 2	2B-27B	Gemma Terms of Use	Good	Small footprint, Google ecosystem
DeepSeek-V3	671B (MoE)	MIT	Good (expert-level)	MoE efficiency, strong reasoning
OpenAI Fine-Tuning API	GPT-4o, GPT-4o-mini	API Only	Managed	Simplest workflow, no infra needed

When to Fine-Tune vs Prompt Engineer vs RAG

The most expensive mistake in enterprise AI is fine-tuning when you do not need to. Before committing to a fine-tuning project, rigorously evaluate whether prompt engineering or retrieval-augmented generation (RAG) can achieve your target performance. Each approach has distinct strengths, and the best production systems often combine them.

The Decision Matrix

Factor	Prompt Engineering	RAG	Fine-Tuning
Setup time	Hours to days	Days to weeks	Weeks to months
Data requirement	Few examples (0-20)	Document corpus	1,000-100,000+ labeled examples
Knowledge injection	Limited by context window	Excellent for factual recall	Bakes in patterns, not facts
Output formatting	Moderate control	Moderate control	Strong control over style and structure
Domain reasoning	Relies on base model knowledge	Retrieves relevant context	Learns domain-specific reasoning patterns
Inference cost	Higher (long prompts)	Higher (retrieval + long context)	Lower (shorter prompts, specialized behavior)
Maintenance	Low	Medium (index updates)	High (retraining cycles)
Best for	General tasks, exploration, prototyping	Factual Q&A, document grounding	Consistent style, domain reasoning, cost reduction at scale

When Fine-Tuning Is the Right Choice

Fine-tune when you need at least one of the following:

Consistent output formatting — Your application requires strict adherence to a specific JSON schema, report structure, or coding style that prompt engineering cannot reliably enforce across thousands of requests.
Domain-specific reasoning — The model needs to apply specialized logic (medical coding rules, legal citation formats, financial compliance checks) that cannot be adequately described in a system prompt.
Inference cost reduction — You are spending heavily on long, detailed prompts. Fine-tuning can encode that behavior into the model weights, reducing per-request token usage by 40-70%.
Latency requirements — Shorter prompts from a fine-tuned model mean faster inference. For latency-sensitive applications processing millions of requests, this matters.
Proprietary behavior patterns — Your competitive advantage depends on the model behaving in a way that reflects your company's unique methodology, tone, or decision-making framework.

When Not to Fine-Tune

Do not fine-tune if:

Your goal is injecting factual knowledge — use RAG instead. Fine-tuning is poor at memorizing facts and prone to hallucination when asked to recall specific data points.
Prompt engineering achieves 90%+ of your target accuracy. The incremental improvement from fine-tuning rarely justifies the cost if you are already close.
Your domain data changes frequently. Retraining cycles add latency to knowledge updates; RAG pipelines can update in real-time.
You have fewer than 500 quality examples. Small datasets lead to overfitting and unreliable generalization.

For architecture patterns that combine RAG with fine-tuning, see our guide on AI data pipeline architecture. To understand where fine-tuning fits into a broader ML strategy, read our enterprise ML strategy guide.

Data Preparation for Fine-Tuning

Data quality is the single most important factor in fine-tuning success. A well-curated dataset of 2,000 examples will outperform a noisy dataset of 50,000. The preparation process involves curation, formatting, quality assurance, and splitting.

Data Curation Strategy

Start by defining what "excellent output" looks like for your use case. Then work backward:

Collect production examples — Mine your existing systems for input-output pairs. Customer support tickets with verified resolutions, analyst reports with structured outputs, code reviews with approved fixes. Production data is the gold standard because it reflects real distribution.
Expert annotation — For tasks without existing data, have domain experts create examples. Budget $10-50 per expert-annotated example depending on complexity. A medical coding fine-tune might need board-certified coders; a customer support fine-tune might need senior agents.
Synthetic data augmentation — Use a stronger model (e.g., GPT-4o or Claude Opus) to generate additional training examples, then have humans verify quality. This is especially effective for expanding edge case coverage.
Deduplication and cleaning — Remove near-duplicates using embedding similarity (cosine similarity > 0.95 indicates likely duplicates). Strip PII, normalize formatting, and remove low-quality examples.

Data Formatting

Most fine-tuning frameworks expect data in conversation format (system/user/assistant turns) or instruction-completion pairs:

// Conversation format (preferred for chat models)
{
  "messages": [
    {"role": "system", "content": "You are a financial analyst..."},
    {"role": "user", "content": "Analyze Q3 revenue trends for..."},
    {"role": "assistant", "content": "Based on the quarterly data..."}
  ]
}

// Instruction format (for completion-style fine-tuning)
{
  "instruction": "Classify the following support ticket...",
  "input": "My account was charged twice for...",
  "output": "Category: Billing\nPriority: High\nSentiment: Negative"
}

Quality Assurance Checklist

Accuracy — Every output in your training set must be correct. Errors in training data directly degrade model quality.
Consistency — Outputs should follow identical formatting conventions. If some examples use bullet points and others use numbered lists for the same task, the model will be inconsistent.
Diversity — Cover the full distribution of inputs your model will encounter in production. Underrepresented categories in training data become failure modes in production.
Appropriate length — Training examples should match your target output length. If you train on long-form examples but want concise outputs, the model will be verbose.
Split correctly — Use 80/10/10 train/validation/test split. The test set must be completely held out and representative of production distribution. Never contaminate your test set.

"We spent 6 weeks curating 3,000 examples and 2 weeks training. If we had spent 2 weeks curating and 6 weeks training, we would have gotten worse results. Data quality dominates everything." — ML Engineering Lead, Fortune 500 Financial Services

Fine-Tuning Techniques: Full, LoRA, QLoRA, PEFT & Adapters

The choice of fine-tuning technique determines your compute cost, training time, and the tradeoff between model quality and efficiency. In 2026, parameter-efficient methods dominate enterprise fine-tuning because they deliver 90-98% of full fine-tuning performance at a fraction of the cost.

Technique Comparison

Technique	Parameters Trained	GPU Memory (7B model)	GPU Memory (70B model)	Quality vs Full FT	Training Time
Full Fine-Tuning	100%	~60 GB	~600 GB	Baseline	Longest
LoRA	0.1-1%	~18 GB	~160 GB	95-98%	3-5x faster
QLoRA	0.1-1%	~10 GB	~48 GB	92-96%	2-4x faster
PEFT (Prefix Tuning)	<0.1%	~16 GB	~140 GB	85-93%	4-6x faster
Adapter Layers	1-5%	~20 GB	~180 GB	93-97%	2-4x faster

LoRA (Low-Rank Adaptation)

LoRA is the default recommendation for enterprise fine-tuning in 2026. It works by freezing the original model weights and injecting small trainable rank-decomposition matrices into each transformer layer. Key hyperparameters:

Rank (r) — Typically 8-64. Higher ranks capture more complex adaptations but increase memory and risk overfitting. Start with r=16 for most tasks.
Alpha — Scaling factor, usually set to 2x the rank (alpha=32 for r=16). Controls the magnitude of the adaptation.
Target modules — Which layers to adapt. For transformer models, targeting query and value projection matrices (q_proj, v_proj) is the standard. Adding key and output projections (k_proj, o_proj) can improve quality at moderate cost.
Dropout — 0.05-0.1 for datasets under 5,000 examples, 0.0 for larger datasets.

QLoRA (Quantized LoRA)

QLoRA quantizes the base model to 4-bit precision (NF4 format) while keeping LoRA adapters in full precision. This dramatically reduces GPU memory requirements — fine-tune a 70B model on a single A100 80GB GPU instead of needing 8 GPUs. The quality tradeoff is modest (2-5% degradation on most benchmarks) and often acceptable for production workloads.

Full Fine-Tuning

Full fine-tuning updates every parameter and provides the best quality ceiling, but requires distributed training across multiple GPUs for any model above 7B parameters. Reserve full fine-tuning for cases where:

You are fine-tuning a small model (under 3B parameters) where LoRA overhead is proportionally high.
Your task is significantly different from the base model's training distribution (e.g., adapting an English model for a low-resource language).
You have verified through experimentation that LoRA leaves a meaningful quality gap for your specific task.

Multi-Adapter Serving

A significant advantage of LoRA-based fine-tuning is that multiple adapters can share a single base model in production. Load different LoRA adapters at inference time based on the request type, customer, or domain. This enables serving dozens of specialized models with the memory footprint of one base model plus small adapter overhead.

Training Infrastructure & Compute Requirements

Fine-tuning compute requirements scale with model size, dataset size, and technique. Here is a practical guide to infrastructure planning.

Hardware Requirements by Model Size

Model Size	Technique	Minimum Hardware	Recommended Hardware	Training Time (5K examples)
7-8B	QLoRA	1x A100 40GB	1x A100 80GB	2-4 hours
7-8B	LoRA	1x A100 80GB	2x A100 80GB	3-6 hours
7-8B	Full FT	4x A100 80GB	8x A100 80GB	8-16 hours
13-14B	QLoRA	1x A100 80GB	2x A100 80GB	4-8 hours
70B	QLoRA	1x A100 80GB	2x H100 80GB	12-24 hours
70B	LoRA	4x A100 80GB	4x H100 80GB	16-32 hours
70B	Full FT	8x H100 80GB	16x H100 80GB	48-96 hours

Cloud Platform Options

For most enterprises, cloud-based GPU instances provide the right balance of flexibility and cost. Major options include:

AWS — p5.48xlarge (8x H100) or p4d.24xlarge (8x A100). SageMaker provides managed training jobs with automatic checkpointing and spot instance support. For a detailed overview, see our AWS AI/ML ecosystem guide.
Google Cloud — a3-highgpu-8g (8x H100) or a2-ultragpu-8g (8x A100). Vertex AI offers managed fine-tuning for select model families.
Azure — ND H100 v5 series (8x H100). Azure ML provides managed training with built-in experiment tracking.
Specialized providers — Lambda Labs, CoreWeave, and Together AI offer GPU-optimized instances at 30-50% lower cost than hyperscalers, with the tradeoff of less mature enterprise tooling.

Use spot/preemptible instances for training runs and implement robust checkpointing (save every 100-200 steps). Spot instances reduce GPU costs by 60-70% and are ideal for training workloads that can resume from checkpoints. See our ML cost optimization guide for detailed strategies.

Training Frameworks

The dominant training frameworks for enterprise fine-tuning in 2026:

Hugging Face TRL + PEFT — The default choice. SFTTrainer handles supervised fine-tuning with LoRA/QLoRA integration, gradient checkpointing, and mixed precision. Excellent ecosystem integration.
Axolotl — Configuration-driven fine-tuning that simplifies multi-GPU training, DeepSpeed integration, and hyperparameter sweeps. Lower learning curve than raw TRL.
LLaMA-Factory — Unified interface for fine-tuning 100+ models with a web UI. Good for teams exploring multiple model-technique combinations.
NVIDIA NeMo — Enterprise-grade framework with built-in support for large-scale distributed training, RLHF, and deployment to Triton Inference Server.

Evaluation Frameworks

A fine-tuned model is only as good as your evaluation proves it to be. Enterprise evaluation requires a multi-layered approach combining automated metrics, domain-specific benchmarks, and human evaluation.

Automated Evaluation

Build an automated evaluation pipeline that runs after every training run and on every candidate model before deployment:

Task-specific metrics — Accuracy, F1, precision, recall for classification tasks. BLEU, ROUGE, BERTScore for generation tasks. Exact match for structured output tasks.
Domain benchmark suite — Create a benchmark of 200-500 examples that represent your production distribution, including edge cases and adversarial inputs. Track performance across training runs to detect regressions.
LLM-as-judge — Use a stronger model (e.g., Claude Opus or GPT-4o) to evaluate your fine-tuned model's outputs on criteria like accuracy, relevance, completeness, and safety. This scales better than human evaluation and correlates at 80-90% with human judgments.
Regression testing — Ensure the fine-tuned model has not degraded on general capabilities. Run a subset of general benchmarks (MMLU, HumanEval, MT-Bench) and compare against the base model. Acceptable degradation is under 5% on general benchmarks.

Human Evaluation

Automated metrics are necessary but insufficient. Human evaluation catches failure modes that metrics miss:

Blind comparison — Present evaluators with outputs from the base model and fine-tuned model (randomized order) and ask which is better. Sample 100-200 examples stratified by difficulty and category.
Error taxonomy — Have domain experts categorize errors in a 50-100 sample of fine-tuned model outputs. Common categories: factual errors, formatting violations, reasoning failures, hallucinations, and safety issues.
Production shadow evaluation — Run the fine-tuned model in shadow mode alongside your current production system for 1-2 weeks. Compare outputs without serving the fine-tuned model's results to users.

"We have seen teams skip human evaluation because their automated metrics looked good, only to discover in production that the model had learned subtle biases from the training data. Budget at least 20 hours of domain expert evaluation per fine-tuning iteration." — Google DeepMind, Best Practices for LLM Fine-Tuning (2025)

For ongoing production monitoring after deployment, see our guide on ML model monitoring and observability.

Deployment & Serving Fine-Tuned Models

Deploying a fine-tuned model requires different infrastructure than API-based model integration. You are now responsible for model serving, scaling, and reliability.

Serving Infrastructure

The primary serving frameworks for fine-tuned LLMs in 2026:

vLLM — The most popular open-source serving engine. PagedAttention for efficient memory management, continuous batching for throughput, and support for LoRA adapter hot-swapping. Recommended as the default choice.
NVIDIA Triton + TensorRT-LLM — Maximum inference performance on NVIDIA hardware. More complex setup but 20-40% faster than vLLM for high-throughput workloads. Best for production deployments that justify the engineering overhead.
Text Generation Inference (TGI) — Hugging Face's production serving solution. Good integration with Hugging Face ecosystem and simpler than Triton for moderate-scale deployments.
SageMaker / Vertex AI endpoints — Managed serving that handles autoscaling, health checks, and blue/green deployments. Higher cost but lower operational burden.

Deployment Patterns

Use these proven patterns for enterprise fine-tuned model deployment:

Canary deployment — Route 5% of traffic to the fine-tuned model, monitor quality metrics and error rates for 24-48 hours, then gradually increase traffic. This is the safest approach for production systems.
A/B testing — Serve both base and fine-tuned models to different user cohorts with randomized assignment. Measure business metrics (task completion rate, user satisfaction, downstream accuracy) alongside model metrics.
Fallback chains — Use the fine-tuned model as the primary, with automatic fallback to the base model (or a commercial API) if the fine-tuned model produces low-confidence outputs or times out. This provides a safety net during the validation period.
Multi-adapter routing — Deploy one base model with multiple LoRA adapters. Route requests to the appropriate adapter based on task type, customer, or domain. vLLM supports dynamic LoRA loading for this pattern.

For the broader MLOps pipeline context, including CI/CD for models, see our MLOps pipeline architecture guide. For integration patterns with existing systems, see AI API integration with existing tech stacks.

Quantization for Deployment

Fine-tuned models can be quantized for serving to reduce GPU memory and improve latency without significant quality loss:

GPTQ — Post-training quantization to 4-bit. Reduces memory by 4x with 1-3% quality degradation. The standard for deployment quantization.
AWQ — Activation-aware quantization that preserves important weights. Often better quality than GPTQ at the same bit width.
FP8 — 8-bit floating point, natively supported on H100 GPUs. Minimal quality loss (under 1%) with 2x memory reduction. Preferred when H100s are available.

Cost Analysis: Training + Inference

Understanding the full cost picture is essential for making the business case for fine-tuning. Costs break down into training (one-time per iteration), evaluation (recurring), and inference (ongoing).

Training Costs

Scenario	Model	Technique	Hardware	Training Time	Estimated Cost
Small	Llama 3.1 8B	QLoRA	1x A100 80GB	3 hours	$10-15
Medium	Llama 3.1 70B	QLoRA	2x H100 80GB	18 hours	$250-400
Large	Llama 3.1 70B	LoRA	4x H100 80GB	24 hours	$700-1,200
Full FT	Llama 3.1 70B	Full	16x H100 80GB	72 hours	$8,000-15,000
API-based	GPT-4o mini	Managed	OpenAI	2-6 hours	$50-500

These are per-run costs. Budget for 3-5 training runs per iteration (hyperparameter tuning, data refinement) and 2-4 major iterations before achieving production-quality results. Total budget for a medium-complexity enterprise fine-tuning project: $5,000-$25,000 including compute, data preparation, and evaluation.

Inference Cost Comparison

The long-term ROI of fine-tuning often comes from inference cost savings. A fine-tuned model that requires shorter prompts saves on every request:

Approach	Cost per 1M Requests	Latency (p50)	Notes
GPT-4o with detailed prompt	$2,500-5,000	2-4s	Long system prompts, few-shot examples
GPT-4o-mini with detailed prompt	$150-400	0.5-1.5s	Lower quality for complex tasks
Fine-tuned GPT-4o-mini	$200-600	0.5-1.5s	Higher per-token cost, but shorter prompts
Self-hosted fine-tuned 8B	$50-150	0.3-0.8s	Fixed GPU cost, unlimited requests
Self-hosted fine-tuned 70B	$200-500	1-3s	Fixed GPU cost, quality matches larger APIs

The breakeven point for self-hosted fine-tuned models vs API calls typically occurs at 100,000-500,000 requests per month, depending on task complexity and model size. For a comprehensive analysis of ML infrastructure costs, see our ML cost optimization guide.

Legal & Licensing Considerations

Fine-tuning introduces licensing complexities that enterprise legal teams must evaluate before production deployment.

Open-Weight Model Licenses

Apache 2.0 (Mistral 7B, Qwen small models) — Most permissive. Use, modify, distribute freely. No restrictions on commercial use or derivative works.
Llama Community License (Llama 3/4) — Free for commercial use up to 700M monthly active users. Requires attribution. Derivative models must include "Llama" in the name. No use for training competing models.
Gemma Terms of Use — Free for commercial use. Restrictions on generating harmful content, using for surveillance, and redistribution of model weights without the license.
Model-specific licenses — Always read the full license. Some models have restrictions on specific use cases (medical, military), geographic regions, or revenue thresholds.

API Fine-Tuning Terms

When using provider fine-tuning APIs:

Data usage — Most providers (OpenAI, Google, Anthropic) state that fine-tuning data is not used to train their base models. Verify this in current terms of service.
Model ownership — Fine-tuned model weights typically remain on the provider's infrastructure. You cannot download or migrate them.
Data retention — Understand how long your training data is retained and whether it can be deleted on request.
Export restrictions — Some model weights have export control implications. Verify compliance with ITAR, EAR, and applicable regulations.

Training Data IP

Ensure your training data does not create intellectual property risks:

Verify you have rights to use all training data for model training (separate from data access rights).
If training on customer data, ensure data processing agreements cover ML training use cases.
Document data provenance for audit purposes. Maintain a data lineage record for every training dataset.
Consider copyright implications if training on copyrighted text — the legal landscape is still evolving in 2026.

For a broader perspective on build-vs-buy decisions in enterprise AI, see our build vs buy AI solutions guide.

Continuous Improvement & Retraining Cycles

Fine-tuning is not a one-time activity. Production models degrade over time as user behavior shifts, domain knowledge evolves, and edge cases accumulate. A systematic retraining cycle is essential.

Monitoring for Retraining Triggers

Establish automated monitoring that detects when your fine-tuned model needs updating:

Quality metric drift — Track your domain benchmark scores weekly. A sustained drop of more than 3-5% signals potential degradation.
User feedback patterns — Monitor thumbs-down rates, correction rates, and escalation frequency. Increasing negative feedback often precedes metric degradation.
Input distribution shift — Compare production input embeddings against training data distribution. Significant drift indicates the model is encountering inputs outside its training distribution.
Error pattern analysis — Cluster model failures to identify systematic error categories. New error categories that did not exist at launch indicate emerging failure modes.

For a comprehensive monitoring setup, see our ML model monitoring and observability guide.

Retraining Workflow

Collect new examples — Continuously collect production examples, especially corrected outputs and edge cases. Aim for 200-500 new high-quality examples per retraining cycle.
Merge with existing dataset — Combine new examples with the original training set. Do not train only on new data — this causes catastrophic forgetting of previous capabilities.
Run training with validation — Fine-tune from the base model (not the previous fine-tuned version) using the combined dataset. This avoids error accumulation from sequential fine-tuning.
Evaluate against all benchmarks — Run the full evaluation suite. The new model must pass on both the original benchmarks and any new test cases added from production failures.
Canary deploy and validate — Deploy the updated model alongside the current production model. Validate with live traffic before full rollover.

Retraining Cadence

Typical retraining schedules by domain:

Fast-moving domains (social media, news, trending topics): Monthly retraining or continuous learning pipelines.
Moderate domains (customer support, general business): Quarterly retraining with monthly evaluation checkpoints.
Stable domains (legal, medical, regulatory): Semi-annual retraining with monthly monitoring. Retrain sooner when regulations change.

Version every model artifact (weights, training data snapshot, hyperparameters, evaluation results) for reproducibility and audit. Use a model registry (MLflow, Weights & Biases, or cloud-native options) to track the full lineage from data to deployed model.

Frequently Asked Questions

How many training examples do I need to fine-tune a foundation model?

For most enterprise tasks, 1,000-5,000 high-quality examples deliver strong results with LoRA fine-tuning. Simple formatting or classification tasks may work with 500-1,000 examples. Complex domain reasoning tasks benefit from 5,000-20,000 examples. Quality matters more than quantity — 2,000 expert-verified examples consistently outperform 20,000 noisy examples. Start with 1,000, evaluate, and scale up only if the evaluation shows improvement with more data.

Should I fine-tune a small model or a large model?

Start with the smallest model that meets your quality bar. A fine-tuned 8B model often matches or exceeds a base 70B model for domain-specific tasks because fine-tuning concentrates capability on your task. Fine-tune an 8B model first, evaluate it rigorously, and only move to a larger model if the 8B model cannot reach your quality threshold. Larger models cost 5-10x more to train and serve, so justify the upgrade with evidence.

Can I fine-tune a model and use RAG at the same time?

Yes, and this combination is often the best approach for enterprise applications. Fine-tune the model for domain-specific reasoning patterns, output formatting, and task behavior. Use RAG to inject current factual knowledge at inference time. The fine-tuned model becomes better at interpreting and synthesizing retrieved context because it understands your domain's language and reasoning patterns. This avoids baking volatile facts into model weights while improving how the model uses retrieved information.

How do I prevent a fine-tuned model from losing general capabilities?

This is called catastrophic forgetting and is managed through several techniques. First, mix 10-20% of general-purpose instruction data into your domain training set to maintain broad capabilities. Second, use LoRA instead of full fine-tuning — LoRA preserves the base model weights and adds domain capability on top. Third, monitor general benchmarks (MMLU, MT-Bench) alongside domain benchmarks during evaluation. If general capability drops more than 5%, reduce training epochs or learning rate.

What is the typical timeline for an enterprise fine-tuning project?

A realistic timeline for the first production deployment: 2-4 weeks for data collection and curation, 1-2 weeks for initial fine-tuning experiments, 1-2 weeks for evaluation and iteration, and 1-2 weeks for deployment and validation. Total: 5-10 weeks from kickoff to production. Subsequent retraining cycles are faster (2-3 weeks) because the data pipeline, evaluation framework, and deployment infrastructure are already in place. Budget the majority of time on data quality — it determines model quality.