Talk "Fine-Tuning Language Models"

Fine-Tuning Language Models: When to customise Language Models for better performance by Laurens Tetzlaff from Netlight presented at Munich Datageeks - September Edition 2025.

Abstract

This talk covers how LLMs changed the way we do and see Data Science, a decision matrix on when to consider what type of GenAI solution as well as in which cases you actually want to fine-tune LLMs. We'll look at practical instruction-based fine-tuning techniques like LoRA/QLoRA and share lessons learned, including common mistakes to avoid.

About the speaker

Laurens Tetzlaff is an engineer and scientist specializing in building and scaling data and AI solutions, teams, and organizations. With a background in data science, information systems engineering, and entrepreneurship, he thrives at the intersection of business and technology to create impactful products. At Netlight, Laurens enables clients in the data and AI domain to unlock their full potential while shaping Netlight as a sustainable platform for client impact and personal growth.

Transcript Summary

Introduction and Speaker Background

The talk focuses on Large Language Models (LLMs), specifically addressing when fine-tuning is necessary and when it should be avoided. The speaker, a data consultant specializing in GenAI with a background in machine learning and data engineering, emphasizes that in most cases, fine-tuning is not the optimal approach. The presentation covers the historical development of LLMs, the current AI landscape, practical use cases, and a decision framework for determining when to invest in fine-tuning.

The Evolution of Large Language Models

The development of modern LLMs began in 2017 with the groundbreaking paper Attention is all you need, which introduced the transformer architecture. This was followed by BERT in 2018, one of the first original LLM architectures that demonstrated the practical potential of these models.

GPT-2 arrived with 1.5 billion parameters and a context length of 1,024 tokens, representing the amount of input text the model could process. One year later, GPT-3 significantly increased both model size and context window, marking a shift in thinking: instead of baking information into the model through fine-tuning or training, developers could provide information directly in the context.

This trend accelerated rapidly. The ChatGPT launch occurred in 2022, followed by GPT-4 in 2023 with approximately 1.7 trillion parameters and a 32,000 token context window. At this scale, users could add entire documents to the context without needing to train the model weights directly.

2024 brought what the speaker calls the token explosion, with Claude 3.5 and Gemini 1.5 claiming to handle one million tokens, though performance at that scale was debatable. By 2025, multimodal models that can process pictures and text became well-established, with very large context windows and off-the-shelf functionality through APIs from providers like Anthropic and OpenAI.

The Modern AI Dilemma

The speaker identifies a common perception in the industry regarding GenAI promises. Many believe that advanced models like GPT-5 and Claude can handle any task, and if they cannot, the solution is simply better prompt engineering or adding more context. When prompts are insufficient, the answer is to add even more context, which should theoretically eliminate the need for fine-tuning. If this still fails, the next model iteration in a few weeks or months should solve the problem.

This perspective is frequently encountered among data scientists, practitioners, and business stakeholders, leading to ongoing questions about when to invest time in building training data and custom models. In 2025, there is significant discussion about AI agents that can build and execute tasks autonomously.

Reality of AI Implementation Challenges

Despite the promise of universal AI solutions, several practical challenges exist in real-world applications:

Specialized Domains: Models available through third-party APIs or open-weight models are rarely specialized for specific companies or industry needs. Many processes involve implicit institutional knowledge—expertise developed by employees over three, four, five, or ten years that is not encoded anywhere in accessible formats.

Consistency and Scale: When deploying models at scale, consistency becomes crucial. Some models are non-deterministic, occasionally producing undesired outputs, which presents significant challenges for production systems.

Cost versus Latency: Using the latest chain-of-thought or reasoning models takes longer to generate responses compared to smaller or non-reasoning models. Hosting considerations add complexity—whether on-premises or in private cloud clusters, GPU costs can become substantial. Running a 7B model on hardware like an H100 quickly becomes expensive, while smaller models offer much more economical alternatives.

Privacy and Compliance: In regulated industries like healthcare, finance, and other sectors with strict data governance requirements, privacy and compliance considerations become paramount factors in model selection and deployment.

Practical Industry Use Cases

Document AI for Invoice Processing: Many companies across media, finance, medtech, and other sectors receive invoices from third parties, employees, or partner companies. When not already automated, manual verification is required to check if charges match contractual agreements, determine which team or department should handle the invoice, and route it appropriately. This process often relies on implicit institutional knowledge held by accountants and office workers.

In specialized contexts like the construction industry, invoices may not be fully digitized, requiring employees to apply their institutional knowledge to understand and process them. This knowledge is not present in standard LLMs and cannot easily be added to context, making fine-tuning an interesting option.

Product Description Generation: For e-commerce platforms where sellers upload products with lengthy descriptions, a small, fine-tuned model can quickly generate quality product descriptions. This use case also allows for adjusting the model's tone of voice to match specific industry standards and company branding.

Conceptual Cost Considerations

Different approaches carry different cost profiles:

Manual Processes: Traditional manual approaches have zero setup costs but high ongoing operational expenses due to labor requirements.

Off-the-Shelf Models: Using pre-trained models via APIs requires low setup effort and generally provides good performance, though usage costs can accumulate with high-volume applications.

Fine-Tuned LLMs: Fine-tuning involves considerations around latency and hardware limits, which can become expensive for full fine-tuning scenarios requiring significant compute budgets. Domain adaptation necessitates training data and understanding of the target domain, resulting in longer lead times and costs. However, once properly trained, the model performs reliably for its specific domain. Smaller fine-tuned models can be deployed on more modest GPUs, resulting in faster inference compared to larger general-purpose models.

Additional factors to evaluate include performance requirements (response time measured in seconds versus sub-second needs), domain knowledge requirements, data privacy concerns (common in industry settings), time to deployment and market, request volume (100 requests per day versus 100,000), and available resources including both personnel expertise and compute infrastructure.

Decision Framework: When Not to Fine-Tune

The speaker emphasizes that fine-tuning is often not the first option for most GenAI use cases and recommends starting simple.

Baseline Capability Check: First, assess whether existing models can already accomplish the required task. Use APIs, apply prompt engineering techniques, and explore few-shot learning. This approach resolves most challenges without additional investment.

Deployment and Privacy Check: Evaluate whether data is confidential, whether the environment is regulated, and whether information is proprietary. For sensitive scenarios, consider on-premises setups, though hosting large models requires substantial compute resources. Alternatively, VPC and cloud solutions with encrypted data may suffice. This consideration frequently arises in client projects.

Problem Type Classification: Categorize the problem by type:

Knowledge Retrieval: If knowledge already exists in structured formats like documents, images, or presentations, use Retrieval Augmented Generation (RAG).
External Tools or Functions: If the solution requires internet access, database queries, integration with other applications, or execution of business logic, consider an agentic approach using tool calling and function execution. This represents the current focus of industry innovation due to its automation potential.
Fine-Tuning Scenarios: Only when models lack necessary information even with RAG and agents, and when dealing with proprietary private data and institutional knowledge embedded in processes, should fine-tuning be considered.

Fine-Tuning Commitment Checklist

Problem Validation: Define success criteria and metrics clearly. Establish baseline performance using off-the-shelf models. Secure stakeholder buy-in, acknowledging that development takes longer than marketed quick GenAI applications—not just one day or one week. Budget appropriately, as costs may exceed simple API usage.

Data Validation: Ensure high-quality training data is available. Verify that data represents the target population appropriately, as with any data science problem. Confirm proper labeling quality through domain expert verification. This is particularly critical because different annotators might assign different labels to the same content, requiring consistency checks. Consider edge cases and failure modes where assumed domain knowledge may prove insufficient.

Resource Validation: Confirm technical expertise is available, compute resources are accessible, and timelines are realistic. Watch for red flags including use case creep (discovering six months later that the model needs to handle additional scenarios not initially considered, potentially requiring training from scratch), corrupted data (problematic in any data science project but especially critical for large model training), and expectations of 100% accuracy from day one (unrealistic given that these models are typically non-deterministic).

Understanding LLM Fundamentals

Large Language Models output a probability distribution over tokens, predicting the next best token from that distribution during inference. Fine-tuning aims to improve the performance of existing models for specific use cases, potentially adjusting which next token is predicted.

In instruction fine-tuning, training data consists of instruction-query and response pairs. These prompt-completion pairs train the model using a loss function that compares correct and incorrect responses, updating the model to improve accuracy. The model learns by comparing its predicted next-token distribution to the target distribution, determining how far off predictions are and how strongly corrections should be applied.

Fine-Tuning Techniques

Full Fine-Tuning: This approach involves updating all model weights using the entire training data corpus. The challenges are substantial: large models require extensive GPU compute resources and very large training data corpora. Models on platforms like Hugging Face are typically instruction fine-tuned models. Training examples consist of prompt-completion pairs, such as providing the input sentence and the expected classification output.

The primary problem with full fine-tuning is catastrophic forgetting. Since LLMs are trained on diverse tasks and task categories, changing all weights can destroy the model's previous capabilities to respond to those tasks. This approach is very expensive and unrealistic for most use cases due to computational and data requirements.

Parameter-Efficient Fine-Tuning (PEFT): This approach freezes most or all model weights and trains only a small subset of task-specific parameters. Various PEFT techniques exist, with LoRA and QLoRA being the primary focus due to their widespread support.

Advantages of PEFT include manageable memory requirements (training only a small parameter subset reduces compute, training data, and time requirements), no catastrophic forgetting (original model weights remain unchanged), and practical deployment options. During inference, the fine-tuned parameters and original model weights are combined.

For deployment, vLLM and similar inference runtimes must support the chosen technique. When deploying custom models, verification of runtime compatibility is essential.

LoRA (Low-Rank Adaptation) in Detail

LoRA enables fine-tuning on high-end consumer GPUs without requiring massive data center infrastructure, unlike full fine-tuning. The technique freezes original model weights and injects two rank decomposition matrices per weight, which are smaller in rank and require training a much smaller parameter subset.

For example, a weight matrix of dimensions 64 by 512 contains approximately 32,000 parameters. Applying rank matrix decomposition reduces this to roughly 4,000 parameters—an 86% reduction. This dramatic decrease enables behavior modification, output structure changes, and domain knowledge integration at a fraction of full fine-tuning costs.

Deployment Flexibility: With Nvidia NIM or open-source solutions like vLLM, the base model and LoRA adapters can be deployed together. For scenarios with multiple use cases and extensive training data, different LoRA adapters can even be swapped at runtime. The base model remains loaded on the GPU while adapters are switched for different use cases—use case A loads adapter A, use case B loads adapter B.

QLoRA: This quantized version of LoRA offers different performance characteristics. QLoRA is generally more efficient during training, while LoRA tends to be more efficient during inference. However, results vary by use case and model, warranting independent testing and evaluation.

Practical Implementation

Implementation using Hugging Face is straightforward. The process involves importing the transformers library, adding components from the PEFT library, loading the base model and tokenizers, and configuring LoRA settings including rank and hyperparameters. These hyperparameters are typically available in the LoRA and QLoRA papers. Training proceeds with the parameter-efficient model and training data, after which the base model and adapter weights are combined by loading the adapters from their saved location on disk.

For organizations preferring managed services, hyperscalers including GCP, AWS, and Azure now offer button-click fine-tuning. Platforms like Vertex AI accept training data and handle the tuning process automatically, eliminating the need for custom code implementation. Nevertheless, understanding the underlying mechanics remains valuable.

Example Application

Using product description generation as an example (not from an actual client due to confidentiality), a base LLM generates standard product descriptions. After fine-tuning for more punchy, concise, to-the-point descriptions, the output matches the desired style and brevity. This example, courtesy of Philipp Schmid's blog, demonstrates practical fine-tuning applications in industry settings.

Key Recommendations and Summary

Do Not Start With Fine-Tuning: Follow the decision framework presented. Consider context enrichment, RAG, and agentic approaches first, as these deliver results faster in most scenarios.

When Considering Fine-Tuning: Validate the dataset thoroughly before training to avoid discovering performance issues post-training. Confirm the genuine need for implicit institutional knowledge that cannot be structured and added to context. Do not expect a silver bullet—fine-tuning has limitations and should be viewed as one tool among many.

Technical Considerations During Fine-Tuning: Libraries and models evolve, particularly for vision models where architecture variations may require different fine-tuning approaches. Some components integrate directly with standard libraries while others require custom solutions.

QLoRA typically offers higher training compute efficiency, while LoRA usually provides better inference speed. As with any technical endeavor, avoid reinventing the wheel—extensive resources exist on Hugging Face and GitHub including notebooks and example implementations that can serve as starting points.

The core message emphasizes strategic decision-making about when fine-tuning genuinely adds value versus when simpler, faster approaches suffice for the use case at hand.