RAG vs. Fine-Tuning
Choosing the Right Method for External Knowledge
In AI development, incorporating proprietary data and external knowledge is crucial. Two key methodologies are Retrieval Augmented Generation (RAG) and fine-tuning. Here’s a quick comparison.
๐๐๐ญ๐ซ๐ข๐๐ฏ๐๐ฅ ๐๐ฎ๐ ๐ฆ๐๐ง๐ญ๐๐ ๐๐๐ง๐๐ซ๐๐ญ๐ข๐จ๐ง (๐๐๐) ๐
RAG combines an LLM’s reasoning with external knowledge through three steps:
1๏ธโฃ Retrieve: Identify related documents from an external knowledge base.
2๏ธโฃ Augment: Enhance the input prompt with these documents.
3๏ธโฃ Generate: Produce the final output using the augmented prompt.
The retrieve step is pivotal, especially when dealing with large knowledge bases. Vector databases are often used to manage and search these extensive datasets efficiently.
Implementing a RAG-Chain with Vector Databases: time to recall the post “AI concept in a Nutshell: LLM series – Embeddings & Vectors ” from 1 month ago!
๐ ๐ข๐ง๐-๐๐ฎ๐ง๐ข๐ง๐ ๐ ๏ธ
Fine-tuning adjusts the LLM’s weights using proprietary data, extending its capabilities to specific tasks.
Approaches to Fine-Tuning:
1๏ธโฃ Supervised Fine-Tuning: Uses demonstration data with input-output pairs.
2๏ธโฃ Reinforcement Learning from Human Feedback: Requires human-labeled data and optimizes the model based on quality scores.
Both approaches need careful decision-making and can be complex.
๐๐ก๐๐ง ๐ญ๐จ ๐๐ฌ๐ ๐๐๐ ๐จ๐ซ ๐ ๐ข๐ง๐-๐๐ฎ๐ง๐ข๐ง๐ ? ๐ค
ยท RAG: Great for adding factual knowledge without altering the LLM. Easy to implement but adds extra components.
ยท Fine-Tuning: Best for specializing in new domains. Offers full customizability but requires labeled data and expertise. May cause catastrophic forgetting.
Choose based on your needs and resources. Both methods have their strengths and challenges, making them valuable tools in AI development.
LLM Temperature
I love the analogy shared during one of the break-out sessions at the OVHcloud summit: LLM temperature is like blood alcohol level – the higher it is, the more unexpected the answers! ๐
To be more specific, temperature is a parameter that controls the randomness of the model’s output. A higher temperature encourages more diverse and creative responses, while a lower temperature makes the output more deterministic and predictable.
๐ Key Points:
๐น High Temperature: More random and creative outputs, useful for brainstorming and generating novel ideas.
๐น Low Temperature: More predictable and coherent outputs, ideal for factual information and structured content.
Understanding and adjusting the temperature can help tailor LLM outputs to specific needs, whether you’re looking for creativity or precision.
Low-rank adaptation (LoRA): Turning LLMs and other foundation models into specialists
Before a foundation model is ready to take on real-world problems, itโs typically fine-tuned on specialized data and its billions, or trillions, of weights are recalculated. This style of conventional fine-tuning is slow & expensive.
โก LoRA is a quicker solution. With LoRA, you fine-tune a small subset of the base modelโs weights, creating a plug-in module that gives the model expertise in, for example, biology or mathematical reasoning at inference time. Like custom bits for a multi-head screwdriver, LoRAs can be swapped in and out of the base model to give it specialized capabilities.
By detaching model updates from the model itself, LoRA has become the most popular of the parameter-efficient fine-tuning (PEFT) methods to emerge with generative AI.
The LoRA approach also makes it easier to add new skills and knowledge without overwriting what the model previously learned, a phenomenon known as catastrophic forgetting. LoRA offers a way to inject new information into a model without sacrificing performance.
But perhaps its most powerful benefit comes at inference time. Loading LoRA updates on and off a base model with the help of additional optimization techniques can be much faster than switching out fully tuned models. With LoRA, hundreds of customized models or more can be served to customers in the time it would take to serve one fully fine-tuned model.
Chain of Thought (CoT)
One of the most effective techniques to improve the performance of a prompt is the Chain of Thought (CoT).
The principle consists of breaking down a problem into several tasks or summarizing all the data available upstream.
Examples:
1) ๐๐ผ๐ป๐๐ฒ๐ป๐ ๐ด๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป: Rather than directly asking the LLM to generate a summary, we can first ask it to generate a table of contents and then a summary.
The final result will therefore have a better chance of being exhaustive on the content of the initial document.
2) ๐๐ฟ๐ฒ๐ฎ๐๐ถ๐ผ๐ป ๐ผ๐ณ ๐๐๐ฟ๐๐ฐ๐๐๐ฟ๐ฒ๐ฑ ๐ฒ๐ป๐๐ถ๐๐ถ๐ฒ๐: The CoT allows you to control the format of generation of structured entities (for example in a YAML format that we will have detailed).
These examples will greatly reduce the chances of having hallucinations in the final response of the LLM.
3) ๐๐ด๐ฒ๐ป๐ ๐ช๐ผ๐ฟ๐ธ๐ณ๐น๐ผ๐: For an LLM Agent, it is preferable to ask the LLM to explicitly generate a plan of the actions it must perform to accomplish its task.
โStart by making a plan of the actions you will have to perform to solve your task. Then answer with the tools you want to use in a YAML block, example: […]โ
4) ๐ฅ๐๐ ๐๐ฒ๐ป๐ฒ๐ฟ๐ฎ๐๐ถ๐ผ๐ป: Using CoT can improve the quality of the generated response while reducing hallucinations. Asking the LLM to summarize the relevant information contained in the documents will strengthen this semantic field when it has to generate the final response.
Conclusion
The chain of thought allows to strengthen the semantic field of the expected result in two different ways:
- Explicit chain of thought: by giving examples of the expected result in the prompt
- Internal chain of thought: by telling the LLM to generate intermediate โreasoningโ steps
This increases latency and costs but in many situations the result is worth the investment
LLM Quantization
Quantization in AI is a technique used to reduce the precision of the numbers that represent the model’s parameters, which are typically weights and activations.
The two most common quantization cases are float32 -> float16 and float32 -> int8.
This reduction in precision leads to several benefits:
1๏ธโฃ ๐ ๐ผ๐ฑ๐ฒ๐น ๐ฆ๐ถ๐๐ฒ ๐ฅ๐ฒ๐ฑ๐๐ฐ๐๐ถ๐ผ๐ป: Lower precision means fewer bits are needed to store each parameter, resulting in a smaller model size.
2๏ธโฃ ๐๐ฎ๐๐๐ฒ๐ฟ ๐๐ป๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ฐ๐ฒ: Lower-precision computations can be performed more quickly, especially on hardware designed for integer operations.
3๏ธโฃ ๐ฅ๐ฒ๐ฑ๐๐ฐ๐ฒ๐ฑ ๐ฃ๐ผ๐๐ฒ๐ฟ ๐๐ผ๐ป๐๐๐บ๐ฝ๐๐ถ๐ผ๐ป: Lower-precision operations consume less power, which is crucial for edge devices like smartphones and IoT sensors.
For example, a deep learning model used for image classification might originally use 32-bit floats for its weights. By applying quantization, these weights can be converted to 8-bit integers. This reduces the model size by a factor of 4 and significantly speeds up inference, making it more efficient to deploy on resource-constrained devices.
In practice, quantization can be applied post-training or during training (quantization-aware training) to minimize any potential loss in model accuracy.
**Follow me on LinkedIn for daily updates**
Fostering and enhancing impactful collaborations with a diverse array of AI partners, spanning Independent Software Vendors (ISVs), Startups, Managed Service Providers (MSPs), Global System Integrators (GSIs), subject matter experts, and more, to deliver added value to our joint customers.
๐ CAREER JOURNEY โ Got a Master degree in IT in 2007 and started career as an IT Consultant
In 2010, had the opportunity to develop business in the early days of the new Cloud Computing era.
In 2016, started to witness the POWER of Partnerships & Alliances to fuel business growth, especially as we became the main go-to MSP partner for healthcare French projects from our hyperscaler.
Decided to double down on this approach through various partner-centric positions: as a managed service provider or as a cloud provider, for Channels or for Tech alliances.
โก๏ธ Now happy to lead an ECOSYSTEM of AI players each bringing high value to our joint customers.