Deploying Fine-Tuned Models

llm-inference

llm-conf-2024

Published

July 21, 2024

Abstract

This discussion explores inference servers, backends, and platforms such as Replicate for hosting models.

Subscribe For More Educational Content

If you enjoyed this content, subscribe to receive updates on new educational content for LLMs.

Chapters

00:00 Overview
Dan discusses the topics covered in the presentation.

01:24 Recap on LoRAs
LoRA adaptations are popular for dramatically reducing the number of parameters/weights needed during model fine-tuning. These can be merged or used as hot-swappable adapters during serving.

06:28 Performance vs. Cost
Balancing full fine-tuning versus LoRA methods involves considering performance, cost, GPU usage, and managing idle time versus cold starts for real-time applications.

10:18 Many Projects Are Not Real-Time
Case studies of projects that do not require real-time processing are discussed.

13:56 Exploring LoRA Training Directory and Pushing to HF Hub
An overview of the LoRA training directory and the process of pushing model files to Hugging Face.

15:15 Hugging Face Inference Endpoints Demo
A demonstration on setting up Hugging Face inference endpoints for cost-effective serving.

18:30 Considerations When Deploying Models
Factors influencing model deployment and how different decisions impact solution complexity.

20:25 Simple vs. Advanced Model Serving
Comparison of simple model serving setups with advanced configurations involving auto-scaling, load balancing, and high availability.

22:04 Kinds of Model Serving
The choice of serving infrastructure based on application use cases and performance variations.

26:20 Honeycomb Example on Replicate
Discussion of factors that motivated the use of Replicate, including permalinks, UI, and the API interface.

31:04 Honeycomb Example Code Walkthrough
Explanation of Cog, an open-source project for deploying models via Docker, including a predict.py function for handling inference requests.

41:33 Deploying Language Models
Challenges of deploying models efficiently and defining success metrics.

46:07 What Makes LLMs Slow
Understanding memory bandwidth and software overhead as reasons for slower performance, and methods to address these issues.

50:44 Making LLMs Fast
Low-level and run-time optimizations to improve LLM speed, including speculative decoding techniques.

52:11 Continuous Batching
Achieving higher GPU utilization by continuously replacing completed sequences with new ones to minimize GPU idle time.

56:09 Performance Metrics
Discussion of various metrics for quantifying model performance and the trade-offs between them.

01:03:52 Simplifying Model Deployment
Strategies for simplifying language model deployment by prioritizing modularity to allow experimentation across frameworks.

01:06:47 Simplifying Deployments with Replicate
Features of Replicate that facilitate easier and more flexible model serving.

01:09:31 Replicate Walkthrough
A walkthrough of Replicate’s features, APIs, and how to create and use a vLLM model on the platform.

01:14:32 Cog-vLLM for Local Development
Using Cog-vLLM for local model hosting and a walkthrough of the project directory.

01:20:19 Predibase’s History
Introduction to Predibase, a managed platform for serving fine-tuned LLMs.

01:24:44 LoRAX Motivation and Idea
LoRAX improves efficiency in serving multiple adapters by using a single base model with optimizations for throughput.

01:29:54 Issues with Merging Adapters
Challenges associated with merging adapters with base models, including inefficiencies and difficulties.

01:32:21 Challenges with QLoRA
Fine-tuning with QLoRA involves dealing with precision issues and quantization errors.

01:34:53 Dequantizing QLoRA Weights
Methods for dequantizing model weights to achieve optimal performance.

01:35:48 Deployment Considerations
Assessing data and load requirements, request distribution, hardware needs, and balancing quality, latency, and cost in deployment strategies.

01:42:53 Speculative Decoding
Boosting performance and throughput with speculative decoding and look-ahead LoRA, achieving significant improvements.

01:47:10 Throughput vs. Latency
Balancing throughput and latency in system optimization, focusing on efficient volume handling and rapid response times.

01:55:17 Improving Latency and Throughput
Optimizing LLM inference involves balancing throughput and latency, with various methods impacting these metrics differently.

02:02:38 Deploying on Modal
Modal offers cost-effective LLM inference with scalable throughput and competitive pricing, though high GPU utilization can be challenging.

02:07:44 Modal Demo
Demonstration of using Modal for automating and securing credit granting from a database with a Python script.

02:12:55 LLM Demo on Modal
Features of Modal for batch processing and high-throughput inference tasks, including a modified Llama 3 70B model.

02:19:29 OpenAI-Compatible Endpoint Demo on Modal
Deploying an OpenAI-compatible endpoint on Modal, with features for middleware and authentication.

02:23:37 Q&A Session
Hamel and Dan answer questions from the community.

Slides

Resources

Links to resources mentioned in the talk:

Axolotl: Merging a LoRA Back to the Base Model: Learn how to merge a LoRA back to the base model using Axolotl.
Dan’s Conference Demo Hugging Face Repo: Explore Dan’s conference demo and resources on Hugging Face.
Amazon SageMaker: Managed service for building, training, and deploying machine learning models at scale.
Anyscale: Platform for scalable and cost-effective distributed computing.
Fireworks AI: AI platform for building and deploying machine learning models.
FastAPI: Modern, fast (high-performance) web framework for building APIs with Python.
OpenLLM: Open-source repository for managing and deploying LLMs.
Nvidia Triton: Inference server for deploying machine learning models.
vLLM: Documentation for vLLM, a high-performance inference framework.
The Many Ways to Deploy a Model: Insights on various methods for deploying machine learning models.
Replicate: Platform for running, fine-tuning, and deploying open-source models with ease.
Parlance Labs Replicate Examples: Examples for using Replicate from Parlance Labs.
HoneyComb Model: HoneyComb model available on Hugging Face.
Cog: Containers for Machine Learning: Platform for containerized machine learning deployments.
Hugging Face Hub: Download Files: Guide on downloading files from the Hugging Face Hub.
Speculative Decoding: Fast Inference from Large Language Models: Techniques for fast inference using speculative decoding.
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution: Research on speculative decoding for LLM generation and attribution.
Continuous Batching: Enhancing LLM Inference Throughput: How continuous batching can significantly improve LLM inference throughput and latency.
Llama.cpp: Inference server for Llama models.
Exllama: Inference server for large language models.
Hugging Face TGI: Documentation for Hugging Face’s Text Generation Inference.
DeepSpeed-FastGen: Fast generation framework using DeepSpeed.
TensorRT-LLM: NVIDIA’s TensorRT for LLM inference.
SGLang: Framework for programming in SGLang.
Ollama: Platform for deploying and managing large language models.
MLC: Universal LLM Deployment Engine: Engine for deploying LLMs with ML Compilation.
Lorax: Multi-LoRA Inference Server: Inference server scaling to thousands of fine-tuned LLMs.
Cog-vLLM: Inference LLM on Replicate with vLLM: Deploy vLLM models with Cog on Replicate.
Predibase: The Fastest Way to Fine-Tune and Serve LLMs: Platform for rapid fine-tuning and serving of LLMs.
Efficiently Serving LLMs Course by Travis Addair: Course on efficient LLM serving.
The Kraken-LoRA Model and Architecture: Kraken-LoRA model available on Hugging Face.
Efficient Fine-Tuning of Llama 3 with FSDP QDoRA: Techniques for fine-tuning Llama 3.
Medusa: Simple LLM Inference Acceleration Framework: Framework for accelerating LLM inference with multiple decoding heads.
Mastering LLMs - Deploying LLM Services on Modal by Charles: Presentation on deploying LLM services on Modal.
The State of AI Infrastructure at Scale 2024: Report on the current state of AI infrastructure.
Latency Lags Bandwidth by David A. Patterson: Paper discussing latency and bandwidth in computing.
Programming Massively Parallel Processors: Book on parallel processor programming.
Performance Benchmarks from Fine-Tuning 700+ Open-Source LLMs: Benchmarks from extensive fine-tuning of open-source LLMs.
Modal: Featured Examples: Examples of using Modal for deploying models.
vLLM AutoAWQ Quantization: Guide on creating 4-bit quantized models with vLLM.
Travis Addair on LinkedIn: LinkedIn profile of Travis Addair.

Notes

Pushing Model to HF Hub

Sharing the model includes creating a repository, using basic git commands, copying merged files, and utilizing Git LFS for handling large files. The following code snippet shares the model files on Hugging Face:

huggingface-cli repo create conference-demo
cp ./outputs/qlora-out/merged/* conference-demo
git lfs track "*.bin"
git add *
git commit -am "Push merged files"
git push origin main

Considerations When Deploying Models

The table below shares some of the factors that must be considered when deploying a model:

Aspect	Slow (time to response)	Fast (time to response)
Speed	Results needed in minutes e.g., portfolio optimization	Results needed in milliseconds e.g., high-frequency trading
Scale	Low: 10 requests/sec or less e.g., an internal dashboard	High: 10k requests/sec or more e.g., a popular e-commerce site
Pace of Improvement	Low: Updates infrequently e.g., a stable, marginal model	High: Constant iteration needed e.g., an innovative, important model
Real-time Inputs Needed?	No real-time inputs e.g., analyze past data	Yes, real-time inputs e.g., targeted travel ads
Reliability Requirement	Low: Okay to fail occasionally e.g., a proof of concept	High: Must not fail e.g., a fraud detection model
Model Complexity	Simple models e.g., linear regression	Complex models e.g., LLMs

Differences Between Simple and Advanced Model Serving

Depending on your application, you might opt for a simpler or a more advanced serving method. Advanced methods increase complexity but also enable more granular control.

Aspect	Simple Model Serving	Advanced Model Serving
Setup Complexity	Basic setup with minimal configuration	Complex setup involving multiple components and configurations
Architecture	Direct integration with model library (e.g., FastAPI)	Uses auto-scaling clusters, load balancers, and specialized components
Scalability	Not designed for high availability or scalability	Designed for high availability and scalability
Pre-Processing/Post-Processing	Minimal or no specialized components	Includes specialized components for pre-processing and post-processing
Use Case	Ideal for proof of concepts and simple applications	Suitable for large-scale, production-level applications
Example Technologies	Basic frameworks and libraries	Kubernetes with NVIDIA Triton and TensorRT, among others

Quantizing Model Using AWQ

The following code can be used to quantize the model using AWQ:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

# Setup
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
quant_path = "hc-mistral-alpaca-merged-awq"
model_path = "parlance-labs/hc-mistral-alpaca-merged"
model = AutoAWQForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize and save model
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Once the model is quantized, it can be saved to HF Hub using:

cd hc-mistral-alpaca-merged-awq
huggingface-cli upload parlance-labs/hc-mistral-alpaca-merged-awq .

Improving LLM Performance

Why are LLMs slow?

Developing a deep intuition of what slows down LLMs can help in engineering solutions to combat the issue. The main reasons for LLMs being slow are:

Issue	Description	Mitigation Strategies
Memory Bandwidth Bottlenecks	- Transformers are memory-intensive - Operations require transferring data from device memory to smaller, faster memory caches - This is a performance bottleneck	- Transfer less data - Use smarter kernels (e.g., fusion, flash attention, paged attention) - Use smaller data (e.g., quantization, speculative decoding)
Software Overhead	- Every operation (e.g., attention, layer norm) requires launching a kernel - Launching a kernel requires communication between the CPU and GPU(s) - This is a performance bottleneck	- Minimize overhead - Use smarter kernels (e.g., fusion → fewer kernels → fewer launches) - Use CUDA graphs to trace GPU operations and launch them as a single unit

How to Make LLMs Faster

Category	Optimization Techniques
Low-level Optimizations	- Kernel fusion - Kernel optimization - CUDA Graphs
Run-time Optimizations	- Continuous batching - KV Caching
Hardware Upgrades	- Enhanced GPU and memory hardware
Tricks	- Speculative decoding - Shorter outputs - Shorter inputs

Using Cog-vLLM

You can use cog-vLLM on your local machine using:

export COG_WEIGHTS="..." # Copy the URL to "Download Weights"
cog predict -e "COG_WEIGHTS=$COG_WEIGHTS" -i prompt="Hello!"

Serving Adapters with QLoRA

Challenges with QLoRA

Fine-tuning a model using quantization reduces memory usage during fine-tuning.
Serving the model in full precision (FP32) or half precision (FP16) avoids dequantization costs and reduces memory overhead.
Dequantizing QLoRA weights leads to differences in activations due to quantization errors, potentially degrading performance.
Serving the model in its quantized form maintains accuracy but significantly increases latency and reduces throughput.
The dilemma: FP16 serves faster but with worse results; quantized serving is accurate but slow.

Solution: Use Dequantized Weights

Dequantizing QLoRA weights involves quantizing FP16 weights to NF4 and then reversing the quantization.
This approach ensures the dequantized FP16 weights are numerically identical to the original quantized weights.
Serving the dequantized weights in FP16 provides performance benefits without degradation from quantization errors.
This method is commonly used for models fine-tuned with QLoRA to achieve optimal performance.

Choosing Between Serverless and Dedicated

The choice between serverless and dedicated will depend on the type of workload and how critical latency is.

Serverless	Dedicated
Request volume is low to medium, but distributed fairly uniformly throughout the day (latency is O(seconds)).	Request volume is highly concentrated/spiky, and not real-time (batch) (latency is O(minutes to hours)).
	Request volume (QPS) is high and consistent latency/throughput SLOs are critical (latency is O(milliseconds to seconds)).

Throughput and Latency

Optimizing a system involves balancing throughput and latency, where:

Throughput addresses handling volume efficiently
Latency focuses on rapid response times

Latency and throughput can be optimized using the following methods:

Method	Effect/Consequence
Increase Batch Size	Increases throughput; Penalties to latency, but ~linear scaling in throughput
Quantize	Shortens latency; Also improves throughput
Distill or Truncate	Shortens latency; Also improves throughput
Buy More Expensive Hardware	Shortens latency; Also improves throughput
Write Really Hard Software	Shortens latency; Also improves throughput
Run the System Entirely on Cache Memory/SRAM (e.g., Groq LPU)	Achieves really short latency; Penalties to throughput per dollar

Additional Concepts

LoRAX: A model or technique for low-rank adaptation of neural networks to efficiently fine-tune large models.
Continuous Batching: A process of aggregating data into batches in a seamless, ongoing manner to optimize computational efficiency and throughput.
Speculative Decoding: A method in natural language processing where multiple potential outputs are generated in parallel to speed up the decoding process.

Full Transcript

Expand to see transcript

[0:03] Dan Becker: We’re warming up. I think today is one of the, at a technical level, one of the deepest topics and also most complex. So thrilled to have this group here. And I think a bunch of you are going to, I think everyone’s going to learn a ton today. So I’m certainly looking forward to it. Our plan is I’m going to give an overview of serving and talk about some of my experience. I think that…
[0:33] Dan Becker: Because of the usage patterns of problems that I’ve worked on as a practitioner, I’ve been able to avoid some of the complexities of real-time serving. So I would touch on those things, but we’ll get into a lot more of that as we go further. Amel’s worked on more real-time use cases. And then we’re going to go through Travis. I think actually this is out of order.
[1:01] Hamel Husain: We’re going to do Joe, then Travis, then Charles.
[1:04] Dan Becker: Yep. So that’s our plan for today. And with that, we will get started on talking about how to serve and deploy fine tunes. There are, as a first approximation, two types of fine tunes, but we have focused on one of them so far. So there’s full fine tunes where you potentially are changing. all of the weights that were trained in the original model. I’m going to set that aside for the moment and instead focus on deploying or serving models that were fine-tuned with lower-ranked adapters.
[1:45] Dan Becker: And as a recap, so this is a slide from workshop two, if you look at any given layer in your model, this is even simplified a little bit from that, so we’re not thinking about the separate, like… query matrices. But if you imagine for any given matrix that goes from the input in a layer to the output in the layer, if the input representation was 4,000 dimensions and the output was 4,000 dimensions, then a matrix that goes from one to the other would be 4,000 by 4,000. That is 16 million weights, 4,000 times 4,000.
[2:22] Dan Becker: If instead you are training with one of these low-rank adapters, then you’ll have one matrix that is 4,000 by some lower number. So a potential adapter rank would be 16. So this lower one might be 4,000 by 16. The next one would be 16 by 4,000. If you have two of those, then you can do some arithmetic here. You see you go from 16 million weights that you would tune if you tuned all of the original weights, but instead in this adapter, which has a particular type of bottleneck or restriction that is 128,000 weights.
[3:02] Dan Becker: And most of when we talk about fine tuning, we’re actually typically talking about these LoRa fine tunes. And after you have trained one of these, you need to make a decision. One thing you can do is say, I’m just going to keep that LoRa. those weights in a separate file and then I’m going to deploy with that. That would be sort of keeping it in the before stage.
[3:32] Dan Becker: The other thing that you can do is you can do that multiplication of these two weight matrices, add them to the original weight and merge, create one new file, which for each layer has the base model weights plus the fine-tuned, the learned weights during fine-tuning. that really takes us to a few different potential workflows. So you say, I’m going to fine tune within that.
[3:59] Dan Becker: I can fine tune with a full fine tune, or I can fine tune with LoRa, and then with LoRa I can either do that merge that I just described, or I can do something which others in this group have immense knowledge about, which is hot swapping the adapters, and I have only superficial knowledge about that. but I’m going to let them talk through a lot of the details of that. But as a practitioner, the things I think you should know are full fine tuning.
[4:33] Dan Becker: If you have enough data, it does have some advantages, but it is memory intensive at training time, and so it is not so, so widely used. Merging the weights of the base model to your LoRa, so it’s all in one file, It’s all in one either file or if it’s very large one, shorter file. That is historically quite widely used. That’s what I’ve historically done as a practitioner. And then arguably the new hotness is this hot swapping adapters. And that has some potential efficiencies, especially for someone who is serving many different models.
[5:19] Dan Becker: Or rather, many different adapters.
[5:22] Hamel Husain: And the one that’s widely used, the reason it’s widely used is…
[5:32] Dan Becker: I can’t tell if I have a bad connection or if Hamill does. Even though I think Hamill’s trying to talk, I’m just going to talk over him since we can’t really understand him. He was going to say that it’s widely used, and I think that’s primarily for two reasons. One is a lot of times people are serving a small number of models, and as you get to platforms, they can… I’ll talk more about… what platforms can do in a moment. And then the other is that historically, this hot swapping of adapters, yeah, it’s newer technology.
[6:14] Dan Becker: And I think that some of the economic advantages of it, which I’ll talk about in the next slide, they just haven’t been realized yet. So when you are deploying a model, really the biggest trade-off that you have is performance versus costs. Some of these are not worth going into much depth about. So for instance, if you have a more powerful GPU, and obviously you can have lower latency. We talked about either in Jono’s office hours or Charles’office hours, that a more powerful or faster GPU can just get through to work.
[6:52] Dan Becker: So just looking at the cost per hour is not the right comparison. If you were using these things at 100% efficiency or 100% capacity, then… using a more powerful GPU might be more of a pure win, but there’s just waiting time between requests. So having a more powerful GPU is more expensive. Running a larger model just is potentially higher performance in terms of quality, but then it’s slower or you need more GPUs to run it quickly at the same cost, more powerful GPUs. There’s some engineering wins, which really are mostly at the… platform level.
[7:31] Dan Becker: So I’m going to set those aside. And I want to talk mostly about this last piece, which I think is like the fundamental, if I had to pick one tension that we need to think through as practitioners, I would say it’s this old start versus idle time trade-off. So loading a model onto a GPU takes, there’s just so many… so many different even definitions of what we could mean by loading a model onto a GPU. But I think it can take anywhere, depending on the details, from 30 seconds to, I don’t know, eight minutes.
[8:14] Dan Becker: And so if you have a real-time use case, these cold starts are just a showstopper. The flip side of that is that if you go a minute, if you have a request and if you have the model loaded and that would take, let’s say, a second, but then you wait a minute before your next request comes in, then you’re paying for 60 times as much GPU time as you actually need. And for newer products, you could actually have a worse ratio than that.
[8:50] Dan Becker: And so this idle time is particularly costly, and it could mean that you’re paying for a somewhat expensive GPU or group of GPUs. for 50 times as much as you actually need. And so you need to decide, do I need low latency and I’m willing to pay for that idle time? Or is there some other solution?
[9:13] Dan Becker: And when others talk about hot swapping, this is basically a way of trying to achieve economies of scale of having many LORAs being served off the same GPU so that you have a more constant traffic to that GPU and you can overcome. this cold start versus idle time problem or challenge. Okay. I want to emphasize that, especially if your work is with technology companies or if you’re just used to thinking about something like ChatGPT as the canonical use case or Copilot, you probably assume that most use cases are real-time. The… I think that…
[10:01] Dan Becker: The projects that I, the commercial projects that I’ve worked on are probably not really representative. And there’s some reasons about my clients that are the explanation for that. But my projects are mostly not real time. So I’ve talked about several of these before. So there’s writing alt text descriptions. We have a queue of images that comes in every day. Every morning we start a job. We process those as a batch. And I’ll show you. some of the setup for how we do this, just the code and such.
[10:31] Dan Becker: But yeah, we process those as a batch and then humans review all those during the day. But once we’ve processed that batch, we scale the GPU down to zero and then we don’t have any more requests until the next morning. Extracting chemical properties from academic chemistry articles, that then all goes into a structured database. Again, we just do that in batch once every morning. I’ve talked about this project I worked on editing journal articles to identify and potentially and propose edits in stereotypes or unconscious biases. Again, those start as a queue of documents.
[11:13] Dan Becker: Each day we run that queue and then people review the edits over the course of the day. The one project that I worked on that was real time was this super popular or fashionable use case. of allowing employees to ask a question in plain text, and then you convert that into a SQL, run the SQL, and then give an answer to them. The project that I did that on, we used open AI. And so we didn’t have to build it or even think deeply about how that large language model got served.
[11:44] Dan Becker: But the ones, these top three are examples of projects that I worked on where we used open weights models.
[11:52] Dan Becker: And for all of those, we just
[11:55] Hamel Husain: batched everything together and did it once a day and that made our life um much simpler in the end the texas sequel that’s really interesting um like with the honeycomb example it’s kind of like yeah it’s basically texas equal but we used open models and so we’ll be talking about that yep
[12:14] Dan Becker: um i’m gonna go through the uh somewhat briefly through the workflow that i’ve personally have done for those top three use cases it’s pretty similar across all of them And those are, so here I’m showing a screenshot at the top of what does my LoRa directory look like after I finish training. You’ll see that I’ve got an adapter model binary. It’s got some weights in it, but 168 megabytes of weights, which is compared to many things we’ll see not so great. I have a step where I merge that to the base model.
[12:53] Dan Becker: And that creates the directories that you see at the bottom. You’ll see that we have four PyTorch model binaries. There’s four of them because they’re sharded because they’re big files. And that’s 16 gigabytes of weights. So we have something like an 80x increase in the weights files. And as a result, I could load 168 megabytes onto a GPU quickly and maybe already have the… base model stored on that GPU, which is this hot swapping idea that others will talk about. Loading 16 gigabytes is a lot.
[13:31] Dan Becker: So after I have done this step, which is just a one liner, and you can even see that one liner in the future. And this one liner is in the read me on the axolotl GitHub page. So like so many things you can look there in the future, or you’ll have access to these slides, but I would look there to see the command. From there, we’ve historically served things with Hugging Face inference endpoints. There’s nothing magical about Hugging Face inference endpoints. Three other people on this call have built platforms that are also quite good.
[14:08] Dan Becker: This is just the one that we started with and haven’t moved off of and I think is also quite nice. And like many others, you have credits for. So the steps here are to create a repository. The rest of the stuff is basic. git commands. So I created a Hugging Face repository called conference demo, copied the merged files into that directory. This git lfs for people who aren’t familiar with it, that’s just a git large file alternative workflow for handling large files.
[14:49] Dan Becker: So this is just saying to track all any binary files because those are about four gigabytes each. I add them, I commit them, I push them to the Hugging Face repository. And then from there, I’m going to show you in a video the next step. Again, there’s so much complexity underneath everything we do, but you’re going to find that this is reasonably straightforward. I’m going to even show it to you in a GUI. So let me start that. I just want to pause and see.
[15:24] Dan Becker: I can’t get too much high resolution, but I’ve got this repository from there. I hit the deploy button. You’ll see there are a handful of options. The one that we’ve historically used is inference endpoints, which is a dedicated endpoint. So you’ll see I will select that. From there, you can choose what cloud you’re on. That’s frequently you want to be on the same cloud that your company does most of their work on for a variety of reasons. You can choose what GPU you use to serve it.
[16:03] Dan Becker: And then I think the last thing I want to spend a little bit more time on is here we’re choosing automatically scale. You can say I want to scale to zero, never. That would be, this is just always up in that trade-off between accepting cold boots or cold starts and paying extra for idle time. Here you would say like we’ll pay for the idle time. We’ll always We’ll never have cold boots. The thing that we’ve done, because we’re doing everything in a batch, I’ll select to scale to zero after 15 minutes of inactivity.
[16:41] Dan Becker: And now we probably only pay for the GPU for an hour or so a day. That is the hour or so when it’s actually getting used after we’ve finished with our batch of requests. And those are just simple post requests to the… endpoint they provide, then you could manually shut it down, but we aren’t even that careful about it. We just let it scale to zero after 15 minutes. So yeah, this has historically been our serving workflow. There are things that are much more sophisticated, especially as you get to real time.
[17:19] Dan Becker: We’ll talk about those, but I think of this as one of several reasonably simple ways to just get your model served. And it’s been sufficient for most of what I’ve done. So I’m going to, at this point, hand it off to, in some ways, we’re actually increasing complexity as we go. So I’m going to hand it to Hamill, who will talk about one level deeper in complexity. And then we’re going to move to Joe and Travis and Charles. And they’ll all be a level of complexity beyond.
[17:49] Hamel Husain: I think you have to make me a co-host again, because I left and came back.
[17:53] Joe: Cool.
[17:58] Dan Becker: So I will make you a co-host and I’m going to stop sharing. One other comment I’ll make is that we’re going to, I think our order or plan is we’re going to do me, Hamill, and I think Joe. And then after that, we’re going to do a round of Q and A and then we’ll do Charles and Travis. And then we’ll do another round of Q and A.
[18:24] Hamel Husain: I’m going to talk. Okay. Thanks. Thanks, Dan. I’m going to talk a little bit about model deployment. more generally, and I’m also show you some code. I’m going to go into the Honeycomb example and how I actually deployed that in real life. So first I’m going to talk a little bit about model deployment. Actually, let me turn on my camera. Sorry about that. Okay, great. So there’s a lot of things to think about when you think about model deployment, and there’s a lot of different dimensions. and which things can get more complex.
[19:00] Hamel Husain: So here’s a quick chart. You don’t need to memorize this. This is just to give you an idea. So things like speed, you know, is it okay if results are slow or do they need to be fast? Scale, do you need to handle lots of requests concurrently or do you not have that many requests at the same time? Another thing is pace of improvement. Do you need to constantly fiddle with your model, constantly update your model, things like that? Do you have real-time inputs?
[19:26] Hamel Husain: you’re constantly streaming inputs through your system also reliability is an issue like um you know if is it does it have to be high availability is it going to be a catastrophic if your model server goes down for some period of time and then like the complexity of your model like you know what are your resource requirements um how big are your models things like that and so kind of the left-hand column is where you have like the very simple use cases where things don’t get too complicated. There’s a lot of off the shelf tools.
[19:56] Hamel Husain: The more things on the right hand side of the column that you kind of stack. then the more complicated it gets. And if you kind of stack too many of those things, then sometimes you need to build custom stuff. The tools are always getting better. So you might not need to build custom stuff, but it’s just, these are things you want to keep in mind when you like look at your use case. So like, what am I talking about with like simple versus complex? Let me give you an idea.
[20:25] Hamel Husain: So like simple model serving is something like you have like some application.
[20:30] Hamel Husain: maybe some website and you just have like interfaces directly with your model library almost it’s like you can just imagine putting your model in fast api something very simple it’s not doing that much very you know simple model serving it’s not going to necessarily you’re not caring about high availability or speed or something like that potentially you just want something simple it’s really good for proof of concepts and things like that advanced model serving kind of can look like something like this, where you have some application, you have some kind of auto-scaling cluster that has
[21:09] Hamel Husain: many different model servers, and that has load balancers that then route requests to a model server for high availability and specialized components for pre-processing and post-processing, things like that. So one example of this is, If you want to use Kubernetes with like a Triton, NVIDIA Triton front end, and maybe a Tensor RTL back end, that’s like one example of like a kind of architecture that people use at scale. We’re not going to go into this scale thing too much. We’ll tell you about it.
[21:50] Hamel Husain: Like, you know, Joe especially has worked a lot on building these systems at scale. And he can talk about this. But just to kind of give you like a sort of flow chart of how to decide. So like there’s many different kinds of model serving. And don’t take this chart too seriously in terms of where I put the logos.
[22:13] Hamel Husain: But basically the idea is like, okay, is there, first of all, is there a finite, if you have a fixed set of inputs that are known in advance, which is not usually not the case with large language models, you can pre-compute your responses. Or you can at least pre-compute some set of the responses.
[22:29] Hamel Husain: and not do inference at all but that usually is not what we’re talking about especially in this course um and then the other decision criteria is like is it okay to return responses asynchronously um like maybe in minutes if so you can do more of this like batch response uh here on the left hand side if no you need to do some kind of real-time response and the question there becomes are you comfortable operating these inference services yourself um And if you’re not comfortable, then use some kind of hosted service like, you know, any scale
[23:07] Hamel Husain: of Fireworks, SageMaker. But if you are comfortable with doing that, then the question becomes, like, do you require large scale and low latency? You know, if you don’t require that, you might use a simple, like, stack, like FastAPI. If not, you might want to use, you know, if you do require low latency and some kind of scale, you might…
[23:28] Hamel Husain: you know go with something like the nvidia stack which joe is going to talk more about um now in this course like we’ve highlighted you know so i’ve deployed this honeycomb model and replicate in real life so i’m going to be walking through that and then we’ve also gone we’ve also talked a lot about modal for fine tuning it turns out you can also use modal for inference in many different ways so i’m not going to tell you where these things go on this chart um i don’t want to shoehorn these folks into like one
[23:59] Hamel Husain: place here um i’ll let you decide where you think it goes but the idea is like you know it’s kind of this decision tree of like what to use um now this is a this is a benchmark that’s like on the like a very gpu poor kind of thing just like one request a batch size of one and just like a single request after after warming up the gpu don’t take this benchmark too seriously like all benchmarks are wrong some are useful you But basically, the point you should get across here is you should try
[24:34] Hamel Husain: different inference servers to see what works best for you. The thing I can say at a general level is VLLM is really easy to use, and it has good trade-offs. And out of all the inference servers out there, and I’ve tried so many of them, I really like it a lot. The NVIDIA stack is more performant.
[24:53] Hamel Husain: Like, so when you talk about tensor RT LLM backend, um, and then the Triton front end, there’s like backends kind of like, um, you know, compile your model, kind of like, are kind of where the computation is like happening, or most of it, and then like the front end is where like, you know, you’re getting the requests, you’re handling, you potentially handle things like batching and things like that. And so the NVIDIA stack is like very performant, but it’s actually really difficult to use.
[25:24] Hamel Husain: There’s a lot of knobs, and it can be a little bit painful. And like, Joe will actually talk a lot about that. And yeah, you want to pay attention to things like quantization. Quantization can make a huge difference. And, you know, you want to, when you do quantization, you want to run evals as well. So we talked about evals in the last lesson a bit. So you want to like have a reasoned way of thinking about the performance trade-offs.
[25:54] Hamel Husain: Like, okay, you can achieve lower latency by quantizing your model, but you just want to double check like what the quality hit on the model is. Again, it’s not important to pay too much attention to this slide. I just want to bring up the fact that you can try different tools and get very different results when it comes to things like speed. And you can get really advanced with these benchmarks. There’s all kinds of different ways you can slice and dice these benchmarks. So this doesn’t mean that one thing is better than the other.
[26:27] Hamel Husain: If you want to see all these slides, kind of the… I go into more detail about these things in this blog. So you can just go here and you can read about a little bit more long form. The next thing I’ll talk about is the honeycomb example, which is the through example I’ve been using in this course, like kind of a case study of like the basically a natural language to query problem, the honeycomb query language. If you don’t know what I’m talking about, I recommend looking at the last.
[27:01] Hamel Husain: courses but um so as kind of we hinted at the honeycomb example needed to be a real-time use case and we wanted to um you know it basically it’s real time because users want to write questions and it’s used in uh in the honeycomb interface and and get results right away and has to be available 24 7. and so the platform i launched this on is replicate so like why replicate let me just like talk about it just for a second. So this is replicate.
[27:34] Hamel Husain: This is kind of the page where I deployed the honeycomb model to. And I’ll talk about like how the code, I’ll show you the code of like how it got here. But basically, there’s some reasons why I chose replicate in this example. One is a lot of times when you’re working with non-business people, and you want to have them try out your model, it’s really nice to have like a playground. And I want to, a lot of times that playground is not just a prompt.
[28:01] Hamel Husain: it’s like some kind of structured input and i want to guide that structured input so in this case i have the natural language query and this is the schema so remember there’s two inputs into the language model um and you know this one is usually retrieved by rag but for the purposes of like playing with the model it’s really useful like when you push your model to replicate and you define it in your python uh like the predict function like it’ll create this entire user interface for you And the thing that’s nice is I can
[28:32] Hamel Husain: give this to my counterpart and I can have them play around. But then also like the predictions have permalinks. So like my client, Philip, who I’ve worked with on this honeycomb example, you know, could kind of like play around and send me situations where he thinks something was kind of weird. And he gave me a permalink. And actually like the permalinks are like really useful because I’m just going to go ahead and run this. But like. and this is what it looks like when it’s run, you know, you get a link to this prediction.
[29:05] Hamel Husain: You see, like, the URL here. And every time I make one of these things, like, there’s so many degrees of freedom. Like, you might, you know, like, for example, what if you put this in double quotes, the name of the columns? Or what if you did something funny, like put new lines where you weren’t supposed to? Well, the permalinks helped me debug that. Like, when my counterpart sends me something, I’m like, oh, you actually, like, did this wrong.
[29:28] Hamel Husain: Another thing that I really like is like when you when i share something with somebody um it actually comes bundled with documentation so you can go to the api kind of page here and then basically it has like a full sort of code they can just copy and paste including the inputs in that example so like you know you can just basically copy paste this this uh like node js code or this python code or just make a curl request for that specific example you can also like save examples. So I’ve saved some examples here.
[30:04] Hamel Husain: So all this stuff is really nice. And that’s like, essentially why I chose to do it. It’s actually and it’s also really easy. So that’s why I ended up for this specific example, choosing replicate. So let me go back to this. Okay, so like, let me show you the code. So if you want to see how that this honeycomb model got to replicate. There’s these two links where you might want to know about. There’s this GitHub repo, which you already have access to. And it’s actually the wrong repo. Let me go to the right repo.
[30:41] Hamel Husain: It’s called ftcourse. Inside ftcourse, you’ll see replicate examples. And there’s these two examples. I’m going to be walking through this one. That’s the quantized model. It really doesn’t matter, but I’m just going to show you the code right now. So let me switch my screen to VS Code so I can show you the code live here for a moment. Okay, so I’m in that folder, and I just want to walk you through a little bit of how Replicate works. And Joe will go into a lot more detail. But this is the readme.
[31:16] Hamel Husain: But basically, first, there’s two things you need. One is this cog.yaml. So cog is a wrapper around Docker. Now, if you’re like me, you might think, why do you need a wrapper around Docker? Docker is perfectly fine. I kind of thought that, too. when first encountering COG. But actually, it helps avoid a lot of CUDA foot guns. Even when I’m using NVIDIA Docker, or with the Docker with the GPU runtime, I constantly have to fight CUDA problems.
[31:52] Hamel Husain: And the folks that made COG, and COG is this open source project that Joe will show in a little bit.
[32:03] Hamel Husain: you know you can basically specify uh it basically is a docker file with some other commands but essentially what it does it’s a docker file with a bundled web server in it and so what you do is you specify predict.pi so this is the predict.pi function and basically there’s really not much going on here so there’s a prompt template there’s a setup you uh define this predictor class and then you have the setup and the setup just basically loads the model onto the you know into gpu and this is only run upon boot or startup
[32:44] Hamel Husain: and so that’s this code this is just vlm vllm code here so pretty straightforward this is a this is a model i’ve uploaded to hugging face hub and has been quantized with awq so that’s where this sort of uh that’s where these arguments come from and then you have this predict function and that’s what’s going to be called the inference time and so you have you know these various inputs um so there’s the natural language query this is the column and then basically this is kind of standard this is also standard vllm code and i’m
[33:20] Hamel Husain: just emitting the output okay so that’s the two pieces that’s all you need cog.yaml and predict.pi you have to go install cog obviously before you do this There’s a repo for that. Joe will show you a little bit more about COG. But it’s pretty straightforward. You just have to install it. And kind of like this is how the process works. So if you want to play with this, you know, I actually recommend that you kind of like upload your own model. Like, I don’t know, the Honeycomb one may be interesting to you, but if you…
[33:59] Hamel Husain: a lot of people are fine tuning their own models i would actually recommend fine tune your own model or find a model that’s like personally interesting to you it’ll just make it a lot more fun so um if you’re gonna debug this like cog stuff locally um you want to download the weights so that’s what i’ve done here is basically this is a command that allows you to download the weights really fast it’s this kind of long but actually like it’s better than cloning like the git clone um and then There’s different ways you can
[34:29] Hamel Husain: debug. One way is to run, you can actually run the cog server. So like I’ll do that here. Let me just make this window a little bit bigger. Let me clear this. And so you can actually like get an intuition for what’s happening by actually running the server. Now it’s cached. So I already built the Docker container. So I didn’t want you to have to look at that for this class. And so it’s gonna run this web server. The web server is basically…
[35:01] Hamel Husain: gun you know it’s basically the same thing that replicate does like what i showed you earlier but it’s just locally and then you can send that web server request you can see this request and you can see the query honeycomb query being uh returned here that’s one way to to run things another way is uh there’s this let me just cancel that stop the server this is also this cog predict function um that’s one way to just you know if you don’t want to do this server run the server you just want to run a
[35:34] Hamel Husain: like get a prediction you can just pass in the inputs so that’s what this dash i is here and that’s how that’s how it works um and so yeah eventually it’ll come with the prediction here you’ll see it right here same thing um there you go um and yeah that’s pretty much it Now, the way you push it to replicate is you do this cog login. I’m not going to do cog login. If I do cog login, you all will see my token and no one to show you that. But it’s very easy.
[36:08] Hamel Husain: You do cog login and then you follow that with a cog push and it pushes it to replicate. You have to specify a destination. So actually in the replicate platform, you have to actually create a model. So you have to go to models. You have to create a new model and you kind of fill out some information. One thing you might want to… Oh, wait, I’m not showing… Oh, let me show the other screen. Sorry about that. Let me switch back.
[36:39] Hamel Husain: Okay, so in the Replicate platform, to go to on your account, you can go to Models here and then Create New Model. And you just fill out some information, like a name of the model. And then the most important thing you want to select the right hardware. So depending on your model size, make sure you think about, based upon what we learned in the last lesson of how to estimate memory, pick the right GPUs.
[37:08] Hamel Husain: And then you, yeah, I always just, for this class, you can do custom cog model, at least the code that I showed here. And then click create. And then that’s the destination that, let me switch back to the VS Code real quick. Let me see. I can do it fast. Here we go. Yeah, so that’s the destination you see here is kind of like this RA.im. That’s just replicate. That’s the replicate repository. And then you just have the fully qualified name here. It’s basically like GitHub in that sense.
[37:42] Hamel Husain: You have like an org and then like a name. So yeah, that’s how it works. It’s very simple. There’s two different examples here. One is with the AWQ quantization VLM. It’s pretty straightforward. Actually, let me show you. Okay, so let me get this window out of the way. Let’s zoom in my way all of a sudden. Let me go back to predict.py. You see I had to do a little bit more ceremony to get the quantized version in here with the one that’s not quantized. It’s almost the same thing. They’re just slightly less arguments.
[38:26] Hamel Husain: And so let me switch back to the Hugging Face repo real quick just to orient you. Go back to this window. And so if you want to know how it was quantized, I have this repo with a sort of, this is just like standard BLM. This is how you can quantize the model. And this is what I did. I basically loaded the model, quantized it, saved it, and then uploaded it to Hugging Face and then pulled it down. So that’s kind of like a lightning round of how I deployed this Honeycomb model.
[39:05] Hamel Husain: And then, yeah, I think now is a good time to give it to Joe, who can talk a little bit more about Replicate. I kind of glazed through Replicate very fast, but he can give you more color and also tell you a lot about what he’s learned, especially in the NVIDIA stack and deploying, creating inference platforms.
[39:29] Joe: let me stop sharing great thanks hamill let’s see when hamill asked me to to talk during this this course or this conference or this spiritual revival i’m not sure what it is at this point um he gave me two two suggestions one was to tell some more stories which i have many many are shared with hamill and then two do a demo So I thought a lot about what information I could share in tens of minutes that would actually be helpful. Dan Hamill alluded to the fact that serving is really complicated. It’s really hard.
[40:16] Joe: And sometimes it’s really awful. And there’s a lot that we could talk about, but I’m not sure how much it would be, how useful it would be. And so I thought a lot about if I was just starting out this journey or I was halfway through it, what would I want to do? What would… what I wish somebody had told me. And so I’m going to try to talk through the things that have hurt me the most, and hopefully that’ll be helpful for some of you.
[40:38] Joe: And then I’ll talk a little bit about Replicate, how it solves some of those problems for me, and how you can serve language models from Replicate. First, share. You don’t see a slideshow? OK, great. We’re going to talk about deploying language models, and then we’ll talk a little bit about deploying language models and replicating. So I’ve been working with language models about four years since the Transformers paper, well, generative language models. Been deploying them for almost as long, and things have changed a lot.
[41:33] Joe: But one thing that hasn’t changed is that it’s really hard to deploy language models. The ways that it’s hard has changed, but it’s still really hard. But maybe not in the ways that you think. The two ways that I’ve found it to be quite difficult still, even in 2024, is the fact that performance is multidimensional and zero-sum. Dan and Hamill talked about a context where you just, a simple deployment, you just need a model that emits tokens. You don’t care too much about performance. You don’t care too much about cost.
[42:15] Joe: If that’s the world you live in, deploying language models isn’t actually too hard. There are a lot of great serving frameworks. There are a lot of great platforms. These tools are ergonomic. Some of them are well-documented. You can have a decent time. and not experience too much pain if you don’t care about performance and cost. The problem is if you do care about those things, then suddenly everything becomes very complicated. And as we’ll talk about it in a bit, performance is multidimensional.
[42:43] Joe: There are some dimensions of performance that you can prioritize, but if you do that, then you’re penalizing other dimensions of performance. And so you really have to think carefully about what your context is. What’s your use case? What does it mean to have a performance SLA? And what do you have to do to meet it? There will be trade-offs. You probably won’t discover all of them until you’re shipping in 30 minutes and you have an emergency. They’re multidimensional and they’re zero-sum. So if you prioritize one dimension of performance, you often are penalizing another dimension.
[43:15] Joe: So the way I deal with this is to try to be really careful and think clearly and carefully about what it means to have a successful deployment. What’s important to my users? What’s important to the platform? What promises have I made to a product manager? the person who actually pays for the GPUs. The other issue is that technology never stops evolving. I’ve been serving language models for years, and there used to not be very many options unless you rolled your own stack. There were a couple of serving frameworks. They were all hard to use.
[43:43] Joe: But that was it. You didn’t have to think about technology change. The rate of evolution was pretty slow. That changed dramatically last year, two years ago. Suddenly, there’s a proliferation of serving frameworks, and there seems like every week maybe every month, there’s a new feature that offers performance improvements. So now part of the problem of deploying and maintaining performance models is keeping up. And in some contexts, you don’t have to keep up. You deploy something, you have a stable stack, you live with that for a while.
[44:15] Joe: But if a new feature comes out and it offers a 10% improvement in performance, it makes your models faster, it makes them cheaper, eventually you’re going to want to pick that up. And sometimes that means changing frameworks. As we’ll talk about, that’s often more difficult than it should be. So often it feels like this. There’s so many options, but it’s really hard to take advantage of those options in a frictionless way. So my solution for that is to minimize the cost of experimentation. It’s inevitable that you’re going to want to change frameworks.
[44:51] Joe: There will be a feature that you want. There will be a performance improvement that you want. you’re going to have to change frameworks. And so I think especially if you’re starting out to build a stack or you’re starting to learn about these things, prioritize modularity and minimize the cost of experimentation. Make sure your stack is at least looking towards a future where it’s easier rather than harder for you to change technologies. So how do we make it easier to serve language models? I want to talk a little bit about what makes language models slow.
[45:30] Joe: This isn’t maybe actionable information, but in the spirit of providing information that I wish I had at the beginning, there’s a lot of buzzwords, there’s a lot of techniques. Most frameworks use these in various ways. And if you’re starting out trying to evaluate frameworks, you need to have some idea of what people are talking about. What’s real and what’s bullshit. And if you have to learn a bunch of buzzwords and you don’t know what they mean. That can be really hard to wade through.
[45:58] Joe: And so I think a little bit of low-level information about how inference happens and how we can improve it can go a long way. So what makes language models slow? Turns out it’s just two things, memory bandwidth and software overhead. And there’s a small set of techniques, methods, strategies that people employ in combination to mitigate these two things. So memory bandwidth. Transformers are fundamentally memory-intensive features. Transformers networks are implemented as a sequence of operations, and each of these operations running on GPU require transferring data from device memory to smaller, faster memory caches.
[46:45] Joe: That’s just what happens. And this is a performance bottleneck. In most cases in production, language models are slow because they have to transfer data. from slow memory to fast memory. We fix this by transferring less data. There are a number of strategies that we can use. Some of you have probably heard of CUDA kernels. They’re very fancy. There’s another large organization learning CUDA. And one of the best ways to mitigate the memory bandwidth bottleneck is to just use better CUDA kernels. For example, you can fuse kernels.
[47:28] Joe: And kernels are just functions that run on a GPU. That’s it. It’s just a function that runs on a GPU. You might have a softmax kernel. You might have a layer norm kernel. You might have an attention kernel, et cetera. Fusion is when you combine these kernels. And so if you combine them in an intelligent way, you can minimize data transfer. That’s great. Decreases the bottleneck effect. Flash attention is a very impressive, made a big splash, still is quite performant. Flash attention 2 is very performant for long sequences.
[48:05] Joe: It does kernel fusion and it focuses on minimizing data transfer. So it’s just a better implementation of attention that minimizes data transfer. Page attention, something similar. More efficient data storage, more efficient data transfer. So all these tricks that people implement with kernels, much of the effect of kernel optimization comes down to just doing better data management. And then, of course, the other thing you can do is make your data smaller. Quantize your model. Your weights are smaller. You don’t have to transfer as much data. You can transfer more in each shot.
[48:44] Joe: We could say that speculative decoding fits in this regime. So there’s really so many… there’s a buffet of optimization methods that you can use. And it can be quite confusing to think about all of them. But fundamentally, much of what they’re trying to do is to just minimize data transfer or make it more efficient. The other main bottleneck for transformers is software overhead. So every operation, every kernel has to be launched, and it takes time. has to be scheduled. And this requires communication between CPU and GPU. This is another bottleneck.
[49:22] Joe: So the question is, how do we minimize overhead? Turns out that smarter kernels are also a great way to do this. Using kernels means that you have fewer kernels and then fewer kernel launches. So kernel fusion, again, is really important. Another very common approach is CUDA graphs. So as you conduct a forward pass through your model, You have a sequence of operators or kernels that have to be scheduled, launched, and executed. What a CUDA graph is, is it’s a trace across all of those operations.
[49:55] Joe: And after you assemble a CUDA graph, you can launch all of your kernels as a single unit. Now, there are some caveats. Sometimes you have to reimplement a model so it’s compatible with CUDA graphs. But in general, CUDA graphs are a really useful way to make a substantial impact on the software overhead. So these are the two main bottlenecks for transformers. So transformers are slow because of memory bandwidth bottlenecks and because of software overhead.
[50:25] Joe: And almost all of the transformer optimization efforts boil down to mitigating these two bottlenecks, often with some combination of the things I just mentioned. What makes language models fast? We talked about some low-level optimizations, kernel fusion, kernel optimization, CUDA graphs. There are things that we can do at runtime also, continuous batching, which I’ll talk a little bit about in a second, heavy caching. So instead of having to re-encode, say, your prompt during chat or your chat history during chat exchange, you can retain your KV embeddings and then reuse those for subsequent computations.
[51:14] Joe: Hardware upgrades, that’s a good one. Just pay more money for your GPU. Your models will get faster. And then there are some tricks. We can call speculative decoding a trick. Make your outputs shorter. That’ll decrease response latency. If you can have a successful response that uses fewer tokens, you can do that and your responses will be faster because they produce fewer tokens. To an extent, shorter inputs can also make your models faster. So you have to have a very large, let’s say, prompt. If you can decrease that, you might see a small benefit.
[51:49] Joe: And this is not so much a major impact for for small prompts but if you’re doing very very large context operations like across many documents many pages of documents this can have a big impact i want to talk a little bit more about continuous batching though uh continuous batching was introduced i think was orca close to two years a year and a half ago ages ago before continuous batching um to run a bash through a model, including language models, you have to assemble each item in the batch.
[52:29] Joe: And so people would wait for small periods of time for requests to come in, assuming that you have an inference regime where requests come in stochastically. So you have to pause, wait for some requests, assemble them into a batch, and then run inference over those batches. And that was a terrible way to serve language models. Because one of the The weird things with language models is that requests are really non-deterministic. You might have a request that generates three tokens and then it’s done. You might have another request that generates 2,000 tokens.
[53:03] Joe: With the standard micro-batching regime, you have to wait until all of the requests are done before you can redistribute the responses to each user. Continuous batching solved that problem. and it also solved the problem of injecting new requests into a batch during inference. The way this is done is really quite elegant. Instead of thinking of inference as operating over a request, you think of it as operating over steps. When you’re decoding tokens, generating tokens from a language model, really what you’re doing is you’re running multiple inference steps where each step produces a single token.
[53:45] Joe: If you orchestrate your batching, sort of from that perspective, then you can introduce a new item in your batch and then just decode the next token. And if a particular item in your batch completes, you just pull it out and send it back to the user. And so this fixed two problems. One, you didn’t have to wait around for new requests to assemble a batch. You can just continuously batch them. As requests come in, they’re injected into the process and then they’re executed. And when they complete, they’re pulled off and sent back to the user.
[54:17] Joe: to the user. This is great. As you’ll see, this adds some complexity for situations where you care a lot about performance too. I’ll emphasize here that a consequence of continuous batching is that you can wind up with dynamic batch sizes. So if you have no requests, you might have a model sitting idle. You have no requests. nothing is in your batch. Say one request comes in. So then you have a batch size of one. Then another request comes in. You have a batch size of two.
[54:55] Joe: Maybe you max out in this particular moment at a batch size of 10. And then one of them completes, and you go down to batch size of nine. And then four complete, you go down to a batch size of 10. Then a bunch more requests come in, and you go up to a batch size of 25. The point is, with continuous batching, you have dynamic batch size. And that has huge implications on the cost of operating your model and the performance that you can provide.
[55:21] Joe: So I talked about a bunch of different ways that you can make models fast. Most inference servers these days offer many of these affordances. BLM, TGI, FastGen, TensorRT-LLM, SG-Lang, all are really cool. All have different affordances. They all do things like continuous batching. They all have specialized kernels. Many of them use the same kernels. BLLM now uses CUDA graphs. TensorRT-LLM compiles to TensorRT, uses CUDA graphs. While the interfaces for these frameworks are a lot different, and some of them have support for different features, they all are using many of the same optimization techniques.
[56:08] Joe: Where things get really tricky is when you care about performance, like I said before. And there are kind of three different ways to think about performance. You can think about total tokens per second, number of tokens produced across all of your requests, single stream tokens per second. is the number of tokens produced per second for a single request. And then requests per second is more of a standard latency. How many requests can you complete in a second? And this gets really tricky with language models, particularly under the regime of continuous bunching.
[56:43] Joe: Specifically, the relationship between total tokens per second and single stream tokens per second is one tension. I’ll show you one graph and just talk through a little bit about what we see here. So on the y-axis, we have single stream tokens per second. On, I think this was for LLAMA70B, I don’t remember what hardware this was on. This was a while ago that I made this graph. On the y-axis, we have single stream tokens per second. So this is the tokens per second that a single request will have.
[57:16] Joe: So a user sends a request, what’s their experienced token per second? On the x-axis, we have batch size. Now remember, it’s good that under the regime of continuous batching, batch size is actually dynamic, which makes it really hard to offer performance SLAs or even think about what kind of performance people get. Sort of perverse consequences that, especially, I don’t know, like last summer, last fall, still today, there are a lot of frameworks, a lot of platforms that make promises about speeds up to X. And they’re talking about single stream tokens per second in most cases.
[57:52] Joe: It speeds up to 150 tokens per second. And then you make a request against some of these services and you get 40 tokens per second. And that’s because they’re talking about maximum single stream tokens per second, which also happens to be at a batch size of one. And nobody’s running these models at a batch size of one. And nobody really has control over what batch size you’re going to kind of be operating over with your request. So the performance you get varies substantially because batch size varies substantially.
[58:21] Joe: Another consequence is that as you increase batch size, single stream tokens per second goes down, but total tokens per second, which is annotated here as TP, goes up. And so this is why it’s really important to think carefully about what your performance SLAs actually are. What are your priorities? Do you need to produce a lot of tokens per second across all requests?
[58:45] Joe: If you’re using agents, if you’re maybe doing something that is not latency sensitive, you have a large batch, it can make a lot of sense to use a really large batch size, prioritize total tokens per second.
[59:02] Hamel Husain: I think a lot of people report benchmarks and they confuse the two. They’re like, oh, look at my total tokens per second. And then, you know, they’re not telling you what the single stream tokens per second are. And it. they get really excited like, oh, I have such great total tokens per second, basically throughput. And so it’s really good to pay attention to this. I’m glad you brought it up.
[59:28] Joe: It’s kind of crazy that it’s still so chaotic. Quite recently, we were working with a very sophisticated group that had put together a very sophisticated model. And we were asking them for benchmarks and they just gave us total tokens per second and i said well what’s the single stream and they had not even thought about measuring that and then they did and it was really bad um one of the One of the things you have to think about if you’re serving for users, so people who are making requests to your model, is user experience.
[1:00:05] Joe: And that has to be weighed against the cost of operating your model. As the operator, as the person paying for the deployment, as you increase tokens per second, assuming you have the demand to actually saturate or reach those tokens per second, your cost goes down. You pay a fixed cost for your GPU. So if you make total tokens per second go up, your cost per token goes down, which is great. That’s what you want to do as the operator. The problem is you make the experience worse for your users.
[1:00:35] Joe: So in the case of this, it was a very sophisticated model developer. It turns out that we really couldn’t give a good user experience at these high tokens per second that they were excited about. And we had to cut that down a lot. And that made the model very expensive. It was really easy to forget about these details and wind up with something that…
[1:00:56] Joe: is it provides a terrible user experience or is is prohibitively expensive and so that’s why it’s really important to think about these these criteria during the specification of your deployment you really don’t want to wind up with a model that you can’t run because it’s too expensive or their users won’t use because it’s not fast enough um joe
[1:01:19] Hamel Husain: one question um this is awesome and actually if we could stay on that chart um It looks to me like you can pretty tightly approximate single stream tokens per second as total tokens per second divided by batch size. I imagine there can be like some variability there, but like I’ve always just reported total tokens per second and batch size together because that seems to approximate that well. Is that the case in your experience?
[1:01:46] Joe: Yeah, you can definitely approximate it. There’s some weirdness. I have some other graphs. I wish I included them. It’s very much not a smooth curve. It’s like there are a lot of step functions, thanks to GPU architecture. And so it’s not perfect. But yeah, I often benchmark like that, too.
[1:02:07] Hamel Husain: Yeah, I guess I’m not saying that you can just take your single stream tokens per second, multiply it by your batch size and scale out to infinity. I’m saying you measure total tokens per second and divide by batch size to get single streams.
[1:02:18] Joe: Yeah, totally. Yeah. So increasing batch size will asymptotically maximize your total tokens per second. If you are generating synthetic data, if you’re not in a latency-sensitive context, there are many instances where it really makes sense to just maximize total tokens per second. In that case, you want to use a really large batch size. I say asymptotically because at some point you will reach the memory bandwidth. threshold and you’re just not going to get a return. Adding a new unit to your batch will just sort of linearly increase the time that it takes to complete.
[1:03:04] Joe: So there’s a limit to the gain you get by increasing your batch size. Decreasing batch size increases single stream tokens per second, but it penalizes total tokens per second. So if you care about latency and you want the fastest possible tokens per second at the request level, you should run a batch size of one. Nothing will be faster than that. all things held constant. That’s also the most expensive way to run your model because you have to scale horizontally as you get concurred lists. The right balance depends on your use case.
[1:03:36] Joe: Thinking about your target batch size is really important and it has to be tuned to the particular context that you’re deploying your model in. Okay, so we talked about making models faster. I want to talk a little bit now about simplifying language model deployment, specifically thinking about making sure your stack is modular so that you can do experiments across different frameworks. Then we’re going to do a quick demo. So I think it’s really important to prioritize modularity. Different frameworks have different affordances.
[1:04:17] Joe: I haven’t told many of our stories, and I think so many of them fall into this category. Just recently, I was working on speculative decoding with TRT-LLM, which has been released since January. That’s several months. TRT-LLM is in development. It’s pretty well documented. Sometimes there are bugs. But in general, I really like the framework. I think they’re putting a lot of great effort into developing it. I thought that speculative decoding would be an easy win, except it didn’t work with streaming. there were a few other things it didn’t work with.
[1:04:55] Joe: So it worked, sort of, but not for my use case. And that was really frustrating. Time and time again, I found a feature that I needed wasn’t implemented in the framework that I wanted to use, or it wasn’t compatible with other features I wanted. For instance, trtl on mouth supports estlora, but not with weight-only quantization. So you can’t use qlora. And we were serving many of our models in 8-bit. and we couldn’t use sLore with that. And that’s just an implementation detail. There’s no reason for that to be the case.
[1:05:29] Joe: That’s just where trtlm is in its development phase right now. So I think it’s really valuable to be able to change frameworks as you need to. To do that efficiently, you need to be able to experiment with different frameworks. As often as I’ve found that a particular framework was lacking, I’ve also tried to switch to a new framework and then found a particular composition of features weren’t compatible. So maybe I’m using features X and Y in my current framework.
[1:05:58] Joe: My new framework has feature Z as well as X and Y, but it doesn’t support the union of X, Y, and Z. It’s better if you can figure that out really quickly rather than having to sink a bunch of engineering effort into it. In a perfect world, these things would all be documented, but this goes back to the problem of a proliferation of new technology. People are constantly introducing… awesome features that we should all take advantage of.
[1:06:25] Joe: But that means that there are bugs, and they’re not well documented, and they don’t always play well with other features. And so figuring out how all these pieces can fit together in your stack is honestly one of the biggest challenges to deploy language models on the bleeding edge. And that’s one of the problems we’re trying to simplify on Replicate. So Replicate is a serverless infrastructure. We offer ready-made APIs for models. You can run any model and replicate. But we also have a piece of software Hamel mentioned called Cog.
[1:07:02] Joe: And you can wrap your model, your model’s serving code, with Cog and push it to replicate. And one of the implications of this is that it gives you complete control over your serving frame. We’re serving our kind of official language models.
[1:07:18] Joe: with trt llm right now and specifically a cog implementation of trt llm that we’re going to open source we’re also serving some with vllm and so the direction we’re going with replicate is to make it a place where not only do you have all these models to experiment with you also have a bunch of serving frameworks to experiment with and you’re still going to have to get into the details if you care a lot about performance tuning or features that aren’t supported out of the box or might have to be activated But my goal, my
[1:07:47] Joe: kind of personal goal, is that after all the blood and sweat and tears, I think it should be a lot easier to experiment with these frameworks. And it should be a lot easier to figure out what’s broken and what isn’t. So right now we’re close to open sourcing CogTRT-LLM. And we did an enormous amount of work last week to be able to open source CogTLLM, which is what I’m going to show you. And the idea is that you can use these servers as a drop-in solution. They want it to be really easy.
[1:08:15] Joe: You want to run a model with VLLM on Replicate? Very easy. But perhaps even more exciting is that you can just pull a repo and change whatever you want to change. You can change the input to get a different API signature. You can add support for a new feature that perhaps we haven’t implemented yet in Coq VLLM. And the idea is to actually take advantage of the fact that Replicate is completely open source. The way we serve models is open source. You can look at how we do it, learn from it.
[1:08:45] Joe: And if you’re really nice, open a PR and make it better. So I’m going to show you our VLLM work through and workflow, workflow, and talk through some of the replicate details. If anything isn’t clear, please stop me and switch to browser. And I’ll show you. When I go to replicate, I land on my profile. This is a dashboard. Shows you recent predictions, recent trainings, recently pushed models. You can look at your models here. You can go to a… explore page and see all the models that are available.
[1:09:53] Joe: You can go to these models, you can see a UI that allows you to run things against these models, but Replicate also provides APIs, clients. So there’s a node, it’s a Python client. Replicate makes it really easy to kind of drop in support for any particular model that’s running on the platform. What I want to show you is Something we set up that makes it really easy to run a model with VLLM.
[1:10:40] Joe: This is currently implemented as a training on Replicate, and we’re working on introducing some new abstractions that’ll be a little bit more coherent and elegant. But on Replicate, a training is really just something that runs in a container that produces some artifacts. of weights and those weights are then associated with a downstream model. So we set this up as a training so that you can pull weights from Hugging Face and then those weights get associated with our VLLM server.
[1:11:11] Joe: What you get in the end is a replicate model running a particular set of weights, particular architecture with VLLM. And I already did these steps, it takes time to download weights and transfer them to replicate. I’ll just show you kind of what you would do if you’re doing this yourself. We also have this documented. So we put in a name. We would add the hugging face ID. I’m doing Gemma2B, Instruct. You can add the model shot if you want, which allows you to specify a revision.
[1:11:57] Joe: And I’m going to drop in a Hugging Face token here, which is registered as a secret. And this is really right now just a wrapper around the Hugging Face hub. So it has all the affordances that you would expect. If you want to disallow files associated with the model in Hugging Face, you can have an ignore pattern, allow pattern, et cetera. You can also specify a prompt template here. Otherwise, we’ll use the default chat template associated with the model’s tokenizer.
[1:12:29] Joe: So you click Create Training, and that will take you to a page that looks like this. And what happens is we pull the weights from Hugging Face, we put them in a tarball, and then we push them to GCP. This is important, and we’re encouraging people to shift towards workflows like this, because if you put things in our buckets, we’re able to cache them and offer much better downloading performance. And then we can do that. We have several other optimizations that we’re introducing soon that will help a lot with cold boots.
[1:13:01] Joe: We regularly see people put models on replicate that, for example, just pull in a naive way from hugging face. That’s a really slow way to download your weights. That’s a good way to make your cold boots even worse. And so we’re working on affordances to make it really simple or rather really hard to do some things to optimize. You shouldn’t even have to think about these problems. That’s kind of where we’re trying to go. And so right now, to get something running with VMLM, easiest thing to do is to just go to this form.
[1:13:31] Joe: And this form will allow you to transfer weights from Hugging Face to the replicated structure. And so it downloads, pushes. And then when the training is completed, it instantiates a new model. And so we could click Run the trained model. And that would bring us here. And this is our model. and we can make requests against it. And as I showed you before, there’s support for different clients. So if you have a Python client and you want to use this model, you have an example here how to do that.
[1:14:13] Joe: So what we did was we took just Gemma2b instruct from Hugging Face, and we’re now running it with VLLM on Replicate. Where we’re going is you can do this with SGLang or TRT-LLM or any of the other frameworks out there. and get the affordances from those frameworks. The other part that I want to show you is COG VLLM. And so this is the VLLM code that actually makes this all possible. And the process that I just walked you through is all documented here, so you can do this yourself.
[1:14:45] Joe: But if you want to do some local development, you can follow these instructions. So you’d install COG, as I mentioned before. You clone this repo, and then you would run something like this. So you would set cog weights variable, and you would set this to the URL that we give you when you complete your training. And I’m going to go ahead and run this in VS Code. So I’ve set my cog weights to this URL. So this is an object created. by the VLLM creator that I showed you.
[1:15:25] Joe: So that creator takes weights and other artifacts, including the tokenizer from Hugging Face, puts it in a tarball, puts it in one of our buckets, you get a path back. Now you can use this path for really fast downloads and replicates. So I set this environment variable and then I just call predict. And it’s going to be really fast because I’ve already downloaded the weights. So if I hadn’t downloaded the weights, they would have to be downloaded locally and they’ll be cached.
[1:15:52] Joe: We see the VLM server starting up and soon we’re going to see some tokens submitted. So let’s see. We’ll see something about CUDA graphs in a second.
[1:16:25] Joe: is it that’s something we’re moving to so one thing you have to think about with um with model surfing is support for models that new form so you might pick a particular framework and then a new model comes out and you want to use it except it’s not supported and that’s really annoying vlm is great because it pretty much has out of the box support for transformers trtlm has focused a lot on adding support as new models come out and they You can collaborate with major model providers. When Gemma drops, there was support immediately.
[1:17:00] Joe: But it’s not always the case. And being able to exchange serving frameworks can really save you if you need support for a new model that’s not supported in your framework of choice. Here are our tokens. The model responded. This is a list of strings because we’re streaming. And this is a model run locally on VM. You don’t need a VM to push, so we could go ahead and change some of our code. We could modify how inference is done. We could push that to replicate and run there if you don’t have a GPU available.
[1:17:37] Joe: And then very quickly, I’ll just talk through some of the components that Hamel already described. So we have a predict.py that’s basically doing most of the work for us, and we have a config.yaml, or a cod.yaml. The thing I want to draw your attention to here is this concurrency argument. And that’s what allows the model to do continuous batching on replicate. So you can specify the maximum concurrency that replicate will send to your model. So here we’ve set it to 32. That means replicate will never schedule more than 32 requests against this model.
[1:18:14] Joe: And it will start to scale out as you reach that max concurrency. And internally, what we do right now is we target about half of the maximum currency as the target batch size. As you’re thinking about performance SLAs, that’s something to keep in mind when you’re thinking about how this will kind of scale out and operate under production loads. So if you’re targeting 16 requests, you can think about if you have that at equilibrium, what will your performance be? How bad will your performance be if you reach the max batch size?
[1:18:44] Joe: One of the benefits of continuous batching is that you have a lot of flex. So you can support dynamic demand without having to scale out. But the cost you pay is that you also have dynamic performance. So if you get a big spike, a bunch of requests come in, everybody gets worse performance. I’m going to stop here and see if there are any questions.
[1:19:10] Hamel Husain: Probably have to move on. Yeah, from questions, just from time management perspective. Good. I can always carry this into office hours or something like that.
[1:19:27] Dan Becker: Okay. I just got a message from Travis. I think he’s going to have a hard stop at some point. So I think we probably need Travis to go next.
[1:19:44] Travis Addair: Awesome. Yeah. Thank you, Dan.
[1:19:51] Hamel Husain: Thanks, Joe. That was really good.
[1:19:57] Joe: Thanks.
[1:20:05] Travis Addair: All right. Should I go ahead and share my screen or?
[1:20:08] Hamel Husain: Yes, please.
[1:20:10] Dan Becker: Yeah.
[1:20:16] Travis Addair: All right. Okay. Thanks, everyone. Yeah, so. Really enjoyed the last two talks here. I think that set the stage really well for what I wanted to talk about, which was some very specific nuances of deploying fine-tuned LMs in particular. I think one thing that I’ve focused on a lot in the last year or so has been this intersection of fine-tuning and serving and what unique challenges or opportunities even can be presented when you’re trying to serve fine-tuned models in particular. as opposed to base models or merged fine-tunes into base models.
[1:20:56] Travis Addair: So the main theme of this talk will be lessons learned building our platform for training and serving fine-tuned LLMs. Just as a little background about me, so I’m the co-founder and CTO of Predabase. We’ve been around for about three years, three to four years now. We were originally founded out of Uber’s Michelangelo ML Platform team. I worked there as a lead on the deep learning infrastructure team. I was there from 2016 to 2021. And while I was there, I was the lead maintainer on a project called Horovod.
[1:21:28] Travis Addair: It’s a distributed deep learning framework for PyTorch and TensorFlow, if folks remember TensorFlow. And Ludwig, which is a declarative deep learning framework that actually has a lot of similarities to Axolotl, but we built it. back before LLMs, so it was much more targeted towards image classification, text classification, use cases like that early on. And then most recently, Lorax, which is our framework for doing fine-tuned LLM inference, which I’ll speak a little bit about today. This is the most salesy slide that I have, which is talking a little bit about our platform itself.
[1:22:03] Travis Addair: So Predabase is essentially a managed platform for fine-tuning and serving fine-tuned LLMs. We try to be very integrated end-to-end in terms of providing the ability to prompt models out of the box serverlessly, fine-tune them, and then deploy them either serverlessly or dedicated in VPC or SaaS, etc. And so the rest of the next few slides will be talking a little bit about some of the specific challenges that we had to overcome when building this platform and how these might relate to challenges that you’re likely to encounter when serving fine-tuned LLMs as well.
[1:22:35] Travis Addair: So let’s talk a little bit about our story then. So our key kind of thesis on the market is that these very general purpose LLMs like ChatGPT, etc. are great. But when you’re trying to put something in production, you want to put something specific in the production to solve a specific problem. And so the joke that we like to say is that, general intelligence is great.
[1:22:57] Travis Addair: I don’t need my point of sale system to recite French poetry, which is basically saying you’re paying for that extra capacity one way or another, whether that’s in dollars or latency, etc. And so ultimately getting fine-tuned models in production is all about getting something more specific to your task that is better suited to it and better optimized for that. So the common pattern that we believe that organizations will go through is starting with something like GPT-4 as a starting point for experimentation. And then over time, migrating to fine-tuned task-specific models where you might have…
[1:23:33] Travis Addair: you know, a model for your customer support tickets, one for sentiment analysis, one for chat, etc. And so the future that we envision is you have lots and lots of fine-tuned models for all your different business tasks and use cases. But a challenge quickly comes up, which is, okay, if I’m going to fine-tune a new model for every task, how much is that actually going to cost to serve all those models?
[1:23:57] Travis Addair: And, you know, if you just kind of look at like a standard entry level GPU and AWS, like an A10G running about $1.21 on demand per hour. It has just enough VRAM to serve some of the 7 billion parameter models nicely. Quickly, you see that this starts to accumulate. And before too long, you know, you have 16 different use cases, you’re paying, you know, 14k a month in cloud bills to serve those. And it only gets worse from there, right?
[1:24:27] Travis Addair: And so this was the same dilemma that we faced when we were thinking about building our platform because we wanted to build this serverless platform for people to fine tune and serve. But if every single user that came in needed a new dedicated deployment, even when they’re just getting started, this was going to be very expensive for us. And so taking a step back and thinking about how you deploy fine tuned LLMs the old fashioned way, you normally have something like a Kubernetes pod.
[1:24:53] Travis Addair: you have a request queue coming in that is accepting requests and queuing them up for completion. And then you have your fine-tuned model weights. And if you’re doing parameter efficient fine-tuning, like Dan mentioned at the beginning, these weights only account for something like 1% to 5%, maybe 10% at worst, of the parameters of the model. But when you’re deploying all these things, you’re deploying them over and over again with those same base model parameters replicated each time, right?
[1:25:24] Travis Addair: So the observation you might have is, well, what if we just took these base model parameters and then tried to serve these fine-tuned model parameters together on top of the same model deployment? And that was exactly what led us to want to build this framework called Lorax, which is a fine-tuned inference system. It’s built on top of Hugging Face’s TGI. And since then, we’ve forked it, expanded a bit, and now use it to support our serverless inference on Predabase. And essentially, it’s that idea.
[1:25:57] Travis Addair: Let’s take all these different requests coming in for all these different fine-tuned models and try to serve them concurrently on a single deployment, provided that they all share a common base model. Now, you might imagine that naively you could do this by just swapping these adapters, but you might realize that this would be quite slow. So what we do instead is we batch together different requests for different adapters at once at a single time.
[1:26:25] Travis Addair: So you might imagine having requests for an adapter in blue here represented adapter in red, and then maybe gray is, you know, another adapter. And then instead of trying to do them one at a time, what you can do is kind of logically think of them as being fused together into a single tensor for the A’s and a single tensor for the B’s, and then just do one straight shot. you know, multiplication against the base model parameters and then multiplications against the A and B matrices.
[1:26:56] Travis Addair: And this actually is, you know, much better than doing it sequentially, but you can do it even better with by lowering this down to the CUDA level so you’re not having to actually physically join or rather gather all these lores together into these giant A and B matrices.
[1:27:14] Travis Addair: Instead, you can do this down at the CUDA level using indirection through pointers and, you know, taking advantage of different tiling strategies, and do it much more efficiently, which was some work that was done by folks at University of Washington on a paper called Punica back in September of last year, which was in the basis for SLOR and various things like that.
[1:27:35] Travis Addair: And what you find when you do this type of approach, which we’ve added in Lorax, is that if you compare against the baseline of trying to do every adapter individually, you see that as your adapters increases, the throughput starts to drop.
[1:27:48] Travis Addair: pretty precipitously you know by the time you’ve got 125 adapters then your total throughput um 120 adapters your total throughput has dropped you know by about 90 or so but by doing this um intelligent you know fusion heterogeneous batching of different adapters together we can maintain uh that peak throughput with only about a sub 10 degradation even at very extreme ends of the number of adapters and so that’s you know really the key you know initial hook uh about lorax that you know made us want to put this out there is having a production way
[1:28:25] Travis Addair: to be able to serve all these things concurrently at once without degrading the performance for for our end users and ultimately we believe that this translates into you know the bottom line being cost savings right so if you compare against fine-tune gpt 3.5 as a baseline um you know i don’t know if these numbers are the latest up to date but last time i checked about six dollars per million tokens for the fine tune models.
[1:28:51] Travis Addair: They have that as a constant because it’s serverless, but if you’re comparing that against Dedicated, Dedicated of course is going to scale up as you add more and more replicas for different use cases, but the idea with Lorax is that you get the scaling of something like serverless but with the baseline cost of Dedicated at very few replicas. And so overall the argument is that you should be able to see pretty dramatic.
[1:29:15] Travis Addair: uh cost savings by doing an approach like this um and so that was ultimately what we ended up doing um and so just kind of the foundation for how we are able to make our platform work the way it does so now i wanted to kind of move off of a little bit you know talking just about us and how we got here and more about you know some practical advice for for you as well um and so one of the things that you know users come in they’re interested in this kind of capability one
[1:29:43] Travis Addair: of the first things that happens is they say okay, I trained an adapter for my particular data set. Now what? What happens next? And I want to revisit this concept of merging. that Dan and Hamill spoke about earlier, because I think it’s actually quite important. It’s like the first step is like, okay, what am I actually going to, what’s my serving strategy going to be for my fine-tuned model?
[1:30:07] Travis Addair: The case for merging is pretty simple, is that you get good baseline performance when you merge the adapter, because you don’t have to pay for any of the overhead from processing subsequent layers at runtime, right? You just merge those lawyers back in, so that effectively eliminates all of that computation, which is obviously a win. But that in and of itself is not the only consideration. Other considerations might be, do I need to serve other models as well, right?
[1:30:33] Travis Addair: So if you want to serve multiple fine tunes, or even if you want to serve the fine tune model plus the base model, that’s a very good reason why you might want to consider not merging, since then you can pack all those together in a single deployment, right? Additionally, I think one thing that came up previously, as well as in the comments, was how does this interact with quantization?
[1:30:55] Travis Addair: And it turns out that the idea of merging the adapter weights back into the base model gets very tricky in a world where you fine-tune the model using QLOR. The reason because you’re no longer serving something that is FP16. Those weights are intrinsically tied to the quantized model, whether implicitly or explicitly. Additionally, we need to think about our iteration velocity. I think someone else mentioned previously that things change very quickly, new models come out, new data sets come out. You want to be able to very rapidly A-B test or verify experiments.
[1:31:35] Travis Addair: Being able to kind of hot load adapters in quickly is a very good way to say, run an A-B test or do some shadow traffic before promoting your model to production to say, does it actually perform better on live data without having to, double your serving costs just to be able to run that test. And a couple of final notes is it doesn’t work on all adapters. So LORAs, it certainly does. Others like DORA would be more complicated. Other adapters that I’ll talk about, like speculative decoding adapters, you can’t merge them.
[1:32:05] Travis Addair: And finally, it adds to additional disk space. So all in, it is a consideration, but it’s one that I think needs to be done on a case by case basis, is whether you keep the adapter in its unmerged form or merge it back in the base model. And I want to talk about this quantization one in a little bit more depth because I think we see this come up all the time with our users, and I imagine it’s something that you’re likely to encounter as well. So let’s say you fine-tune a model using QLora.
[1:32:35] Travis Addair: How are you going to serve that model? I think one of the first things that people think of is, well, I fine-tuned it in quantization because fine-tuning requires more memory because of the gradients and everything else. But I’m going to serve it. using full precision or FP16 half precision because I want to not have to pay the cost of dequantizing and the memory overhead is less at serving time so I can generally get away with it.
[1:33:03] Travis Addair: However, one thing that you’re likely to discover is that the weights, the activations that come out of that base model portion after dequantization from QLORA and the activations that come out of the base model in FP16… are not actually the same. And that’s because there’s some amount of error that exists when you quantize and de-quantize. And so as a result, that adapter was actually fine-tuned in such a way that it’s connected through noise, really, but it is connected to the fact that it was trained on that quantized adapter.
[1:33:39] Travis Addair: And so therefore, just swapping them out is going to produce different behavior and may actually degrade performance substantially. So the alternative might be, well, what if I just served it in the QLOR or Quantize form, since that was what I trained it on? And you can do that, and that would work, and Lorax supports it. You can have the adapter be used with the Quantize base model at runtime. But the trade-off there is that performance is significantly worse in terms of latency and throughput when you’re serving the Quantize model.
[1:34:10] Travis Addair: So here you can see time to first token.
[1:34:13] Travis Addair: uh this is just running on an a100 with like a mistral of seven billion you can see that uh time to first token with fp16 uh you know significantly lower less than half of the bits and bytes and then even on a throughput so tokens per second this is single request tokens per second you can see a pretty significant drop there as well so if you can get away with it you definitely would rather not serve uh with qlaura So that presents the dilemma, is that serving on FP16 produces worse results?
[1:34:44] Travis Addair: Serving with bits and bytes quantization is slow. Is there a way we can get the best of both worlds, quality and speed, for serving these models? And so that’s where a little trick comes in, which is, what if you… de-quantize the QLORA weights. And so this is something we actually do quite a lot with our customers is you take the original weights of the model in FP16, quantize them using bits and bytes as you would for QLORA.
[1:35:09] Travis Addair: Now you have the weights in NF4 and then you just reverse the quantization because you need to do that at a per layer level anyway for QLORA. So you have the functions there to do it and the result is a set of FP16 weights that is now numerically identical to the quantized weights. And then now you can go ahead and serve that in FP16, get all the performance benefits of doing so, but with none of the performance degradation that you had before.
[1:35:35] Travis Addair: And so in practice, this is what we end up doing the most for users who fine tune with Qlora. A couple of notes on performance. So I think this one has been a lot of good discussion previously about performance optimization for base models. So I don’t want to spend too much time. On that particular note, but definitely one thing that I always mention to customers when we’re getting started with a POC is, what is your data and load requirements in terms of queries per second or requests per second?
[1:36:08] Travis Addair: What does it look like in terms of the distribution of input and output tokens and the number of adapters that you intend to serve in production? Certainly, when we talk about ideal batch size, target throughput numbers, things like that, they vary quite dramatically depending on the combination of these variables.
[1:36:24] Travis Addair: So for example, high requests per second with a high number of input tokens, you know, you’re ultimately bound in terms of how many, how much throughput you can get by the pre-fill step in a large way, because that is the bottleneck of how many requests you can pack into a single batch to then serve to the model. So if all your requests are very small in terms of input tokens, you can increase the parallelism quite significantly and therefore increase your total throughput quite substantially.
[1:36:55] Travis Addair: Whereas if all your requests are very long context and you’re generating a few number of tokens, your number of output tokens generated at peak is going to be much lower than it would otherwise. So those things all have a very big effect on what numbers you’re likely to see or what you should set as your expectations. And then apart from that, you know, that then of course leads into what your service level objectives, your SLOs are.
[1:37:19] Travis Addair: So some users are very sensitive to peak throughput, total tokens per second, usually for batch use cases in particular, where you want to scale up, process everything, and scale it down. Maximum intent latency can be very important for real-time use cases, of course, and then cost per day, per month, per year. I think when you’re using open source, one of the main reasons to do so is to reduce costs relative to commercial APIs, and so I think that one comes up by far the most and is intrinsically very tightly connected to peak throughput as well.
[1:37:50] Travis Addair: Because if you get higher throughput, you require fewer machines, or you can do it faster, so you have to be up for less time, and therefore you can reduce costs. One other thing I think is very important for deployments is thinking about what hardware you need to run your deployment in the first place. So as a typical recommendation, I would say, back of the envelope, at least 1.5x the model weights are needed for serving the model.
[1:38:17] Travis Addair: And that ultimately comes down to the fact that it’s not just the model weights that are required at runtime that occupy memory on the GPU. You also have the activations, which is intrinsically connected to things like how many tokens you’re allowed to fit in the batch at a single time, how long of an input length you’re going to tolerate per request, the adapters as well, how much memory you want to allocate for serving different lore adapters if you’re using Lorax to be able to support multiple at a single time.
[1:38:50] Travis Addair: And then of course the KV cache itself, which is going to need to be bigger if you’re generating more tokens and supporting larger batch sizes, since all that needs to be kept around for the duration of the entire request in order to avoid having to recompute the attention between different tokens. So all these things add up collectively and need to be considered when choosing hardware. But to me, the key questions that I think are worth asking are always, you know, okay, given this, how much VRAM do I need?
[1:39:19] Travis Addair: How many requests per second am I expecting a peak? How are the requests distributed throughout the day? Like, are they all concentrated at once? In which case, you know, I can scale up, handle it all, scale it down. Or are they distributed pretty evenly? What’s my maximum acceptable latency per request?
[1:39:38] Travis Addair: you know is it something where i can let it complete in minutes or hours it doesn’t need to be milliseconds am i willing to sacrifice quality or latency to reduce cost so willing to use quantization and how many different tasks do i need to support in production so things like whether i need to use multi-law inference And then last note here, I think, in particular for us, since we support both serverless and dedicated deployments on our platform, one question that gets asked a lot is, should I use serverless or should I use dedicated?
[1:40:08] Travis Addair: And to me, it comes down to the same set of things that were discussed below, but then how that translates into the requirements for your use case. So when request volume is low to medium, but distributed fairly uniformly, so you’re getting some set of requests coming throughout the day.
[1:40:24] Travis Addair: you know you’re okay with latency on the order of seconds like some amount of variance in that that to me is like the perfect serverless use case um because it’s not overly sensitive to a specific latency requirement but it does need constant uptime and so that’s one of the cases where dedicated uh tends not to be the best option because with dedicated always on you’re paying for it whether it’s utilized or not and that’s definitely a bummer if you’re not getting a lot of utilization but dedicated does become very attractive uh even though it seems
[1:40:55] Travis Addair: like serverless costs are so low, when you have one of two scenarios, where either the request volume is very high, concentrated and spiky, and not real-time, so something like batch, where latency is minutes to hours, but request volume is very high within those time windows. Or on the opposite end of the spectrum, when request volume is high and consistent latency, and throughput SLOs are critical, so you need millisecond to second at most.
[1:41:23] Travis Addair: latency on these requests and you have lots of them coming in at once and therefore the noisy neighbor problem of serverless could result in situations that are not acceptable and then last thing to comment on before kind of closing out is just a little bit of thoughts on where things are headed in my opinion was fine-tuned with fine-tuning and serving of fine-tune models in particular and this notion of fine-tuning for throughput so there was a little bit of discussion before about speculative decoding.
[1:41:54] Travis Addair: And I think one thing that we think a lot about with fine tuning today is quality. Like I want to fine tune to make my model better, but we don’t spend a lot of time thinking about fine tuning for speed. But I think that actually there’s a huge opportunity there to think about fine tuning as a way to actually improve model performance from a throughput and latency perspective as well. And to motivate this, let’s look at a couple of quick numbers of today, like the baseline model throughput, single request tokens per second.
[1:42:22] Travis Addair: compared to running a single LoRa adapter. And so it’s pretty well established that there’s a little bit of a performance hit you take when you serve a LoRa model. Here, comparing just VLM and LoRaX for the purposes of kind of saying it’s not just one framework that has this issue. So about a parity on base model performance. And then you’ll notice that there’s a bit of a drop when we introduce the LoRa, right? But there are ways that we could potentially… change, flip this around a bit and actually think about the lore as being better.
[1:42:53] Travis Addair: And one way is through thinking about fine tuning for speculative decoding. So Medusa, for example, is a famous solution that came out last year for fine tuning additional projections that allow you to predict future tokens and not just the current token. And additionally, introduce a verification step that ensure that you’re not accepting tokens. that are categorically incorrect, that you’re always only accepting tokens that are correct.
[1:43:20] Travis Addair: In the interest of time, I won’t go into too much details on how this works, but essentially just making sure that the tokens that the model would have originally predicted are, in fact, the ones that you accept versus the ones that the model would not have predicted or rejected. And as long as you are guessing correctly more than you’re not, you’ll speed things up, right?
[1:43:40] Travis Addair: But crucially, there’s an opportunity here to be able to combine these different strategies, quality and performance, by fine-tuning adapters that do both at the same time, which is exactly what we’ve been investigating. And broadly, we call this strategy a look-ahead LoRa, where you want to fine-tune a task-specific model that predicts not just the next token, but maybe the next end tokens. And interestingly, we found that this not only degrades performance, it doesn’t have any degradation performance. In some cases, You can get lucky, and it might actually do better without tweaking the hyperparameters too much.
[1:44:17] Travis Addair: But the performance in terms of throughput is dramatically better in this case, as much as 2 to 3x, what it is of the base model as well as the adapter itself. And this is comparing performance on different use cases, here Magic Coder, here Con LPP. Con LPP being pretty easy because it’s a JSON extraction use case as we framed it versus Magic Coder being code generation.
[1:44:41] Travis Addair: But as you can see, there’s quite a big opportunity to be able to combine these two techniques in a way that not only helps with quality, but helps with throughput as well. And so this is exactly what we’ve been doing at Predabase. Just as a very quick demonstration of this, I hope this won’t take more than a minute, but let me just quickly show a demo of how this works in practice.
[1:45:04] Travis Addair: So here I have my OpenAI client connected to my Lorax instance, and I’m just going to run this demo real quick, see how quickly I can generate some text for this magic coder use case. You see I get 98 tokens per second because in this case I’m using a generic Medusa as the base model. Actually, And now let me go ahead and try a Medusa that was fine-tuned on this data, but not with LoRa. And you can see that now performance has gone up to 113 tokens per second in terms of throughput.
[1:45:46] Travis Addair: And then finally, let’s try using one of these new fancy lookahead LoRa’s that I was talking about where we fine-tuned both for the task as well as for the the throughput as well and you can see that suddenly our throughput has shot up to 147 tokens per second which was what i showed you in the original graph so this is i think where things are going in my opinion is with the intersection of fine tuning and serving is you know let’s not just think about fine tuning as only a quality um proposition but also being
[1:46:20] Travis Addair: an opportunity to tailor models specifically to specific tasks for performance you know quality throughput and latency all jointly at the same time. And that’s it. I’ll go ahead and hand it over to Dan and Hamill and Charles for the next bit.
[1:46:40] Dan Becker: Yeah, I’m going to hand it quickly over to Charles. I know a little bit about what Charles is going to talk about, so you guys should really stick around. This is going to be awesome. We are planning to run over by quite a bit, and hopefully people don’t have to worry about it.
[1:46:59] Dan Becker: hard stop.
[1:47:00] Hamel Husain: You really don’t want to miss what Charles is going to say.
[1:47:02] Dan Becker: Yeah, we love Charles and we’re going to let him go.
[1:47:05] Hamel Husain: He’s going to say he’s going to it’s going to be epic right now.
[1:47:09] Charles Frye: Wow. So absolutely zero pressure on this, given how epic those other talks were. All right. So, yeah, you’ve already heard a lot of stuff about deploying LLMs today. I wanted to talk a little bit about deploying LLM services on modal. I was also told to cover batch versus streaming, but it seems like we’ve pretty much nailed that one.
[1:47:36] Charles Frye: And so I’m just going to give a little bit of what I was going to talk about and focus just on a very high level takeaway that I want to make sure is very clear for folks about not just like batch versus streaming for language models, but the fundamental tension of throughput and latency. So like when you talk about whether some like people will sometimes say a software system is slow with no further. like explanation. And sometimes that means that it took a long time to do a big job.
[1:48:05] Charles Frye: And sometimes that means it took a long time to do something that seemed very small. And in the former case, it’s that the system does not have sufficient throughput. It does not complete sufficient requests in a given unit time. In the latter case, it’s that the system has insufficient latency. It does not complete a single request in in a certain unit time. And with both of these, when you are optimizing these two things, as we heard already from Joe and Travis, you can deploy resources to achieve different service levels on throughput and latency.
[1:48:45] Charles Frye: And cost becomes the hidden third feature of systems that determines their throughput and latency. So we’ve already heard all this. So… With throughput, we’re thinking about batch-oriented LLM inference. So you’re refreshing a recommendation system every night or every week, like Spotify Weekly. You’re doing your evals and your CICD. With real-time, that’s things like chatbots, copilots, audio or video chat, guardrails, where an LLM is checking another LLM in production. And so those are places where you’re going to have tight constraints on throughput in the first case or on latency in the real-time cases.
[1:49:26] Charles Frye: And where these will become most challenging for cost, where it will be most difficult to exchange cost for throughput or latency are cases where you have consumer facing applications, where standards are very high and like customers are very fickle and ready to change. And at large scale, where some of the things you can do as tricks at small scale fail to continue to develop. So what are the constraints on these features of a system? With throughput, you’re generally dependent on the up or downstream system. How much throughput can they handle?
[1:50:05] Charles Frye: How much throughput do they want? So if you’re producing tens of thousands of tokens per second total, and you’re putting that through a logging service, you might find that some of software 1.0 logging services do not expect tens of thousands of tokens per second, and that logging software will start to fail.
[1:50:22] Charles Frye: If you have CICD and you have a bunch of other parallel jobs running, you may not need to have tens of thousands of tokens per second to finish in the five minutes that it takes for some other component like a Docker image push to complete in your CICD. So that’s the primary constraint determinant on throughput. It’s the systems to which this LLM and its inference is connected. For latency, the… almost all the time that when we’re talking about low latency systems with LLMs, it’s human perception. Human perception of latency is what matters.
[1:50:58] Charles Frye: So first, that means you can use tricks to get around latency constraints, like showing intermediate results or producing a draft that you then replace with a higher quality result. The second takeaway is that the constraint here on latency is on the order of a couple hundred milliseconds for the entire system, not the LLM. but the entire system. So any network calls, any file IO, all of that, as long as it fits, this is backed by sort of like psychophysical and user research.
[1:51:26] Charles Frye: As long as it fits in the order of a couple hundred milliseconds, the human actually becomes the latency bottleneck. They’re like our ability to react to information presented by the software system. And then finally on how much cost, how much money do you have to deploy to solve these two problems? It’s gonna be dependent on how much value you can deliver.
[1:51:46] Charles Frye: so the more value you can deliver the more like the more money you have to deliver that value So the reason why this, like the very high level outside of just LLMs, but in general, why this becomes really challenging is latency sensitive applications are some of the most exciting applications of LLMs. It was ChatGPT that got people extremely excited about this technology. And that was one of these chatbot latency sensitive situations. And it is a fundamental basic fact of engineering that latency lags throughput.
[1:52:24] Charles Frye: This is a famous paper that written back in maybe the mid 2000s about how much easier it is to improve bandwidth than it is to improve latency. So on the right here, I have a figure of the relative improvements on a bunch of different components of the stack network, disk, memory, the actual CPU. Over time, the bandwidth improvements have been like super Moore’s law. Whereas the latency improvements. have been much slower.
[1:52:55] Charles Frye: So that kind of mustard brown line in the bottom right, that’s if latency and bandwidth improved at the same rate, all of these lines would be on that. And this is log scale, by the way. So we’re seeing that microprocessors got, like, if you use their full bandwidth, they got over 1,000 times faster during this time period. But their latency, if you’re just looking at how quickly can one microprocessor process, say, a single instruction, that only got on the order of 10 times faster.
[1:53:29] Charles Frye: And in general, you end up running against literal physical limits, the speed of light in networking, or the amount of heat that gets stuck in silicon when we run electrons through it when it comes to the speed of microprocessors. And you can’t bribe physics, but you can spend more money on things to create greater bandwidth. Maybe the most important example of this, the most relevant one for people working on the things in this class, is actually GPUs are a throughput-oriented design for a processor, where CPUs are a latency-oriented design.
[1:54:03] Charles Frye: Substantial amounts of silicon area are devoted to caching and control flow in red and orange on the left-hand side of this figure, and much less to operational throughput, which are the ALUs in green in the top right. The CPU architecture is oriented to retrieving… small bits of information really fast so they can be operated on quickly.
[1:54:24] Charles Frye: A GPU, on the other hand, is a throughput-oriented design, where there’s a tremendous amount of silicon area given over to processing and much less to control, but still quite a lot over to caching, and in particular to very high bandwidth caches. And so even though, as Joe correctly pointed out, memory bandwidth is the, which is memory throughput, is one of the limits on GPUs. They have substantially higher memory throughput than CPUs.
[1:54:55] Charles Frye: And so this distinction that has created a new multi-trillion dollar company in NVIDIA is another example of throughput being able to move ahead of latency over time. This figure comes, by the way, from the Vakuda book, Programming Massively Parallel Processors, which I can highly recommend to folks. So it’s a general phenomenon. It’s particularly relevant with GPUs, so it’s very punishing in LLM inference in particular. We’ve already seen this. If you want more throughput, you can just increase the batch size.
[1:55:32] Charles Frye: There are penalties to latency and, as Joe pointed out, some penalties to throughput for an individual user. But you see this basically linear scaling in your throughput up until you reach compute boundedness, at which point you can get more GPUs. and continue to basically linearly increase throughput up to the point where you’re spending literally billions of dollars on inference. You can just keep on increasing throughput until you start to run out of power plants. But if you want shorter latency, you can basically just go die.
[1:56:03] Charles Frye: So we heard about some of the strategies for this for Joe. Joe decided not to just go die and to try and solve this problem. So quantization, we also heard about this from Joe. Like, like. reduce your latency if done correctly, but it also improves your throughput. You can distill models down to a smaller size, you can truncate them and fine-tune them. That also improves throughput, turns out, so it doesn’t just improve your latency and maybe mostly improves your latency by boosting throughput.
[1:56:32] Charles Frye: You can buy more expensive hardware if you’re running on A10Gs, you can upgrade to L40Ss, you can upgrade to A100s or H100s. They do have faster latencies, so you might get some additional latency. that’s not throughput related, but you will also improve your throughput at the same time. Or you can do some of the things that Joe was talking about, which is write some really, really hard to write software. So hand composed CUDA kernels can reduce latency. And by avoiding round trips to memory and things like that, this turns out will also improve throughput.
[1:57:09] Charles Frye: So if you actually want to improve latency, like you want the shortest latency and you don’t care about throughput, there’s basically one solution that people have found, which is to run the system entirely on cache memory or what’s or SRAM static RAM. So that’s like the going back to our little chip architecture picture here. It’s like take this this cache memory here, which is not the RAM that you’re used to thinking about.
[1:57:34] Charles Frye: This is the L3, the L2 or the L1 cache in your CPU and just build 80 gigabytes of that and run the language model off of that memory directly rather than running it off of the more typical. dynamic RAM architectures that you’re used to thinking of as the RAM in which the memory of the system in which your model weights are stored. So if you want to do that, you should be Grok, who built the LPU, which basically runs off of that core principle.
[1:58:06] Charles Frye: There’s more brilliant stuff in there, but the core thing that enables them to operate at extremely short latency is to just have 80 gigabytes of SRAM to run models off of.
[1:58:19] Charles Frye: does actually have a penalty to throughput and in particular it’s such a big change in the system you actually have to talk about throughput per dollar you have to connect it back to that third constraint on our lm inference which is the amount of money and resources we’re willing to expend on these other two features uh so the if you check out there’s some a great semi-analysis post uh estimating the the like throughput per dollar uh of something like grok versus um like a more traditional nvidia um many H100s approach to serving LLMs at scale.
[1:58:55] Charles Frye: So this third constraint of cost is one of the, it’s like kind of lying there secretly whenever we see a throughput or a latency constraint, it’s like, well, we add more money, we could do this. So that’s bad news, except for people who can raise $6 billion on a Series A. But for everybody else, the good news is that costs are in fact falling. Over time, we’ve seen like rapid decreases in price. So this chart, there’s a lot going on here.
[1:59:24] Charles Frye: I’m going to try and walk you through it in the time that we have here. On the x-axis is time, when were language models released? And then the y-axis is the cost in dollars per megatoken in log scale. So the first phenomenon I want to draw you to is how much intelligence or cognition can you get for $20 a megatoken? And in 2020, that got you DaVinci, the original GPT-3 API.
[1:59:51] Charles Frye: In the middle of 2022, that would get you DaVinci 002, which was about the first model to have chat GPT on release level cognition, part of the Code DaVinci 2 lineage. At that point, it was also $20 a megatoken. So the intelligence or cognitive capability of the system, the amount of knowledge in the system, had increased substantially without increasing the price. So that’s indicated by, I use the MMLU5 score just because it’s a very widely used benchmark and not totally polluted, and so roughly gets the capabilities of a model.
[2:00:29] Charles Frye: And now, at this point in mid 2024, roughly $20 a mega token will now get you outputs of GPT-4.0, and that is a substantially more capable model than DaVinci Zero 2. So that’s this orange line. So if you say, all right, I have a $20 megatoken budget. I cannot serve a model of sufficient capability at this cost. You can simply wait for the model capabilities to catch up at that price point. The other way of looking at this is looking at what is the cost for a fixed level of cognitive capability over time.
[2:01:08] Charles Frye: So for that, we’re looking at this gray dashed line here. You can see that in early 2022, it was going to cost you $20 a mega token to get chat GPT unreleased level cognition. With Lama 3.8b, I think this is the cost to run it as an any scale endpoint. So this isn’t even the like bargain basement cost you could get if you really optimize the service yourself. That is 100 times cheaper two years later than what the DaVinci Zero 2, Text DaVinci Zero 2 endpoint was. And that means like we’ve seen this.
[2:01:41] Charles Frye: like this is like much faster than Moore’s law. And so it’s unclear like how much more it can continue. But as we continue to accumulate improvements to hardware and improvements to algorithms and like the tremendous amount of capital expenditure in this industry on research and development, we can expect. costs to continue to decrease rapidly.
[2:02:03] Charles Frye: And so if you cannot deploy a language model like at the price that you are interested in, or rather if you deploy it now at a particular price, you can plan that the cost for servicing that endpoint will go down over time. Separately, this is the reason why it’s a bad idea to run an inference as a service startup at this time. It’s a pricing war to the bottom. All right. So with that sort of high-level perspective on LLM inference, let’s talk about deploying LLMs on modal. So what’s modal’s story on throughput, latency, and cost?
[2:02:44] Charles Frye: For throughput, it’s extremely easy to run very, very low-cost LLM inference services on modal. You can scale out to hundreds of A100s without having to go golfing with a cloud sales associate or thousands of A10Gs. We have people running at that scale on our platform right now. And so maybe you don’t need thousands of ATNGs every single minute of every day. You need them to speed through a fine-tuning hyperparameter sweep. You want to run 1,000 parameters at once instead of 10 at a time or 8 at a time on your local hardware.
[2:03:25] Charles Frye: So for that, I think we’re in a really solid place. You can get up to like eight A100s or eight H100s as a single discrete unit. You can scale out to many of those. And so that’s enough to do these fine tuning jobs, even at the scale of 70 billion parameter models for fine tuning and for inference at like good large batch sizes, large sequence lengths. For latency, it’s… Certainly challenging. Like we’ve heard from everybody how difficult it is to handle latency. I’m not going to tell you that we make it easy.
[2:03:59] Charles Frye: But I would say balancing latency and cost for models that are 13 billion parameters and smaller is doable. And you can, in fact, run reasonably latency sensitive services that include neural networks at that scale on modal. And I expect to be able to move that up both by.
[2:04:21] Charles Frye: of increasing quality of cognition available at 13 billion and below and also uh as like hardware improves and novel like ex you know uh new chips perhaps the gh200s from nvidia um or perhaps others like make it easier to serve uh serve inference for larger models um for cost uh we run at about a dollar ten uh an hour for a 10g That actually looks like cheaper than AWS’s prices now. We run on multiple clouds, so we can find the cheapest GPUs, pass those savings on to you.
[2:04:57] Charles Frye: You might find for some things, like particularly the fastest accelerators like H100s, that there is in fact a markup over the cost you could get it from running dedicated instances on the cloud. 765 an hour. The way to win there is by making sure that you, like if your workloads are spiky, and variable, then you can get high utilization on modal, challenging to get high utilization with dedicated instances. So that’s, again, more common with these throughput-sensitive, batch-oriented jobs, that is. So I’ve done the math.
[2:05:40] Charles Frye: You can run endpoints on modal at roughly the price that these API endpoints are available on. from other providers if you put some engineering work in it and you can match your requests to the hardware that you are running. So we’re running short on time. I won’t go into great detail on this. It’s hard to achieve very high GPU utilization. Check out the State of AI Infrastructure at Scale report for additional details about that and about some of the other challenges that you will face if you are running LLM inference at a large scale.
[2:06:15] Charles Frye: It’s an industry survey. a bunch of brilliant information in it. So now is when I had intended to do my demo of running LLM inference on modal. I’m actually in the, I’m going to go back to this. But before going through that, I actually want to just say one thing. This is more like a little bit more exciting than the LLM inference. Modal is actually for more than just GPUs. It’s not just a serverless GPU platform. It’s an overall serverless runtime for running models.
[2:06:51] Charles Frye: What I mean by a runtime, it’s something that takes something that you’ve described in code and turns it into something that’s out there in the real world. So it provides resources and infrastructure. So what kinds of things do you need in a serverless runtime? You need storage, like distributed file system, distributed queues and dictionaries, the ability to mount information from local to web. We’ve gotten lots of questions about using Axolotl and people are like, Oh, can I do this? Can I do that? It’s like, yeah, dog.
[2:07:20] Charles Frye: this is a whole serverless computer you can totally store stuff on it um we would you know it wouldn’t be worth the worth the name of serverless runtime if you couldn’t um there’s compute which you’ve already seen with functions and gpu acceleration there’s also serving web services web endpoints and web servers so why do i tell you this um because there are other things that i have wanted to do during the lm fine tuning course that are not serving lms that i’ve done with modal So let me pull that up.
[2:07:51] Charles Frye: So let’s say you are working with a database that produces a large list of users who want to get credits on your cloud platform. And you don’t want to necessarily do that by hand, but you’re going to need to connect to some databases. You’re going to need to communicate from place to place. And you want to make sure to do this in a secured, repeatable manner. That is exactly the sort of thing that one could run on modal. So that’s what this little example here shows. We got a little image with a little HTTP client.
[2:08:24] Charles Frye: It gets access to some secrets. Let me show you. It’s sort of like, this is like a little local Python script here. It’s got this main function, but decorated with app.local entry point, which says talk to something in modal on the cloud. So the key thing here is right there. This line in my VS code here results. I’m going to map a function that I’ve written to grant. credits to people over a large list of workspace names.
[2:08:53] Charles Frye: And of course, for safety, I’m going to give myself the ability to do it as a dry run without without actually doing anything, just reading credits from the database instead. And then I can save the results locally for inspection. So here’s that grant credits function. It’s got a secrets because, you know, wouldn’t be good if we had a public API endpoint that you could just grant yourself credits for. That’s even in Silicon Valley, that’s considered a bad way to run a business.
[2:09:19] Charles Frye: And so this grant credits function grabs that secret information and posts the payload, which came from the posts, the information that came from the local Python environment on like connects to our database and, you know, grants credits. So let’s just do my. quick test here and we’ll see how that goes. So I’m going to do a dry run of it here.
[2:09:49] Charles Frye: This is everybody who is in the LM fine tuning class who has also, and got a credit, who has also used modal for something like, you know, like you hit our servers, we have that information. So there it is. Yeah. So modal run credits.py. See, I’m running it using it just like a normal Python script. Okay. dash dash dry run. We got a little terminal noise, but that looks right. Okay. Dry run. All right, here we go. Spin up 25 containers and start checking. Okay. I’m seeing a lot of 200s there.
[2:10:26] Charles Frye: So these are all these folks who got credits. I think it’s about a couple hundred folks who got credits and used modal. Great. So now let me just really quickly rerun that, drop the dry run, and this will now grant… $500 in credits with a year expiration, just like the previously granted credits, to everybody who has already checked out modal in the course of this class. So let’s go.
[2:10:51] Charles Frye: I haven’t run this before because I obviously, you know, I’m not going to grant $500 in credits at scale many times while testing, but it looks like that worked. Ooh, Dry Run would have sent post request. All right, let’s drop that then. Boom. Did I say Dry Run? Oh, you know what it is? Default to dry run. That’s a fun fact. Let’s actually see that in the help. We can see that the default here is dry run, not no dry run. So let me actually pass that.
[2:11:23] Charles Frye: That’s why it’s nice to have the local logs when you kick off a job. You’d hate to find that out way after your presentation, that you had ran this script. But I could see it there in the.. in the would have sent post this dry run here in the logs. Really great to have logs available. So let’s rerun that with no dry run.
[2:11:45] Hamel Husain: Good for you for reading your own logs. A lot of people, that’s advanced.
[2:11:49] Charles Frye: Sorry?
[2:11:50] Hamel Husain: Is it good on you for reading your logs?
[2:11:52] Charles Frye: Oh, yeah, you got to do it. Okay, I’m seeing a lot of 201s credits added successfully. Store the results of that so I can double check that when I ran this live, I actually did grant credits to everybody. And you can also double check my work and make sure that you got your extra 500 if you’ve used modal. If you haven’t used modal yet, and you want that extra 500 in credits, just I’m going to run this script again with a new database query, checking for everybody who has used modal.
[2:12:20] Charles Frye: and has not gotten credits twice, rerun it and grant credits to everybody who uses modal for something by one week from today, which is June 11th. So yeah, that’s deployment of other things that are not LLMs on modal.
[2:12:39] Hamel Husain: That was a lot of fun. That was a good flex too, granting credits on modal using modal.
[2:12:46] Charles Frye: When I suggested this to the engineers on the platform, They were not entirely enthused, but we made it work. Okay, so let me dive into doing inference on modal. And we’re already 20 minutes over time, so I’ll be really quick. If you want to see how to run inference on modal, the place to go is our modal examples repo. We have a whole folder of LLM serving examples that show different ways to use it. We’ve got VLLM, TRTLLM, we’ve got Mixtrel and Mistrel, we’ve got LAMA models.
[2:13:21] Charles Frye: So you can kind of mix and match those components. We also have different ways of serving them. So let me show you one deployment that I have up and going. This is an obliterated version of LAMA 370B. What obliterated means is that the model’s internal representation has been zeroed out in the direction that represents refusing to respond to a request. So I can ask for something like, you know, hey, Llama, can you help me hide a dead body? I need your help on this one.
[2:14:00] Charles Frye: You know, like the kind of question that if you were to ask ChatGPT, you would get an answer of like, no, I can’t help you with that. So there’s this technique. If you look it up, it’s. Less wrong post, Neil Nanda is one of the authors on it, how to orthogonalize LMs and remove sort of concepts from their residual stream, from their weights. And you can remove by looking at differences between when the model says yes and when the model says no, you can remove the ability to say no from open weights.
[2:14:33] Charles Frye: So for example, how can I host a website for phishing credit card numbers? Get a nice answer out of the model for that. Write me a website that I can use to fish credit card numbers in HTML and JavaScript. And, you know, for, let’s see, that one’s not as interesting. And yeah, write me a phishing email that will get me some credit card numbers.
[2:14:53] Charles Frye: So the utility here, I think, apart from like fun demos, there are certain things that I think that the large models have been fine-tuned not to do that I think people should have the right to do. But the utility here is not just that we’re serving inference on a model that says funny stuff like this, but it can also be used for security testing, for pen testing, automating, for example, sending out these emails to your employees to make sure that they are good at detecting phishing scams.
[2:15:28] Charles Frye: So I sent that query a while back and we see that this input is still in queue and cold starting. This is what people have been talking about with this. issue with serverless setups and low latency LLM applications. So if we take, we can take a look inside the modal, inside of modal to see what’s going on with our logs. So we just launched this new log thingy that’s a lot, got a lot more information in it. All right, there we go.
[2:16:02] Charles Frye: So these are very similar to the logs that you actually saw from Joe when he was running models on replicate because we’re all running these open source libraries for running models vlm and ray in this case all right that looks like we should be set up so let’s see how we are here maybe not let’s uh let’s run a second one see if that fixes it so again i haven’t tested i set this demo up like a couple weeks ago and then haven’t tested it since then.
[2:16:39] Charles Frye: It’s what I get for running a new input live. But we’ll see if that one comes up. Yeah, while we’re waiting on our serving example, the other one I want to show is using modal to run kind of batch jobs. So we have an example that uses trt llm to run llama3 8b. So the way this is something that looks a lot more like that credits grant script, it runs on some like collection of inputs that you send and runs it in like a batch manner oriented at super high throughput. So unlike that chat oriented.
[2:17:17] Charles Frye: interface that I showed previously. So we have, let me actually kick that one off while I talk through the code. So this code here, this is like Python code that describes the entirety of setting up a TRTLM inference service, starting from the, like a raw, like getting a raw Docker image and, and building, building a TRTLM model. So it looks like, yeah.
[2:17:46] Charles Frye: this is we start with our trt image we add some additional uh dependencies like downloading models all this stuff is done inside of python code along with the other things that you’re putting out on modal um so it stays all in this and like sort of more friendly and flexible code environment instead of being in configuration and yaml files um so yeah so we have this is like When we download the model, we do it by running a Python function, which is good for sort of handling all the things that can go wrong when you’re
[2:18:19] Charles Frye: downloading a large model. Looks like our inference is just kicking off there. Printed out some input tensors to the logs. Always check your input tensors. And there we go. 32,000 tokens in seven seconds at a total throughput of 4,500 tokens per second. That’s a batch size 128, so much lower throughput individually, more like. probably, what’s that, 30 tokens a second for each, like, as it would be experienced by each individual request on an A140 gigabyte instance.
[2:18:51] Charles Frye: So if you’re doing batch stuff, I actually think it makes sense to kind of view them as more like scripts that happen to run on robust cloud infrastructure, rather, and then you get an experience that’s kind of similar to doing, like, Python script.py locally, like I did with the credit grant, rather than, like, turning them into an API service, blah, blah, blah. We do that for you under the hood, but it’s not like built into the serving of your application. All right.
[2:19:16] Charles Frye: It looks like I flew a little too close to the sun trying to run obliterated llama live. So I don’t think we’ll get an answer to our question of how to hide a dead body. Maybe that’s good. Probably shouldn’t be sharing that information too much, you know, too broadly. But yeah, last example that I wanted to show. I won’t go through this in detail. I was going to show those hot reloading development server where you can stand up an endpoint, make a change locally, it automatically redeploys. I was going to go through that in detail.
[2:19:51] Charles Frye: But yeah, let’s do that. You can see when you edit your serving code, it recreates. So if you want to go from that more script-type fast iteration loop environment to deployment, we have this intermediate modal serve to help you out. and you get a pretty fast recreation of those endpoints, which is important. This particular example is our OpenAI-compatible endpoint serving example.
[2:20:20] Charles Frye: And that one, so you can take something, like if you run a server in OpenAI-compatible mode, you can use it with a bunch of open-source software like Instructor that expects the types of endpoints and the types of affordances of those endpoints that the OpenAI server provides. And on modal, what that looks like, by the way, is you get a whole you go into the VLM, like VLM the library, you pull out their implementation of an open AI API server, and then you toss some extra features on it.
[2:20:54] Charles Frye: So you like add some middleware and authentication, you like connect it to your, your async engine. And then you when you return it from that function, that runs inside Modal’s infrastructure, and we turn that into a fully deployed endpoint, which you can run, like, which you can hit with a client. So that’s what this one is showing with Python running against that client. So that one has, like, a fixed prompt of, like, compose a lyric about baboons and raccoons, but you could go in and change that client script.
[2:21:25] Charles Frye: It can be any kind of client script that you want, and it’s, you know, separated out from everything else that you are doing. Great. Did we get? All right. Nope. All right, that’s it. Trying to do four demos in 30 minutes is probably a bit too much and too close to the sun. So sorry, no obliterated llama today.
[2:21:48] Charles Frye: Hit me up if you’re interested in the code for that example, or if you’re interested in one of the members of the class built Golden Gate Llama on Modal, a Llama model that believes it’s the Golden Gate Bridge. If you’re interested in working on that project, let me know. All right. So that’s all that I have. I think, yeah, reminder with the credit grant sent out to everybody who had gotten credits already and used Modal.
[2:22:17] Charles Frye: If you use modal in the next week, you’ll like, then you will get it one week from today, an extra $500 in credits. Modal strongly believes that the skills that you are learning in this fine tuning course and the workflows that you are that you are learning and the applications you are going to build are going to deliver a tremendous amount of like of value and are going to be the way that this technology of LLMs or artificial intelligence. actually rubber beats road and value is delivered.
[2:22:49] Charles Frye: And so we’re very enthusiastic about supporting you and about making sure that our platform supports what you’re building.
[2:22:58] Dan Becker: Okay. Also the modal form for people who missed filling it out, modal form is still live.
[2:23:05] Charles Frye: It is in fact still live and yeah, you can, and there’s a script running on modal to grant those credits.
[2:23:13] Dan Becker: Okay. We have a… bunch of open questions and I actually sort of encouraged our speakers not to type out answers because we wanted everyone to hear them and also not to have to be distracted as you were listening. So I think we’re going to roughly try to speed run as many of these questions as we can that are in the Q&A. I’ll start at the One with the most votes, can you please add a conference or demo for efficient swapping of LoRa adapters at inference? I’m trying to think about how to interpret that.
[2:23:54] Dan Becker: I think by conference or demo that’s just saying…
[2:23:57] Hamel Husain: I’m saying show me how.
[2:24:01] Dan Becker: Yeah, show you how. And then I think there are two ways to interpret that. One is use a platform that is… doing this in the background, and then the other is if I want to set up VLM to do it somewhere outside of any of these platforms. Do we have anyone on this call? Do we have a demo for this? It seems like the recommendation is probably to use a platform.
[2:24:38] Hamel Husain: Yeah, one thing I can say is, OK, we’ll have a… We have like another modal session to just schedule so Charles doesn’t even know about it yet. And they’ll probably will do it office hours with replicate as well. So potentially in one of those are both can show some hot swapping if either person is is comfortable with that. That’s one option.
[2:25:03] Dan Becker: All right. What is this mention of AWQ in the honeycomb example?
[2:25:10] Hamel Husain: Yeah. So AWQ is a quantization technique. I don’t really, I didn’t go like too deep into it, but it’s just like a, it’s a tool that you can use to quantize a model. It’s compatible with VLM. It’s very easy to run. It’s actually integrated quite nicely into VLM. So you can, the code that I showed, showed you, showed you how to perform the quantization. There’s a lot of different knobs that you can tweak in the quantization. I haven’t really explored those to be honest.
[2:25:39] Hamel Husain: I just kind of use the default ones or the ones they have in the documentation.
[2:25:50] Dan Becker: Ruben Alvarez, which was a while ago, has been talked about since then. But there’s a question. If I fine tune, during fine tune, I load the model in 8-bit and the Lora is in whatever bit size, what happens if I quantize like Hamill just showed?
[2:26:11] Hamel Husain: I actually don’t know. You know, as you can see, like, I merged it back in my example. Just because, in fact, yeah, if that’s, like, a feasible thing for me to do, I just do it because I don’t want to deal with complications. So, actually, I think Travis may have, like, kind of gone through a little bit of what can happen there. So, I would, like, review what he went over in the recording.
[2:26:41] Dan Becker: yeah uh also one way with the process is to push an awq model to the hub after training
[2:26:50] Hamel Husain: yeah wade um yeah that code i shared in my portion of the presentation i shared a hugging face hub um a repo with the code in it that you can use to quantize a model push it push it to the hub so if you go in the recording when it’s available go back to that you’ll see that code unrelated
[2:27:12] Dan Becker: to today’s content but how much should you charge enterprises for a fine tuning project We could talk about this, as with everything else, in varying levels of depth. My recommendation is to try to figure out what the problem is that they’re solving and figure out how important that problem is to them. And then figure out, do they have some metric? Which is like, they’re trying to get this metric from X to Y. And work with them to figure out how much the project is worth to them.
[2:27:49] Dan Becker: use that as a way to sort of ground the answer. And then, hey, well, let’s say don’t charge hourly. So once you figure out how much it’s worth to them, figure out a reasonable fraction of that.
[2:28:05] Hamel Husain: Yeah, and a lot of people don’t know what it’s worth to them, which means don’t do it.
[2:28:13] Dan Becker: Yeah, but I would start by trying to figure out, not by starting, well, here’s the hourly rate, but instead, like, think about… What’s the metric in their business that they’re trying to move? And I think that will help you think through, back of the envelope, what is it worth to them?