Napkin Math For Fine Tuning

fine-tuning

llm-conf-2024

Published

July 1, 2024

Abstract

We will show you how to build intuition around training performance with a focus on GPU-poor fine-tuning.

Subscribe For More Educational Content

If you enjoyed this content, subscribe to receive updates on new educational content for LLMs.

Chapters

01:23 About Johno and AnswerAI

Johno shares his background and his work at AnswerAI, an applied R&D lab focusing on the societal benefits of AI.

03:18 Plan for the Talk

Johno outlines the structure of the talk, including objectives, running experiments, and live napkin math to estimate memory use.

04:40 Training and Fine-Tuning Loop

Description of the training loop: feeding data through a model, measuring accuracy, updating the model, and repeating the process.

09:05 Hardware Considerations

Discussion on the different hardware components (CPU, GPU, RAM) and how they affect training performance.

12:28 Tricks for Efficient Training

Overview of various techniques to optimize training efficiency, including LoRa, quantization, and CPU offloading.

13:12 Full Fine-Tuning

Describes the parameters and memory involved with full fine-tuning.

18:14 LoRA

Detailed explanation of full fine-tuning versus parameter-efficient fine-tuning techniques like LoRa.

21:04 Quantization and Memory Savings

Discussion on quantization methods to reduce memory usage and enable training of larger models.

23:10 Combining Techniques

Combining different techniques like quantization and LoRa to maximize training efficiency.

22:55 Running Experiments

Importance of running controlled experiments to understand the impact of various training parameters.

25:46 CPU Offloading

How CPU offloading works and the tradeoffs.

28:31 Real-World Example

Demo of memory optimization and problem-solving during model training, with code. This also includes pragmatic ways to profile your code.

45:44 Case Study: QLoRA + FSDP

Discussion of QLoRA with FSDP, along with a discussion of tradeoffs.

54:25 Recap / Conclusion

Johno summarizes the key points of his talk.

Resources

Links to resources mentioned in the talk:

Johnowhitaker.dev << Personal website for Johno Whitaker.
FSDP+QLoRA Benchmarks << Johno’s (and others) benchmarks for FSDP+QLoRA used in an example
Transformers Issue 25572 << Someone showing math for activations etc
Talk Slides << Talk slides
PyTorch Tutorial on Optimizer Step in Backward << More use of memory viz plus an under-rated technique

Slides

Full Transcript

Expand to see transcript

[0:00] Johno: All right, so welcome everybody. The talk is titled Napkin Math for Fine Tuning. The goal here is to answer a number of different related questions that often come up when you’re talking about training, and especially with a lot of people getting into training models for the first time via fine-tuning these big existing models. What affects the performance? How do I make this better or worse? Why is this running out of memory? Or why is this taking so long? There’s some questions already in the Q&A. What if we’re over the limit around GPU?
[0:33] Johno: What are the things that we can turn on to bring us under? you know, how do we, how do we like reason about these different parameters? Because if you’ve looked at the axolotl config files or anything like that, you realize there are so many knobs to tweak and it can be tricky to get a mental model of what those different things will do. So that’s the goal in this talk is to kind of get a feel for the space. What are the different types of fine tuning? [0:55] Johno: What are the different things that we can tweak and how does that affect how much memory it uses, how much computation it has to do, how fast or slow it is, how cheap it is, et cetera. Cool. Okay, so these links I’ve also posted in the channel. These are some of the things we’re going to reference. I guess, yeah, for those who are curious about me, there’s a site there if you want all the info. I currently work at AnswerAI, which is an applied R&D. [1:23] Johno: research and development lab for AI, trying to figure out what these models are useful for, what things like fine-tuning are actually useful for, and also how can we make them more accessible, easier, as a public benefit corporation. So the goal is to try and find societally beneficial uses of AI, which is quite a big challenge and I think an open question. [1:44] Johno: Okay, so the good news for this talk, we understand how these models work from it, like not in how do they learn things, that’s still quite an open question, but at least in how do we produce an answer, right? We have some known mathematical operations, we have some known transforms of the data, we put the data through these different operations and we get an output. We do understand this. We can do this maths and we can also experiment and find things out. [2:08] Johno: The bad news is it’s never as simple as you think it’s going to be. The paper might be different from their code, might be different from the Hugging Face implementation. The implementations are changing and shifting over time. The maths can get very tricky. It’s easy to forget things. I always feel as soon as I start reaching into this that I’m very quickly into a depth where oh, wow, I thought I had this down. I thought I understood X. There’s always something deeper. [2:35] Johno: There’s always the little asterisk that says, oh, actually in a multi-GPU system, we keep two copies of this buffer around just in case. You know, there’s always something that you don’t know. And I want to make this clear from the outset. This is like, I will try and be useful in this talk. We’re doing napkin math, so it will be hand wavy. I make no claims that anything I do or say here is 100% accurate. We’re just going to do our best. [2:59] Johno: And the whole talk is going to be this process of trying to guesstimate, trying to get there without having to do necessarily all the perfect nitty gritty steps. All right. Cool. So it’s going to be okay. We’re going to get through even though there’s those hard bits. So, what is the plan? If you’re just joining, we’re talking about napkin math for fine tuning, we’re going to go through what the objectives are, that’s the intro that we’ve just done, so that’s what, 14 and a bit percent of the way through already. [3:30] Johno: We’re going to talk about what happens during fine tuning to get an understanding of like, okay, what are the pieces that we should be keeping an eye on and thinking about? We’re going to talk about how you can run experiments to interrogate some of these questions. And then we’ll also do obviously some napkin math to say, before I run the experiment, maybe I can try and at least make some predictions as to what I might see. [3:51] Johno: I’ll show some code as a way to run experiments that are like at a smaller scale then I run a full fine tuning run on some GPUs that I’ve rented in the cloud like how do you test out things more locally again we’ll do more more napkin math I’m going to try and do some like live mathing We’ll see how well that goes. And then we’ll dive into maybe like another case study that I’ve done recently, just as a way to hopefully surface more and more questions around some of the nitty gritty details. [4:21] Johno: And then like I said, hopefully lots of time for questions. Cool. Does that sound good? Sounds great. Fantastic. Okay. Yeah. So, yeah. Oh, sorry. I just said I’m really excited. Oh, fantastic. Cool. Okay. So this is the loop that we’re going to be talking about. And training and fine-tuning, it’s sort of the same operation in a sense, except that fine-tuning, we’re starting with something that’s already been trained for a little while. What we do is we take some data and we feed it through a model and we get an answer. [5:01] Johno: And then we measure how good that answer is, and we try and update the model, hopefully to make a better answer in the future. Right, then we take some more data, we feed it through, we get an answer, we compare it to the true answer, we update the model, and we repeat and repeat. And in the case of language models, which is what this course is focusing on, usually the correct answer is what word actually came next. And the predictions are the probabilities for what word might come next. [5:27] Johno: So we have some fine tuning data set and we’re saying, oh, you know, I feed my instruction and my response through the model. And at each token, it’s predicting, well, what’s the next token that I look at, especially for the response? What tokens that I predict versus which ones that I actually want? That’s going to be the thing I measure. And that’s going to be what I update with, right? So this is our loose map. And we’re doing this not just in some sort of virtual space, but we’re doing this on actual physical hardware. [5:55] Johno: You usually have a CPU with some CPU RAM. You have hopefully a GPU with some GPU RAM. Ideally, multiple GPUs, each with their own, potentially, a little bit of GPU RAM. Maybe you have multiple computers, each of which have multiple GPUs. So this hardware that we’re running on is going to start to influence a lot of the questions that we’re asking in conjunction with what’s actually going on in that training loop. Okay, so now this is that same loop as before, just annotate it a little bit. [6:23] Johno: What are some of the things we need to keep an eye on when we talk about what makes fine-tuning faster or slower? Why is it difficult? Why do we need lots of GPUs? Can’t I just do it with one? So the thing is, some data that we’re loading from disk or from the internet or from somewhere, that takes up some space, usually not too much in the grand scheme of things. We want to feed that through a model. And now each of these operations that we’re doing, like attention, we’re feeding it through these feed-forward networks. [6:51] Johno: Everything here is maths. Lots and lots of operations. So this is crunching a lot of numbers. This takes time. The GPU has a number of different cores in it, you can imagine. They’re each going as fast as they can. But still, there’s just a lot of numerical operations. You’ve heard flops, floating point operations. There’s a lot of those we need to do. Also, the parameters of the model… take up space as well. [7:16] Johno: So we have the data, but we also need the matrices that we’re multiplying that data against, and then the results are getting multiplied with more matrices. So the model takes up a lot of space too. And then since we want to train this model, we’re also keeping track of something called gradients. Like, okay, if I were to tweak this parameter, how would that affect the output of this layer? And then how would that affect the output of the next layer all the way down to the final prediction? [7:41] Johno: So I’m feeding data through the model, that’s taking a lot of computation. The model is taking up a lot of space, restoring gradients, they’re taking up a lot of space. Eventually I get an answer. So we compare it to the right answer. And then we now want to say, okay, how do I update the model? Again, this is a lot of crunching of numbers to figure out what those gradients are, how I need to update these parameters. And then if you’re using an optimizer to update the parameters, there’s more stuff to store there. [8:08] Johno: So the point of the slide is to look a little confusing. I see someone’s very helpfully annotated the flow. Yeah, we’re still going circular, right? This is still the same as this one. Feed some data through a model, get an answer, measure how good it is, update, right? looping cycle. And here we’re just saying at various points in the cycle, there’s things that involve a lot of computation and there’s things that take up a lot of space. And so what I want you to take from this, keep an eye on those two values, right? [8:34] Johno: There’s things that take number crunching and there’s things that hold memory. So, yeah, keep these two aspects in your mind. These are going to be like two sides of the coin when you talk about performance, et cetera. People in the chat are saying, oh, there should be errors in that slide. Napkin man, who has time for the errors? Okay, so that’s the framing that we want to keep in mind. We’re holding things in memory and we’re shuffling data around and we’re doing operations. Okay, so why do I say shuffling data around? [9:09] Johno: Remember I mentioned this is all running on actual computers. And what that means is that there’s a whole bunch of different types of memory. And I don’t want to get too deep into the weeds on this, but for example, on your physical CPU die, right? This is like an actual piece of silicon with lots of little tiny wires and circuits etched into it. You have some memory that is right next to the CPU. It is like very, very short actual physical distance. The transfer times between those two are almost instantaneous. [9:43] Johno: Then you also have those sticks of RAM that you put in your desktop over to the side. Those are also very fast, right? Things that are in RAM, we usually think of that as fast memory compared to something like stored on your hard drive. So every little piece of this chain, we have different chunks of memory at different stages, and the copying back and forth is going to take more or less time. Same thing on the GPU. On the GPU, you have some RAM that’s actually on the physical GPU die, usually not very much. [10:10] Johno: You have some other RAM blocks usually right next to it that are still really, really fast, but a little slower. You can share memory across GPUs, but then you’re not having to communicate via NVLink or via your PCIe lens on your motherboard. And so if we’re copying data back and forth, like say I’m copying data to the GPU so that I can do some computations. If I’m copying that data from a spinning hard drive. that’s going to take a very long time. If I’m copying it from an SSD, it’s going to be faster. [10:37] Johno: If I’m copying it from RAM, like on the CPU, it’s going to be faster, et cetera, et cetera, et cetera. So we have this hierarchy, and we want to be as aware as we can that putting things in slower memory is going to mean it takes longer to copy them across. And if we have to copy them across every time we go through that loop, that’s going to slow things down. So that’s a big piece of the performance puzzle to keep an eye out on. And again, when we talk about… [11:02] Johno: GPU rich scenarios where you have multiple nodes and each node has multiple GPUs. The network link between two nodes might be like 10 or 100 times slower than the inter-GPU communication, which might again be like 10 times slower than the high bandwidth memory on the GPU. Okay, so that’s a lot of like background, but these are the pieces that we want to be thinking about when we’re talking about what makes something fast. Okay, I should check the chats and questions in case I’m… glossing over things too fast. Someone says my voice sounds AI generated. [11:38] Johno: That’s great. Thank you. Also apologies. I mean, hopefully the recording, you can put it at half speed or something like that. I’ve been told I do talk quite fast. Okay. Some specific questions that we’ll get into at some point. Cool. All right. So let’s plot on. Okay. So the goal in training is we want to keep this GPU fit. And so if I said, hey, I’ve got a brand new idea, my hard drive has a terabyte of space. [12:08] Johno: So why can’t I train a massive model just by keeping the whole model on the hard drive and then layer by layer, I’ll copy it onto my GPU, I’ll do my operations and then I’ll copy it back onto the hard drive. That’s going to be a lot slower than if I could say, okay, I can fit my model in the GPU round. Okay. So maybe we’ve got lots of tricks that we can do to keep things closer to the metal. [12:35] Johno: Maybe we should first switch to the napkin and look at an example of like, yeah, some different types of fine tuning and how much memory they’re using. Hamel, do you have questions or other things that you’ve seen that we should look at before we start putting this theory into practice? No, I mean, I think this outline looks good to me. All right, cool. So let’s switch to the napkin and let’s try and doodle an actual example. Okay, and so we’re going to start with, I saw a question, different types of fine tuning. [13:15] Johno: Let’s start with full fine tuning. So every parameter of the model I am wanting to update. And so on this napkin here, what I’m going to try and do is I’m going to try and keep size somewhat representative of how much memory something’s taking. So let’s imagine we’re lucky. We have a lot of GPU RAM and our model is quite small. So I have my model. This model is represented by a number of parameters. Each parameter is some numeric value, and we’re going to use maybe high precision. We’re going to use 32 bits per parameter. [13:56] Johno: So that’s four bytes. So if my model is 100 million parameters, this is going to be 400 million. or it’s going to be 400 megabytes, right? And the reason this factor of four here, just to be clear, we’re saying we often have multiple bits per parameter, usually 8, 16, 32, right? So one byte is eight bits. If I have 32 bits, that’s four bytes. So for every parameter in this 100 million parameter model, I have four bytes of memory representing that parameter. So that’s why I said this is a 400 megabyte model. [14:38] Johno: Okay, so then I take some data and let’s imagine our data is tiny, right? Like this is a model that takes in one number and it tells you if it’s greater than or less than one, right? This is classic overparameterized machine learning. Why use a small model when a big model will do? So my data is going to be negligible. I put it on the GPU and I want to run it through this model. Now, at the start, the model has no gradients, we haven’t done any operations yet. [15:03] Johno: But when I start feeding the model, feeding the data through the model, what we’re going to end up with is for every parameter in the model, we’re also going to be storing a gradient potentially. And so then I have now a 32-bit value that is separate from the… model parameter that’s like a gradient, right? And then I get my final answer. I compare it to the true answer. Was the label correct? I call backwards. And actually, it’s the backward pass that’s filling in these gradients. Now I would like to update the model. [15:44] Johno: And if I’m using an optimizer like stochastic gradient descent, the optimization step is just looking at these gradients and then figuring out from there how to update the model. but some fancier optimizers and specifically something like Atom, which a lot of us use, they don’t just use the gradient from this particular training step. They have ways of accounting for like momentum, right? So that’s another set of parameters for momentum. They might have like a variance measure of like the last few gradients, have they all been very similar or have they been very different? [16:17] Johno: So we can end up with potentially multiple sets of numbers in the optimizer state, each of which is the same amount of parameters, same amount of storage as the model. Right, so you can see… So is it usual to have like, I mean, is it this pictorial representation? Is the optimizer taking up three to four times more space than the weights? Yes, with lots of caveats. So often people will use, you’ve heard of like maybe 8-bit Atom. [16:47] Johno: Right, was a thing where we said, okay, rather than representing these in full precision, you know, maybe I’ll have my, I’ll use smaller little 8-bit precision rather than 32-bit precision buffers to store the momentum or whatever, right? We’ll have some sort of clever filtering or clever quantization that compresses this. There’s also like different optimizers will try and say, how can we get away with less state, right? [17:19] Johno: So I have my gradients and maybe I only have one other value, or maybe what I do is I have one value for every block of gradients, you know, so for every layer maybe. So then I suddenly end up with a much smaller set of stored states. So this is a lot of the juggling that goes on. But in general, for the high-performing optimizers that everyone uses, you do end up with, yeah, my model takes up 400 megabytes. The gradients take up 400 megabytes. The optimizer state takes up 400 or 800 megabytes as well. [17:49] Johno: So it is a lot of overhead. What we’ll see is that this changes when we come to the kind of LoRa LLM fine-tuning that we’re talking about. Boom. Okay, so this is one example here. When we’re talking about language model fine-tuning, we are in a slightly different situation, right? Because we usually have a very big model, but often we’re using something like LoRa specifically because I don’t have space on this GPU to fit four copies of this model, right? I just actually don’t have enough space. [18:27] Johno: So one of the tricks at our disposal is to say, well, why don’t I keep most of this model frozen? In other words, I’m not going to be updating them. And if I’m not updating them, I don’t need gradients and I don’t need optimizer state. right? But what I’m going to do is maybe just some layers, or maybe I’ll add a few extra parameters, right? [18:48] Johno: Like some small subset, 1% of the weights I’m going to add for every big, you know, 1000 by 1000 matrix, I’m going to add a 1000 by 32 matrix, something like that, right? So much smaller subset of the weights. And these are the ones that I’m going to be keeping track of gradients. And these are the ones that I’m going to be optimizing. So now, even though I don’t have room for four copies of this whole network, that’s fine because I only need one copy. I can feed my data through. [19:19] Johno: So I get some data, I feed it through, get my answer, and now my gradients are only for these trainable parameters. My optimizer state is only for these trainable parameters. And so I’ve still got some room here. So that’s one of the reasons why LoRa was so popular is it’s saying suddenly… you don’t need 4x your model size to train. You need just enough room for your model plus 4x your trainable parameters, which are usually like 1% of the model size. All right, so if my model… [19:51] Johno: Just for the audience a bit, I see people trying to take notes and write down an outline. I guess maybe at the end, maybe you can enumerate right down on whatever all the different components. And then we can just pay attention to this. I don’t know. I think the collaborative note taking, some people find helpful. And that’s also nice for me because I’m going to be like, what I should be doing is redrawing this diagram multiple times. So I should have. Maybe made a new one for the lower. [20:20] Johno: But I think it might be easiest if I can just erase things, scribble over things. So we might not end up with much of an artifact here. So if people feel like they can take their own notes, that’s even better. Okay. Maybe it’s useful to actually write down, just say the name on the side. Like, hey, yeah. All the components. Okay. So I mentioned in the slides, we have lots of tricks at our disposal to try and beat this situation where we’re… [20:47] Johno: running out of memory, we’re having to put things in slower memory, and that’s slowing everything down. Okay, so lower was one, only training some parameters, right? So I’ve got fewer parameters to train, smaller sets of gradients to keep track of, smaller optimizer state. So let’s look at another one that’s quite popular these days, right? Here’s my GPU. If this is how much the model would take in 32-bit, we can say, okay… can I represent that with fewer bits per parameter? So we call this quantization. [21:22] Johno: And so where we would have had this big model, I can now say, well, rather than using 32 bits to represent every parameter, can’t I get away with some smaller number? So 8 bits, suddenly I’m using already a quarter of the size versus 32 bits. And if you’re really crafty, you can compress that down even further to something like 4 bits per parameter. [21:46] Johno: So now I have my frozen model, the weights are stored in this quantized state, we’re not going to go into what quantization is or all the nitty gritty of how that’s done, but it just means that these take up much less room. And so suddenly, from 60% of my GPU being taken up by these 32-bit weights, we have gone from 32 bits to 4 bits. So that’s an 8x reduction. So now we’re at… 5, 6, 7, 8% of the GPU. Or we can train a 5 or 10 times larger model. You can combine these techniques. [22:22] Johno: Let me just put a Q for quantization. So you can keep the base model quantized and then you can have trainable lower layers that are still in full precision. But because they’re so small, The total size taken is a lot less than if you had a full model in 16-bit. I now have a full model in 4-bit and some lower layers in 16-bit, but the lower layers maybe add up now to 10% of what the quantized model is. So this is another big trick we can do to get more space to do more things. [22:55] Johno: So if you’re trying to train, like here’s a very actionable thing. Oh, I’m trying to train a model, it just doesn’t fit on my GPUs. It’s like, okay, maybe try out LoRa, right? So now you’re only training a much smaller set of parameters. See if that fits. Okay, cool. This is great. I want to go to an even bigger model and now it doesn’t fit on my GPU even if I’m just doing LoRa. Like, okay, maybe consider quantization. So do look up QLoRa, right? There’s a load in 4-bit equals true thing in Axolotl. [23:22] Johno: A lot of other trainers will support this. And so now you’re saying, okay, I’m quantizing my model and I’m putting it on and we can go from there. There’s a question when I do View Lower, is both weights and gradients quantized or only one of them? Is there pros and cons of choosing full precision of gradients over weights? So yes, what we usually do is we just quantize the frozen base weights and then the lower parameters we still keep in higher precision and the gradients we still keep in higher precision. [23:51] Johno: So, um, There’s a few reasons. One, we can get away with it, right? Because the lower parameters are still some small subset of the whole. But two, it’s very tricky because what we’re usually doing during training is we’re making very small updates to the weights. And if you have only a four-bit value and you’re trying to train that directly, You might calculate your gradient and then your update might be 0.001 in some direction. But because I’ve quantized the value so aggressively, the possible values I can represent with four bits might be 0 or 0.2. [24:23] Johno: And so if I’m saying, okay, I’m going to try and add 0.001 to 0. Okay, it’s 0.001 now, but then I’m quantizing it so it still stays zero. We don’t actually get an update. There are tricks like the 8-bit optimizer does some various tricks to try and get effectively higher precision with something called the Kalman filter and all these clever tricks. But in general, you want your trainable parameters and higher precision. You want the base weights in low precision if you can get away with it. And all of this is usually found via experimentation. [24:52] Johno: So we tried 8-bit training that was very popular for a while. Then people found like, hey, as long as we’re doing lower and the lower is still in 16-bit, we can get away with… 4-bit, 3-bit, even 2-bit compression of the main bass weights. But at below 4-bits, you start to see really some performance degradation. So that’s kind of like the sweet spot. Maybe for some models, it’s 6-bits. For some models, it’s 2-bits, whatever. But yeah, this is the sort of thinking that people are doing. Like, okay, where can I keep full precision? [25:22] Johno: There’s some layers that we never quantize down, or at least we keep the activations in high precision, things like the position embeddings, layer norm. But for the most part, it’s like wherever we can, we want to compress it as much as we want. But for the things that are being trained, we still keep them in high precision. Okay, so these are some of the tricks. There’s more. If we look back at my slide, there’s a few others I listed. CPU offloading. [25:48] Johno: If we get to the stage where we’re saying, look, here’s my model, and here’s my GPU. Oh, I can quantize my model. So here’s my quantized model. It’s still not going to fit. I don’t need to keep gradients. I don’t need to do other things. So at this point, we can say, well, I have my CPU RAM. And it’s, you know, CPU RAM is cheap. I have 256 gigs or 128 gigs in my machine, even though my GPU only has 32 gigs. So I’ll kind of like declare bankruptcy. [26:19] Johno: And what I’ll do is I’ll keep the GPU like empty for calculations. And I’ll only put one layer at a time of the model. And then I’ll do some operations and then I’ll store back the gradients, the weights and everything on the CPU. I’ll bring the next layer on. I’ll do some stuff on the GPU. I’ll put it back so we can get even more size in terms of models, but there’s a trade off. There’s some copying. Okay. So this is all to do with memory. [26:49] Johno: Specifically, I’ve been talking about the weights and the gradients and things. There’s a few other considerations to start thinking about now, and maybe we’ll jump into the code shortly and then switch back to the napkin. But one really good question is, at what context length do activations become significant enough for us to start thinking about? And that is an excellent question because this first explanation here, this is what you’ll see in a lot of places online. But it definitely doesn’t tell the whole story. And so let’s go back, let’s draw another GPU and a model. [27:25] Johno: And we can go whatever, here’s my gradients, say. Before I said, oh, here’s my data, and we’re going to feed it through. Now if we’re feeding a lot of data, like I have a really large batch size or a really long sequence, suddenly the size of that starts to matter. And every layer in this model. I’m storing the outputs of that layer to be fed into the next layer. And I often need to keep those around to be able to calculate the gradients. [27:53] Johno: And so what you end up with is then, okay, I’ve got my input data. It’s probably still quite small relative to the model size. But for every layer, I’m keeping around these output activations so that I can calculate the gradients. And these really start to matter. These start to add up. And suddenly, you might see a situation where, like if any of you have tried long context length training, this starts to be a problem. So let’s switch over into… Well, actually, first of all, I’ll show you how you can run some experiments like this. [28:22] Johno: And maybe I should have started with this before we went straight to the napkin. But then I’ll look at some code and show some concrete examples of the stuff we’re talking about. Okay, so one way to run experiments is to… do two versions of your training, right? If you follow the modal, getting started with axolotl docs, and I did this, this was my first time using it, but you get a nice configuration file, you can change the batch size, right? Which on an AppCon, that’s going to be changing like maybe how much data we have. [28:54] Johno: You know, there’s one sample, another sample. So if I have a higher or lower batch size, there’s more data. You can run that through and you can see how long it takes. Another way is to go directly into code and to try and look at these individual steps more close to the metal, rather than having to rely on multiple minutes of training on some GPU in the cloud. It’s sometimes nice to have a smaller example. So that’s the notebook that I prepared. We’re just going to run through… [29:24] Johno: doing one cycle effectively, or not even, like we’re not even going to worry about Optimizer, just feeding some data through the model and looking at the gradients and starting to see like, how do we understand a question like this? How does context length come in? How does batch size come in? When does that start to matter? Okay, so we have a notebook here. We’re going to be using a few different things. One, PyTorch has some nice built-in memory tracking options. And so we can print out… the memory allocated and the memory reserved. [29:57] Johno: Memory reserved is what you’ll see in something like the weights and biases, memory usage logs or NVIDIA SMI. But sometimes it’s like we’ve put stuff in memory. We’ve actually deleted it or we’re not using it at the moment or PyTorch has put it there ready for the next computation. But if we didn’t have enough space, it wouldn’t really matter and we could just load that as needed. So memory allocated is maybe a slightly better measure of how much you can get away with if you have 24 gigabytes of GPU RAM. [30:26] Johno: and you can do something that memory allocated is 23 point something, right? It’ll be a squeeze, but you might fit in. Memory reserved might show 24 gigs used, but actually only 18 might be allocated. It’s a small difference, but these are two measures of like GPU RAM used. Okay, so we can do something like load up a model. In this case, TinyLama, it’s a very small, well, still over a billion parameters. So it’s kind of funny to say small, but in the days we live in, that is small. [30:53] Johno: We can load it in 4-bit, so this is doing the quantization. We can set up LoRa, so this is now making sure the base model is frozen, but we have these trainable parameters that are a small subset of the total. We can create a batch of data, and with data, usually you’re thinking about, wait, where is the text? This is the language model. Each token of text gets transformed into a number, and so that’s why I’m just using random integers here. [31:22] Johno: And then we have a list of them that is 12,000 tokens long for each example in the batch, and in this case batch size is one. I’m going to feed that data through my model, and then I’m going to calculate the gradients, and then I’m going to print the memory usage. And if I run this, we’ll see there’s some amount of memory used. Okay, so this is where we can start to do some napkin math, right? So first it might be to say, what if the data was really, really small? And that ran too fast. [31:58] Johno: I was going to say, make your predictions how much space is going to be here. But you can see this is pretty small. It’s a small model and it’s quantized. As we increase this. we’ll see that suddenly the memory usage actually is quite a lot bigger. And so you can play with this. In fact, I put a little exercise there. I may have left the solution in the notebook that I shared, but just try this out. [32:22] Johno: Plot your memory usage versus context link or see what happens when I say, okay, what if I double the batch size here? So let’s do that. Double the batch size. Suddenly I have a higher memory. So you can see here, this is a lot easier than like, I kick off a training run on modal, I wait for it to go, I take time for the training to finish, and then I go look on weights and biases and I see which finished. Here we have this very like concrete way of checking things out. [32:52] Johno: So I could do, for example, load in 8-bit, I think this is a thing that one can just say. And see what that does. Cool. So I’ll get a different memory usage there. If I didn’t load the model with any quantization at all, right? So we’re just loading it now. Sorry, where is the print memory stats thing coming from again? That’s just up here. So this is just like a nice wrapper around torch.cuda.maxMemoryAllocated and maxMemoryReserved, which I think used to be called maxMemoryCached. [33:31] Johno: So if we load it without any quantization, I’m expecting those numbers to be even slightly higher. Yeah, suddenly we’re up to 8 gigs instead of 6. So this is a little thing for you to go play with. I’d love for you to explore, like, what if I turn on CPU offloading, turn on and off gradient checkpointing? As always, there’s complication. Question about that print thing. Yeah, yeah. Does that account for all of the GPU memory that you might care about? [33:56] Johno: So I’ve tried using this before, and I’ve found, like, yeah, like, discrepancy between, like, what weights and biases logs and what that prints out. I’m just curious if you’ve experienced anything like that before. This will give you a pretty good idea, except, I mean, maybe look at the higher one, just to be safe. Aim to not fully saturate your memory because sometimes during training, if you’ve got a bit of extra, it can do things like prefetching the next layer or whatever. It’s sometimes helpful to have a little bit of headroom. [34:28] Johno: And then also, as soon as you’re talking about a distributed setup, I’ve got multiple GPUs or I’ve got multiple nodes, there’s things like if my weights are spread out over the GPUs, PyTorch needs to be able to combine them and prepare them in ahead of doing the computation. So sometimes you need a little bit of extra overhead there. [34:46] Johno: But in general, this is a really good way to track, you know, if I have a 24 gigabyte GPU and then this is saying, you know, 20 gigabytes, then I probably can’t do my training on a 12 gig GPU, right? It’s a pretty good approximation. Okay, so that’s one way to like get insight. If you really want to dig deeper, there are some other tools. So… Here is a cool little function that’s maybe not as documented as it should be. [35:13] Johno: If you turn on record memory history, PyTorch will log exactly what is taking up memory over time. So if I run this cell, it’s going to take a little while, and then it’s going to spit out a pickle file, which if I go to this PyTorch memory site, sorry, just a sec, I’ll need to find this file. Yeah, here we go. I can drag this on. This is now over time, the total memory usage. You can see here on the left, figures in gigabytes. [35:49] Johno: And so this is a really instructive way of looking at things because you can see here, this like base usage that was there at the start, that’s the model weights. The model weights are already on the GPU quantized in this example. So they’re pretty small. And then as the data starts going through the model. We’re building up more and more memory that’s being used by the activations. We get to the end, there’s a little spike during the cross-entropy calculation and the final language modeling hit. And then in backwards, we’re now calculating gradients. [36:24] Johno: So you’ll see that these gradients are appearing. But once you’ve calculated the gradients, we can kind of let go of the activations and so we’re reducing the memory overall. This will look very different if you’ve got gradient checkpointing turned off. It’ll look very different if you’ve got quantizers unquantized. Different batch sizes, you’ll see a lot of model training that we’re doing. The base usage of the weights might be most of it. And then you’ll see a little up and down as you get more activations and things. [36:50] Johno: But it’s a very, very fun way to get an idea for, hey, this is what’s actually going on. And if there’s some big spike, you can click on that. And you can try and pass this big list of this is what’s going on. occasionally, if you’re lucky, you can then figure out what is actually causing that and what you need to do. So very underrated tool. And I’m putting it in there mostly for like, this is like, you might only use this if you’re desperate, but it is useful to know. [37:16] Johno: And it’s also cool for like, if you look at the PyTorch tutorials, some of them have started to incorporate this to say, okay, here’s what’s actually going on. You can see, oh, we’re using this new special technique that does the updates. with the optimizer as it’s doing the backward pass. And so then we don’t have to like store these gradients and then do the gradient up. So there’s all these different little tricks. So this is a way to get a much deeper insight into that. I have a question about this. [37:43] Johno: So if you do like prefetching and stuff like this, does this like smooth out into more of a plateau-ish looking thing? Or how does it… Yeah. Sorry, prefetching? Oh, like of your data loader and you’re trying to shove it through the model? Not really, because the data might already be on the GPU here, but it’s all the intermediate activations that stack up. I see. And so maybe this is a good bit of napkin math we can do. [38:17] Johno: One of the links I shared was to a PyTorch issue where someone asked, what’s the deal with this activation memory on… on this model and whatever. So here’s what the model looks like. We’ve got some number of layers. They each have, there’s a hidden dimension. It is possible, and maybe my talk was misleading because I’m not talking much about this kind of actual math, but it is possible to go and do some calculations and figure it out. Specifically, if we go to the slides, right, they’ll be here. [38:53] Johno: In the slides, we had a link to some issue, this one. And the person gives a formula, right? So there are ways to calculate, okay, every attention layer is going to take this type of input, produce this output, and then it’s going to multiply that with that and take that output, and then it’s going to do these operations. So we can take this formula and we can just run this for our model. So let me go back to tldraw. I’ll keep this on a different screen. [39:24] Johno: Okay, so they said activation memory is equal to S, B, H, 34 plus 5AS over H. A lot of these, if you go read the paper that they link, these are all terms that are defined. S is the length of our sequence, which in our case was 1,000 tokens, I believe. Batch size is 1, so we don’t need to worry about that. Hidden dimension is 2048. This is the activation memory per layer. We said there was 22 layers. What else do we need? I think the number of attention heads is what A is. [40:04] Johno: I think that’s 8. Okay, so let’s run this calculation. We have… 1000 times 1 times 2k, right? So this is 2 million-ish times by 34 plus 5. I said a was 8, so that’s sequence length, which is 1k. h is 2k, so this is 4. So that is 20. This is 54-ish, call it 50. So this is about 100 megabytes. And you can see, like, I’m not worrying about this multiple bits. I’m kind of rounding off everything that I can. So for 22 layers, this is… Have I done this right? Yes, so this is… [41:06] Johno: 2.2 gigabytes? Is that right? No. Right. Because there’s floating bit. I feel like this might need to be multiplied because you’re doing everything in 16-bit. Oof. No. Hmm. Okay. Because on our plot, right, we didn’t have We didn’t have about 2 gigabytes, we had about 4 gigabytes. I should probably have rehearsed this. And on that plot, I guess like, okay, that plot is showing there’s a reserved amount. And does that stay flat throughout the whole? Oh, like this base thing here? Is that what is reserved? It’s more like these are the model weights. [42:00] Johno: So the model weights… You had two numbers, right? Like the higher one and the lower one. Yeah, this is active memory. So this is the slightly lower value of the two. This is things that are actually active. And so, for example, as these intermediates are being thrown away, they might not be immediately cleared out of the cache or whatever. So they might still be reported as lingering around in the reserved value, but not in the active value. Can you talk about an example where you looked at this graph and it helped you? [42:31] Johno: Like, you know, get unstuck on something or something? Yeah, sure. I mean, one… I found some bugs because the spike was suddenly extremely high due to an incorrect implementation. I’ve also seen cases where, okay, so now we have like maybe some optimizer state or some gradients or something like that. Those were larger than they should have been because I hadn’t frozen all the weights I should have frozen. If you call loss.backwards and then you do your update, but then you don’t delete the loss, some gradients might stick around rather than being forcibly cleared. [43:07] Johno: And so then your second training step might actually take slightly more memory. That’s been me before. And this graph was really helpful to log the memory over several steps to check that it wasn’t just continuously escalating. That was pretty helpful. What else? Oh, and then in finding, there was a thing where in… One of the recent versions of Transformers, I think 4.37 or something like that, HuggingFace changed their implementation. They did some optimization to make things faster for inference with KB caching and whatnot. [43:41] Johno: But what it meant was that they weren’t actually using the right efficient kernel. They were using the efficient attention rather than the flash attention. But efficient attention means like compute efficiency. But for training, it was now using a lot more memory. and so we could come in and see before and after like his version you know 0.36 and here’s version 0.39, which of the pieces of this graph are larger? [44:05] Johno: This is not the interactive one, but if you hover, you can see which chunks suddenly went from 100 megabytes to two gigabytes or whatever on longer sequences. And we could say that’s the culprit, see which kernel was being called. Wait, that looks like it’s from a different attention kernel to the one that we expected. Go and find the bug, go and, please change it back to the flash attention for training. Yeah, I see. Does it take a lot of practice to read the logs when you hover over those things and you get those long logs? [44:35] Johno: Yeah, I would say don’t even worry too much about that. This is more to just get a general picture. I’ve got some memory allocated. I’ve got some memory that’s going up and down. If I change the sequence length to be smaller, the base is not going to change, but this little spike will be lower. And so this starts to give you a feel for where you can have these tradeoffs of, oh, if my model is taking most of the space and I really don’t have much space for the… [44:59] Johno: the actual batch data, then I’m reduced to doing batch size of one on short sequences. That’s not ideal. I’ve just realized we’ve gone longer than I wanted on this. I should have maybe just… left this as an exercise for the reader. I’ve shown how you can use a smaller model here. It’s my fault, sorry. No, no, it’s totally on me. What we should do is go back to this links page and as a, well, okay. So let me jump in. We have another speaker. [45:34] Johno: Right on the hour, so we’re not going to be able to go over. We’re not going to be able to go over. Okay, so I will tell people, maybe just go check out this link. [45:44] Johno: The reason that I’m talking about being able to estimate memory usage and things like that is that, especially for a GPU-poor situation, where maybe you don’t have the fastest interconnect between your GPUs or you don’t have that much GPU memory, If you can go, like every cycle, every time I’m loading the model weights from CPU onto GPU or something like that, that’s taking a lot of time. Then I do some computation and then I have to load the next layer and that takes some time. Then I do some computation. So that data transfer takes time. [46:22] Johno: which means that if I can find any way to do more computation before I then have to copy data across, that’s a good thing. And so this example here, this 3090 basement rig, you’ll notice that all the tricks in the book, quantization, using QLoRa rather than just LoRa, anything that we could do to use a larger batch size resulted in quite a significant speed up. right? Because now I’m doing fewer cycles overall because I can do 32 samples at once versus eight or something like that, right? [46:56] Johno: And this in general is going to hold true for a lot of training where you’re memory bandwidth constrained on slower machines, cheaper machines. And so then you’re really trying to think like, yeah, how can I optimize this? How can I maximize either the sequence length or the batch size that I can fit? And so all of these tricks come into play. [47:16] Johno: But then if you go and read through this post later on, we try some different machines and one that stood out was once you get to an H100 with the super fast like SXM is like the proprietary server version of NVLink, the communication between the GPUs is so fast, and the GPUs have so much memory, that you can keep the model weights in that fast memory. [47:38] Johno: And even if they’re spread across multiple GPUs, you can load them so fast that you haven’t even finished doing the computations for the last layer when the next layer is already loaded. And so suddenly you’re not memory bound at all, you’re compute bound. And in that case, if you do a batch size of eight versus a batch size of 12, you’re doing fewer steps, but you still have to crunch the same total number of numbers. And so the time doesn’t actually change that much. [48:05] Johno: And so in the benchmarking example, where we were looking at the axolotl version, where I said, hey, here’s a way to run experiments. Using the batch size from 16 to 32. it didn’t actually change the runtime that much, right? Sorry, from 32 to 64. And quantizing, it didn’t really change the runtime that much either. And the reason is, okay, quantized weights might be faster to copy from the CPU to the GPU, but that almost doesn’t matter because it’s all about like, how fast can you crunch those numbers? [48:37] Johno: So the like dark red part of my diagram. Whereas if this was on a slow machine or a machine with good GPUs but slow interconnect, it really does matter being able to fit a larger batch size so you can do fewer total batches, so you can do fewer total loads. So there’s this kind of juggling trade-off that one ends up doing. Okay, so I should see… There’s lots of rabbit holes we could go down further. [49:01] Johno: I should see what questions are in the chat that I can help rather than trying to finish everything I wanted to cover. Okay, does CPU offloading make any practical sense for training or only for inference? So I found it helps specifically in a case like mine. There’s no way I can fit a 70 billion parameter model on one GPU. And even though I have maybe several 3090s, if I spread the model weights over those GPUs, there’s very little overhead left for the actual data and the activations and so on. [49:33] Johno: And so even though it’s slower to copy the model weights from CPU, because I can fit a much larger batch. I only have to copy the model weights from the CPU once to then process like 32 samples versus having to process one or two samples at a time because there’s so little overhead. So yeah, CPU offloading does actually give you a reasonable speed up often if you’re in this case where you really just need more room. [49:59] Johno: As soon as you’ve got more capacity, you’re training smaller models or you’ve got 80 gig H100s at your disposal, then definitely not. So my recommendation actually, if you check out the bottom of that benchmarking post, there’s like a sequence of steps. Start with the default, right? None of these optimizations turned on. And then slowly go through, like turn on quantization and then see if you can increase the batch size as much as you can and then check, did that actually give you an improvement? Right? [50:25] Johno: If you’re in a FSDP scenario where you’ve got multiple GPUs, start with just data parallel, no sharding. Then, see if the sharding gives you an advantage. Then see if the full sharding and cross nodes, just like there’s a nice little sequence there that Kerem wrote up to say, yep, start with the basics, lower, no quantization, just vanilla data parallel if you’re on multi-GPUs. And then slowly add, if you find that you actually really have so little overhead that you’re needing a very small batch size, you can slowly start to add in. [50:56] Johno: Quantization, and then maybe CPU offloading, you know, definitely have gradient checkpointing, all these little tricks that I list, just turn them on one by one until you find like the sweet spot where okay, at this point, like I can now go fast. Could I share the notebook? I’ve put it in the discord, we’ll probably also then put it up as a gist or something like that and share that. Sweet spot with quantization between compression and accuracy. Yeah, it seems to be especially Quantization plus adapters seems to be a really nice thing. [51:24] Johno: I think if you’re just doing quantization below maybe six or four bits, you start to see a drop. But if you can then correct for that with a lower that you train, you can go, it seems like maybe two bits is potentially doable. But four bits for me is like a nice default that’s just, okay, if I’m training four bit plus a lower, I can usually get the same performance I’d get without any quantization. And then as long as you’re keeping the lower in high precision. Yeah, that’s pretty nice. [51:54] Johno: Some discussion around gradient accumulation steps and micro batch size. Gradient accumulation is where if I want to do, say I wanted to target a batch size of 32, but I can only fit a batch size of eight on my GPU. What I can do is I can run four batches before I do an update step. And so then it’s effectively the same as having a batch size of 32 from the learning dynamics perspective. But I’ve been able to do it in four micro-batches. [52:26] Johno: So this is useful if you’re trying to target a specific, like, oh, I want to match what someone else did, but they had an A100 and I only have a 3090. And yeah, that’s a useful way if you’re… If you’re training at batch size 1 or 2, you probably want to be using gradient accumulation to get that to a slightly larger value. But you don’t need to go crazy. You don’t need to target a batch size of 1000. 32, 64, 16, these are all reasonable values, especially just for a little bit of lower fine tuning. [52:57] Johno: What is the x-axis on the plot? I think that’s probably the memory profile. That’s time. So that’s over time as we do more computation. You can see forward pass, backward pass. That’s kind of like the flow. So the data is going layer by layer through that transformer, and then we’re back propagating those gradients back. If you have multiple lowers, do they update the same parameters or do they each update a different subset? You could target different ones, I guess. Usually, if I’ve applied several lowers, each lower specifies which layers it’s targeting. [53:32] Johno: Maybe it’s the up-projection matrix, the down-projection matrix, and the attention matrices. So you can look in the lower config. If people talk about having multiple lowers applied, they’re usually updating the same weights. Any other questions that you guys wanted to raise to the top? I’m looking at most highly uploaded on… I think we got this top one. I will also… I’ll keep an eye on the Discord. Sorry, I said I would be able to multitask. I totally couldn’t. But I will try and answer questions in that channel even going forward. [54:20] Johno: But maybe to summarize, like to step back and summarize, let me stop sharing my screen. Um… Napkin math for fine tuning. We’ve said we’re trying to juggle shuffling data around as little as possible and trying to get through all the computations we need to do this training. And so things that take up space in memory, the model weights, the gradients, et cetera, tricks that we have at our disposal. Lower means we only need gradients for a small subset of the weights. Quantization means we can store those weights with fewer bits. [54:52] Johno: Experimentation is key to see where these trade-offs are. You can measure the amount of memory allocated. You can go and try and tweak these things and see experimentally how does this work. As your batch size or your context length increases, you need more space for activations, intermediate values. And so suddenly you’re not just dealing with the model and optimize the state. You suddenly have this extra component that scales as you go longer sequences or longer batches. [55:16] Johno: So if you’re finding you’re running out of memory and you really and you can reduce your context length, that’s fine. If you can’t, then you need to start looking at the other places that you can save memory, like quantizing the weights, like offloading them to the CPU. It’s possible to calculate all of these things, although my one attempt to actually show that maybe was off by a factor of two, because I forgot to account for the data storage type. But it’s often like… [55:43] Johno: rather than trying to calculate from a, here’s this big formula I found, calculating from a, look, I measured, and with a single layer, if I do the forward and backward pass, this is how much the activations took, this is how much the weights took. I can kind of guess, okay, for a 7 billion parameter model in 16-bit, that’s 14 gigs of memory. Then I need, the lowers are 1% of that, but I need, you know, activation there. Okay, cool, that’s maybe another gigabyte. [56:08] Johno: And then for every, like, thousand token I add to my sequence length, I need another gigabyte of memory, you can start to quite quickly get a feel for, okay, I can probably go from the one that I’m currently training at, I can probably bump my sequence length up to here before I need to think about buying a more expensive GPU or renting out. So this is the space that we’re operating in. [56:30] Johno: Lots of parameters to tweak, but hopefully this talk has given you a bit of a feel for where the key things are coming from, why people even care about things like LoRa, things like quantization. tricks like CPU offloading. Yeah, I hope that’s given you some context. I will be on the Discord for further follow-up.