FSDP, DeepSpeed and Accelerate

fine-tuning

llm-conf-2024

Published

July 30, 2024

Abstract

Advanced techniques and practical considerations for fine-tuning large language models, comparing tools, discussing model precision and optimization, and exploring best practices for effective training and deployment.

Subscribe For More Educational Content

If you enjoyed this content, subscribe to receive updates on new educational content for LLMs.

Chapters

00:00 Axolotl vs. Hugging Face AutoTrain
Zach discusses the differences between Axolotl and Hugging Face AutoTrain.

02:06 Becoming an Effective LLM Engineer
Zach emphasizes the importance of hands-on experience in training models for effective learning.

07:13 Getting Feedback from Experts
Participants explore ways to reach out to experts for valuable feedback.

09:44 Datasets for Finetuning LLMs
Discussion on the use of synthetic data, Hugging Face datasets, and other sources for fine-tuning LLMs.

14:23 Advantage of FSDP
Zach explains how FSDP was crucial for fine-tuning LLama 3 across multiple GPUs.

15:22 Advantages of Using torch.compile
Torch.compile is highlighted as an optimization tool beneficial for both inference and training due to its dynamic operator fusion capabilities.

17:34 Training Inference Precision
Zach talks about the challenges of training models in bf16 precision.

19:18 Downsides of FSDP
Techniques like FSDP are crucial for running training or inference when the model is larger than the VRAM of a single GPU. Deepspeed is an FSDP alternative that offers more flexibility.

25:24 Fine-tuning vs. Frontier Models
Discussion on the possibility of fine-tuned models beating frontier models.

27:27 Ensuring Proper Model Functioning During Inference
Zach advises using the Hugging Face pipeline to test the finetuned model to ensure proper behavior.

29:51 Running 8 Billion Parameters Model on a 4090
Participants discuss techniques like AWQ, TRTllm, and vLLM for running large models locally.

31:01 Training Models in INT-8
Training in INT-8 and fp8 is unstable, leading to the preference for bf16, though some companies are exploring fp8.

34:30 Accelerate Failure Case
Zach discusses the robustness and resilience of Accelerate in demanding workloads using thousands of GPUs.

35:58 Relevance of Chinchilla Scaling Laws
Chinchilla scaling laws remain relevant, and extended training offers significant benefits.

38:22 Relevance of TensorFlow
Despite its decline in popularity, TensorFlow is still supported by transformers, and there is a discussion on the future of JAX.

41:50 Training on Apple Silicon
Apple Silicon is suitable for inference but not ideal for training, with NVIDIA GPUs being recommended.

45:18 Serving Multiple LoRAs with Accelerate
vLLM and LoRAX can be used to serve multiple LoRAs.

48:56 Mixture of LoRAs
A discussion on using LoRAs in a mixture setting with a router.

51:35 Deciding on a Fine-tuning Project
Zach suggests recreating existing solutions for various problems as a good starting point, emphasizing learning from both successes and failures.

53:19 Choosing Model Sweet Spot
Zach recommends choosing a model based on available VRAM, while Hamel suggests using the smallest model that achieves the desired performance. Dan highlights the importance of considering inference cost over training cost.

58:31 Phi-3’s Popularity for Finetuning
Zach discusses the Phi-3 model’s lack of real-world performance, making it less popular for fine-tuning.

Resources

Links to resources mentioned in the talk:

Scaling model training with more compute, how do they do it? : Presentation on scaling model training.
Accelerate docs: Documentation for Accelerate.
Accelerate Quicktour: Quick tour of Accelerate.
FSDP vs DeepSpeed: Comparison between FSDP and DeepSpeed.
Accelerate Examples: Examples for using Accelerate (recommend starting with nlp_example then exploring by_feature).
Memory estimator: Tool for estimating model memory usage.
Can I run it LLM edition, also talks about LoRa (inference focused): Discussion on running LLMs and LoRa.
TransformerAnalyzer: Detailed estimator showing FLOPS and other parameters.

Slides

Full Transcript

Expand to see transcript

[0:03] Dan Becker: What are your feelings on Axolotl versus HF Autotrain?
[0:10] Zack Mueller: There’s a wink emoji. Who is this? Because it’s either my coworker or you know who from Axolotl. They solve two different problems in a way, right? Autotrain is agnostic in the sense of like train whatever you want with the data you have. And it’s also a slightly different API with what it looks for. And axolotl is very high level, train a model for basically text right now, as Wing was talking about earlier. Do it quickly, do it fast, following a pretty loose guideline towards what you can fit.
[0:53] Zack Mueller: So there are two different solutions to a problem that can have overlap.
[1:02] Dan Becker: if that makes sense i’ve never used hf auto trade so i was actually i prioritized this question because i was hoping to learn and i think i did uh what hf auto or hiking face auto train actually does through your answer um and i personally did that it almost sounds like is auto train really doing more of like the smart search of parameters part of the problem uh oh boy
[1:32] Zack Mueller: Oh, the auto train talk soon will be more because I don’t know. My extent of auto train is helping debug auto train when accelerator So that would be a good question to save for the auto train talk is my official answer.
[1:55] Dan Becker: Great. I’ll count that one is answered and I’ll let you start picking some.
[2:01] Zack Mueller: Sure. Since many of us are new to LLM fine tuning, how do you think the typical learning journey looks like until one becomes an effective LLM engineer? Just tweak with shit. Like just play with code, build models, do things. Because you can spend ages reading and reading and reading and reading and never touch code. And someone who’s just in their backyard doing these things for fun on the side will have more practical experience.
[2:32] Zack Mueller: That sort of what’s the situation three or four years ago when I was learning fast AI, and I still believe that situation today. Train models, get comfortable with the code. The depth of expertise will come with respect to the amount of time you do it. Even if it’s just running the same axolotl command and just looking at the weights and biases graph for different data sets, you will learn something there that I probably won’t because… I haven’t done that enough yet.
[3:05] Zack Mueller: So just start playing with things and sort of the same advice that we started giving with Fast.ai back in the day. Make it public, make your learnings public, you know, your models available, your teachings available, and be open to criticism from the people that know a lot more than you to learn from it and be very humble in that regard, is I guess my answer to that.
[3:32] Dan Becker: I want to push back for a moment just to hit, like, maybe there’s something interesting in this. And that is when you are just fiddling, which I quite like doing, you learn by seeing what works or what happens and what doesn’t happen. And if we were doing, there are many domains and conventional machine learning is probably a great example where you try something, your score goes up or it goes down and you’re like, oh. this improved it or it didn’t improve it.
[4:04] Dan Becker: And with generative AI use cases, and I think language in particular, sometimes you can see what happened to your perplexity score, but it’s a little hard to know, like, do I like what just happened? Or like, what did just happen when I, I don’t know, changed my learning rate or changed whatever. Do you have any thoughts about how one gets feedback as they’re experimenting in order to make sure they’re learning?
[4:42] Zack Mueller: Yeah. Talk to the people that are doing the cool things. Just send them a DM and say, hey, I’m doing this. I don’t understand what this particular graph means or post it on Twitter. If you go through my last Twitter feed from the last month, I’m pretty sure I made a tweet that was quite literally just Hey guys, here’s my loss graph. Does this seem normal? And because if you’re only judging yourself and going off of yourself and your own knowledge, you’re going to think everything’s fine because everything looks normal, right?
[5:17] Zack Mueller: And so that’s where communities really matter, especially like this community that is now forming, because we’re all here to go learn this thing, right? So we can all sit there and go, hey, here’s my graph. What are your thoughts?
[5:32] Hamel Husain: And I think it’s also important to pick a small project. It’s really good if you have an LLM production already and you have some traces, you have some logs about how your LLM is responding and what the user is asking and all whatever. We’ll go through that in the next lecture a little bit of how to collect those logs. But you can take those and just fine tune a model. And start with a smaller model. Start with LoRa. It’s actually kind of hard to. mess it up. It was like really nice.
[6:03] Hamel Husain: Like Laura in the pre-trained model use axolotl. You don’t really have, it’s not like a, it’s not, you have to get the hyperpresenters exactly right necessarily. Like it’s kind of has, yeah, it’s like kind of this random forest feeling where you just like point the model at the data and it kind of works. And if you get that positive reinforcement, it really helps if you get it fast.
[6:28] Zack Mueller: Yes. It’s easy. as I look at my graphs from the last week.
[6:34] Dan Becker: Yeah,
[6:35] Zack Mueller: you’re good. You’re good.
[6:37] Dan Becker: I was going back to our talk from when we were with Wing. Parts of it might be easy. The part that’s hard is actually dealing with tokenization and prompts and making sure that you do the same thing at inference time as training time. And that is like the opposite of… That’s very hard.
[6:58] Zack Mueller: I think what I spent the last hour making sure that I could get my prompts right for inference after doing my axolotl training. So it’s a learning curve and some of it’s easier than others, but that’s part of the game.
[7:11] Dan Becker: There was a question in Discord, which was, and I want to say one thing before, then I’ll hand it to you guys for an answer, which was, Uh, something like where do all the cool people hang out that I can use to get feedback? Um, I think that despite the immense amount of hype, the community of people who are actively like doing stuff and sharing it, even like community of doing it, people who are doing interesting things is smaller than you would expect and is sort of in its infancy.
[7:48] Dan Becker: And as a result, the ability to get feedback is much, much greater now than it will be. a year from now when there’s more people doing it and then it’s harder to get people’s attention.
[8:00] Dan Becker: And so we could talk about Twitter and I think our Discord, a bunch of places that you can think of, but I just want to say, I think it is a huge, huge opportunity for us to, if I can sound like a salesman for a moment, to get in on the ground floor by which I mean, people are not, people who do cool stuff like Zach are not so inundated that they probably refuse.
[8:25] Zack Mueller: messages from anyone or anything like that no this community like it’s been a very humbling experience because i sort of wasn’t doing community stuff for a bit that it’s we’re all people you know and a lot of us don’t have egos and especially right now while it’s a fine group of people like you were saying make use of that you know talk to the peoples that are coming out with these models and evals and ask them questions in twitter dm on their responses More likely they’ll not, they’ll answer whatever your question is, as long as
[8:57] Zack Mueller: you like phrase it right. And don’t, you know, give them not enough information.
[9:02] Zack Mueller: That’s a,
[9:03] Hamel Husain: but you definitely should put in the work. So like, don’t just like, don’t just DM someone like whatever, like, you know, like tell them what you tried, what didn’t work, what you researched, but then you got stuck, like show the person that you are trying. So from time to time, I’ll get a DM from someone that they just. didn’t try to like read the documentation read their error message read anything and in that case many times i have to ignore those um so don’t don’t be that person uh
[9:44] Charles Frye: i’d be actually super interested to hear zach’s answer to one of the most upvoted questions which are like what are some good public you data sets or Kaggle challenges for training or fine tuning LLMs. Like the Titanic data set or Iris or Penguins or whatever, but for LLMs.
[10:02] Zack Mueller: That’s a good question that I’m not 100% sure on the answer for, and I’ll sort of explain a little why. So my Lama 3 fine tune is based off of the StarCoder 2 self-instruct paper. And essentially what that is, is using an LLM, using the base LLM to generate your completion dataset for you off of existing code on GitHub on a dataset called the stack.
[10:30] Zack Mueller: So like Personally, if I were to choose a starting data set, that one’s not terrible because you’re generating it yourself or you have them available with known benchmarks that have end-to-end full transparent pipelines. So you know what’s good, you know what’s bad, and you know roughly where you should end. That’s at least one that I know of that I’m currently doing to learn. But I’d be very interested to know what you guys sort of also use as a gateway sort of data set and problem.
[10:57] Hamel Husain: I like the instruction tuning one a lot, the Phil Smith. I don’t know if I’m pronouncing his last name correct. Whatever. Like Phil Schmidt, he has like a fine-tuning Lama blog post. I think instruction tuning is a really good gateway because it kind of like gets your mind ready for fine-tuning.
[11:20] Zack Mueller: 100%. There’s a ton you can learn in that. Let’s see. What was…
[11:30] Dan Becker: The other… need to put together demos for potential clients somewhat frequently. And if you just go to Hugging Face and do a dataset search, like if I wanted a dataset about discussion of chemistry topics, I can probably find one. And then it’s probably in a standard format. And then I just drop that link in the axolotl config. I would say that the same way that Kaggle used to be and maybe still is for tabular data sets, the breadth of what’s in Hugging Face data sets is pretty great.
[12:14] Charles Frye: Yeah. One thing I would say is that synthetic data is really nice for, like, one, there’s a practical use, which is taking a model that’s too expensive to run and getting it into a smaller model. And then two…
[12:29] Charles Frye: you have a very clear data generating process that you’re trying to mimic and you can run it right if you have a data set you got from uh the internet you can’t like generate new data from that on the fly ad hoc and that’s really important for like testing whether your training is working or not um so uh yeah and you so you can always compare it to this executing this like model you can also execute so that’s maybe you run inference in a 70b model and you’re fine tuning an eight B or a 3B
[12:59] Charles Frye: model. And you have a really solid gold standard because all you’re trying to do is mimic that thing. And then the second piece of info, the second advice is that you also want something that changes over time, like the data set grows over time. Because when it comes time to actually deploy machine learning in production, you’re going to have like a stream of data, not like a batch data set that you just run on. And so if you want something, so you like, just for example, you might take tweets and have GPT-4 classify them.
[13:37] Charles Frye: And that’s your data set. And then a week later, there will be more tweets that you can run through GPT-4 and classify. And now you can find out whether you have like train test leakage issues. You can find out if you’re like overfitting to the to the like this week’s content of tweets. And next week, it’s something different. And that, so it’s like kind of similar to Numeri or some of the like ongoing challenges where you need to like actually learn, you know, of like a, how to model a stream of data.
[14:07] Charles Frye: And that also is going to really help you know what you need to do when it’s time to build a production ML application.
[14:21] Zack Mueller: Let’s see. One of the questions I was interested in, in my experience, it’s been a pain either to use accelerate or torch. Can you clarify the best practices when using it? And is it worth using it when running DDP, FSDP or deep speed? I just remember when torch compile launched, it was all hype. But then there was no example in Excel all these compile. And does it have any benefit for VRAM? Yes, I would not be able to do my llama three billion fine tune on 240 90s if I did not have FSDP available.
[14:52] Zack Mueller: Because essentially what FSTP allows is your 24090s are not two 24 gig cards. They are, in a light sense, 148 gigs. And so it’s critical that you do that, which go back through the talk. I’m pretty sure I talked very explicitly about the difference in memory usage between the three of those. Let’s see.
[15:22] Charles Frye: I guess the question was also kind of about Torch compile, if I understand it right. It was like using Accelerate with compile, using Axolotl with compile. And I guess my feeling is that.
[15:33] Charles Frye: compile is like a little bit more of an inference time optimization than a than a training time optimization where it’s like you have a very fixed graph you um like the maybe you’re at inference time often your batch sizes are much smaller um because you’re only getting like one request at a time and uh so the like overhead from running a dynamic graph generation is too high and so you’d rather just so you want to um switch to a static compiled graph. That’s my impression from how I’ve used the tools.
[16:09] Zack Mueller: That’s certainly how it used to be. That is not where PyTorch is going. PyTorch wants Torch compiled, which mixed, right? Because it’s compiled, it can have issues. For instance, they have a library called Pippi out right now that is native Torch pipeline parallelism for training. that relies heavily on Torch Compile. So that’s where things get weird and sort of experimental. And I expect that landscape to change dramatically over the next like three to six months. So for right now, I believe you can train on compile.
[16:51] Zack Mueller: How well it works, a lot depends on sort of how the model is laid out, I found, with just where people’s issues have come.
[17:00] Charles Frye: So the benefit of train time is that you’re getting dynamic operator fusion?
[17:05] Zack Mueller: Yes. So it feeds up train time. Mostly it’s throughput optimizations, more than memory optimizations.
[17:16] Charles Frye: Yeah, like kernel fusion?
[17:18] Zack Mueller: Yes. Yes.
[17:21] Charles Frye: Got it.
[18:08] Zack Mueller: BF16 Here’s sort of an interesting one. I have a question about training inference precision. I’m training on 3090s at home or thinking of renting an A100 on Azure. However, my inference setting T4s do not support BF16, but my trained model, trained based on DeepSpeak uses BF16 seems to work poorly in FP16. Any recommendation on here on how I can increase throughput? I would use vanilla Hugging Face inference, which has fake BF16. but the throughput is painfully slow. You’re gonna get painfully slow. You know, it’s, it’s, they’re optimized for for a very good reason.
[18:12] Zack Mueller: And that’s especially a scary place to be in because we had a lot of issues with models that were trained fully in BF16, which broke entirely when they were upcast to be to FP32 again. That’s why Trainer uses an autocast form to train everything in bf16 so that way it can uh be upscaled back easily so uh but normal fp16 is and same might come in and smack me for this but in my experience usually a little less uh optimized than bf16 will be so it will be slower Let’s see what else we have.
[19:17] Zack Mueller: Ah, I like this. I like anti-questions. What are the main downsides with FSTP? when is it really the not best tool for your job in training and fine tuning such as hardware configurations llms diffusions and vice versa uh basically fsdp really shines when you have you know big models when you’re training big models or if your model barely fits on one gpu and you have two available because now essentially what you can kind of do and imagine is the model is fully on one and the gradients and everything else are just in the other GPU.
[19:57] Zack Mueller: And so you have a ton of extra VRAM that you can go use. Now, FSTP versus DeepSpeed. DeepSpeed is a very configurable. It has a little more freedom than FSTP would when it comes to, like, offloading certain things and having certain device configurations. FSTP takes the all or nothing approach. So, like, I’m pretty sure DeepSpeed, you can specify, like, just these layers should be offloaded to, you know, the CPU. So that way we barely don’t hit out of memory. With FSTP, it’s all going to be done. So.
[20:32] Zack Mueller: It’s little situations like that where, does it all fit in memory? Use FSDV. If you have to do a little bit of offloading, might look at deep speed, be that 0.3 or 0.2, and see sort of what optimizations could go there.
[20:47] Hamel Husain: One question I have about that, and this is always the debate, is like, okay, if you’re GPU poor and using your own rig, there’s some fair amount of debate of like…
[21:02] Hamel Husain: does envy how much does envy link the the lack of envy link destroy performance and like if you had to google for this topic there’s very like sparse information there’s like there’s like one hugging face page where like sylvain did a experiment one time and sylvain and maybe even uh stas showed like some 25 degradation but then there’s some debate about okay that’s if that’s the right number and whatever. And then like, there’s a Tim Detmer’s blog who just flatly says that you don’t need it.
[21:38] Hamel Husain: And then there’s like other people are like, you do need it. Or like, it does like make a huge difference. So I’m just like curious if you like, yeah, I don’t, I don’t know the answer either.
[21:47] Zack Mueller: So it’s, NVLink doesn’t work for 30, for 49 days. And I believe maybe barely works for 30, 90s. I forget, but, um,
[21:59] Hamel Husain: That’s true. Yeah. So if you’re GPU poor, it would probably, yeah, yeah, yeah. It’s like you’re stuck in the 90s. Yeah.
[22:04] Zack Mueller: It’s only 3090s, which is why for a while, 3090s were sort of the golden child. I don’t have 3090s. I could find 4090s before I could find 3090s. That’s how insane it is. So the exact numbers on stuff like this is like, you know, Tim Detmers talked about how consumer cards are throttled a bit on purpose with their drivers. And NVIDIA knows about this. And so when I was doing the math recently to get like a new A4500 at the same VRAM as the 4090 was an extra $500, I think, five or 600.
[22:44] Zack Mueller: So it was like a tier up, because you’re already paying, you know, $1,800 for a, well, it was about the same ish. I think it was like 24 or 2500 for the A4500. If I had to build my rig again, I would probably buy the card.
[23:01] Hamel Husain: You would probably buy what card?
[23:03] Zack Mueller: I’d buy the A4500. Because it is a thing where like, more than the performance hit, the power usage.
[23:13] Hamel Husain: And the heat and everything, yeah. And the space.
[23:15] Zack Mueller: It’s a fraction. Yeah. Right? Like, I have two water-cooled 4090s, and I have to have a giant case because of it. Versus it, I could fit four A4500s in that case. I’m pretty sure. And because I think they’re single shot slot, if not, they’re too. Yeah,
[23:31] Hamel Husain: they’re very nice and slim.
[23:33] Zack Mueller: Yeah, they’re super slim. And they use like a fourth of the wattage. So and also they have more CUDA cores available to them for actually to be used by for training. However, if you can’t find them, which a lot of people can’t, you can get by, right? I’ve done my job. granted my job has had limited training but i’ve done my job with just two 40 90s to do four iterations on laura tuning uh llama 3 8 billion it’s about somewhere between four to six hours to get full training done yeah um it’s
[24:17] Hamel Husain: i think it’s a pretty good car yeah it’s a pretty nice yes card for sure like yeah it’s like small and yeah makes sense it works and it’s
[24:25] Zack Mueller: you know it’s one of those things where it’s like if you can find it i’d probably do it but if not like you’ll be okay it sucks because nvidia is sort of squeezing us out but yeah we’ll find a way they’re waging a war on the gpu ports it’s
[24:42] Hamel Husain: great we love it really bad um i like having the rig though oh it’s great for some reason like i just you Yeah, it’s kind of crazy. It’s 2024. I still like having a rig, but I still like having a rig.
[24:56] Zack Mueller: Well, it’s because you own it. You own it. I don’t know.
[25:00] Hamel Husain: It’s just like, it’s still a little bit easier for me to like tinker around.
[25:05] Zack Mueller: Yeah. And you know, you never get bored if you just sit there and upgrade computer parts every once in a while. Gives you a fun side hobby to do while you stare at the pretty fans. RGB or not? Let’s see. We want to know if fine tuning can ever catch up to frontier models. That’s a fun question.
[25:31] Hamel Husain: Well, if you look at Technium’s models, he has this community of fine-tunes and they like exceed the base on many benchmarks because he tries to fine-tune them broadly. It’s almost like a continued free training in a way from what he’s doing.
[25:46] Zack Mueller: Exactly. And so it’s largely on the community, right? And so if we can go through and make those fine-tunes and make them competitive, that’s great. In some ways, we’re also fighting a losing war because data, right? Because the closed source people have access to a ton more data than we do. But, you know, as you said, Hamill, we’re kind of keeping pace as best we can, you know?
[26:18] Zack Mueller: Also, notice this in the Discord, someone getting excited because they never heard about the A4500s before and the price, how it looks affordable because it’s close to the 3090. Make sure you’re looking at the right one. The older version is 20 gig. The new version is 24 gig. Either or, again, is fine in my opinion. I almost went with the 20 gigs instead of the 24 gigs. And I believe Sam actually recommended the 20 gigs for me instead of 4090s back in the day when I was building that. But here we are.
[26:49] Zack Mueller: I had to do some debugging with FP8, so I needed 4090s.
[27:02] Charles Frye: Wait, sorry, the A40, oh, the 4090s do have FP8, but the A4500s don’t because they’re Ampere series. Okay, yeah.
[27:10] Zack Mueller: The prior ones. There’s A4500s that are ADAs.
[27:16] Charles Frye: And they didn’t make it in H?
[27:18] Zack Mueller: Nope.
[27:20] Charles Frye: You hate to see it.
[27:21] Zack Mueller: So, but that’s the newer version. That’s the 24 gig that has the ADA capability. What are some ways to ensure that prompting and tokens work at… inference time. That’s a fun one. I was just looking at that. The best thing I found to do is upload your model to the hub or just have it locally and load it in a pipeline. And do, I believe it’s model.generate or even just calling the pipeline itself. There’s a phenomenal guide, which I will link here. Well, I guess I can’t link it there. I’ll link it in the discord.
[28:01] Zack Mueller: that talks about chat templating. And it was the exact debugging guide I used when I was looking at trying to find problems with inference when it came to taking these trained models and trying to look at how are my tokens looking? Let’s see.
[28:29] Hamel Husain: Yeah. I mean, the, the hugging face chat templates are pretty cool, but I don’t, I guess like, I actually don’t understand, like, can you use those directly in X lot? I’m not sure.
[28:41] Zack Mueller: No, you can’t. That was actually a big thing I was talking about with wing because X all doesn’t do that. And so you essentially have to copy and paste your entire chat template with the question you want to answer. which was very concerning for me because I felt like I was doing something wrong because I felt like that should be magical.
[28:59] Hamel Husain: Yeah, it does feel like the right design pattern in the future is like maybe standardizing on something like that, like the Hugging Face chat template. Maybe that’s what you’re advocating for in the Discord.
[29:12] Zack Mueller: Yeah, I was talking with him about that. And it’s like, that’d be a good thing at some point that even I might look at doing. Because that’s, you know, when you… merge your PEFT model, one thing I immediately wanted to do was test the weights to make sure that everything worked. So that’s huge.
[29:28] Hamel Husain: Yeah. And like, yeah, if you could do that, it would actually get rid of lots of spaghetti code. Because like a lot of the spaghetti code is around the prompt templating craziness. But the masking part, I don’t know the input output masking. Okay, maybe you can’t. Maybe it’s like, yeah.
[29:52] Zack Mueller: What I need to do to run inference on an 8 billion model using a single NVIDIA 4090. My current calculations show a minimum GPU requirement of 13 gigs. That’s about right. If you’re doing half precision, you can do quantize. You can also get a bit slow and do like offloading to offload onto the CPU. It’ll be a bit slower, but you can run it. And that’s just do device map equals auto whenever you’re like.
[30:26] Hamel Husain: bringing the model in from transformers yeah i mean like a user experience standpoint like sweet spot is i would say vLLM use auto awq or some quantization method like that and it’s a nice you know you can always like get faster with trt lm but then you have to like
[30:45] Zack Mueller: suffer i quickly fell in love with the llm that’s what the star coder to instruct people used for their data set generation it was like holy crap this is easy to use You just point to the folder and it goes.
[30:59] Hamel Husain: Yeah, for sure.
[31:01] Zack Mueller: Why aren’t models trained in int8? That’s a good question. Because it’s very unstable. NVIDIA has come out with something called Transformers Engine or, oh, there’s another one that I can’t remember off the top of my head. MS AMP. And they are experimental. You can train in 8-bit. Practically, from what I’ve heard just around the bend, is people have tried it, they still go back to BF16. It’s just better. That’s part of why I’m doing this Lama 3A billion experiment, is I want to go see it for myself. But that’s just what I’ve heard.
[31:42] Zack Mueller: I know PyTorch, though, is adding official support in. And so maybe they figured out some secret sauce with quantization on the fly plus native FP8. I don’t know. I haven’t looked too much into that repo yet, but I’m hoping that’ll change a bit in the future. For right now, it’s just not realistic from a quality perspective.
[32:08] Charles Frye: And that’s fair. You were answering that both for Int8 and for FP8, because you mentioned the transformer engine, which is FP8, not Int8,
[32:16] Zack Mueller: right? Yeah, it’s one of those weird things, right, where it’s Int8 for quantization or FP8. So I was answering…
[32:23] Charles Frye: for fp8 aka native 8-bit got it um and so yeah i’ve also heard what like reports from people about instability and difficulty using fp oh like yeah using fp8 which is why it’s like kind of surprising that they doubled down with fp4 with
[32:42] Zack Mueller: the blackwell architecture right like do you have any thoughts about that yeah i have thoughts how much of that i can say i don’t know you But it’s definitely interesting because clearly they have some sort of special sauce that lets them train models decently well. Because otherwise they wouldn’t be doing it. The thing that’s weird is we can’t necessarily recreate it, you know? Because otherwise everyone would be doing FP8 if it was truly as good as it was. So there’s some other stuff happening on the back end. I would love to know.
[33:20] Zack Mueller: it when the llama three paper finally comes out what they trained it because they have you know all those h100s which are optimized for fp8 um and so i would love to know if they actually were able to train it in fp8 well or if uh it wound up having to be you know bf16 which that’s a bit of a misnomer it’s bf16 with fp8 essentially So yeah, it’s definitely a wait and see sort of question.
[34:02] Hamel Husain: Amon wants to know if you can explain the difference between RLHFDPO and SFT.
[34:09] Zack Mueller: I cannot. I have not touched that library at all, sadly. but I think maybe one of my coworkers will be here that talks about it. I’ll double check the list, but at the very least, I know nothing about that. I’m sorry.
[34:28] Hamel Husain: I don’t know too much about that either, to be honest.
[34:32] Zack Mueller: What have you seen in terms of reasonable scale in applications and projects where Accelerate failed and what was missing that forced you to change or update Accelerate? Not a lot. So in the presentation, I mentioned how Lucid Rains uses accelerate for most of his projects he what he does is he recreates closed uh source papers in trainings and fully recreates them from scratch. And he’s orchestrating these on trainings of over 1,000 GPUs. Accelerate can’t fail, per se, because if Accelerate fails, all we are is a light wrapper around what everyone else is orchestrating.
[35:16] Zack Mueller: So it’d be very rare for Accelerate to fail in a few ways. The only thing I think I might have seen… is occasionally like weird timeout things and timeout issues that I still haven’t diagnosed the cause of and I’m still not 100% sure if it’s Accelerate because training and just distributed is just hard. You get weird issues that you don’t know the answer to that you got to go figure out. Your best advice is the experts at PyTorch, if not NVIDIA engineers. So that’s a good question.
[35:56] Charles Frye: There’s one I thought was interesting and I’d like to hear your take on it, Zach. Are the Chinchilla scaling laws still relevant and can we use it, them or other scaling laws to estimate which models will benefit the most from additional training?
[36:14] Zack Mueller: That’s a fun one because I’ve been mildly keeping an eye on that space. So for a while we thought, yes, right? Chinchilla scaling laws are sort of the end all be all. They still kind of are, except what Lama3 showed us and what all these other labs are showing us as well is you can still just keep training. You can still just keep training and do good. But I still think the scaling laws matter.
[36:43] Charles Frye: Yeah, I like the scaling laws tell you how to optimally allocate a fixed number of flops between parameters and data. And like for that, they are still.
[36:54] Charles Frye: like the there’s been like revisions and some window dressing around chinchilla but like for that they’re still correct but they do not tell you how to get the best model because the answer to get the best model is always to just continue training until it has converged right and so i but i think what’s maybe interesting for people who are doing continued pre-training and fine-tuning is that Some empirical evidence and most people’s intuition is that models that are under-trained, like Chinchilla models or the original Hoffman scaling law models, and anything but Lama 3, bottles that
[37:37] Charles Frye: are heavily under-trained should be more steerable and easier to fine-tune. Because there should be more slop in the weights. I don’t know to what extent, Zach, that has been your experience.
[37:49] Zack Mueller: No, I can agree with that thinking. I don’t have…
[37:51] Zack Mueller: comments on llama 3 yet the only comments i do know is when i was looking at my base checkpoints because i went actually a bit over in my path tuning compared to what i think you guys recommend because usually it’s like one to two iterations through the data and i trained for four if not five the fourth or fifth one showed the best while still improving my eval loss so at least for eight billion there might be some space there still but otherwise fully agree Is it fair to say that TensorFlow isn’t really relevant for
[38:24] Zack Mueller: any of this? Hugging face accelerate, etc. It seems like everything is built up by Torch now. Yes and no. TensorFlow and especially Keras is still very much a back end in Transformers. Support for it is, you know, yes, it’s died down a little bit, but now we have a few people that are working directly on it. And so starting to go back up. But yes, the general trend for the last few years has been PyTorch started out for research and then people just started preferring that framework a lot for a lot of things.
[39:05] Zack Mueller: No, I’m not going to say one is better than the other because they’re both equal. You know, it’s just everything I do is in PyTorch. That’s how I got started and I’m still in PyTorch. Kind of how it is. And I’m sure you guys feel mildly similar about that.
[39:22] Dan Becker: It seemed like Jax was going to gain a lot of popularity, and then I never hear people talk about it anymore. Is there much happening that you see in Jax?
[39:36] Zack Mueller: Kind of. The fun with Jax.
[39:39] Hamel Husain: The danger zone. Danger zone. Some person’s going to come in.
[39:45] Zack Mueller: Shoot me. You’re allowed to take me in the back room. It’s fine. It’s Google. At the end of the day, it’s Google. you know so take it with grain of salt they rewrite everything they do every few years you know, it depends. It’s a risk. It’s a gamble. And that’s sort of, I think, part of why people are hesitant to go use it. And also, I don’t know, it’s like with XLA. For a while, it was just on TPUs, and only researchers had access to TPUs.
[40:17] Zack Mueller: And so you didn’t really do a whole lot with XLA unless you were a researcher. And they’ve only recently brought it over to GPUs, and it works. It’s fine. It’s just weird, because I… wouldn’t have thought that they would have moved over to GPUs. So it’s kind of a situation like that of, it’ll be neat to see where it is in three years, if it’s still a thing in three years, and if it is, then that’s probably a good sign. Yeah.
[40:40] Charles Frye: I think also it is a very beautiful framework for thinking about, it’s not just an auto-differentiation framework, it’s like a program transformation framework. And so it’s like in some galaxy brain way, it’s like the right way. to do this. However, that makes it a lot harder to use. Like, like, it’s, it’s, you know, like pure functional programming when you get down to it.
[41:07] Charles Frye: And a lot of people who are researchers and tinkerers and data scientists are not comfortable with like, managing an RNG monad, you know, and like, Jax kind of like puts that stuff in your face.
[41:20] Hamel Husain: You just convinced me not to use it.
[41:24] Charles Frye: I don’t like that sort of thing.
[41:28] Hamel Husain: and will some of us enjoy the nerdy stuff i enjoy nerdy stuff too but i mean sometimes i mean yeah gotta pick and choose uh let’s see any
[41:45] Zack Mueller: other interesting ones uh any particular tips or limit trait limitations for training on apple silicon that’s a fun one um I can talk about this since PyTorch admitted it. Sumit tweeted about this last month. Silicon with PyTorch is okay. Inference, it’s great. Training, you’re looking at three different lower-end libraries that all try and do the same thing, each of them with varying degrees of success. One of them is PyTorches, and they have admitted that theirs is one of the worst ones.
[42:33] Zack Mueller: They’re looking to revamp that and get more into it and actually invest time in that in the next few months, I believe, Asuma said. So for right now, it’s good for inference, training. I’m sorry if you can hear my cat. Perfect.
[42:54] Hamel Husain: It’s not a big deal. Don’t worry about it.
[42:55] Zack Mueller: Okay. For training, it’s mixed. Take it with a grain of salt and try with an NVIDIA GPU if you have one. MacBooks are cool. I refuse to sign on to that. And so thankfully, I don’t have to sign on.
[43:11] Hamel Husain: Look, so much stuff can go wrong. I’m like, I’m just going to go the paved path here. I’m using NVIDIA GPUs, using Axolotl. Yeah. You know, something battle tested, like, because I don’t have time. Like, there’s so much stuff to do. Just barely have time to fix the data, usually.
[43:31] Charles Frye: Yeah, and I think it’s unlikely that Apple Silicon is going to become a preferred target for training because it’s like a system on a chip, CPU, GPU together. There isn’t something that looks like a server card with fast GPU interconnect. And additionally, on top of that, you need fast communication between each server blade.
[43:54] Charles Frye: and that’s just not something where apple has done work so like in 10 years we might be looking back at this the way somebody 10 years ago would have been like apple’s not making a gpu um you know apple’s not going to make their own hardware so like in the foreseeable future there is not the infrastructure in place to make it a really great target for training um so zach do you agree
[44:26] Dan Becker: We can’t hear. I didn’t hear what you said, Zach.
[44:28] Zack Mueller: Oh yeah, sorry. I muted my mic because Max was right here. The biggest sign of that, I agree with that. The biggest sign for me is if you’re not familiar with Cash on Twitter, who writes Dingboard, I love him. He’s phenomenal. He’s hilarious, but also has really good insights. He was this close to executing on like a cluster of the new Mac workstations to have the total VRAM usage. And then he said, no, actually, it’s much better if I just stick with the video.
[44:56] Zack Mueller: did a total 180 was about to buy it all and they went nope actually going with davidia because everything’s too unstable i think yeah similar story with george hotz right and the amd chips it’s like yeah yes amd is a whole different ball game but yes uh how would you serve multiple lauras with accelerate inference you’re not thinking about that how you could like hot swap lauras in and out I don’t think we’ve done that yet. I don’t necessarily know of anyone that’s done that yet. It’d be a neat problem.
[45:34] Charles Frye: Does VOM not have that?
[45:36] Zack Mueller: I don’t know.
[45:38] Charles Frye: You can load several Loras. Yeah,
[45:40] Hamel Husain: it does. You can?
[45:41] Zack Mueller: Oh,
[45:42] Hamel Husain: nice. It has a hot swap thing, kind of. You can go between Loras. It’s pretty cool. It’s like you let a new batch.
[45:48] Charles Frye: Or no.
[45:50] Hamel Husain: I didn’t try that part.
[45:51] Zack Mueller: Yeah. You’re slowly making me love it.
[45:54] Charles Frye: Yeah.
[45:54] Hamel Husain: But it’s really cool, actually. I can pull up the docs real quick, actually.
[45:58] Zack Mueller: Yeah, that’s fascinating, because that was a thought I had, I think, after the class on earlier this week, where I was like, huh, that’d be a fun experiment to see if we could go do. Someone go do that, document it, and write a blog. Tell us what happens.
[46:15] Dan Becker: I think that Lorax does that.
[46:21] Hamel Husain: OK, so you see this one? So basically, using lower adapters, and then you can. You can serve lore adapters. Just a second. Yeah, so you like these lore requests. You can have different adapters. I don’t know about how you would hop. Yeah, I guess because you already, yeah. You can just at inference in the generate, you can just decide to pass in a different adapter.
[46:55] Hamel Husain: apparently did you click that multi-lora inference.py file yeah i remember like looking at this in the past and not quite figuring out whether they would be hot swapped yeah process yeah i don’t know i haven’t tried this yet maybe it’s not hot swapped maybe i just not remembering yeah we’d have to go into because like they’re sort of saying it’s like as though it’s a separate model and i don’t know like separate i guess like you’re calling you’re not uh so like you’re loading the model separately yeah you see And then it’s only in the
[47:28] Hamel Husain: whatever forward pass that you are then passing the adapter. So I guess that’s hotspot. I might not know that. Maybe I’m just like assuming what the word hotspot means.
[47:40] Hamel Husain: I mean,
[47:41] Charles Frye: there’s like there’s like versions where you need to merge it into the model weights. And that’s, you know, that’s fine. It’s nice to not have to merge it in the model weights because it’s just like easier to manage the artifacts. And then the next optimization on top of that is as a batch is going through, you route some rows in the matrix through different loras. And that’s the really cool thing about loras, especially if you’re serving many people’s fine-tunes. You have a fine-tune for every user, you have a fine-tune for every region, whatever.
[48:14] Charles Frye: It’s great to just have your batches just get collected together, and you get most of the throughput benefit of the… of…
[48:21] Charles Frye: batching for the large model with all the customization of laura’s um so that’s that’s the like full um like per request laura um parallelization that you really like that’s why that’s why it’s so nice to have these like these parallel adapters like laura instead of sequential adapters like um like replacing the language model head or uh or like putting an adapter at the end of every linear layer or whatever makes
[48:47] Hamel Husain: sense Don’t pay attention. I was just scrolling around.
[48:55] Charles Frye: Yeah. Somebody asked about mixture of Laura. I don’t know if that’s, I guess that would be like live routing between Laura’s like they were experts. I don’t know if I haven’t heard of that being used.
[49:09] Hamel Husain: That was like a fun idea.
[49:11] Charles Frye: Yeah. I guess, yeah, you need to train them together the way you train experts together.
[49:19] Zack Mueller: because you need a like routing between each one well didn’t like what was the kraken model that just came out that has like six different models each on completely different tasks to
[49:35] Hamel Husain: help route through things oh god please tell me about you could make a really janky version of it like yeah just have a classifier or something the route i think that’s a separate model half of what that was Isn’t the MOE thing, isn’t it trained to route? In the network itself, it’s routing.
[49:57] Charles Frye: to a different path more directly than like you know going outside yes okay uh yeah the person the person asking about mixture of laura i think is posting about it in the discord saying like yeah train train laura separately get a router as a yeah laura mixture of experts cracking laura and
[50:26] Zack Mueller: The picture is quite literally Cthulhu with a bunch of different fingers pointing to different loras that perform different tasks.
[50:33] Charles Frye: Nice.
[50:35] Zack Mueller: It’s a collection of experts is what they’re calling it.
[50:39] Charles Frye: Oh, boy.
[50:41] Zack Mueller: Yeah. Combines the best Python, SQL, function calling, reasoning, and German models by applying dynamic loras so far.
[50:53] Charles Frye: Oh, I see. The top tentacle is holding some muesli, I guess. That’s the German one.
[50:59] Zack Mueller: Ah, that makes sense. Yeah, dynamic model routing uses a sequence classification model to route inputs to the most suitable language model based on the characteristics. Fascinating. I did drop that in the Discord in case people are getting FOMO. How do you decide a fine tuning project? Even this Lama 3 billion experiment you’re running now, what led you to say, let’s do it? I read the paper, it seemed neat. It seemed like something that was reasonable for me to go and do. The steps were all right there.
[51:52] Zack Mueller: It was something where I was like, okay, that’s fine. It’s a pet problem. It doesn’t have to be unique. You could be just recreating what other people did. I think that’s what one of my winning solutions at a little hackathon back in college was. I was quite literally taking the ULM fit lesson from Jeremy’s Fast AI course back in the day and applying it on news articles. And then we just served it. And that all in all took like a day and a half.
[52:20] Zack Mueller: And so generally when I think of ideas, it’s sort of what interests me and what’s relevant to what I’m doing right now and try and find a connection there. Either previous work that people have done, especially if it’s open source or data and just go with it. One thing I sort of like about, there’s a quote from one of the guys on Mythbusters, and it basically goes, the difference between just messing around in science is writing it down. Even if you’re just messing around with data, write it down. Great. Now it’s a blog.
[52:55] Zack Mueller: You know, model fails? Great. That’s fine. Write it down. That’s a lesson learned. And all it does is bolster you and your experience. There’s another question similarly that you guys might actually have some thoughts on. When you work on a fine tuning project, do you typically get constraints for inference time and memory consumption when you start? Or how do you choose the sweet spot of model size and performance? My first guess would be go off of your budget. That dictates how much VRAM you have available.
[53:41] Zack Mueller: And then that dictates what kind of model you can train. Right? On 24090s, I can’t train the 70 billion Lama, but I can train 8 billion and smaller. And time, again, is cost. It’s free for me to just run that for two days, or I can pay $80 and spend up an H100 and get it done in an hour or two. But that’s my thought. What are you guys’thoughts on that?
[54:09] Hamel Husain: Yeah, it’s similar. Okay, like the honeycomb example, which I’m going through in this course.
[54:14] Hamel Husain: In that situation, the incumbent model was open ai right so you like look at their their latency their cost whatever and that’s like their latency and their cost is pretty good like it’s hard to compete with that now if you’re going to replace there’s a humans have a cognitive bias by the way like once you already have something you value it more same thing with models a lot of times like you’re like it has to be like better it has to be like markedly better for anyone to notice to really notice i mean you can
[54:48] Hamel Husain: measure it and stuff but like to really in this case like i also do consulting so i like really need it to be better so like you can noticeably feel it from a user experience perspective. So I want it to be a lot faster. And so in that case, it was pretty clear to me it needs to be a 7 billion parameter model. It needs to be small. It needs to be fast. And I need to actually make the prediction quality higher than GPT 3.5 through fine tuning. So it was kind of like that.
[55:20] Hamel Husain: But usually, it’s like, OK, what is the smallest model I can get away with for the quality threshold? Basically.
[55:30] Dan Becker: I have a bunch of thoughts on this. I mean, first of all, I generally think of training compute as free. Like we talked about.
[55:41] Hamel Husain: It is after this course, yeah.
[55:42] Dan Becker: Yeah. With all the credits that you have, you will need to find ways to use compute. So in the past, you would have used 8 billion for a model. And I’m like, well, I got to find something to do with all these credits. But inference is so vastly more expensive for all the projects I work on than training. I just think it’s not even worth thinking about training costs. There’s one project we’re in the middle of working on. I’ve talked about this before. This multimodal alt text generation, so image to text.
[56:19] Dan Becker: We’re doing the most expensive training runs that I’ve worked on. And they still like, it’s a trivial fraction of inference costs. It’s not even like a meaningful fraction of the amount of time that we spend talking. Like we, I get paid hourly for that project. The amount of time that I spent on a single call deciding what model to run would cover all the models that we train for weeks. So training really doesn’t matter. And inference is what matters. And then the other thing is the time, like your iteration speed matters.
[56:55] Dan Becker: We’ve talked in this course about how important data is. And because data is so important, I typically will start with an 8 billion per meter model because you can iterate faster, see what’s breaking faster. If you want to change later on to a larger model, you can. And what we do is we always start with the 7 or 8 billion per meter model because it’s like a smartish size. You can train it quickly and easily on a single GPU, and it’s nice to not be using distributed GPUs.
[57:29] Dan Becker: And then frequently we’ll say, OK, let’s try a bigger model. And then that would have some cost for inference if we end up using a bigger model. And we do that, and then we say, oh, actually, it’s not meaningfully better.
[57:45] Dan Becker: And so we almost always end up just deploying 7 or 8 billion perimeter models because they’re a little faster than a 13 billion parameter model or a little cheaper um like you’d use a less powerful gpu um and my experience is that we can’t actually tell the difference when you just side by side look at the results so uh model size is like a less interesting problem than you would expect we just always run seven or eight billion parameter models um and it’s not because we’re scared to train like i would could train a 70
[58:20] Dan Becker: billion parameter and not worry about the cost. It just isn’t worth it.
[58:29] Zack Mueller: Let’s see. Here’s an interesting one. Why aren’t people fine tuning more on five three? Is it because it’s smaller? Or is it just mistral small enough with Laura training? I think five series. I don’t know. There’s something weird with the data and how they trained it. That isn’t showing real world performance from what I’ve gathered on Twitter. And that’s generally why people don’t like it. Like, to be biased, I generally make my decisions on what models I look at based on and sort of his experiments.
[59:05] Zack Mueller: Because he just sits there and hacks with prompts all day and figures out what he wants to run locally. And for like, I think 12 hours, he said 5.3 was actually good. And then he threw it into his real world scenarios and it didn’t work. Whereas Lama…
[59:20] Zack Mueller: 70 billion performed out of the box phenomenal 8 billion is still good for him if he wants to do a small fine tune so it’s just the phi series is weird i think would be an apt way of saying that i guess it’d be cool if it worked but it’s kind of the situation i think where seven or eight billion is kind of the threshold of where things need to be at for things to happen still parameter wise
[59:49] Hamel Husain: so you might be on time right yes yeah yeah um
[59:55] Dan Becker: I found how to export unanswered questions. So I will export the CSV, drop it into the chat for this session.
[1:00:07] Hamel Husain: That’s a great homework for someone to build some kind of RAG application or whatever you want to do.
[1:00:14] Dan Becker: Nice. Yeah. So if we want to come back, we have questions both answered and unanswered. save that I’ll share with everyone. And we’ll share this recording, of course.