Best Practices For Fine Tuning Mistral

fine-tuning

llm-conf-2024

Published

June 11, 2024

Abstract

We will discuss best practices for fine-tuning Mistral models. We will cover topics like: (1) The permissive Mistral ToS and how it’s perfect for fine tuning smaller models from bigger ones (2) How should people collect data (3) Domain specific evals (4) Use cases & examples (5) Common mistakes

Chapters

00:00 Introduction

Sophia Yang introduces herself and provides an overview of the talk, which will cover Mistral models, their fine-tuning API, and demos.

0:35 Mistral’s History and Model Offerings

Sophia discusses Mistral’s history, from their founding to the release of various models, including open-source and enterprise-grade models, as well as specialized models like CodeStraw.

02:52 Customization and Fine-Tuning

Mistral recently released a fine-tuning codebase and API, allowing users to customize their models using LoRa fine-tuning. Sophia compares the performance of LoRa fine-tuning to full fine-tuning.

04:22 Prompting vs. Fine-Tuning

Sophia discusses the advantages and use cases for prompting and fine-tuning, emphasizing the importance of considering prompting before fine-tuning for specific tasks.

05:35 Fine-Tuning Demos

Sophia demonstrates how to use fine-tuned models shared by colleagues, as well as models fine-tuned on specific datasets like research paper abstracts and medical chatbots.

10:57 Developer Examples and Real-World Use Cases

Sophia showcases real-world examples of startups and developers using Mistral’s fine-tuning API for various applications, such as information retrieval, medical domain, and legal co-pilots.

12:09 Using Mistral’s Fine-Tuning API

Sophia walks through an end-to-end example of using Mistral’s Fine-Tuning API on a custom dataset, including data preparation, uploading, creating fine-tuning jobs, and using the fine-tuned model.

19:10 Open-Source Fine-Tuning with Mistral

Sophia demonstrates how to fine-tune Mistral models using their open-source codebase, including installing dependencies, preparing data, and running the training process locally.

Resources

Links to resources mentioned in the talk:

mistral-finetune is a light-weight codebase that enables memory-efficient and performant finetuning of Mistral’s models.: mistral-finetune
Example notebook
Fine-Tuning guide
Spaces in the prompt templates for different versions of Mistral: tweet
Mistral Inference guide.

Full Transcript

Expand to see transcript

[0:01] Sophia Yang: Okay, cool. Yeah. Yeah. Thank you so much everyone for joining this course and this talk. I’m super excited to be here. My name is Sophia Young. I lead developer relations at Mistral. So, yeah, sorry, it’s June already. Online of this talk, this talk is we’re going to talk about an overview of Mistral models. We’re going to talk about our fine-tuning API that we just released today. And also we have an open source fine-tuning code base that you can play with. And then I will show you some demos. Some brief history about Mistral.
[0:40] Sophia Yang: I hope all of you know some of Mistral, but if you don’t, we are a Paris based team of more than 50 people. We were founded about a year ago and in September last year we released our first model, Maestro 7B. And then December we released Maestro 8x7B with a lot of good feedback on both models. And we also released our first commercial model, Maestro Medium, and also the platform where you can use our model on the platform through our API. And then February this year we released Mr. Small and Mr. Large. Mr.
[1:20] Sophia Yang: Large is our flagship model with advanced reasoning, multilingual capabilities, function calling, all the good things. And Lachat is our conversational AI interface. It’s completely free. So if you’d like to talk to our models, please sign up and use Lachat. And in April this year, we released, at the time it was the best open source model, A times 22B. It was really good.
[1:49] Sophia Yang: And just last week, we released CodeStraw, which is a specialized model trained on 80 plus languages, programming languages, and you can use with various VS Code plugins to generate code and talk with your code. And here’s another view of our model offerings. We have three open source models, Apache 2 license, which means you can use it freely for your personal use cases or commercial use cases. We have two enterprise grade models, Mr. Small and Mr. Large. Yeah, so for your most sophisticated needs, Mr. Large is really, really good.
[2:33] Sophia Yang: It supports multilingual function calling, really good at RAG, And now we support fine-tuning Mistral-small and Mistral-7b. And again, we have specialized Postal for coding, and we have an embedded model. So we really care about customization. A lot of people asking about, like, how do you fine-tune Mistral? A lot of people want to have a recipe or want to use our API to fine-tune our model. So about… Two weeks ago, we released Mr. FineTune, which is a fine-tuning code base everyone can use to fine-tune our open-source models.
[3:18] Sophia Yang: Then just two hours ago, we announced a fine-tuning API so you can customize your Mr. Model using our fine-tuning API directly. And yeah, so the technology we use is LoRa fine tuning. Since this is probably the end of the course, you probably already know a lot about LoRa. It’s very efficient. It’s very performant. So we did some analysis on the comparison between Mr. LoRa fine tuning and a full fine tuning on Mr. 7B and Mr. Small. So they have really similar performance, as you can see here.
[4:00] Sophia Yang: MR7B, lower fine tuning on this benchmark is 0.9 and the full fine tuning is 0.91. So very, very close. Note that this benchmark is an internal benchmark normalized to this MR small number. So just some note there. And then before I show you some demos, just some thoughts on prompting and fine tuning. You probably already learned about this. So before you start fine tuning, you should always think about maybe you can just do prompting. So with prompting, your model can work out of the box, doesn’t require any data or training to make it work.
[4:44] Sophia Yang: So it’s really easy and it can be easily updated for new workflows or prototyping. With fine-tuning, sometimes it works a lot better than prompting. It can work better than a larger model, faster and cheaper, because it doesn’t require a very long prompt. So for specific use cases with a large model, if you have a sophisticated prompt, you can make it work. But with small fine-tuned models, you can just fine-tune it on specific behavior that doesn’t require a long prompt.
[5:21] Sophia Yang: So, yeah, again, it’s better alignment with the task of interest because it’s specifically trained on different tasks and you can teach new facts and information. Okay, so now I want to show you some demos, basically how to use our FineTune API and also how to use, excuse me, our Mr. FineTune, the open source code base. specifically, I want to show you four demos. Demo of how can you use a fine-tuned model that may be fine-tuned by yourself or by someone else. Demo of some developer examples, like real-world use cases.
[6:07] Sophia Yang: And also the API work through and the Mistral fine-tune walkthrough. So let me move this bar. Okay, so in this notebook, first of all, we need to install Mistral API. This is the latest version, that was released today. So if you want to use our fine-tuning features, make sure you have the latest version. And in this notebook, I’m actually using three fine-tuned models shared by my coworker.
[6:49] Sophia Yang: So if you are in an organization, you want to use models created by your coworker, or you want to share the fine-tuned model you yourself created, you can use the model from the same organization with different sets of options. So it’s really easy for me to use. a model that my coworker has fine-tuned today. So for the first example, it was trained on title abstract peers from archive. So basically if you input a title of a research paper, this model will generate a abstract for you.
[7:33] Sophia Yang: So if I input fine-tuning is all you need, it will give us some… similarly okay abstract for this paper title. And then just another fun example, the croissant convolution. We made it up, obviously. So the abstract is like we propose novel convolution there, the croissant convolution. So it’s just a fun example for people to see how the finding link could work to play with. And then I have another example, Mroy is all you need. And again, it will output the abstract.
[8:13] Sophia Yang: So I’m or a research paper because we trained on the title and abstract from the research papers. And another example, here’s another model. Note that The model name always have this structure. It’s fine-tuned. It’s fine-tuned on OpenMistral7b, our 7b model. So it will always have this name here, and it will have some random strings for the fine-tuned version. So for the medical chatbot, we actually trained on this hanging face data set of AI medical chatbot data. And then…
[8:56] Sophia Yang: Here, as you can see, we ask it some questions and it will answer like it was a chatbot. And another example, so those two examples are from actual data. So here the data we get from archive and here’s the data we got from Hugging Face. But what if you don’t have the data? You can actually generate data from a larger model. Sometimes we want to mimic the behavior of a larger model like Mr. Large because we know Mr. Large behaves really well and we want to do some knowledge distillation or knowledge transfer.
[9:34] Sophia Yang: from a larger model. So in this case, if you want to see how exactly we generated the data, you can go to our docs. And in this example, we have this prompt, you are a news article stylist following the economist style guide. So basically, we want to rewrite the news as the economist news article. And then we use Mr. Large to generate the revised news or like different revisions of how the news can be revised. And we give it some guidelines, basically. So So in this example, we use Mr.
[10:18] Sophia Yang: Large to generate our data, and then we give it a news article, right? And then we use the fine-tuned data. It can generate a new piece of news article that may sound more like economist news. And then if you give it some prompting, it can also generate a set of revisions it proposed for you to change, how you can change the style of the news to make it sound better. Or more like economist. Yeah, so that’s. That’s some quick demonstrations of how the fine-tune model can look like.
[10:57] Sophia Yang: In our docs, we have some developer examples of real-world use cases that have been using our fine-tuning API. For example, we have FOSFO. They are a startup using our fine-tune model for RAC, for Internet retrieval, and you can see the details here. We have examples of RAC for medical domain, financial conversation assistant, legal co-pilot. So they are all using fine-tuned Mr. Models with our fine-tuning API. And in these examples, you can see the specific data. You can see the evaluation training steps and how the benchmark results look like.
[11:47] Sophia Yang: Yeah, so the fine-tuned with just small is better than the not fine-tuned with just small. Not surprising. Yeah, and then different results. So yeah, if you’re interested, you can check out our developer examples for some real-world use cases of fine-tuned model. So, Yeah, so in this next section, next demo, there are so many demos I want to show you today because it’s very exciting. In this next section, I want to show you an end-to-end example of how you can use Mistral Fine Tuning API on your own and for your own dataset.
[12:33] Sophia Yang: Of course, we need to install Mistral AI. Make sure it’s or some numbers larger than that after probably after today. And we also need to install Pandas. The first step is, of course, we need to prepare our data set. Yeah. So whenever you want to do some fine tuning, the first step is always the data. In this simple example, we. read data from a parquet file that’s hosted on Honeyface is the UltraChat data.
[13:10] Sophia Yang: We are only reading one parquet file because this data is quite large, and we don’t want to spend too much money on it. So you can feel free to change the data to your own use cases. We split the data into training and evaluation here, and then we save the data locally into the JSON-IL format. This format is actually needed for using our API. And note that here are the sizes for the training data and evaluation data. So there is a size limit here for our API. For training data, it’s limited at 512 megabytes.
[13:49] Sophia Yang: So you can have multiple files feeding to your training, your fine-tuning pipeline. But each file needs to be under 512 megabytes. For your evaluation dataset, it needs to be less than one megabytes. Yeah, so a lot of times the data on Hugging Face is not greatly formatted for the training purposes. So we have a script to help you reformat your data. This script is not, maybe not robust in all of the use cases, but in our use cases it works.
[14:35] Sophia Yang: So if your use case is more sophisticated, you might want to change the script to your own use case accordingly. So in this example, we basically just skipped several examples because they’re not not right formatted. So if we take a look at one of the examples, one of the issues here is the assistant message is empty. So we can’t really train based on this kind of data. So we need to make sure the data is right formatted. Otherwise, it’s going to be difficult.
[15:09] Participant 1: I have a quick question about this. Is there like a format validator that y’all have? I see that you have a thing to convert it to format, but I’m curious about validation.
[15:19] Sophia Yang: That’s a great question. We do have a validator script as well. That was in my next notebook.
[15:26] Participant 1: Oh, sorry about that.
[15:28] Sophia Yang: No, no, no. That’s a great question. So we do have a validation data script from the Mr. Find2 repo. Also, whenever you upload the model to our server, it will validate the data for you. So if your data is not right, it will tell you why it’s not right and ask you to change it.
[15:54] Participant 1: Excellent.
[15:55] Sophia Yang: Yeah, thanks for the question. Yeah, so yeah, so the next step is to upload the data set with the files create function and then you can just define the file name and the actual file here and we’re uploading the file and then we can see the ID of the file, the purpose of the file. The purpose right now is just fine too, maybe there will be some other purposes later, but right now it’s just fine too. And then to create a fine tuning job, you will need the IDs of the files we just uploaded.
[16:32] Sophia Yang: And of course, you need to define the model you want to fine tune. Right now we only support minstrel7b and minstrelsmall, but in the future we might add more models here. And then you can try different hyperparameters. And in this example, I only ran 10 steps just to have it faster. But if you want to increase your steps, actually, you probably should increase your steps if you’re doing something serious. Yeah. So make sure you can change those different configurations here.
[17:10] Sophia Yang: And then we can find the job ID and then we can use, and then we can like do different things like listing the jobs that we have. I have quite a few jobs that I ran. And we can retrieve a job based on the job ID to see if it’s successful or not. And then you can see we have some metrics here, training loss, validation loss, token accuracy. Because we only ran 10 steps, you only see this once. If you run it 100 steps, you will see it 10 times.
[17:48] Sophia Yang: So every 10 step or every 10% of the step, you will see the metrics. So you can see if you are making progress or not making progress. And finally, with the fine-tuned model, when the model is, the job is successful, you will see the fine-tuned model get populated from your retrieved jobs, and then you can call this model and then ask any questions you want. So, so that’s, that’s how you can use a fine-tuned model. It’s exactly the same syntax as what if you, if you’re using our normal, non-fine-tuned models.
[18:31] Sophia Yang: Okay, and finally, if you want to use weight and biases, you can add your weight and biases credentials here. Yeah, you will need your API key and everything. Just to show you how that might look like. Basically, you can check your losses, your perplexity score, your learning rate and everything. So. Yep, so that’s how you can run the fine-tuning through our API. I hope it’s clear. Any questions? Okay, cool.
[19:11] Sophia Yang: So this last demo I want to show you is what if you want to just use our open source code base to fine tune MISDRAW7B or other MISDRAW models. I’m actually running this in Colab Pro Plus account, so it’s possible to fine tune a model in Colab. So in this example, we… I just git clone this repo, the MrFinding repo, because it’s not a package, it’s a repository. And in this repo, we need to install all the required packages. And of course we need to download our model because we’re doing everything basically locally.
[19:58] Sophia Yang: And we’re downloading this model from the Mistral website and we’re downloading 7BB3 model here. And then… Same thing as we have seen before, we’re preparing the dataset. We’re using the exactly the same data as before reading the parquet file, splitting into training and evaluation, save the data locally. And then. Same as before, we reformat our data. Everything, the evaluation data looks pretty good. And afterwards, you can verify if your data is correctly formatted or not with our evaluation data script.
[20:40] Sophia Yang: So yeah, so it will, this, this actually will take some time because it’s evaluating, yeah, each record. And then you can start training. The important thing here is actually this config file, right? This basically is telling the model, telling the LoRa how you want to fine-tune your model and different, where is the path of everything. So we have the data file. You want to define the path of your instruct data, your evaluation data. This is the training data, your evaluation data. We want to define the path of your model.
[21:26] Sophia Yang: You might need to define or just leave it as default, those hyperparameters. We recommend using a sequence length of 32K, but because I’m using a Colab as the memory is limited, I ran into auto memory issue a lot. So I decreased this number to 8,000. But if you have a better GPU, you should use 82.
[21:54] Sophia Yang: a thousand and then you can you can define all the other fun stuff right okay yeah and then with this one line of code we can start training and fine-tune our model um and then in uh the result as a result you can see the checkpoint of of your LoRa result here. This is where we can use to in our inference as our fine tuning model. And to run our inference, we have a package called Mr. Inference to help you run all of our open source models and also all the fine tuning models.
[22:40] Sophia Yang: So basically, In this, in Mistral inference, you need to define the tokenizer, which is in the, which you are downloading from the Mistral, Mistral fine tune file. We use the v3 tokenizer because it’s a v3 model. And then we need to define the model that’s reading from the models folder we downloaded. And then we need to load LoRa from the checkpoint that we have, we just saved from the fine tuning process. And then we can run check completions and get the result. So that’s basically how you can run Mr. Fine-Tuning with Mr.
[23:21] Sophia Yang: Fine-Tune the code base. And finally, I want to share some exciting news that since we just released our Fine-Tuning API, we are hosting a Fine-Tuning Hackathon starting today to June 30th. Yeah, so please feel free to check out our hackathon and you can submit from this Google form and yeah, very exciting. Looking forward to see what people are building. Thank you so much.
[24:00] Participant 1: It’s very cool. What’s your favorite, like, so the very end you did like kind of like training like an open model, not using API or whatever, using Torch Run. Is that your preferred way to fine tune? I prefer the open models.
[24:23] Sophia Yang: You can fine tune Mr7b with our API as well. I would recommend using our API just because you don’t need a GPU. It’s so much easier. You will not run into out of memory issues, hopefully.