Fine Tuning LLMs for Function Calling

fine-tuning

llm-conf-2024

Published

July 2, 2024

Abstract

In this talk, we will go through the process and best practices of fine-tuning an LLM for function/tool use. We will discuss topics like data preparation, objective-based tuning, efficient serving, and evaluation.

This talk was given by Pawel Garbacki at the Mastering LLMs Conference.

Subscribe For More Educational Content

If you enjoyed this content, subscribe to receive updates on new educational content for LLMs.

Chapters

00:00 Introduction and Background

00:29 Functional Tool Calling Overview

02:23 Single-Turn First Call Objective

02:51 Forced Call Explanation

03:28 Parallel Function Calling

04:00 Nested Calls Explanation

06:24 Multi-Turn Chat Use Case

13:54 Selecting Function Call Syntax

17:44 Full Weight Tuning vs. LoRa Tuning

19:19 Efficient LoRa Serving

23:06 Constrained Generation

26:21 Generic Function Calling Models

40:02 Q&A

Resources

Notes

Why Function-Calling?

To provide information to the model not available in the training dataset. Example: Accessing real-time information (e.g., stock prices).

Framing the Objective of Your System

Pick the right objective for your use case – it determines the complexity of your fine-tuning approach. Common objective: Single-turn forced function call – “forced” means the model is expected to call a function instead of providing an open-ended response. More complex objectives: Parallel function calling, nested calls, multi-turn chat with optional function calls.

Syntax for Function Calls

Special “function call” token: Introduce a special token to the vocabulary to prefix function calls, which allows easier parsing, switching to “function-call mode” when streaming (waiting for the complete function output vs. normal token-by-token), switching to constrained generation.

General syntax: Python function signature or JSON schema. Python syntax is easier for models trained extensively with Python; JSON schema is better for complex, deeply nested parameter types and compatibility with OpenAI APIs and other leading models.

Preserving Model Capabilities

Tune on top of instruction-tuned models, not base models, to preserve instruction-following capabilities.
Caveat: Forced function-calling, where you know you want a function call and don’t need general chat – then a base model is probably okay.

LoRA vs. Full Fine-Tuning

Field is divided, but seems LoRA is enough for function-tuning. LoRA tuning preferred for limited data regimes, faster iterations, and less resource demand. Fireworks serves up to hundreds of models per account cheaply by keeping LoRA weights separate and intelligent batching.

Constrained Generation

Limit LLM output generation to tokens which have been deemed valid for a given scenario. Reduces hallucinations and speeds up inference because we can skip generation for tokens already defined in our schema. Supported in Fireworks.

Off-the-Shelf Models and Evals

Use open-source function-calling models if possible to save time and effort. Otherwise, be prepared to invest a lot to achieve high-quality models. Focus on high-quality training data over large quantities.

Public evals only indicate so much; your use case likely has special cases where a public model would fail or eval could not cover. Performance also varies along different axes (e.g., sometimes prompt engineering beats function APIs); pick the five best models and try them on your use case.

Gorilla seems only to focus on forced function-calling scenarios, and their dataset only includes very simple function signatures.
NexusRaven tends more towards the complex function-calling scenarios, and thus is likely a stronger model.

Use Case Recommendations

For security, focus on read-only functions to minimize risks. Consider precise instructions in system prompts for safety.

Potential ways to address a scenario with hundreds of functions and local serving: - Put function signatures in system prompt (so don’t vary them); then pre-populate the KV cache before session start - Or, instead of putting the functions in the prompt, build a RAG system and embed the function signatures

How many samples does it take to get good results? With LoRA supervised fine-tuning, as little as 1,000, maybe even less. People tend to invest in hand-curating datasets for this purpose and get decent results. For downstream alignment with DPO, even 100 samples may suffice.

If you have an audience of beta testers, you can share the model with them and gather their positive/negative feedback for DPO.

Synthetic Data Generation

Open-source models tend not to provide the cleanest datasets; that said, quality is still achievable, especially with post-filtering of samples. Define clear objectives and boundary cases; having many varied synthetic data samples is likely better than repeated examples of one use case or objective.

Full Transcript

Expand to see transcript

[0:00] Pavel: Yeah, so thanks for having me. I’m Pavel from Fireworks Api. And, you know, in this stock presentation, I will give you some tips and tricks around fine tuning your models for function calling. Next slide. So let’s start with what is actually function or tool calling. So the way I like to think about it is that, you know, in some cases, you need to give your model the ability to interface with the external world.
[0:29] Pavel: So, you know, one example use case is where you want your model to get access to the information that wasn’t available in the model training data set. So like, you know, that includes in particular the types of information that is available in real time. As in this specific example. where we want the model to help to retrieve the current price of NVIDIA over the last year. This type of information requires access to the stock price time series over the recent time.
[1:03] Pavel: And since that type of information wasn’t included in the pre-training data set, we need to give the model the ability to pull this information from elsewhere. Another very common use case for function calling is orchestration in multi-agent systems where you have a number of tools that the model may access in order to assist the user. The function calling functionality provides a way to do that. Next slide. So, let me guide you with some of the decisions you need to make when you decide to tune your model for function calling.
[1:41] Pavel: One of the more important decisions is the objective, right? And the objective you pick is going to affect a lot of things like the data you need to prepare, the amount of training data, the overall complexity of the fine tuning and the complexity of use. So in general, it is recommended to pick just the right objective that is required for your use case, and try to keep the objective as simple as possible. Previous, yeah. So typically, the most common and also the easiest objective is what I call single-turn first call.
[2:23] Pavel: Some other people also call it routing use case. So in this case, we have access to a number of functions. And the user will provide us with a single instruction and that instruction will be mapped to a call of one of those functions. And basically the objective of the model is to pick the right function and to pick the parameters for that.
[2:46] Hamel: Why is it called a forced call?
[2:51] Pavel: It’s called a forced call because we are forcing a function call. The model won’t be able to answer a question like, how are you today? Because that will be just like a natural language response. But in this case, we are kind of forcing the function call. So the response is always going to be a function call.
[3:11] Hamel: I see.
[3:12] Pavel: Thanks. Next time. Okay, so a slightly more complex objective is parallel function calling. So in this case, we’re also forcing function calls, but instead of calling one function, we may call multiple functions. But those functions are kind of independent of each other, so they can be called in parallel. So one example here is like, let’s say we want to get the stock price of two companies, Nvidia and Apple. And we have access to a function that can take only one ticket at a time.
[3:46] Pavel: So we need to split this invocation into two function calls, but they are independent of each other so they can be called in parallel. Next slide. Okay, so let’s complicate it. So you can also have cases where you have dependency between functions, right? So like you may call one function and that will return some response and that response will need to feed into another function call. So I call it nested calls. So as in this example, after getting the stock price of Nvidia and Apple, we also want to plot those prices.
[4:23] Pavel: So we basically have two functions, one to get. at the prices and the other one to interpret those prices and plot them. So as you can see, in this case, we are entering this multi-turn scenarios where we need to call the functions sequentially, starting with the nested function, and then we call the outer function to interpret the results of the nested calls. Next slide.
[4:52] Hamel: Let me ask a quick question on this one. So, if I use this as training data, I don’t see where the model is learning that when it calls the tool for plotting. Oh, I see. The tool is returning this 120 and 121. I see. And then that’s what’s getting plugged into the plot call. Okay. Sorry. I’m just saying.
[5:20] Pavel: It’s a good question. It’s a good question. So here, maybe I should have specified it earlier. I’m using the terminology, which is following the OpenApi APIs. And typically in this case, you have kind of like three roles. Like you have the user role, which is usually the… client interfacing with the model. The assistant is the model itself, so it is generating the function calls. And then you have like a special role called tool, which is basically like used also on the client side to feed the result of the function call back into the model, right?
[5:57] Pavel: So in this case, you know, we have the user issuing an instruction and then we have the model generating function calls. Typically, those function calls are interpreted on the client side by basically invoking them and feeding the results back into the model. And the model may decide to follow up on that, as in this case, calling another function, feeding the results from the initial function calls into this second function call.
[6:22] Hamel: Yep. Okay.
[6:24] Pavel: And finally, probably the most complicated, one of the more complicated use cases is the one where we want to have conversations with the function calls. We may think about it as the user is having conversation with the model, and in certain situations, and the response from the model may involve a function call. I call it a multi-turn chat use case with optional function calling. So as in this case, the user may inquire about the like what’s in the news today, to which assistant will respond with a function call to pull in the trending news.
[7:12] Pavel: And response from that will be fed back into the assistant who is going to summarize the response to which user may follow up with additional inquiries related to the previous responses. So this is typically like the most complicated use case. to tune for because it basically involves multiple objectives. So like the model needs to be good at both, you know, responding in a natural language and having like a natural language conversation with the user, which is interleaved with function calls. Next slide. Okay, so let’s switch gears and talk about the function call token.
[7:52] Pavel: So what is the function call token? So in general, like, you know, the way that the client interfaces with the model is that, you know, the model is going to return a response and the client will have to interpret it. And you know, especially in the cases where we have complex objectives, like the multi-turn general chat with optional function calling. the client will have to tell the intent of the response. And the intent could be like a function call or maybe like no function call or maybe like multiple function calls.
[8:24] Pavel: So it is generally advisable to make the output of the model structured in such a way that it is easy for the client to parse it and to tell when we perform a function call. And if the response includes a mix of natural language and function calls to also tell where is the boundary of the function call. So the general advice here is to introduce like a special token, like a single token, that is going to prefix the function call. So the reason to do that is… actually threefold.
[8:57] Pavel: So the first one I already mentioned, like it makes parsing of the responses easier. But there are also like more. So, you know, typically the modern inference services give you the ability to stream the response back. of like waiting for the response to come back fully and only then revealing it to the client. You may want to stream the response chunk by chunk.
[9:29] Pavel: However, like when it comes to function calling, you probably don’t want to like stream the response token by token, but instead you want to, you know, when you enter the function call mode, you want to wait for the entire session. function call signature to be generated before you return it back to the user. Otherwise, like, you know, it’s pretty hard to parse like, you know, partial function calls. So having like a single token telling you that, okay, you know, now we are switching into the function call mode makes it easier to implement streaming generation.
[9:59] Pavel: Also, there will be like another concept, which I call constraint generation, that I will be discussing later in this. presentation, but just like as a previous way to enforce like a certain schema of the function calls and also by having this way of, you know, switching the model intent to function call, it makes it easier to basically enable the constraint generation mode. I will come back to this concept in a few slides. Next slide, please. Right. Another decision you need to make is to pick the syntax for the function calling.
[10:41] Pavel: So I would say that, you know, based on what I have seen, there are two major contenders here. One is to perform a function call through the Python syntax. So basically generate like a Python function call signature. And the other one is to generate a JSON schema. So basically have like an adjacent a structure that describes the function name and the parameters. And so, you know, so far in this presentation, I have been following the JSON schema syntax. But, you know, the Python function signature generation is also a pretty good option.
[11:21] Pavel: So when considering the trade-offs between those two here is what you should take into account. So generally, it is a little bit easier for the model to generate Python function signatures. So the main reason is that, you know, modern LLMs have been trained extensively with Python and to understand Python. So, by relying on generating the function calls in the Python syntax, we’re kind of piggybacking on the models like internal understanding of Python.
[11:53] Pavel: However, like the problem with this approach is that, you know, it is not very common for like, you know, the existing code in Python to perform, you know, calls of functions with very complex parameters. Strat inline in the call. So parameter, right? Typically what you do, like in Python, you would create like another class that, you know, defines the type of this parameter. And then like you would basically set the values of that object, like outside of the function call and pass it to the function call.
[12:36] Pavel: However, like, you know, here we are generating basically like a single line invocation of the function. which makes it pretty unnatural to pass in deeply nested parameter types. So if your use case involves deeply nested complex parameter types, signature may not be the right choice for you. JSON schema, on the other hand, It’s better for those complex, deeply nested parameter types.
[13:11] Pavel: It is also a little bit easier to generate with this constraint generation concept, which I will discuss in a slide or two, where we are basically specifying a grammar of our function and making sure that the model follows this grammar. And it also has the advantage that if you want your model to be exposed to an OpenApi compatible API, which is pretty common nowadays. It may be a better choice because OpenApi models like GPT-4 are following the JSON schema syntax.
[13:41] Pavel: So if you want to stay compatible with the leading offerings, you may want to pick the JSON schema option. Next slide.
[13:53] Hamel: All right.
[13:57] Pavel: Right. Okay. Another consideration you may take into account is around preserving the capabilities of the model you are tuning on top of. So, you know… So if you compare the quality of the models that we are getting nowadays compared to the quality of models we have been getting, let’s say, six months ago, we have seen a very significant improvement in instruction following capabilities. So this is mainly coming from the fact that, you know, companies that… tune those open source models have been investing pretty heavily in creating very high quality alignment data sets.
[14:39] Pavel: And maintaining and curating those data sets involves significant effort and also costs a lot of money. So for instance, LAMA3 has been aligned with 10 million human curated samples. And because of that, LAMA3 is really good at instruction following. So if you are tuning your function calling model to mix function calling with general chat, it will be… a little bit of shame if we would kind of override the instruction following capabilities during our fine tuning process.
[15:15] Pavel: So what we are doing nowadays when tuning function calling models for those more complicated objectives, we try to preserve the existing instruction following capabilities of the model we are tuning on top of while like adding function calling On top of them. You should pick the instruct version of the model, not the bare bones version of the base model when you are doing tuning. Also, you know, in order to preserve Can I ask you a question about that? Sorry. You should try.
[15:57] Hamel: So are you saying, like, okay, when you’re fine-tuning models or function calling, you recommend fine-tuning on top of the instruction tune model instead of starting from the base? Did I understand it correctly?
[16:13] Pavel: Exactly. So it kind of… Yes. That’s what I’m saying. However, the caveat here is that it kind of depends on the objective you are tuning on. So that goes back to my previous slides. If you are tuning for a forced function called objective, right, where you don’t know… You are going to call a function. You are not going to follow instructions or have general conversation with the user. In this case, it probably doesn’t matter. And you can pick the base version of the model.
[16:42] Pavel: But if you are tuning for a more complicated objective that interleaves general conversation with function calling, you may want to pick the instruct variant of the model.
[16:53] Hamel: I see. Thank you.
[16:57] Pavel: Right. So, you know, in general, in order to preserve as many of the instruction following capabilities as we can, it makes sense to reduce the amount of training data, right? But then of course, like it may have an impact on the quality of the model. So if we don’t use a lot of training data, like we would better make sure that the data that we are using is of high quality. So the general recommendation here is to basically reduce the amount of training data, but at the cost of preparing higher quality data for our tuning.
[17:37] Pavel: Next slide. Right, so another consideration is whether to do full weight tuning versus LoRa tuning. So based on my experience, and I guess the field is kind of divided here, it depends who you ask, but based on my personal experience, LoRa is good enough for tuning models, for function calling, especially in this limited data regime that I mentioned in the previous slide. And so since we won’t be using tons of tuning data, it is actually better to use LoRa because it has fewer parameters that we need to converge.
[18:23] Pavel: So actually when you tune with this extremely low data volume regime, by using LoRa you can actually paradoxically end up with a model that is of higher. I wanted to point out that may not be obvious to people, that if you’re going to fine-tuning your models, you will be going through a lot of iterations. And if you go through a lot of iterations, it is actually better to have a setup that lets you churn through the different versions of the model faster. And tuning Clarice is like way, way faster than performing four-way tuning.
[19:10] Pavel: It takes less resources and it also is cheaper to host and experiment with. Next slide. So why is it cheaper to host Loras? So it’s not always cheaper. It is cheaper if your inference service has certain facilities that enable efficient LoRa serving. So in particular, the inference solution that we have developed at Fireworks, it has the ability to serve multiple Loras very cheaply. So we are able to load hundreds, if not thousands of LORAs per user account. And we charge users only per token. We don’t charge per memory footprint.
[20:00] Pavel: So the reason why we can do that is that, you know, instead of… merging the LoRa weights based into the base model, we are keeping them separately. And since the LoRa weights are way, way smaller compared to the base model weights, we are able to greatly reduce the memory footprint. And also at the inference time, we’re able to do intelligent batching where we can batch requests that go into multiple LoRa adapters. as they pass through the base model together.
[20:34] Pavel: So we are basically able to do larger batch size inference for the base model part of the model. And we can use smaller batches for the individual LoRa adapter weights. But inferring through the LoRa weights is cheaper than inferring through the base model weights. So the…
[20:55] Hamel: I have a question, if you don’t mind. While we’re talking about fireworks, now you showed us a lot about… generally how what to think about when you’re fine tuning for function calling there’s a is there a paved path on fireworks that allows that kind of yeah like for fine tuning with function calling like is it easy like you know yeah is there some product that does that
[21:23] Pavel: I’m actually going to reference a family of models that we have tuned for function calling when it comes to fine-tuning your own models. We have a fine-tuning service that you can use. Currently, we do not have a recipe specifically for function calling, but you can use a general recipe with your curated dataset to tune a model for function calling. And as I said, our platform is built in such a way that we encourage people to experiment with multiple variants of the model with very limited cost, which is super, super important.
[22:05] Pavel: From my own experience, in order to fine tune a really high quality function calling model, you need to go through hundreds of iterations, right? And you need to modify things like the you know, the training data, different hyperparameters, you may want to iterate over your benchmarks and so forth. So having the ability to, you know, move really quickly across the iterations and having the ability to host multiple versions of the model at the same time and kind of comparing them side by side is pivotal in being able to create like a high quality model.
[22:45] Pavel: I guess like this whole thing applies not only to function calling, but in general to… you know fine tuning any kind of model for reasonably complex objective makes sense and next time Yeah, okay. So going back to the constraint generation, I referenced previously. So constraint generation is a mechanism that allows us to reduce the hallucinations in the model output, right? So when we are performing a function called generation, we know the signatures of the functions because they’re kind of given upfront in the API. So we can use this knowledge.
[23:32] Pavel: to guide the model, to basically force it to generate the output in a certain schema, right? So when the model decides to call a function to get… And the stock prices of a company, like we know that that function takes like a single parameter and that parameter is of type string and then it has name ticker, right? So we can basically like, you know, guide the model generation with this knowledge of the schema.
[24:00] Hamel: And does Fireworks offer a constraint generation thing? Yes. On their product? Okay.
[24:06] Pavel: Yes. We have constraint generation support.
[24:09] Hamel: And how do you specify it? Is it some kind of grammar that you provide? Or how do you formulate?
[24:17] Pavel: Yeah. So currently we support it for function calling only. So you will have to basically provide the schema of your functions. And based on the schema of the functions, we are going to extract the grammar and force the output to conform to it.
[24:38] Hamel: Out of curiosity, are you using some kind of open source constrained generation tool like outlines or something? Or do you make your own?
[24:47] Pavel: Something like that. So our solution is inspired by certain context-free grammar, open source solutions, but we have customized it pretty heavily. So it’s not just like, you know, plug-in. use of an existing tool. It’s a heavily customized version of basically multiple open source tools. We took the best out of multiple available solutions.
[25:16] Hamel: That’s really fascinating. That should be a talk on its own.
[25:20] Pavel: I agree. Yeah, so, you know, the grammar gives you the ability to reduce the hallucinations, or many times even eliminate hallucinations, but also it actually speeds up the generation. The reason why it can speed up the generation is that, you know, like, for instance, in this example, right, like we have one field calling temperature. So we kind of know that, you know, this function takes only one parameter that has the name temperature. So let’s say there was a token generated for temp, right?
[25:56] Pavel: So we know that, you know, the next token is going to be like auto-completing temperature, right? So instead of like, you know, running the generation for the subsequent tokens through the model, we can kind of auto-complete them from the grammar. And this way we’re kind of short-circuiting the generation and making the generation faster. Next slide. All right. So this is my final slide. And this is probably like the most important tip I have for you guys is to work smart, not hard. And actually not to fine tune for function calling unless you have to.
[26:38] Pavel: There are like a lot of great open source models for function calling, including one that we provided at Fireworks. And, you know, in many, many cases, like, you know, those models are good enough for a lot of use cases. If you are going to be tuning for function calling, like, you know, be prepared to put, like, you know, sweat and tears into it. Like it’s not easy. It definitely, it’s a process. It’s an interactive process. It’s going to take time. It’s going to take effort. It’s like very rewarding.
[27:13] Pavel: But like many times, you know, when you are. on a deadline is not required and you can get pretty far.
[27:20] Hamel: I have a question about this.
[27:23] Pavel: Sure.
[27:24] Hamel: When I go to fireworks.ai forward slash models and I go to that webpage and I go to language models and I try to filter by this fire function model, I only see V1. Is V2 some kind of alpha thing that is not publicly available yet?
[27:45] Pavel: yes yeah you did your homework yeah so we didn’t officially release it yet so this is actually the the first time i’m mentioning it externally and we are almost ready to release it. We’ll be releasing it in the next few days. We have been testing the release candidate pretty extensively with our Discord users for the last few weeks. So I’m pretty confident about the quality of the model in the wild. So the Fire Function V2 is not discoverable through the UI, but this link that I pasted here should work.
[28:25] Pavel: And the best part is that it is free.
[28:28] Pavel: for the time being so feel free to play with it you know send us your feedback join our discord we’re always happy to hear from our users and your feedback helps a lot and is it based on Lava or something or can you sure it is based on Lava 3.7 TB so
[28:53] Hamel: When you train these, there are many things that we fine-tune on. We’re collecting the data while it’s actually just naturally occurring. There might be some dataset of questions and answers that we can find in those, just in the natural course of things, they get generated. With function calling, that tends not to be the case. I’m interested when you create a FHIR function or in other examples you’ve seen, what is the process that you use to bootstrap getting a bunch of examples to fine tune on. So I think there’s two parts of it.
[29:25] Hamel: One is what’s the process to bootstrap that process? And then after that, you will have many function calling conversations where there’s good results and many that have bad results, it’s calling the wrong function. And so what are you doing on an ongoing basis or what have you seen on an ongoing basis to curate so that you have another, a better data set for future fine tuning?
[29:50] Pavel: Yeah, that’s a great question. So it is like, you know, multi-stage process. So like probably a good starting point is to look into some of the open source data sets for function calling. And there are a few out there. One of the problems with those is that they are typically like, you know, very focused on type of type. V2, for instance, we wanted to try to approximate the conversation capabilities of GPT-4 mixed with function calling. So there are not that many higher quality data sets that contain those more complicated use cases.
[30:30] Pavel: So I don’t think that you will be able to fine tune your model. purely based on the existing open source data sets, at least not for the most complicated objective. So you would have to indeed invest time in building your own data sets. Some of the tips here is that, first of all, it’s kind of important to define the types of data, or the categories of data you are interested in. So like, you know, should it be more like… parallel function calling, nested function calling, how many turns is it going to be?
[31:03] Pavel: Maybe single turn or more than 10 turns. Another thing to keep in mind is how many functions do you want to support at a time? This is another important parameter. For instance, in some cases, people want to have support for up to five functions. And this is a very different story between tuning for five functions versus tuning for 50 functions. And so it is like really, really important that you kind of have some, you know, objectives in mind or some sort of like, you know, boundary cases for the use cases for the model.
[31:39] Pavel: And you need to make sure that you have a representative data in your data set. Actually, one pretty good source of data are also existing multi-agent systems like Autogen, for instance. Right. So you may look into those and typically come with some tutorials. that plug in the function calling model. And typically, those multi-agent frameworks, they have pretty complicated scenarios involving different agents with different roles and also pretty complex system prompts. But I would say that by far the most or the hardest data set to get is… one with complex system prompts.
[32:26] Pavel: And like paradoxically, this is also what we have. seen among the common use cases among our users. There’s a pretty big difference between this clean textbook example, get stock price, which requires almost no system prompt, than the real world use case where you may want to say, okay, I have those five functions and they can be very close to each other. One can be Google search, the other one may be… And then I’m looking for restaurant.
[33:00] Pavel: And in the system prompt, you may want to provide some guidance around which of those functions should be called in which case. It is super hard, not only in the function calling domain, but in the general function in dataset domain to find high quality open source dataset with complex instructions. And this is where a lot of effort on our side went into to generate some of the synthetic data sets with complex instructions.
[33:29] Hamel: I have one more question that I have. And then if you have time to stick around, we’ve got a bunch of other questions. So when you are generating text that gets shown to people, the real world impact from an adversarial input is not really that great. Like if you say, if someone puts something in and says like, use a curse word, then the worst thing they could see is a curse word. But if you allow someone to make arbitrary function calls or function calls from some set, the security implications are obviously much greater.
[34:04] Hamel: Are there any guardrails or guardrails best practices that you’ve seen that you quite like?
[34:12] Pavel: So in general, I think that’s all. Almost like all of the use cases of function calling I have seen are around read-only functions. I actually haven’t seen a lot of use cases around function calls that modify data. For instance, if you are accessing your database, you will have functions that correspond to select statement but not update or delete statements. Of course, it is not a solution and it also doesn’t mean that it has no risks because you can still access. some sensitive information this way.
[34:52] Pavel: But yeah, in general, I think this is like a problem that we will need to address at some point, like where the function calling can like multi-agent systems like become a little bit more ubiquitous. But I would say that at this point, like…
[35:08] Pavel: I don’t think we are at the level where it matters that much to some extent, because I feel like I still kind of focus on a little bit like, you know, simpler use cases and basically, you know, not exposing model to like very sensitive APIs as opposed to, you know, trying to fix it in some way. And I guess like, you know, one way to do it would be to include. precise instructions in the system prompts. And that kind of like goes back to the lack of good training data set with complex system prompts.
[35:45] Pavel: But yeah, I definitely acknowledge it as a, as a problem. And, you know, I’m also going to admit that we haven’t looked very heavily into, into this space yet, but I believe that, you know, as those models and multi-agent systems become more and more popular, like there’ll be time. where we will need to look into it.
[36:09] Hamel: Yeah, I’ve in the past worried about, even with read-only functions, some denial of service attack, and I get this thing to run some expensive operation very many times. Who knows? We’ve got a bunch of questions. So the first from Wade Gilliam, what should a prompt template look like for all these use cases where we are fine tuning an open source model for a single or multiple tool calls?
[36:39] Pavel: Right. So in general, the system prompt is your choice, but there are some problems. guidelines. So for instance, you know, in one of the slides I was proposing to tune on top of the existing Instruct model and try to preserve as many of the capabilities of this model as possible. So if we are doing that, it’s generally better to stick to the prompt format of the Instruct model, not to change it too much. So, you know, in particular, you know, for something like Fire Function V2, like…
[37:12] Pavel: We didn’t even introduce a new role for tools, but we rather reused the user role, try to minimize the differences between the Instruct model and the model we’re tuning for. But yeah, in general, the the structure of the prompt doesn’t matter as much in the cases where you’re looking for single-turn forced calls. Because the only thing that matters here is that you have the correct you know, prefix for the role. Like you typically have like two roles or like three roles, maybe like one system, one user, and then you have the assistant response.
[38:03] Pavel: So in this case, the format doesn’t matter too much. But yeah, when you are entering this area of like multi-turn chats, it is important to pick the right prompt format.
[38:21] Hamel: I’m taking the questions from most votes to least. I’m going to bring up one, but I think you’ve already answered this. Someone asked very early in your presentation, do you treat the tool as a user or give it its own role? And does message format matter? And I think you showed that it’s got its own role, which is different from the user’s role. And then it says, does message format matter?
[38:48] Pavel: So I think message formats in general, like, matter in the sense that, you know, as I mentioned in one of the slides, we need an easy way to parse the function call, right? So there are some cases in this more generic use case of mixing function calls with general chat where like a single assistant response may contain like a mix of like natural language sentence or sentences and function call like you know for instance like you ask a model, can you give me the stock price of Nvidia? And the model tries to be very nice.
[39:24] Pavel: So instead of just like dumping the function call on you, like it will tell you like, sure, I can do it for you. And, you know, here is the function call. And then, you know, we’re going to inject this special token to indicate that, you know, now it’s time to switch the modes to function call generation. So yeah, it is kind of important, you know, to make the format so that it is easier. you know, for the client to parse.
[39:49] Pavel: And also, you know, if it’s easier for the client to parse, it is typically also easier for the model to generate.
[39:59] Hamel: Okay. We’ve got another question from Wade. I’m going to make a slight addition to this. So he says, what is the most complicated function calling example you have ever seen successfully fine-tuned? And then I’d also be interested about what is the most complicated function calling example you’ve seen just with prompt engineering?
[40:27] Pavel: Sure. So instead of the most complicated cases, I have seen… you know, being successfully tuned for are the actually cases where GPD4 breaks. So one example here is that unless it got changed in the last few weeks, but as of a few weeks ago, the OpenApi API has a limit on the number of characters you can provide in the function description, right? So like each function has some like schema that you have to follow in order to define a function. And one of the fields in that schema is description.
[41:12] Pavel: So we can describe what this function does. And they have some limit on the, like the number of characters you can put in the description of this function. So if you have like, you know, very complicated function, like the description may be longer than this limit. But what we can do when we are doing fine-tuning of our models is, we basically can remove this constraint. So we can make the descriptions as long as possible.
[41:40] Pavel: And to make it even more concrete, we have seen cases where people have functions with with enum values that are very, very long. Like for instance, like you may have, let’s say 1000 enums, right? And unless like the names are super descriptive, like you may want to actually include some information, you know, about like the meaning of those enums. And the JSON schema. that OpenApi supports, it doesn’t have a dedicated field to each enum value. So typically people would put in the description of the function or description of the parameter.
[42:20] Pavel: So that’s where the longer character limit is helpful. In the case of the other question, so like what was the most successful and like the more, the most complex case that we were able to prompt the engineer for. So I would say that, you know, that’s where the instruction following capabilities of the original model come into play. Right. I have seen cases where like, man, maybe the function is not like super.
[42:55] Pavel: complicated, and maybe there are not that many functions, because typically the more functions you have, the more the prompt engineering starts breaking, and also the more terms you have, the prompt engineering starts breaking. But if you have, let’s say, a case where there are not too many terms, and maybe the functions are not too complex, but the instructions are very complex, right? So you may want to have a very precise algorithm saying, in this case, you know. call this function in that case, call some other function.
[43:27] Pavel: Yeah, so I guess that would be my experience, is that if it’s not that much about the following of the schema of the function, so I think the parameter varies, if it’s more about like… very complicated instructions when to call which function, then prompt engineering in some cases may be even more successful than going through a function calling model.
[43:56] Hamel: There’s a question. I guess you’ve mostly answered this between this slide and a previous comment. So someone asks, any examples you suggest for function calling data sets and evaluation? So you have a bunch of evals or benchmarks on this slide, but other suggestions for function calling data sets, potentially multi-torn, and for function calling evaluations.
[44:39] Pavel: Right. Yeah. So there are data sets out there. So there’s the glaive data set that is of pretty high quality, but it has also pretty limited coverage of different use cases. But it is a sizable data set of pretty high quality for the fine tuning. In terms of the evaluation data set, there are some data sets from Gorilla, from Berkeley. and some data sets from Nexus that you can use. One caveat I would mention here though is that those data sets are also oftentimes very clean. So there’s a very clean use case where you…
[45:28] Pavel: have, let’s say, functions of certain complexity. Like, for instance, the Gorilla dataset, it contains rather simple functions without too many nested parameters. And it also requires the model to follow the Python syntax. So I think that in general, what we do is we start with those publicly available datasets. But if you’re… Okay, let me put it this way. If your model is doing good on those data sets, there is a high chance you can use one of the open source models.
[46:09] Pavel: I mean, things get really heavy where you have some very special use cases where the open source models cannot be of help. And typically in those cases, you would need to invest in your own evaluation data set. in order to cover those use cases. But yeah, there’s definitely like, there are quite a few data sets that you can use to get yourself started.
[46:35] Hamel: I’m going to go back to questions in a second. As we look at this slide, I haven’t said congratulations yet. This is, these are like really great metrics beating GPT-4.0 across. mini benchmark so this is yeah this is going to be number i mean this is right now number one on the leaderboard on the gorilla leaderboard i mean not that you should get overly excited about any one leaderboard but i mean it is it’s pretty cool yeah
[47:03] Pavel: Yeah, thanks, thanks. But yeah, as I said, we kind of wanted to… Instead of building our own benchmarks, we kind of wanted to improve the legitimacy of this model by using someone else’s data sets. There are also a few things to pay attention to. So for instance, if you look at the Gorilla leaderboard like that. Even for the same model, for instance, for GPT-4, there are many variants of the benchmark. There is one that is using the OpenApi function calling API, going through JSON schema.
[47:50] Pavel: And then there is another version where they do some prompt engineering of the GPT. And actually, interestingly enough, some of the GPT for versions with prompt engineering do better than going through the API. And that kind of goes back to one of the slides I had there where I was pointing out that generating Python syntax is easier to the model than generating JSON schema, but then you’re losing some capabilities, like the grammar enforcement and so forth. So I think it’s good to look at the benchmarks to have some rough.
[48:31] Pavel: estimate of the capabilities of your model. But I think it’s really, really important to just try things out. So when you do the model selection, just don’t go to the leaderboard and pick the top model, but rather, I don’t know, pick the top five models and try them out on your use case and see which one does best.
[48:54] Hamel: We got another question here. What’s a good base model? to fine tune off of for function calling?
[49:05] Pavel: So I would say that LAMA 3 is definitely a good model. The FHIR function V1 was actually tuned on a mixed trial and FHIR function V2 was trained on LAMA. And we have seen like a pretty big, so of course like we did a lot of changes including the data set changes, like some of the parameters and so forth. But like given Just switching the base model, we have seen pretty big improvements. Also I would say that the model selection also depends a little bit on the objective.
[49:42] Pavel: So for instance, if you are picking one of those first function call objectives, and if you do single-turn, you may actually… And you are okay with Python signature generation. You may want to pick one of those. you know, coding models. Like, you know, Lava has, at least Lava 2 had some, you know, model that was tuned specifically for Python code generation. So in this case, this model may do better.
[50:11] Pavel: Also, we are starting to experiment with it because this is a pretty new model, but the QN model seems to be also showing a lot of promise. So I don’t have a conclusion here yet, but based on some experiments, it seems to be doing pretty well for this use case.
[50:31] Hamel: Yeah, there’s a question a little lower down, but it’s one that… My impression is that this is actually a quite hard problem. So if you have a long chain of calls, what is a preferred way to retain memory of state across that chain of calls?
[50:52] Pavel: Okay, so this is like, you know, a very open-ended question. So in general, if you do have you know, very long conversations, it may be also better to pick a model with longer context, right? So for instance, you know, Lava 3 has only like 8k context, right? So it may not be like the best use case for very, very long conversations. But then like, you know, you also need to make sure that you have enough of the long conversations in your tuning data set to kind of make sure that the model can capture this property.
[51:28] Pavel: So this is the… and this kind of like the easy way out right but of course you know you can um you can use your uh your imagination as the the limit right so there of course like things you can build, like layers you can build on top of your function calling model. So OpenApi has this memory layer, right?
[51:52] Pavel: But you can build your own version of it where you can, let’s say, as your conversation is going out of the context window of the model, like you may still want to use some sort of approach to, you know, for this user question.
[52:11] Pavel: you will try to like find the messages from the previous conversation that like fell out of the context window and kind of like inject them back right so like basically that would fall under the umbrella of intelligent conversation pruning right so typically the most like naive approach to um you know working around the context size limitation is that you take like the top messages, like, you know, the early messages, and then you take the tail messages and you kind of remove the messages like from the middle, right?
[52:48] Pavel: Because typically, like, you know, the beginning, like the end of the conversation are the most relevant parts. But of course, you can have a smarter algorithm to do some sort of semantic matching between the latest user instruction and the middle messages. I don’t think I’m going to invent anything super smart on the spot, but I guess there are definitely a lot of opportunities. And if people are interested in exploring this area, like how to do intelligent message selection. And that would be potentially a very high impact area.
[53:27] Hamel: For what it’s worth, there’s something that I have on my backlog to explore, which is instead of keeping track of the whole conversational back and forth, at the end of each turn, basically have the model summarize, like, here is the conversation that we want to carry forward. So you just drop basically all the messages and you just pass it. its own summary to its future self. I’ve heard this called a whiteboard approach, where it gets to write on the whiteboard, and then that’s what gets passed. But it’s on my to-explore list.
[54:07] Pavel: Yeah, here you go. I think this is an excellent idea.
[54:14] Hamel: Let’s go back to some upvoted questions. This is currently leading in upvotes, but I think it’s a quite hard and open-ended question. It is, how would you go about creating an agentech team of all of these models?
[54:37] Pavel: Yeah, I mean, you know, there are those… multiple multi-agent frameworks, right? Where you can describe. So I would basically probably start with identifying the strengths and weaknesses of individual models, right? Because, you know, in this, probably like the easiest way to prototype and play with some ideas would be to leverage like one of the existing. multi-agent frameworks. And in those frameworks, typically you define agents, right? And those agents have descriptions so that you can put some sort of routing or controller model in front of them.
[55:20] Pavel: And based on those descriptions, the controller model would be able to redirect individual messages to the appropriate agent. So I guess the good first step would be to identify Like, what are the strengths and weaknesses of individual models? And once you have that, you can wrap them up with your APIs, right? But also, like, I guess, like, you need to be a little bit…
[55:48] Pavel: careful here because you know things are getting pretty hairy as the conversation is getting longer and longer like if you just have you know an agent that is capable of a certain thing um but it is like called like deep in the conversation and it may be like i don’t know missing some sort of context from like the previous exchanges right so like you need to make sure that you are building not only the proper routing layer but also the as you said like or like, you know, to make sure that you can like properly
[56:19] Pavel: extract the context. and pass it to that model and also make sure that, you know, that model is able to interpret this context correctly. But yeah, I mean, there are, I guess there are like, you know, different approaches and this is like the most direct one because it’s based on just like wrapping the existing models. But there has been also like a lot of interesting core grants merging multiple models, right? There is this open source project MergeKit.
[56:49] Pavel: Where you can take layers from individual models and match them together, creating those like Frankenstein type of models, like composed of multiple pieces of existing models. I didn’t personally explore this area, but it could be also a pretty interesting way to combine the skill sets of multiple models into one model.
[57:14] Hamel: Um, brr. I think that’s a great answer. For what it’s worth, when Husain Chase was with us, he talked about, he thought having multiple models together was like a very exciting direction. And then I asked, like, is it really the case that we think models have different strengths so much that you could actually benefit a lot from this routing? He said that the thing that he’s seen that’s been successful is not so much. having different models with different strengths, but basically having like a big model and a small model.
[57:51] Hamel: And then you’re like, oh, this task, I just want it to be quick. And so we’ll just use a small model and here’s some other tasks that’s like strategic and complex. And when you route it to the big model, which I hadn’t heard that before, but that strikes me as less complex than some of the things I hear people talk about for like teams of different models that are all specialists in various ways.
[58:12] Pavel: Yeah, this is a great point. So, you know, if you, it kind of like depends what you are trying to optimize for, right? Like if you have no constraints, you know, you just go with GBD4, right? Like it’s expensive, but it’s intelligent. And, you know, it’s also slow, but that may be okay. If your optimization objective is more around like the cost of latency reduction, then, you know, it definitely makes sense to…
[58:42] Pavel: start exploring those like more you know I don’t know, hierarchical approaches where you maybe have some intelligent model that doesn’t require that much context and they’re just crafting this decision. And you basically get a bunch of those less intelligent, but more specialized models. And I guess function calling. So for instance, what you could do is you could, let’s say, fine tune your model for single Darren. function calling, right? And if you identify that the use case fits within that, you just use this model.
[59:20] Pavel: And maybe instead of using Lava 3.7 TB, you would use a way smaller model for this use case because it doesn’t require as much intelligence. And this way, you would reduce the cost. So I definitely buy this argument.
[59:37] Hamel: We’ve got, I think the next three questions are all questions I really like. So the next one from Andrew is, how does your fine-tuning compare to the Gorilla project from Berkeley’s Bear in terms of whether that’s depth of function signatures, breadth of functions? He says, no, I’ve only briefly seen Gorilla, and he has not used it. But how does your work differ from the Gorilla project?
[1:00:05] Pavel: Yeah, so the Gorilla project is mainly focused on single-turn use cases and forced function calling and generating Python signatures, right? So typically there is an instruction from the user there that results in function calling. And they actually support a pretty good variety of function calling scenarios, so they have you know, single function calling, they have parallel calling, they have nested calling.
[1:00:37] Pavel: However, like, based on my understanding, like, the way that this model was tuned is tuned mainly for, like, pretty simple, pretty basic functions that don’t have, like, too many parameters or maybe, like, don’t have, like, very complex parameters. But it is a model of a pretty decent quality. However, as I said before, many of the real world use cases that we see from our users involve like very precise system prompting that like require model to behave like in a very specific way and also mixing conversations with function calling.
[1:01:19] Pavel: So I think this is like, although I didn’t like evaluate that model under those scenarios, but like, you know, for instance, the, the Gorilla leaderboard doesn’t contain like any tasks for that. And even if you look at this slide right here, I specifically included the empty bench. And it’s a multi-term bench, which basically doesn’t involve any function calling. It is like the pure, like, you know, just general instruction following. So I really wanted to include it for our model to make sure that our model can still perform like non-function calling tasks.
[1:01:57] Pavel: Nexus Raven is also a very good model. It’s actually, it includes some of the samples that include deeply nested, pretty complex parameters. So to some extent, like at least if you look at the benchmarks, like the Nexus benchmarks are more challenging than the Gorilla leaderboard benchmarks. If I remember correctly, like that model is also focused mainly on the Python function calling. Yeah, I guess that’s the brief summary of what I remember about those alternative models.
[1:02:45] Hamel: Okay, this next question is another one that I exceptionally like because it’s someone who has a real use case problem they’re trying to solve, though it sounds extremely difficult. So I’ll be interested in your reaction to this. Misha says, I have a very complex use case. For home assistant, I need to train a model that will work with hundreds of functions for a smart home and all Api assistant standard things and the model should be local. Any idea on the smallest model that I can start with? And then there’s more to it.
[1:03:18] Hamel: But he says ideally, I should be able to run without a GPU. Why don’t we stop there? Does that even sound possible?
[1:03:30] Pavel: I think it sounds possible. So I may speculate a little bit and just play a little bit with those assumptions and see what we can come up with. So basically, if I were to support a very large number of functions and if I wanted to host a model locally, right? So in general, the inference speed will depend on the, of course, the hardware, the functionality. number of functions which are going to affect the context and also the size of the model. But there are some things you can do.
[1:04:13] Pavel: So for instance, what you could do, you could… make sure that the function spec is provided, let’s say, in the system prompt. So like you said, the top of the model. And you basically don’t change it throughout the conversation. So what you can do, you can basically… Even before you expose the model to the actual traffic, you run this system through the model and pre-populate the KV cache. And then this KV cache is going to stay in memory.
[1:04:53] Pavel: So when you are, let’s say you may have multiple users chatting with the model, but whenever a new session starts, you can just copy paste the KV cache. that already has the definitions of the functions pre-unpopulated. So this is one way in which we can reduce the serving cost. Another way would be, we may put those functions behind some sort of rag. So for each function, we could basically compute an embedding based on this function definition. So we don’t have to include them in the system prompt.
[1:05:37] Pavel: And then based on the conversation with the user, you would embed the… user prompt and match it with, let’s say, top five functions and inject only those five functions into the prompt, which will reduce the size of the prompt. And of course, using smaller models could be an option. Those five models by Microsoft seem to be doing pretty well based on their size. So, you could try them out. If like, you know, 8 billion parameter model is an option, like, also, you know, trying the LAMA family of models could be an option.
[1:06:24] Pavel: But yeah, so in general, you know, there are like many, many things you can do here.
[1:06:33] Hamel: Yeah. He asked, any idea of the smallest model that he can start with? So, you said phi. I think there’s a new set of models out from like Quen2 also has like a 2 billion parameter model. And should run without GPU. Very cool. I guess he said it has to be local. If you were putting that on an embedded device and having RAG and all this stuff, it starts to get pretty complicated. But if it’s embedded and on like a, or if it’s local, but on like a conventional system, then. Sounds more plausible.
[1:07:13] Hamel: I really like that. That rag idea is a great idea. The next one, I think, is another really interesting question. Have you tried using function calling with GraphQL? Any tips or tricks, for example, by leveraging the graph data structure?
[1:07:26] Pavel: I guess you could. In general, it’s a question like how to map. like GraphQL to let’s say FunctionSpecs, right? And both are basically structured data formats. And the function call, we call it function because it is intended to basically invoke. something, but something is a pretty abstract concept. So at the end of the day, we’re basically training the model to generate structured data. And GraphQL also fits the definition of structured format.
[1:08:15] Pavel: So I guess you could try existing function calling models and see if you define the graph as a function, if you will be able to. Like, you know, leverage the model to fill out the missing fields. Yeah, and also, like, you know, the grammar mode should be actually pretty useful in this case.
[1:08:44] Hamel: Cool. Yeah, a lot of questions. Let me see what else has upvotes now. Okay, we’ve got a, I don’t know how much time you have, but there’s a question about whether you can demo how to run the fire function demo. Wade says, I’m not sure how to provide it the function definitions to use.
[1:09:10] Pavel: Yeah, I’m sorry. I have an issue with the screen sharing. I wouldn’t be able to demo. But maybe you can just join our Discord and just ping me. We have a channel for function calling. Maybe I can follow up with you to share some examples.
[1:09:35] Hamel: And then is the way that you call it the same for Fire Function 1, V1, and Fire Function V2?
[1:09:44] Pavel: Yeah, in terms of the API, we are just relying on the OpenApi compatible API. And so just go to the OpenApi SDK spec and basically follow the same structure of the input. You can basically define the functions in the same way. as you do through the OpenApi API and also the same format.
[1:10:10] Hamel: Okay, but we should also join your Discord. Maybe because I’m staring at these benchmarks, I’m quite excited to try this stuff. So hopefully others are as well, and we can have a bunch of us join your Discord. There’s a question. It’s from an hour ago, and it’s, I think, a question about a very specific slide or point in the presentation. Is this called multi-agent? I’m trying to, I don’t know if you remember some place where that question might have come up.
[1:10:57] Pavel: Yeah, I think second slide, maybe.
[1:10:59] Hamel: All right, let’s see if we can go back. Probably this. Not this one. This one, right?
[1:11:07] Pavel: No, the hub. Just go to the second slide. Okay. Yeah, this one. We mentioned multi-agent systems here.
[1:11:23] Hamel: Okay. Commonly used to orchestrate agents in a multi-agent system. How do you define multi-agent system?
[1:11:34] Pavel: So, you know, you can basically think about a function as an agent, right? I mean, you can wrap it in any way you want, but this is one way to plug function calling into an SDK that has some abstractions that make it easy to build applications on top of. So there are many open source multi-agent systems where you can basically, let’s say, in the simplest case, define a Python function, right? And then say, okay, this is an agent.
[1:12:10] Pavel: And, you know, the framework is going to extract the schema of this function, and that may be translated into, like, you know, JSON schema object that can be passed into the function calling model. And then, like, when the model responds with a function call, like, the framework can automatically parse the response and call the function, get the response back, and pass it into the model in the next iteration.
[1:12:37] Pavel: But I guess also this is a valid question in a sense that I think the term agent is abused quite a bit and it’s not super well defined. So it kind of definition is context dependent. But I would basically, to me, an agent is a component with well-defined functionality that can be interacted with through a well-defined API. Yep.
[1:13:02] Hamel: Thank you. We’ve got a couple of questions again from Wade about thoughts on just using the same syntax we use to provide or retrieve results from OpenApi tool calling. I think you’ve said that that’s what you do. Yeah. Cool. Okay. Here’s a question which I think Comes back to some of the challenges with running fine-tuning systems in general. Any recommendations on how to handle changes or updates to APIs?
[1:13:47] Pavel: So, you know, in general, like at the end of the day, fine tuning is just a data set and a process, right? Hence hyper parameters. So once you have prepared your data set, you know, if there is any change in the APIs, like I don’t know, for instance, you know, we may want to go from Python syntax to JSON schema, right? Typically, like what we do, we may store the data in some sort of like canonical format. Like either of those can be designated as the canonical format.
[1:14:28] Pavel: And then we can basically have some sort of like, you know, translation layer that, you know, during training, when we read the data, we can, you know, convert them into different syntax. So, I mean, this is one way, but also like… If we are standardizing around OpenApi APIs, you actually don’t need to modify the model itself. You can write some sort of wrapper around this API that is going to translate the call to the syntax you need. And this is typically… what you do on the client side.
[1:15:08] Pavel: I can see you basically don’t want to have two complex APIs. You prefer to have one API and then different translation layers that the users can customize.
[1:15:19] Hamel: Yeah, and it seems like some people might fine-tune a model to make specific function calls. And this would be a reason to say, actually, let’s put the function definitions in our prompt and use a more general model so that we can more easily… When our functions change or our APIs change, we can just update the prompt rather than need it to train a new model, potentially on data that is for the new API that doesn’t exist yet.
[1:15:47] Pavel: Right. Yes.
[1:15:48] Hamel: Okay. Here’s a question we get. For in different guises for many different presentations. I’ll be interested in your reaction specifically for fine-tuning with functions. So the question is, how much training data samples would be good enough for LoRa fine-tuning? So you could answer that in general, but I think we’re especially interested in cases where you are fine-tuning for calling a specific set of functions.
[1:16:27] Pavel: Yeah, so actually surprisingly few samples. If you have high enough quality of data, I would say that maybe around 1,000, maybe even less samples could be good enough. So in many cases I have actually seen people investing time in hand curating certain samples, right? Because this is like, you know, a few hundred samples is not something you cannot manage curating by hand. And, you know, also it kind of like, you know, depends what kind of… techniques you use.
[1:17:09] Pavel: Like for instance, whether you use single stage process where you just do supervised fine tuning, or also if you use some sort of alignment with DPO. I would say for supervised fine tuning, you need maybe around 1,000 samples. If you do a low-rate fine tuning, of course, it depends a lot on the base model use case and so forth. For DPO tuning, we have also seen very promising results with very few samples. Even having around 100 samples should give you reasonable boosts in the model quality.
[1:17:52] Hamel: Well, there’s a question, another question here that I liked. What are best practices for generating synthetic datasets? And here I’m going to add more function calling.
[1:18:10] Pavel: So I think the most important, so two most important things are that you need to have proper like prompt and seed datasets, but also having a good quality model to generate those samples is also super important. And you know, of course, legally you’re not supposed to use the Gloss Source models. Both OpenApi and Tropic and also Google, they basically won’t allow you legally to use their models to generate data for tuning other models. So you have to be pretty creative. But on the other hand, like…
[1:18:52] Pavel: you know, if you use some sort of like technique to also post filter the samples, because, you know, even if you use weaker models, like on average, like they are able to generate pretty high quality data sets, but they kind of make more mistakes or maybe like require a little bit more prompt engineering. So I would say that, you know, you will get a lot of, you know, false and or like bad quality samples with an open source model compared to the closed source model.
[1:19:24] Pavel: But like if you do like proper filtering of those bad samples, then you should be able to get comparable quality data sets. So for instance, one thing I can tell you from my own experience is that actually, at Fireworks, we also have a family of multimodal models that we train on top of the Lava architecture. And the reason why we did that was that the original Lava data set was generated with OpenApi models. And because of that, the original Lava was not permissive to use in commercial settings.
[1:20:07] Pavel: So what we did, we kind of regenerate the synthetic data set that was used for Lava using open source models. And then we retrained a version of Lava based on this data set. And we have seen like most like no quality and degradation, right? So it is definitely possible to use open source models for synthetic data generation, but like it would require a little bit more effort. And also like, you know, when you go into the synthetic data generation, also like have a plan.
[1:20:37] Pavel: So I think it’s more important to have like more variety of the cases as opposed to having like more examples for like the same. So for instance, if you have this single-term forced function calling, you will get very diminishing results if you generate, let’s say, more than, I don’t know, let’s say, 200 samples for this case. So it’s more important to have a plan. Okay, I want to support single-term forced calling, parallel calling, nested calling, and for each of those have a good prompt.
[1:21:15] Pavel: And also, like, another thing that helps a lot, it may be obvious, is that including, like, few short samples in the prompt, it also helps a lot. So, you know, instead of just, like, providing the description of the data you want to generate, like, include a few samples of what you would like to see. That’s helpful. Also, like, you know, varying the temperature helps because then the model gets, like, a little bit more creative. But I mean, those are the pretty obvious things.
[1:21:43] Hamel: I have two follow-ups on that. So actually three. One is, you said that you are doing some work with Lava. Do you have a model that can take both text and images as input and then do function calling?
[1:22:03] Pavel: Not yet.
[1:22:03] Hamel: Okay. When you do, I have… Yeah, when you do, I want to hear about it. But okay, the second, and I think there’s some comments about this in the Discord, is some people, I think that you mentioned DPO in the context of function calling. How does that work? Like, where do the preferences of what’s better and what’s worse come from?
[1:22:27] Pavel: Right, so like typically, you know, I’m not sure if it’s a typical approach or not, but like, you know, we would… do the supervised fine-tuning first, and then let’s say, share the model with our Discord users, then they would start reporting some failure cases, right? I see. And some of those failure cases could be translated into more generic DPO negatives, right? Because the… the difference between SFT and DPO is that basically DPO has also a negative, right?
[1:23:01] Pavel: So it’s like, we not only like teach the model, like what to do, but also we tell it what not to do. And, you know, especially, I don’t think that like DPO is that useful if you… have very simple system prompts, but DeepYog is really useful if you want to make your model follow the system prompts more precisely. And the reason kind of goes back to what I mentioned before is that you have typically very limited amount of instruction data with complex system prompts.
[1:23:41] Pavel: And DPO is kind of a little bit more efficient in terms of utilizing the information. Because you have to show a lot of positive examples for the model to discover some pattern of what to do. But it takes relatively fewer samples if you can mix both positives and negatives. So show it, okay, do this, but don’t do that.
[1:24:07] Hamel: And staying with this idea of distinguishing better from worse, when you generate synthetic data, do you ever generate both the code and then some tests that you want to run on it and then use that to filter out the bad code that would then go into your SFT step?
[1:24:29] Pavel: I think that’s a great idea. We actually don’t do it, but I think it’s a good idea.
[1:24:34] Hamel: That’s it. Okay. What else do we got here? This is a question from a while ago. You can choose to take this seriously or not.
[1:25:02] Pavel: I mean, it feels good, but I also don’t think about it this way, right? So you have a problem at hand and you want to address it, like you basically want to fix it. And you’re And then you can see that your model is behaving well in the scenarios that you didn’t even think about before when you were preparing the data. It’s definitely very rewarding because then you can see that the model was able to generalize beyond what we envisioned for it.
[1:25:49] Pavel: I still feel that we are quite far from the general AGI capabilities, but definitely function calling will make it more real, in a sense that if we do have AGI capable model and then through function calling, we basically give it access to the real world. So I guess I feel very pessimistic about the future of the human race. And, you know, you basically give them all tools to do things in the real world. And, like, the function calling becomes the proxy to that. It’s kind of like an interesting, you know, thing to think about.
[1:26:30] Pavel: But I don’t think we’re close to that. So I basically don’t honestly think too much about it at this point.
[1:26:40] Hamel: Yeah. Well, the most recent question was another one about AGI. So I’m going to, I think given your last answer, I’ll skip that one. Yeah, we got a lot of, we’re also coming up on, I don’t even know whether it’s two times or three times the original scheduled time. So let’s see what else we got here. Yeah, I think this is a question that I asked, and whether there are just best practices for creating custom function calling data sets. I don’t know if you have anything else to add beyond what you’ve said.
[1:27:27] Pavel: Well, it is definitely important to have the right data. Actually, maybe I can ask different questions. But I’m also interested in how important is the data versus hyper parameters, like versus the base model, right? So I would say that the recent trend we have seen, at least with the fine-tuning, is that since the models are getting more intelligent, like people are using less data, then the quality of the data becomes more and more important, because you want to maximize the signal density in your data, given that you have limited number of samples.
[1:28:15] Pavel: But we have also seen that if you train with fewer data samples, then the model also becomes more sensitive to hyperparameters. So, you know, every single like like, you know, 2x or like 3x in the learning grade can make a big difference on the quality of the model. So, like, you know, it definitely makes sense to play with the hyperparameters when doing fine tuning. And also, like, even with the same learning grade, like, it may make sense to, like, run a few.
[1:28:57] Pavel: training rounds, like unless you kind of like, you know, fix the random seed across like the entire board, like there is still some randomness that leaks into the tuning process. And you may actually get like slightly different quality models, like even with the same data set. So it helps to have like multiple rounds and I don’t know, pick the best variant. So yeah, as I said before, you know, if you don’t have to go into fine tuning, just don’t.
[1:29:27] Pavel: go there because you know it is not you know slides are one thing and it may sound easy but like at the end of the day like if the if the use case is like reasonably complex like it will definitely take like a lot of time to not only like prepare the right data but also like you know to play with different kinds of parameters and like do tons of tests and and so forth so it’s not super easy it is rewarding but like it takes time and therefore you
[1:29:56] Hamel: This has been… This is one of the last talks of our conference. This has been so, so interesting or enjoyable. For people who want to follow you and keep hearing about all the new things you’re doing, what’s the best way to keep track of your work?
[1:30:17] Pavel: Sure. So I’m not as active on Twitter or X as I should be, but I have an account there so you can look me up. I’m like a little bit more active on LinkedIn. So yeah, feel free to look me up there. Also just like, you know, follow Fireworks. HQ on X and on LinkedIn. We have Discord where we are also pretty active. So, you know, feel free to join it. We have a channel for function calling. Yeah, and thank you so much for having me.
[1:30:53] Pavel: It was like a very enjoyable experience to hear, you know, the questions. from the community. And yeah, I’m basically looking forward to getting more feedback and working with you guys. Yeah, as a company, we’re really committed to… you know, supporting developers. And, you know, that’s why we are investing in basically cheap and efficient inference, right? Like our inference services, like build around LoRa tuning that you can use to, you know, basically experiment with like different variants, like the model hosting is pretty cheap.
[1:31:34] Pavel: So, you know, we are like, you know, very motivated to make the developers happy and always, you know. interested in your feedback. So definitely keep in touch.
[1:31:48] Hamel: Okay. I will do that. And I think many of us will. And it sounds like actually getting into the fireworks Discord might be a good way to just stay in the loop about what you guys have going on as well.
[1:32:02] Pavel: Yeah, yeah. As a company, we are active on X. It’s just me personally. I’m not super active. But yeah, definitely follow us on X and on LinkedIn as well.
[1:32:14] Hamel: Okay. Sounds good. Well, thanks so much.
[1:32:19] Pavel: Yeah. Thank you for having me. Have a good day.
[1:32:22] Hamel: Bye.