Creating, curating, and cleaning data for LLMs


July 8, 2024


Good data is a key component for creating a strong LLM. This talk will outline approaches to getting the best data for training your LLMs. The talk will cover: 1. How to find existing datasets to build on top of 2. Approaches to creating synthetic data 3. Practical techniques and tools for exploring, deduplicating, and filtering datasets to enhance their quality.

Subscribe For More Educational Content

If you enjoyed this content, subscribe to receive updates on new educational content for LLMs.


00:00 Introduction

Daniel Van Strien and David Berenstein introduce themselves and provide an overview of their talk. They discuss datasets in the context of Large Language Models (LLMs) and briefly outline the features available in the Huggingface datasets.

02:31 Reusing Existing Datasets

Huggingface offers a wide range of datasets that are tailored to specific domains and tasks, though their relevance to your specific use case may vary. They provide various tools for searching, viewing, and exploring datasets.

07:14 Creating Your Own Datasets

Datasets can be created by restructuring existing data, incorporating user feedback to tailor preferences, utilizing internal data sources, or generating synthetic data. The discussion includes preprocessing requirements essential for training LLMs.

09:04 Dataset Preparation

Daniel explains the importance of formatting datasets to meet specific requirements for training LLMs, emphasizing scoping and planning based on user needs.

11:09 Supervised Fine-Tuning Datasets

These datasets consist of question-answer pairs used to fine-tune models for specific tasks, facilitating the mapping of high-level concepts to data.

12:56 Direct Preference Optimization (DPO) Dataset

Pairs inputs with preferred and rejected responses to guide models in generating desired outputs using ground truth and suboptimal examples.

14:43 Kahneman Tversky Optimization (KTO) Datasets

These datasets feature binary feedback (thumbs up or thumbs down) on model responses, easily collected from user interactions in existing systems.

15:47 Spin and Orpo as Alternatives to DPO

Spin generates synthetic data from minimal initial datasets to reduce data requirements, while Orpo streamlines training by skipping the fine-tuning step, employing a format similar to DPO.

17:56 Synthetic Data

David discusses how LLMs generate synthetic datasets, enhancing model quality and complexity through prompts, completions, and AI-generated feedback for refining preferences.

20:25 Issues with Synthetic Data

David highlights concerns such as hallucinations, toxicity, and stereotypes in models trained with synthetic data, potentially stemming from biases in the data-generating models.

21:18 Instruction-Based Dataset Evaluation

Models complete instructions evaluated by GPT-4 for criteria like truthfulness and helpfulness, with simplification to an overall rating to reduce costs. Human review reveals coding errors, stressing the need for validation.

24:20 Considerations in Synthetic Data Creation

Efficient scaling requires avoiding vendor lock-in, ensuring fault tolerance, and generating structured data formats like JSON, highlighting the complexity of the process.

25:17 Outlines Package

Produces structured text generation with JSON output, optimizing token sampling for efficiency and accuracy to reduce inference time.

26:10 DSPy Package

Focuses on programming prompts for LLMs, optimizing prompts and model weights through multiple API calls to improve prediction accuracy.

27:09 Distilabel Framework

Uses a directed graph structure to generate synthetic data and AI feedback, enabling scalable and parallel execution for efficient data processing.

28:19 Improving Data Quality

David discusses the iterative process of dataset improvement, emphasizing evaluations of diversity, quality, and quantity, where better data means higher quality rather than simply more data.

29:57 Data Improvement Strategies

Deduplication and custom techniques like hashing and rule-based approaches using regex can enhance data quality.

31:53 Advanced Techniques for Data Cleaning

Utilizing zero-shot models for initial topic predictions, classifiers for precise filtering, or LLMs for rationale-backed decisions, alongside intuitive text descriptive tools for straightforward data analysis.

32:27 Tools for Annotators

David showcases various annotation tools available, ranging from pre-made interfaces to custom Gradio setups and robust tools like Lilac and Argilla.

41:41 Example Dataset Walkthrough

Daniel walks through example DPO and KTO datasets, detailing the approach taken during dataset creation.

45:00 Case Study: LLM Summarizer

Daniel discusses the pipeline for a summarizer he’s developing, including preparations for the preference data pipeline.

50:48 Data Preparation Repository

Daniel shares a repository containing notebooks covering the topics discussed in the talk.

51:42 Resources and Conclusion

Daniel briefly discusses using Huggingface course credits and highlights additional resources on data duplication strategies and synthetic data generation.


Links to resources mentioned in the talk:

Full Transcript

[0:05] Daniel: So the plan in this session is to kind of do a little bit of a high level overview, focusing on this topic of creating, curating and cleaning data. So I think this is already been discussed quite a lot in the course and I think it’s come up in various points. So there’s probably some stuff that we’ll say that you’ll already be familiar with, but I think the idea is to hopefully… give you some ideas for how to approach building datasets for fine-tuning large language models.
[0:42] Daniel: So I’ll just quickly introduce myself and then let David introduce himself and then we can kind of dive in. So my name’s Daniel Van Streen. I work as a machine learning librarian at Hugging Face. I can talk more about what that means at the end of the session if anyone’s interested. But my background, as the kind of name implies, is very much in libraries.
[1:05] Daniel: So I kind of fell into machine learning and large language models in the same way probably a lot of other people did via the Fast AI course that I took years ago now. And.
[1:19] Daniel: But I think the library background is kind of nice for this topic because one thing that you will do a lot in libraries is look at lots and lots of data and think about how to organise data and how to structure things in a systematic way and I think that is a big part of working with data in the context of large language models so I think those kind of skills can be quite useful even though they’re a little bit outside of what you’d expect for people working in this domain.
[1:48] Daniel: David, do you want to do a quick intro and then kind of dive in?
[1:52] David: Sure. So I’m David. I’m doing ML and DevRel at Arjela. And Arjela is this data collaboration platform for AI engineers and domain experts, for the ones that aren’t familiar with it. And yeah, I kind of started off with NLP and these kind of things by initially my master’s and then working in private intelligence, working on knowledge graphs and also custom domain NLP models. And that’s also, I think, where data quality and model quality is very, very important. So yeah, that’s kind of my background.
[2:25] David: And I will be covering the synthetic data part and hopefully give some pointers on that for you guys.
[2:34] Daniel: Okay, that sounds good. So I’ll jump in then. So basically the kind of structure of this presentation is to start from like the ideal case in which you might already find some existing data and then work to the probably more realistic case that you’re going to have to build some of your own data for fine tuning if you’re not working in a topic that’s already kind of well worked on. And then you can start to build some of So I’ll just talk very quickly about some ways in which you might find existing data.
[3:03] Daniel: So obviously I’m going to pitch Hugging Face as a good place to look for datasets. And I think that the question is a little bit what kind of datasets you need to look for. So the thing with Hugging Face, there’s quite a diversity of datasets that are shared on the hub, but a lot of the ones that will be trending and very visible tend to be. focus on like a very kind of specific type of use case.
[3:29] Daniel: So an example of that here is this fine web data set, which probably quite a lot of you’ve seen do the rounds on social media. So this is a kind of data set focused pre-training large language models. So though it’s like a very interesting data set, it’s probably not something you’re going to find particularly useful unless you’re actually training based models yourself.
[3:54] Daniel: The other thing that I would say is that there are a lot of research datasets that are associated with papers that can be really interesting, but I think there’s also a growing number of community contributed datasets that are made by people doing slightly more kind of bespoke and sometimes weird stuff. But those datasets can actually be really valuable both for using directly, but also for getting a bit of a better sense of how people are approaching building datasets.
[4:25] Daniel: So one of the ways I would suggest if you’re not already familiar with finding datasets on the hub is to use this kind of tags feature. So I think that’s not super well used at the moment, but there’s kind of growing usage of it. And particularly for things like DPO datasets, it can be a really useful way of finding datasets that kind of match a specific format. And I’ll talk a little bit more about DPO in a little while. And there is also this full text search.
[4:55] Daniel: So if people have done a kind of bad job of naming their datasets, this can often be quite useful way of finding datasets because it’s looking in the full dataset carter to find things. And then once you found the dataset, ideally someone would have documented it really well and explained exactly what it’s for and what the limitations are and how it was made, but that often doesn’t happen. But one thing that I think is really nice with the hub is that you have this datasets viewer that gives you a kind of preview of the dataset.
[5:27] Daniel: And I think that can be a really good way of doing this kind of vibe check on whether a dataset is going to be useful for your application or not. So there’s a kind of example of what that might look like here. So there’s this dataset for doing function calling. As you can see, it’s very badly documented. There’s no information needed. I’m not trying to pick out on this person, but it’s just quite common that people leave that documentation until later.
[5:57] Daniel: One of the things that you get in this preview is some kind of metadata about the dataset itself. And in this case you have this conversation which is this kind of chat ML format a lot of people are familiar with, where you have basically a list of dictionaries. And one of the things that might be interesting for you to know is like how long those conversations are and what the distribution of those looks like, depending on what kind of domain you’re working in.
[6:22] Daniel: You might expect the kind of chats to be very short or you might expect to have longer chats. And then having a sense of what this looks like can be quite useful. But the other thing that you can then start to do is to dive into, okay, like what are the actual conversations in there? So you can look at actual example rows. And one of the things I noticed with this data set, even though some of the chat’s messages are quite long.
[6:47] Daniel: Some of the responses or the kind of final turns are just basically the person being like, oh, thank you. And then the model being, oh, you’re very welcome. So probably that kind of response is not actually that useful. So if a lot of the longer messages are like that, then maybe you can’t kind of see this as a good data set for those kind of long conversation. Yeah, fine tuning use cases. So I said that was probably the ideal scenario that you find something that you can either adapt or kind of directly use.
[7:20] Daniel: But I think in practice, often you’re going to have to create your own data set. And I looked a little bit through the Discord and I think people are like tend to be quite creative about this already. But I think one of the things that.
[7:35] Daniel: probably is like a little bit underused still is just to adapt existing datasets for large language model fine tuning so there’s a bunch of datasets from kind of classic NLP that can be restructured or reformatted to work with LLM fine tuning and particularly if you’re already an organisation that’s kind of been using machine learning for a while probably you have some of those datasets already around that you potentially could adapt. The other part is like whether you already have some feedback from users. So there’s a lot of discussion about preferences and preference data sets.
[8:10] Daniel: And you can set about creating those very deliberately, either by using human annotations or LLM judges. But quite often you already have some indications, either very direct, so people have like a thumbs up or a thumbs down. But there might be other ways in which you can kind of get at. this question of like was a user satisfied with a particular interaction and that might be a source of preference data that you can kind of create without actually having to to regather all that preference data manually.
[8:43] Daniel: And then the kind of final one is this synthetic data which I think can go hand in hand with these other bits but I think it’s also very powerful for kind of jumpstarting the process of building these datasets. I won’t kind of labour this point too much, but I think there’s also often quite a lot of work to get your data into a kind of format you can actually use for large language model training. And I think there’s some nice tooling for this, but often it will be geared at a very specific use case.
[9:21] Daniel: But yeah, one thing I wanted to kind of mention about that already is that I think a lot of the work that you do in prepping this data might also end up being very useful for when you get to the stage of deploying a language model in production. So I think you want to have that in mind, the kind of format that you’re gathering this data from should be quite close to what you’re going to actually be using that language model for in practice. So some of this kind of pre-processing.
[9:50] Daniel: code and work will have a lot of overlap with what you’re actually going to be deploying later on. So jumping a little bit more into the kinds of data sets you need for fine tuning. So I think in some of the earlier lessons, this was already discussed a little bit. But I think in terms of the kind of, I guess the question of like what needs to be in your data set, and I know that’s a little bit vague, but I think one of the…
[10:21] Daniel: The kind of considerations which I think is slightly different when you fine-tuning for a more specific use case is being okay with the kind of model losing certain abilities. So often when you’re kind of fine-tuning chat models, which I think is what a lot of the literature ends up being about, you want to make sure that you have a very diverse data set. and that it kind of captures all the things that users might discuss.
[10:46] Daniel: But in practice, quite often the kind of scope of your problem or the application you’re trying to develop is much more narrow. And then I think you don’t have to worry about the diversity of the data in quite the same way. And you have to really think about diversity in terms of what your large language model is actually likely to get as input. And I’ll talk a little bit about these common dataset genres. So unfortunately, it’s a little bit the Wild West in terms of like the actual way in which datasets are structured.
[11:22] Daniel: So there’s a lot of variation in this, but often there’s some kind of similarity. So for the supervised fine tuning. You have this kind of question and answer response, and it probably sounds quite obvious, but I think for some of these fine tuning use cases and also when you’re thinking about reinforcement learning and some of those algorithms for a lot of people, I think they find it easier to understand, okay, what does the data set for training using this particular algorithm look like?
[11:55] Daniel: And then they can kind of map a little bit more at a high level what the process actually looks like. possible ways that they could get to a dataset that matches this kind of format. So very kind of briefly, I won’t talk about this reinforcement learning bit too much. But if you’re kind of following this literature at all, there’s a lot of algorithms being published. Some are kind of iterative improvements on existing ones and some are quite different.
[12:29] Daniel: But what I think seen over the last year or so is that a lot of these algorithms end up using very similar data set structures. So I think that’s one of the kind of nice things in a way that a lot of the existing data sets can either be kind of lightly adapted or use as they are without having to. do a lot of work to reformat things. So one example of this kind of algorithm that became very popular and has been discussed earlier is this direct reference optimization.
[13:02] Daniel: And kind of going back to this idea of some people finding this kind of expression of algorithms not that helpful, I think it can then be really useful to go back to what does the model actually expect when it’s being trained using this particular approach. And this direct preference optimization, I think is really nice because it kind of maps quite well to how you kind of want to nudge a model when you’re doing fine tuning. So you have some input and that could be whatever your task is related to.
[13:39] Daniel: It could be kind of natural language. It could be code. It could be a bunch of different things. And then basically you want to nudge your model more to this chosen and rejected.
[13:49] Daniel: and the reason I kind of mentioned this is I think one of the nice things I’ve seen quite a lot in the past year or so since this algorithm became really popular is like really interesting and creative ways for people to actually come up with where are these chosen and rejected so one way you could do it obviously is to like manually write a good example and then write a bad example but that’s really time consuming and tedious so people have done other things so for example with this chosen If you already have some kind
[14:19] Daniel: of ground truth or gold standard data that a human has generated, often that can serve as a really good value for this chosen. And then, for example, generating this rejected response from a model that you’ve kind of evaluated that does an OK job, but isn’t quite there in terms of what you would like the response to look like. And then the final kind of algorithm that I’ll mention in terms of datasets is this KTO algorithm.
[14:49] Daniel: And the nice thing about this is in contrast to DPO, where you need this kind of two preference pairs of chosen and rejected with KTO. You basically just have a response from the model and then this binary preference, so basically a thumbs up or a thumbs down. And that’s something that can be quite easy to collect. I think in general, people are quite happy to say, I don’t like this or I do like this in an existing system.
[15:18] Daniel: But again, you might also have other kind of creative ways in which you might intuitively be able to understand, okay, well, this person didn’t click this link or didn’t do something else in the system. So probably that was a thumbs down and they did do something that kind of implies a thumbs up. So I think this could be a really useful approach to gathering data if you already have something in the wild that can kind of be interacted with by some kind of end user.
[15:48] Daniel: And then the final kind of one of these algorithms in terms of data sets that I’ll mention is SPIN and ORPO. So SPIN is a kind of iterative approach. So a lot of the kind of approaches to doing this reinforcement learning have been focused on trying to reduce the amount of data you require, because that’s kind of one of the big bottlenecks.
[16:12] Daniel: And the idea with SPIN is that you basically go through these different stages with some starting data and then you kind of synthetically generate new responses and then kind of build a data set on top of that without actually having to have such a large data set initially. And this ORPO algorithm, again, is a little bit about efficiency. So ORPO… algorithm basically expects the data to be in the same format as the DPO. So you have this input and then a response that’s chosen, a response that’s rejected.
[16:49] Daniel: But the difference here is that it doesn’t rely on you having done this self-supervised fine-tuning step. So you can kind of more quickly go directly from a base model to actually doing this alignment without having to kind of train in two steps, which is often what you have to do with these other approaches. So that can be both nice in terms of not having to duplicate data for doing the fine tuning part and then the preference part.
[17:16] Daniel: But also it means that the actual model training is a little bit more lightweight because you’re going to do that in two stages. And I’ll hand over to David, do you want me to hand over the slides to you as well?
[17:30] David: Yeah, please. So I believe you have to stop sharing for a while and then I can. I think you need to disable it in Zoom itself. Yeah, perfect. So just to kind of get back to kind of what Daniel already highlighted is that in the end, often you don’t have your perfect data set. And eventually you actually want to work towards like the perfect data set that you might have. So that’s a way to actually do that is through synthetic data.
[18:13] David: And synthetic data is data that’s generated by LLMs and is commonly used for fine tuning LLMs. So you can actually start your synthetic data. for example, by generating prompts from scratch, by generating completions based on prompts. And normally you basically prompt an LLM along with some initial contexts, for example, in order to also do rephrasing.
[18:36] David: And some of these algorithms that do rephrasing of prompts to make them more higher quality or higher complexity or more nuanced, so to say, in order to ensure a higher quality prompt or completion are the evil complexity and the evil quality prompts. And then there’s also this special kind of AI feedback or synthetic data, and that’s actually judging.
[18:58] David: judging synthetic data and that’s to assign scores to prompts and completions or to do preference ranking or to actually provide a rationale or critique along with your initial score that kind of explains as to why it came up with why the model actually produced the score so we actually would use a prompt template prompting a model okay please provide a score for this prompt and the associated response And can you also provide a rationale along with that?
[19:30] David: And that would be kind of the prompt template, so to say, that people apply for these kinds of synthetic data things. So one of the earlier models that was actually generated, trained on synthetic data was the alpaca-7b model. And what they did was actually prompt the text-to-pinchy model along with the self-instruct research paper. So it was a prompt template that was actually taking some seed instructions, so some initial seed data, and actually was prompted to rewrite that seed data to better instructions or more diverse instructions.
[20:05] David: So they started up with 175 instructions, then eventually ended up with 52,000 instructions and actually did supervised fine tuning along with the LAMA 7B model that Meta initially released. So that’s, I would say, amazing. You can actually use LLM data to train LLMs and then everything is solved. But as you might expect, that’s not really the case. So if you actually look into their report about synthetic data and the self-instruct and how the model actually performed, you can see that there’s a lot of hallucinations, toxicity and stereotypes within the model.
[20:41] David: It might be due to the fact that they actually used the Meta model, which wasn’t trained optimally. It might be due to the fact that they actually used the DexDaVinci model from OpenAI. which might contain some bias, but in the end, it would have probably been better to actually look at the data and see what’s in there. So one of the examples from a completionist is like, what’s the optimal seed? Why is 42 the optimal seed?
[21:08] David: And the model actually starts hallucinating that 42 is the optimal seed and that it can be used for any neural network training. So another example would be maybe using more complex pipelines, more better models, better data. And that’s actually what they tried to do for the ILTA feedback paper, where they initially sourced instructions. So a lot of both synthetic and human generated instructions from an instruction pool and actually asked, prompted different models to actually provide completions for each one of these instructions. And each one of these completions was actually judged by GPT-4.
[21:47] David: based on four different criteria. So instruction following, truthfulness, honesty, and helpfulness. And also one overall criteria, just asking or prompting the model whether overall the completion was correct according to the initial input prompt. An issue, a good thing to take away is that whenever you do this judging, what we’ve seen is that each of these individual criteria when you’re focusing on instruction following truthfulness, et cetera, highly correlates on average, so to say, with one overall rating.
[22:21] David: So if you actually want to reduce some costs and reduce some compute, you can also just apply one overall rating, also given the fact that you probably need some human involvement at the end. So, also this didn’t turn out to work.
[22:38] David: So when we actually started uploading this data in our UI and actually looking at the data by human judgment, we actually saw that some data was sourced from benchmarks, that there was like this weird thing happening where there were some forms that completely or some responses that completely didn’t make sense and still ended up getting a very high score. And we kind of investigated that and investigated that and looked into the code and it was apparently a coding error leading the, for the scores to be converted from a one to a 10.
[23:13] David: So all of these scores that were actually a 10 could have been, been a one. So that kind of messed up the entire dataset. There was like incomplete ratings due to the fact that open AI API calls filled. And there were some ties in the data that were still being processed as. chosen and rejected pairs, even though if the responses would get the same rating, you wouldn’t expect them to have one preference over another because they would be in a tie.
[23:43] David: And that’s actually what we kind of also showed that whenever you spend some time, look at your data, get involved and work towards your higher quality data set, that you can actually train a better model. And initially the Hugging Face team published the alignment handbook saying that this awesome Zephyr model and made the entire research reproducible. And then we did that exact reproduction, but then instead we used the clean Ultra feedback set. And that actually led to a model that performed better on the benchmarks and also in intuitively putting to human judgment performed better.
[24:21] David: So then apparently you need better data, better synthetic data, but whenever you actually want to work towards that, there’s a lot of things that come to mind. So initially you can kind of start spamming ChatGPT with random requests. But if you actually want to scale that or make it more cost efficient, want to avoid vendor lock-in licenses, you might not be able to be allowed to use all of the different models for actually generating synthetic data to find your models.
[24:50] David: some fault tolerance that might arise as you have seen in the ultra feedback data set and things like like structural generation where you might want to just output JSON output responses in a JSON format. So overall, there’s just a lot of complexity involved and actually starting to do this. And yeah, I’ve chosen a set of tools that you might kind of look into whenever you start working with your own synthetic data. And one of them is the outlines package for structured text generation. And it’s going to actually produce structured text generation based on…
[25:25] David: like prompts for LLMs based on Regex, JSON, Pydentic models that everyone loves and uses, and also actually functions to be able to do a function calling. And this actually relies on the modified, by modifying the sampling of tokens. And yeah, this is a very accurate approach to actually take. And it actually also reduces the inference time due to the fact that there’s a limited… number of tokens that you can actually sample from. And one of the interesting blog posts from Hacking Value actually dives a bit deeper into this topic.
[26:02] David: Another package is DSPy that probably everyone is familiar with, but it’s focused on programming, not prompting, so to say. And this actually lets you define a DSPy signature that you use for actually calling an LLM, where you kind of program the expectations that you might have for your model. And then whenever you… kind of optimize this for your specific use case. The package underneath tries to optimize the prompt and optimize or fine-tune the model weight slightly, so that then eventually you end up with a higher or more accurate prediction.
[26:42] David: And one thing that’s Hamel actually showed in one of his blog posts is that, yeah, under the hood, the model or the DSPy package actually makes hundreds of calls to the OpenAI API when it tries to optimize these prompts due to the fact that it’s gathering good few shot examples for your use case. And that actually is quite costly. And that’s an interesting blog post where Hamel kind of goes over some of these tools as well. So I recommend reading that too.
[27:13] David: And the last one is a DC label and it’s a synthetic framework for synthetic data generation and AI feedback. And it relies on a direct basically graph structure. And it’s more of a pipelining framework where we kind of worked on serializable pipelines, uh, cast intermediate results and actually included structure generation as well, based on the outline speculates and everything is execute executable in parallel. So whenever you. really want to scale this up, then you can actually do that.
[27:47] David: And it’s kind of like application development, like the things like Lama Index or Deepset or some of these tools, but then really focused on the dataset engineer. And one of the interesting things that I’ve seen come by recently was also this blog post about the rise of the dataset engineer. So not really having AI engineers or these kind of… people anymore, but really people that focus on data and data quality. And it’s an interesting read as well.
[28:17] David: So when you’re working on improving your data, it’s like an iterative process where you have your original dataset and you add some things to assess diversity, the dataset quantity, dataset quality. You do the duplication, filtering, and these kind of things. And you throw some human feedback, but also AI feedback in the mix to make it more intuitive and more smart. So what you can actually ensure or what you can think of for improving your dataset quality is that there’s enough quality, but enough quantity, but more isn’t always better.
[28:56] David: So it’s also fine to throw away data that’s maybe very present within your dataset that you might be able to deduplicate or also reduce, simply reduce to lesser quality examples within your dataset. So one example, what Some examples that we’ve gathered from actual public reproductions of these initial fine-tuning techniques that Daniel highlighted earlier is that for SFT, we’ve seen great results according to benchmarks with 10K input samples for ORPO, 7K, DPO, 3K, and spin as low as 2K data examples. So apparently with…
[29:36] David: lot less data and a lot of models are being fine-tuned, you can actually do a pretty good job at achieving high benchmark standards, but probably also to actually have some good human intuitive models. And whenever you do think about data duplication, there’s a lot of research being done in duplication, and often the Duplication pipelines aren’t very reproducible and not very well documented. But as we’ve seen with the FIMERAP dataset that’s recently been published by Huggingface, they actually published the entire pipeline and made it publicly available to the Data12 library.
[30:22] David: And what you can think about when actually doing deduplication is applying intuitive rules like applying existing metadata that you might use for filtering your data, for filtering out some irrelevant examples, for, I don’t know, doing something like topic-wise deduplication, where you might want the most representative examples per topic. You can also think about creating your own custom metadata or your own custom features.
[30:48] David: and doing like intuitive, simpler things like hashing of the strings or doing like filtering based on embedding similarity and then actually grabbing the exemplars from like the embedded data points that are most representative for your data. When you’re doing rule-based data, it can be, yes, rule-based cleaning of your data can be as simple as just writing a Regex query.
[31:16] David: So, for example, querying as a large language model, if you don’t want to have that in your data set, normally some models used to respond like this or using the word Delve or using like these quotation marks that might indicate that the model is going to kind of hallucinate random references, some random URLs that you might want. to at least change within your dataset. And all of these things are like very intuitive, human understandable rules that you can apply to in order to improve your dataset quality.
[31:54] David: And on top of that, what you can also do is apply these more advanced techniques or more advanced specifications or embeddings or these kind of things where you quite easily can get a zero-shot model out of Hugging Phase that’s performing quite well and decide some topics that you deem relevant for your dataset and actually start predicting, making initial predictions for these topics and actually use that as some initial ways to filter your data as well. And these classifiers can also be very, very cheap. So you can actually use one of the zero-shot models.
[32:32] David: You can use a set-fit model after annotating only a couple of examples. And if you actually want to go more expensive, you can look at LLMs as judges, juries, or rationalizers where you also have this rationale, like the explanation as to why they provide this score. And if you want to go a bit more simple and a bit more intuitive and explainable, then Something like text descriptives, like the package where I provide the URL, can provide a very easy out-of-the-box analysis of your data. And all of these methods that I went over.
[33:10] David: are actually intended to be used according to our vision, our view, along with human annotation. And this really helps normally to make human annotation more fun, more engaging and also more effective. So that’s kind of where you want to be. So maybe somewhere in the middle where you can either choose for a very basic, simple, intuitive approach to using Google sheets or like a completely customized approach that really works for you.
[33:42] David: But for, I would say for, for a lot of the people, something in between where you might use like some, some already out of the box tools or web apps really works and really differs per person, what you, what you want and what you prefer. So one of the custom tools that I’ve seen come by to kind of play around with your data and kind of see what’s in your data is bulk annotation from Vincent Warmerdam.
[34:11] David: And it’s like a custom tool that he built with Bokey and Pandas and some JavaScript to tie everything together where you embed all of your text and you can kind of explore and kind of play around with it. So this is what it looks like. You can do this lasso movement, see what’s within that specific cluster, and then move from there to annotating and actually saving your initial results.
[34:38] Participant 3: By the way, I have a question about this. I saw this too, and I thought it was pretty cool. Does the Argila have something like this, a specific functionality?
[34:47] David: Yeah, we actually have something like this where you might be able to attach vectors to your records and you can do semantic search, but we don’t have this interactive kind of annotation thing. It’s a thing that we’ve considered. I believe Snorkel has something similar to it, but we’ve not gotten around to actually implementing this. An alternative that you might consider is like using something like notebooks where you use notebooks to initially annotate your data.
[35:15] David: And I think everything in terms of notebook annotation was a request from from Hamel to kind of highlight this is based on IP annotations. And then from that, there’s a lot of branches kind of looking roughly the same and giving the same overview. But this is also fully customizable where after. you’ve filled out one annotation, you can actually do some callbacks and actually do some post-processing.
[35:40] David: So you might be able to do some active learning and actually start gathering data for fine-tuning set with classifier on the fly, so to say, and then run inference with that classification model. Another thing that’s kind of like a little bit more out of the box is like creating web apps with radio, with Streamlit, with Shiny, with R Shiny. And you can actually use like all of their components and tools to directly out of a box have a very intuitive basic UI. And this can also be fully customized normally.
[36:16] David: So there’s one UI of the Somers NLP hackathon that someone created that we normally host together with Hugging Face. And this is the input text where after that someone is able to kind of correct or say, okay, this is the right translation or wrong translation can also be used for like the KTO example that Daniel mentioned earlier, where you might give a thumbs up, thumbs down and kind of go through your records one by one.
[36:40] David: And this is once again, like a nice, nice example, but with Gradios, Streamlit and Shiny, you can really customize everything as far as you might want to do that. And if you actually want something a bit more out of the box, there’s also these fully fledged cool solutions with a lot of integrated features. One of them is Lilac. And Lilac kind of has this dataset overview thing where whenever you select the dataset, you can actually select different topics and you can actually see some of the main clusters within your data. And it’s pretty, pretty cool.
[37:15] David: And on top of that, they also have this semantic search feature that I previously. mentioned.
[37:23] Participant 3: What’s about Lilac, by the way? Do you recommend using Argila with Lilac for the different features, using them concurrently? What’s the stack that you like, personally?
[37:32] David: For me, it’s kind of a bit of a mix of both. I think it really depends on what you prefer, what you like. I find the Lilac UI a bit less intuitive, personally. For example, for Within Argylla, there’s less, I think, features included and the UX, UI is a bit more intuitive for me. Things like spacey exposure are played around with, but it really depends on what you want.
[38:03] David: Probably if you, as an AI engineer, look at the API that we have or the SDK that we have compared to the one with Lilac, if you as a domain expert or a labeler.
[38:13] David: kind of go into into lilac and find their features and ui ux uh better then yeah then go for that but i think it’s the the same tool inherently covering the same topic having a lot of the the same or similar features and
[38:29] Participant 3: for me it really depends on uh on that kind of thing well you daniel do you have a do you have an opinion or do you have a stack that you like
[38:38] Daniel: Yeah, I mean, I think one thing is like at what point in your kind of process you’re trying to look at your data. So I think there’s this like initial kind of vibes check where you’re trying to look at quite a lot of data very quickly and maybe looking at a few examples carefully, but not necessarily like actually fixing things yet. So I think it’s good to kind of separate the kind of annotation mindset from the like, I’m trying to just understand this data overall.
[39:05] Daniel: And I think for that, Lilac can be quite nice because you have these different kind of visualizations. And I think. The other thing that’s super crude, but if you can get your data into some kind of data frame, even in Google Colab and compute some features like string length and token counts and very basic crude features, it gives you these plot suggestions, which are actually sometimes okay. And it’s just a really lazy way of being like, hey, there’s this one really random outlier.
[39:34] Daniel: So it’s a little bit crude, but I think it can be quite a quick way of finding where something’s going wrong. And then I think after that, kind of trying to dive in a little bit more detail and actually poke around the individual responses.
[39:48] David: Yeah. And I think with the nice thing from Lilac already more full-fledged tools is that you don’t need to think about what you want, so to say. So of course they have a lot of these highly opinionated features building and based on that you… I think a lot of people will get away with having like some nice out of the box toolkits. There’s of course always people that are going to want to have some custom thing built on top of some custom features.
[40:18] David: And then I think it’s if you really value that and it’s also worth spending time on that and actually building some of these custom tools or some of these custom annotation flows out of the box. I think both Lilac and Arjela are open source as well. Yeah. So. This was the other example that I had, Arjela, same kind of idea, having a dataset overview. You can log in and play with a demo as with Lilac. And also you have this like semantic search kind of thing.
[40:51] David: You have your content that you might load into your UI, and you can actually start annotating and doing bulk annotation and all of these kind of things. So it’s same, same, but different. I have a preference for Arjela, of course, and I’m more familiar with that. But I guess for everyone, it’s up to them to decide which UI and API and workflows they prefer. Yeah. So maybe I’ll hand it over to Daniel again. Yeah. Thanks, Daniel. Thanks, Daniel. Thanks, Daniel. Thanks, Daniel.
[41:36] Daniel: Okay can you see that? Yep. So I’m just going to talk through some example datasets and this might seem a little bit weird, like why I’m talking about these very particular datasets, but I actually think you can sometimes learn very little, but sometimes learn a lot from looking at some examples of datasets people in the community have built. Particularly for approaches to generating these datasets in kind of creative ways that doesn’t just rely on either throwing loads of API credits at the problem or loads of human effort.
[42:11] Daniel: So I think that also goes back to this thing I was saying earlier about DPO datasets. Why they became so kind of popular in the community is because this idea of generating a chosen and rejected pair can be done in like a lot of creative ways. So there’s an example here of this dataset that’s trying to make models, large language models, better at writing.
[42:36] Daniel: And the approach they take is basically to kind of take some existing books written by humans and then prompt a large language model to summarize those and then prompt another model to rewrite based on that summary. So, you know, right. novel or chapter of a novel based on this kind of summary. And then they choose the kind of the original human generated response as the chosen, the model generated response as the rejected.
[43:07] Daniel: So that might not apply to your use case, but I think it’s an example of like how you can get at these things without having to just rely on either human data or model data. Here’s another example. So this one is actually kind of a vision model, but it has a kind of similar interest in that it’s focusing on kind of actually generating this data flywheel. So trying to get this thumbs up, thumbs down. So, I mean, it really depends.
[43:43] Daniel: what kind of context you’re working in but I think getting these thumbs up thumbs down ratings can be quite useful and the other thing that I’ve found from doing a lot of annotation so I think it’s worth thinking a little bit about the kind of ergonomics so not just of the tool but also what is the task that you’re actually trying to do when you’re annotating and how does that work better so you Third, these kind of preference data sets, depending on what you’re trying to do, sometimes it can be really easy to say that’s
[44:15] Daniel: a good response or that’s a bad response. Other times it’s really helpful to have these two generations next to each other and say, okay, this one is better than this one, because without having a comparison, it’s quite difficult to actually say which model response is better. So I think when you’re doing this human annotation, I would give like quite a lot of thought to those kind of questions, which I think people kind of don’t really think about that much. But I think you can actually.
[44:44] Daniel: Yeah, I think it can be quite important for both your experience of doing the annotation and how pleasant and easy you find it. But I also have no evidence for this, but I suspect it actually influences the results of the annotations you get quite a bit, depending on how you set up the task. So I think it’s worth giving a bit of thought. And I think what maybe we’ll do is just give like a very high level overview of this kind of project I’m actually working on for this course. So that is.
[45:14] Daniel: basically trying to build a large language model summarizer and hopefully a small large language model summarizer so it can actually be deployed easily. Then we’ll take a dataset card for a dataset hosted on the hub and turn this kind of long markdown into a too long, didn’t read summary that you can kind of very quickly look at and be like, what is this dataset about? What is it for? This is just like a very kind of high level overview of the steps that are kind of taken to kind of create that data set.
[45:47] Daniel: So you have, as I kind of mentioned at the start, a bunch of processing that you do with the markdown, and that will also be processing that you’ll carry over to the actual deployment. So in the markdown, you have a lot of stuff that you probably don’t even pass to a large language model because it’s just like formatting or content. It’s not very informative. You might want to move some of this YAML and then you kind of get this summary.
[46:13] Daniel: And then in this particular case, the approach I’m trying to take is to build this preference data set. And there’s different ways of approaching this, but this is what I’m kind of using this distilled label pipeline for. And this kind of looks a little bit crazy, but I guess the overall workflow, which is actually not that difficult to define, is that you load this data set that has a bunch of these data set cards and you do this kind of initial filtering. You format this prompt.
[46:46] Daniel: And then in this case, I’m trying to use open models. So kind of compare these three different summaries and then use ultra feedback, which is a kind of approach to doing this language model judging. And in this case, I’m using Lama3 for that. And I guess the kind of thing I want to say about this pipeline, I think with synthetic data generation, it really depends on what you’re trying to do.
[47:12] Daniel: Sometimes just like having a prompt and a bunch of inputs and being like, right, just call that API a bunch of time with this prompt will work well. And sometimes it’s kind of more complicated pipeline can work better. But the thing I found quite nice about this is that.
[47:28] Daniel: at each of the kind of steps you kind of maybe increase the number of responses you’re generating so maybe initially when you’re comparing these different models you kind of do like a hundred or something and just kind of very quickly start looking through the responses and I guess at that point what you’re thinking about which is kind of similar to this point that you’re often doing when you’re kind of thinking about using prompting versus fine-tuning as being questioning about okay can I just improve this via the prompt so you might already done a little bit
[47:59] Daniel: playing around but when you kind of get a few more responses you think is there something that I’m getting in the response that can just be fixing this prompt step and you kind of iterate on this pipeline but each of these steps might actually remain uh each like the kind of workflow might remain quite static but you’re kind of fiddling around with each of these steps you And the other thing that I will say in this kind of particular pipeline, I’m sending all of these responses to Arjela.
[48:26] Daniel: So you’re kind of looking at two things when you’re doing that. So you’re looking at the model responses. and seeing how they look. But the other thing that I’m trying to look at when I’m kind of building this pipeline is this LM judge part, because I’m basically assuming with this workflow that this judge is actually going to be able to detect which summaries I’m going to like better. So one of the things that I kind of do, and there’s a kind of notebook in a GitHub that I’ll share afterwards, but…
[48:58] Daniel: basically rate a bunch of examples and then look at the ratings that the model gives and whether there’s some like average better rating but also like how often do I change the rating from the model and like how big is that difference maybe if it’s like only one score different then that’s not so bad but if it’s consistently getting bad ratings then that’s maybe something to worry about in terms of whether it’s reflecting what I want the LLM to do.
[49:29] Daniel: And the other thing I would say about this part, and I’m probably a little bit overly obsessed with this idea, but I think when you’re looking at these ratings and the responses, like thinking about heuristics that you can use to kind of capture, is this response good or bad? Like maybe it’s too complicated to judge via a simple rule, but quite often there might be patterns that you start to see.
[49:54] Daniel: So once you have like even 100 ratings or maybe even like 20, you can start being like, okay, do I always prefer the one that’s shorter or the one that’s longer or is it like random? Because if it’s just always a shorter one, then maybe you can already say, I just need to like cap the kind of maximum tokens or do something a little bit more manual. Or I can just basically say all the short responses are always better.
[50:17] Daniel: and then skip this kind of LM judge part because this is the kind of expensive part but it’s also like slightly brittle because you in this case like put quite a lot of faith as you kind of scale this in this consistently working well and if you can use some of these simple rules either instead of or as well as the the kind of judge to check okay is this continuing to work well I think that could be quite useful.
[50:47] Daniel: So because we’re running out of time, I thought I’d just quickly talk through this repo that we created. So there’s some notebooks in here that kind of give some examples of like different approaches to deduplication, how to kind of do these kind of database checks. And then an example of using Distil label to generate one of these synthetic pipelines. And the other thing which I have to admit is a work in progress is trying to basically put all the work I’m doing for this particular project in here.
[51:18] Daniel: I’m not saying that’s necessarily a good example of what to do, but I will at least try and share it openly and kind of communicate what I’m trying to do there. And then, yeah, I’ll maybe just quickly point to these other resources that we have here. So one thing I just wanted to mention is that with the course credits that you have from Hugging Face, you can use that for spaces, but you can also use it for inference endpoints.
[51:47] Daniel: So I think between all the course credits, you could actually generate quite a lot of synthetic data. And I think it’s quite interesting to see kind of what people can come up with that. And then beyond the kind of notebooks for this session, there’s also a few other resources here that might be interesting, including this blog post about FineWeb, which talks a little bit about their approach to deduplication.
[52:12] Daniel: And it is kind of for a slightly different use case, but I think there’s some kind of nice lessons in there that might still be kind of relevant. you know, doing like a really aggressive deduplication doesn’t always work better. And there’s these kind of subtle heuristics, which unfortunately, like often you have to actually train a model to see what the impact is. But if someone else has done that, that’s quite useful to see if there’s something you could pick up there. So yeah, I think maybe go ahead.
[52:40] Participant 3: Sorry.
[52:41] Daniel: No, sorry. I was just rambling to a close.
[52:44] Participant 3: No worries. We have about two minutes left. There isn’t too much time for questions, but I’ll just throw one out there. Aminah asked, how would you recommend generating synthetic data if we’re fine tuning on proprietary data sets and human annotation is expensive?
[53:09] David: Proprietary data sets that someone else creates, like a vendor of some sort?
[53:18] Participant 3: Yeah, I mean, I don’t really know what he means, but maybe it’s, yeah, like fine-tuning on, I don’t even know if the word proprietary really is. I mean, just like your own company state. Just like, let’s go that way.
[53:31] David: I guess it’s kind of the same as whenever you would fine-tune your own model versus using a proprietary model. It’s about data ownership, privacy, about the fact of you. value these kind of things in your company, if you need these things in your company, and if you really want to be the owner of your data and your model. And those are, I think, the main takeaways for this kind of trade off, both for models, but also for your data.
[54:08] Participant 3: Sounds good. We can probably take the rest of the questions to the Discord. Maybe y’all, if you want, David and Daniel, if you have time, you might want to peruse it. There’s a channel dedicated specifically to this talk where people are in. But yeah, thanks a lot.
[54:28] Daniel: This was a very good talk. And just to say also, I’m overly excited about people building data sets. So I’m happy to try and help out. I might not have the answer, but I’m happy to brainstorm with people if they’re interested in building a data set as part of this and sharing it, then definitely try and give them a hand with that.
[54:47] David: Yeah. Great. All right.
[54:50] Daniel: Thanks a lot. Thanks.
[54:53] Participant 3: All right. Thanks for having us.
[54:54] Daniel: Bye. Bye. Bye.