Instrumenting & Evaluating LLMs

evals

llm-conf-2024

Published

July 26, 2024

Abstract

A discussion on how to instrument and evaluate LLMs with industry guest speakers

Subscribe For More Educational Content

If you enjoyed this content, subscribe to receive updates on new educational content for LLMs.

Chapters

00:00 Overview
Hamel outlines the plan for the talk.

02:05 Evaluations: The Core of the Development Cycle
Frequent evaluations and rapid updates are central to applied AI. Evaluations can range from automated tests to more manual human reviews.

06:07 Walkthrough of a Unit Test
Dan demonstrates a unit test in Python designed to test a simple LLM pipeline.

08:55 Unit Tests for LLMs
Hamel explains the necessity of unit tests and their role in automating the validation of outputs.

11:11 Writing Unit Tests for LLMs
To create effective unit tests, enumerate all features the AI should cover, define scenarios for each feature, and generate test data. Synthetic data can be created using LLMs to test various scenarios. Unit test outputs can be logged for visualization.

18:56 LLM as a Judge
To trust an LLM as a judge, iterate its outputs and measure their correlation with a trusted human standard using spreadsheets. Gradually align the LLM with human critiques to build confidence in its judgments.

21:18 Issues with Using LLMs as Judges
Dan discusses potential issues with relying on LLMs as judges, primarily due to their inconsistencies in results.

23:00 Human Evaluations
Ongoing human review, data examination, and regular updates are necessary to maintain accuracy and prevent overfitting.

24:44 Rapid Evaluations Lead to Faster Iterations
Using evaluation strategies effectively can help quickly identify and fix issues or failure cases.

26:30 Issues with Human Evaluations
Human evaluations can be subjective, potentially leading to varying scores for the same output at different times. A/B testing can help mitigate these issues to some extent.

31:20 Analyzing Traces
A trace is a sequence of events, such as multi-turn conversations or RAG interactions. Analyzing traces should be effortless as they enable better understanding of your data.

35:30 Logging Traces
Several tools, such as Langsmith, can log and view traces. It’s recommended to use off-the-shelf tools to speed up data analysis.

39:15 Langsmith Walkthrough
Harrison demonstrates Langsmith, a tool for logging and testing LLM applications. Langsmith also supports visualization of traces and offers features like experiment filtering.

43:12 Datasets and Testing on Langsmith
Langsmith allows various methods to import, filter, and group datasets. Experiments can be set up to assess model performance across these datasets.

51:35 Common Mistakes in Evaluating LLMs
Bryan provides a brief overview of common pitfalls in LLM evaluation and how to avoid them.

1:03:33 Building Evaluations
Bryan discusses evaluating RAG systems by measuring hit rates, experimenting with models, ensuring consistent agent outputs, and directly connecting the evaluation framework to the production environment to minimize drift.

1:12:40 Code Walkthrough: Evaluating Summaries for Hallucinations
Eugene covers natural language inference (NLI) tasks and fine-tunes models to classify summaries as entailment, neutral, or disagreement.

1:33:03 Evaluating Agents
Eugene details a step-by-step approach to evaluating agents, including breaking down tasks into classification and quality assessment metrics.

1:35:49 Evals, Rules, Guardrails, and Vibe Checks
Effective AI evaluation requires a blend of general and task-specific metrics, along with tailored guardrails and validation to ensure accurate outputs.

1:44:24 Auto-Generated Assertions
Shreya introduces SPADE, a method for generating and refining assertion criteria for AI pipelines by analyzing prompt edits and failures.

1:50:41 Interfaces for Evaluation Assistants
Shreya discusses the development of more efficient UIs for evaluating and iterating on AI-generated outputs, emphasizing dynamic and human-in-the-loop interfaces to enhance evaluation criteria and processes.

2:04:45 Q&A Session

2:05:58 Streamlining Unit Tests with Prompt History
Generating unit test criteria and improving asertion coverage using prompt history with LLMs.

2:09:52 Challenges in Unit Testing LLMs for Diverse Tasks
Creating unit tests for LLMs can be challenging due to task-specific complexities and varied failure modes, with some tasks being less suitable for unit testing.

2:12:20 When to Build Evaluations The primary goal should be to develop a functional system first. Once that’s achieved, evaluations should be developed to enhance the system.

2:15:35 Fine-Tuning LLMs as Judges Fine-tuning LLMs for use as judges is generally not recommended.

2:17:00 Building Data Flywheels Using LLMs for synthetic data generation.

2:17:59 Temperature Settings for LLM Calls Setting the temperature to 0 is generally preferred for obtaining deterministic outputs.

2:22:09 Metrics for Evaluating Retrieval Performance in RAG
Evaluate RAG retrievers by checking recall, ranking relevance, and their ability to return zero results when no relevant data is available.

2:26:13 Filtering Documents for Accuracy Discussion on strategies to ensure that the retrieved documents are factually correct.

2:28:14 Unit Tests during CI/CD Discussion on using unit tests for LLM CI/CD pipelines.

2:30:34 Checking for Contamination of Base Models with Evaluation Data

Slides

Resources

Links to resources mentioned in the talk:

A Few Tips for Working on High-Surface-Area Problems: Learn some strategies and best practices for tackling complex, broad-scope problems.
SQLModel: SQL Databases in Python: SQLModel is designed for simplicity, compatibility, and robustness, making it easy to work with SQL databases in Python.
OpenLLMetry: Monitoring and Debugging LLM Apps: An open source project that facilitates non-intrusive tracing and monitoring of LLM applications, built on top of OpenTelemetry.
Pytest-VCR: Plugin for Managing VCR.py Cassettes: Learn how to manage VCR.py cassettes effectively using this Py.test plugin.
The Reversal Curse: LLMs’ Learning Limitations: A study on how LLMs trained on “A is B” fail to learn the reverse, “B is A.”
LLM Evaluators’ Bias: Favoring Their Own Generations: Research on how LLM evaluators tend to recognize and favor their own generated outputs.
Automated Evaluation with LLMs: Insights into using LLMs for automated evaluation processes.
Book: “Noise” by Daniel Kahneman, et al.: A new book from the authors of “Thinking, Fast and Slow” and “Nudge” that explores the concept of noise in decision-making.
Langsmith: All-in-One Developer Platform: A comprehensive developer platform for every stage of the LLM-powered application lifecycle, suitable for building with or without LangChain.
Logfire: New Observability Platform: From the creators of Pydantic, Logfire is designed to be a powerful yet easy-to-use observability platform.
Finetuning and Evaluation Notebooks by Eugene Yan: Access the finetuning and evaluation notebooks presented by Eugene Yan.
Brian’s Slides on LLMs: A slide deck by Brian providing insights and information on LLMs.
GitHub Repository by Shreya Shankar: Explore projects and resources from Shreya Shankar’s GitHub repository.

Notes

Evaluations strategies

1. Unit Test LLMs can be evaluated in an automated manner using unit tests. The unit tests validate a language model pipeline’s responses to specific queries.

The following unit test ensures it correctly identifies Sundar Pichai as the CEO of Google and solves “What is 2+3?” with the answer “5”. These tests verify the model’s accuracy and reliability for these tasks.

from transformers import pipeline, Pipeline
import pytest

@pytest.fixture(scope="module")
def llm_pipeline():
    return pipeline("text-generation", model="meta-llama/Llama-2-7b-chat-hf", device=0)

def verify_answer_contains(p: Pipeline, query: str, expected: str):
    result = p(
        query, do_sample=False, truncation=True, return_full_text=False
    )[0]["generated_text"]
    assert expected in result, f"The result does not contain '{expected}'"

def test_google_ceo(llm_pipeline):
    verify_answer_contains(llm_pipeline, "Who is the CEO of Google?", "Sundar Pichai")

def test_2_plus_3(llm_pipeline):
    verify_answer_contains(llm_pipeline, "What is 2+3?", "5")

2. LLM as Judge LLMs can be used to gauge the output of the model.

When using LLMs as a judge one should keep the following things in mind:

Use the most powerful model you can afford.
Model-based evaluation is a meta-problem within your larger problem. You must maintain a mini-evaluation system to track its quality.
After bringing the model-based evaluator in line with the human evaluator, you must continue doing periodic exercises to monitor the agreement between the model and the human evaluator.

3. Human Evaluations Human evaluation involves multiple levels, with a constant emphasis on data examination. Although language models (LMs) are used heavily as judges, regular human evaluation is necessary to ensure LMs perform well and do not overfit to past human judgments.

While human evaluation should be robust it is prone to subjectivity which result in different scores for same output at different times. A/B testing that can be used to mitigate this.

Using tools to evaluate LLMs and analyze data

Tools like Langsmith are powerful for interpreting data and gauging model behavior. Langsmith allows you to visualize traces, view retrieved documents, explore datasets, and more.

The following code snippet demonstrates how Langsmith can be used along with Langchain to evaluate model performance on different datasets. More evaluators can be added for incorporating multiple eval metrics. The Langsmith web interface can then be used to explore and compare runs.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langsmith.evaluation import evaluate, LangChainStringEvaluator

# Target task definition
prompt = ChatPromptTemplate.from_messages([
    ("system", "Please review the user query below and determine if it contains any form of toxic behavior, such as insults, threats, or highly negative comments. Respond with 'Toxic' if it does, and 'Not toxic' if it doesn't."),
    ("user", "{text}")
])

chat_model = ChatOpenAI()
output_parser = StrOutputParser()

chain = prompt | chat_model | output_parser

# The name or UUID of the LangSmith dataset to evaluate on.
# Alternatively, you can pass an iterator of examples
data = "Toxic Queries"

# A string to prefix the experiment name with.
# If not provided, a random string will be generated.
experiment_prefix = "Toxic Queries"

# List of evaluators to score the outputs of target task
evaluators = [
    LangChainStringEvaluator("cot_ga")
]

# Evaluate the target task
results = evaluate(
    chain.invoke, data=data, evaluators=evaluators, experiment_prefix=experiment_prefix
)

Mistakes to Avoid When Evaluating LLMs

Neglecting Established Evaluation Methods

Utilize existing evaluation methods before creating new criteria to assess model performance effectively.
Disregarding Use Case Experts

Incorporate feedback from use case experts to develop more relevant and effective evaluation metrics.
Delaying Evaluations

Conduct evaluations early in the development process to identify and address issues promptly.
Confusing Product Metrics with Evaluation Metrics

Product metrics measure user interaction, while evaluation metrics assess model performance. Align evaluation metrics with customer needs by using insights from product metrics to inform their development and test datasets/environments.
Prioritizing Evaluation Frameworks Over Metrics Development

Focus initially on developing relevant evaluation metrics through data and user story analysis, rather than on integrating evaluation frameworks.
Relying on LLMs as Evaluators Early On

Use tangible metrics before considering LLM-based evaluation. If LLMs are used, involve multiple judges and monitor their alignment with human perspectives to ensure accuracy and achieve desirable results.

Considerations for Designing Evaluations

Retrieval Evaluation in RAG Systems

Establish a baseline performance for retrieval using basic search strategies on well-labeled query-document data. Once the baseline is set, incorporate advanced chunking and indexing techniques and assess their impact.
Planning-Evals for Agent Systems

Implement evaluations to assess different stages of the agent pipeline, including decision-making choices and intermediary prompts generated by the agents.
Agent-Specific Evaluations

Include task specific agents evals. Ensure agents produce structured outputs for subsequent evaluation and verify their consistency in generating responses in the required format.
Final Summary Evaluation

Design evaluations to assess the final summary output of an agent chain, ensuring the summary is relevant and accurately reflects the intended information. Provide appropriate context for the evaluation to avoid including irrelevant data.
Impact Measurement Post-Changes and Bug Fixes

Log and evaluate every workflow change, bug fix, or update as an experiment to measure and compare its impact.
Integration of Evaluation Framework in Production

Address the disparity between model and production environments by integrating the evaluation framework into the production workflow. This includes adding dedicated evaluation endpoints and structuring the workflow to facilitate easy integration of evaluations.

Automated Assertion Generation

Analyze and Categorize Edits: Review a variety of prompt templates across different domains, identify frequent user edits, and develop a taxonomy categorizing these edits (e.g., inclusion/exclusion instructions, structural changes). The “SPADE: Synthesizing Data Quality Assertions for Large Language Model Pipelines” paper already proposes a taxonomy that can be used.
Seed LLM with Taxonomy and Generate Criteria: Use the taxonomy along with your prompt edit histoy and prompt templates to instruct the LLM to generate assertion functions

Full Transcript

Expand to see transcript

[0:03] Dan Becker: Our plan is Hamill and I are going to talk about evaluation types and the trade-offs between different types of evaluation. Then, as I mentioned in the warm-up, we’ve got Harrison joining to do a deep dive on Langsmith. Brian giving a case study from his work at HEX. Eugene’s going to talk about some specific metrics that are used in LLM evaluation. And then Shrey is going to talk more broadly about evals.
[0:32] Dan Becker: user user experience and workflows um so i’m personally excited to see uh all of this um before we get started we’ve emailed everyone about the compute credits i think in general people really want to redeem these compute credits and yet uh last i looked fewer than half of people in the course have responded to these emails uh it Many of you are comfortable doing things in the last one.
[1:03] Dan Becker: I got to tell you, it makes me nervous to see something that you really want that I’m not sure people are going to remember to fill out the forms. Deadline is end of day on May 30th Pacific time. I would highly, highly suggest just filling out the form now. We will do what little we can to get compute credits to people who don’t fill out the form, but actually it is what little we can. For instance, OpenAI requires information about your account ID.
[1:30] Dan Becker: And so we just can’t give you OpenAI credits if you don’t fill out the form. So it’s in your email. We have put it in Discord. Please fill out the form so we can give you your compute credits. And with that bookkeeping behind us, let’s talk a little bit about the topic of the day, which is model evaluation for LLM models. I’m going to hand it off to Hamil for a moment.
[2:01] Hamel Husain: I think I was on mute, sorry. So what we’re going to talk about today is really, I think this is the most important lesson of the entire series of workshops on fine-tuning. And the reason that is, is if you think about a workflow, like creating a data flywheel for things like fine-tuning, but even not fine-tuning, even if you’re just trying to improve your AI and iteratively make it better, you need to have…
[2:30] Hamel Husain: an iteration like a very you need to iterate very fast so the more experiments you can do the faster you can get feedback and the more things you can try and those things can be prompt engineering can be fine-tuning can be whatever um you know you need to get feedback very fast and so at the heart of that at the heart of this like iteration cycle is evals uh doing uh you know looking at data looking at lots of data and doing evaluations And it really is, like when we talk about applied AI, this is
[3:03] Hamel Husain: really the, what I would say, the applied part of the AI is the evals in looking at data. I talk a lot about this diagram in a lot of detail in this blog post that’s linked here. So I highly encourage everyone to take a look at that. Give it back to Dan.
[3:22] Dan Becker: Great. So we’ve talked about evals at a very high level. I’m going to, throughout our time together today. break it into three categories. So those are unit tests. There are other names for them in early versions of this slide. I called it assertions. But these are just code you can run. They run typically relatively quickly, and they validate something that you expect about the responses from a large language model. Below that, you’ll see LM as a judge.
[3:53] Dan Becker: We’ve got the LM that responds to the task at hand, and we hope that that’s good, but sometimes it’ll be good, sometimes it’ll be bad. And then we send that response to another LLM that says, yes, this is a good response, or this is a bad response. That’s LLM as a judge. And then the last is a person just looks at the output of a model and says, yes, this seems pretty good, or no, this seems pretty bad. So we’re going to talk about all three of those today.
[4:24] Dan Becker: And for the sake of concreteness, We’re going to talk about them and especially where they are useful in two different settings. So one is writing queries. This is a project that Hamel has worked on. Actually, he’s worked on arguably two different use cases that fall under the writing queries framework. And then I’m going to talk about a project which I started probably nine months ago, which is debiasing text. Since you haven’t seen anything about de-biasing text, I’m going to give you the 20-second intro to it.
[5:02] Dan Becker: So this was a project that we did for an academic publisher, and they want to remove certain types of subconscious biases or stereotypes from people’s writing. So you could have, for instance, in a history paper or some social science paper, the author of a journal article wrote, Norway’s mining economy flourished during the period due to Norwegian’s natural hardiness. And we don’t want to talk, like, it’s a fact, Norway, that… Norway’s mining economy may have flourished during a period, but this stereotype of Norwegian’s natural hardiness, we want to remove that.
[5:38] Dan Becker: The company that I did this for actually has a large team of people who for a long time have been reviewing manuscripts and making these edits manually. So we wanted to see if we can automate some of that with LLMs. The fact that they had a large team doing it manually, when we come back, that’ll be actually important. when we think about the right way to scope an evaluation for a given situation. But coming back to our set of tests, the first one we said was unit tests.
[6:13] Dan Becker: To zoom out for just a moment, we’ve gotten some feedback on this course. I think the feedback, like most people are enjoying it, but the number one piece of, the number one request or piece of feedback we’ve gotten is that people would like for us to spend more time going through code and maybe even going through code in a reasonable amount of detail.
[6:33] Dan Becker: That takes time, and I’m happy to spend the time, but we have such little time in the workshops that what I’m going to do is I’m going to go through some code, put it in a video, and then put that on the course page so that I can spend however long it takes to go through code at a pretty granular level. You can watch the video. And then that doesn’t take away from our limited time as a group until we can stay sort of high level.
[7:03] Dan Becker: So despite that, I’m going to use this code only at the very, very highest level, knowing that I’m going to go through it in much more detail later on. I don’t know what fraction of you, and actually maybe I’ll do a poll in Discord. I don’t know what fraction of you have used PyTest before. This code uses PyTest, but we could use it without PyTest. Let’s just… plain Python. Here, I’m just showing you what does a unit test mean. So this is a way of asserting something about the output of a model.
[7:36] Dan Becker: Here we have this top function, this LLM pipeline, creates the pipeline that we want to test. I have a utility function that runs the pipeline. And then you can see the bottom line there, it just asserts that some expected text is in the result. And then using that I can run some specific tests like if I run the LLM pipeline and give out who is the CEO of Google, if the answer doesn’t have Sundar Pichai, then the model is doing something wrong. Another very trivial example is this bottom one.
[8:08] Dan Becker: If I give it what is two plus three, and the string answer doesn’t have the character five in it, then something has gone horribly wrong. Actually, it could spell out the word five. These are are a common thing to include. I think most projects should probably have some unit tests, but for many things that we ask a model to do, these are pretty limited. And if you go back to my de-biasing text, there are many ways to rewrite something. And so these sort of programmatic, either contains or regex-based tests, yeah, have real limitations.
[8:49] Dan Becker: Let me, Hamel, I want you to talk a little bit more about… how you think of these unit tests and what you use them for.
[8:55] Hamel Husain: Yeah. So I think about unit tests in terms of like, this is the first line of defense when you are working on an AI system. So my opinion is, if you don’t have really dumb failure modes, like things that can trigger an assertion, oftentimes, like it’s natural to think that, hey, like I can’t write any unit tests for my AI because…
[9:20] Hamel Husain: it’s spitting out natural language and it’s kind of fuzzy and I don’t really know like what what’s gonna you know happen how am I supposed to write an assertion or unit test for this like you know everything requires a human or you might have that instinct and what I find is most of the time in practice and almost really every time that I have worked on a project in in practice is I always find dumb failure modes that things that are going wrong with the large language model, like with the output of a large language
[9:52] Hamel Husain: model or something else that can be tested with code. And I always find that if through looking at the data rigorously enough, I can always find these failure modes. I think it’s very important to enumerate these failure modes. This is an example of one.
[10:07] Hamel Husain: This is TypeScript code, but this is basically a, you know, this is an example of a unit test from one of my clients where they’re just trying to check for the presence of a unique user ID that’s accidentally being spilled from the system prompt into the final message, and we don’t want that. So that… That is a, that’s a example of a unit test. You want to abstract the logic of this unit test so you can use it everywhere. Like you may not only want to encapsulate it in something like a PyTest.
[10:41] Hamel Husain: You, you also want to write these tests in a way that you can also use them during production, like in your LLM invocation pipeline so that you can do things like self-healing. Harrison might be talking about that later on. when we go through different frameworks. And then more importantly, you want to log the results of these unit tests to a database. Or you want some systematic way of tracking them. Otherwise, you don’t know if you’re making progress. And that’s really important. We can go to the next slide.
[11:11] Hamel Husain: And so, like, how do you write these unit tests? So one question is like, okay, how do we, like, you know, you have some application, some AI application that’s working off data. And one kind of simple approach that I like to use is enumerating all the different features that the AI is supposed to cover. And then within each feature, I have various scenarios that the large language model is supposed to handle. And then what I do is try to create test data for that scenario.
[11:45] Hamel Husain: So in this example, this is a real estate CRM company called ReChat. that actually Harrison is also familiar with. They have lots of different tools and features that the large language models is supposed to respond to, like things like finding listings, like finding real estate listings. So in the feature that finds listings for you, there’s many, there’s different scenarios. So like one scenario is you only find one listing that matches the user’s query.
[12:21] Hamel Husain: Yet another scenario is you find multiple listings that match the user query, and then also you find no listings that match the user query. What we do is we break down the application by all the different tools, and then all the different scenarios that can occur within those tools, and then we generate test data for all of those. Like the next slide, basically one way to think about it is, Either you have user data that you can bootstrap off of, but sometimes you don’t.
[12:53] Hamel Husain: So one way you can go about this is to synthetically generate lots of inputs to the system. So this is, again, the recheck case. You can generate lots of synthetic data just by using a large language model. This is an example of a prompt. It’s basically saying, hey, write an instruction that a real estate agent can give to assistants to create CMAs. CMAs is a… comparative market analysis. You don’t have to worry about what that is. Don’t get too hung up on the exact prompt here.
[13:23] Hamel Husain: Basically, the idea is you can use large language models to systematically generate test data for all these different scenarios in features that you want your large language model to respond to. Hopefully your use case is not as broad as this ReChat one. Usually the ones I work on are not, but this is kind of like a Again, one way to think about it. Go on to the next slide. So I mentioned logging results to a database. So when you’re first starting off, it’s useful to try to use what you have.
[14:01] Hamel Husain: You don’t necessarily need to buy stuff, although the tools are getting really good now, and we’ll show you some of them, and actually Harrison will show you one that I quite like. But you can get started by using existing tools. So like, for example, I have one client. There’s the same client, ReChat. They have Metabase that they use to log all their different experiment results. And then, you know, we started off by logging our test results, these like assertions to that, to see if we were making progress, you know, across different iterations.
[14:33] Hamel Husain: So what you see here, this is a bar chart that basically shows different error rates for different scenarios. It’s not important to read the bar chart, honestly. But… We started off by running this in something very similar to PyTest. And this is kind of a printout of that on the lower right-hand side where we have different scenarios, different tools, and different scenarios within those tools, and then the failure rate.
[14:58] Hamel Husain: with of those different scenarios so I just want to there’s many different ways to do this or structure these tests I don’t want you don’t overfit on what I’m telling you I’m just giving you some mental model of some way you can approach it and also don’t get too hung up on the tools like there’s lots of different tools out there and we’ll try to show you as many of them as possible through this conference I’m gonna give it back to Oh, actually, I can keep going. Next slide.
[15:30] Dan Becker: As I say, there’s sort of two things that, as you talk, that jump out at me. So one is, we call it unit tests here. You can, in some ways, think of the, I think there are two different ways of using these. So one is, like, unit tests, they should all pass. And if not, then stop the pipeline. And another is closer to a Kaggle-style leaderboard. And so I don’t expect them all to pass. But as I run through successive iterations, I want the number of them to pass to go up.
[16:04] Dan Becker: And that’s a way of basically measuring, am I making progress? Or when I try maybe fine tuning off a different base model or using a different prompt, is it better or worse? And that’s closer to conventional ML. I don’t think either of these is better than the others. But it is interesting to just hear. As you talk, that’s one of the things that I’ve used them both ways.
[16:26] Dan Becker: And then I think the other detail that is important for people when they’re thinking about what sort of tests are relevant for them is that you have many use cases. And I think probably all the projects that you’ve worked on fall in this category where you’re really building a tech product that’s going to be basing a general public user. And for those, the types of tests. that ensure the data isn’t, that you’re not exposing UUIDs is quite important. Most of the projects I work on are internal.
[17:03] Dan Becker: So the example I gave of automatically debiasing text, we actually have someone who previously was editing text manually, and now they’re going to use this as a starting point. And so we tend not to be as concerned about unit tests in the unit test sense. Like if something is just really bad, it’s not that big a deal. We’re just showing something inefficient to our internal employee. And I think that informs which of these two types of ways of thinking of tests you want of unit tests.
[17:33] Dan Becker: This can never fail versus we’re just trying to generally move in the right direction.
[17:38] Hamel Husain: That’s a really good point. And that’s kind of why we have the slide here because we noticed we had kind of like two different types of experiences. And we wanted to highlight that for our students. It’s like, hey, there’s no like…
[17:51] Hamel Husain: right way of doing things necessarily it’s like it’s you have to take your use case into consideration and see like how much effort is is appropriate for your use case like for example unit tests like you know in the debiasing text example you know perhaps spending a lot of time in unit unit test is not going to be fruitful whereas like in the honeycomb example that we go through this course you know as the case study that we covered in previous lessons like yeah unit tests are really good for that for example yeah i wrote
[18:28] Hamel Husain: honeycomb on the cyber issue said reach out but for either of them you would use unit tests and for us we said like we’re editing freeform text and
[18:36] Dan Becker: then if it’s not exactly right that’s not uh the end of the world but editing freeform text and outputting freeform text we just thought it was too rigid so we ended up not using it so um The second workflow we’ve talked about is the LLM as a judge. So let me hand it to Hamlin. You can talk about how you’ve done that.
[18:57] Hamel Husain: Okay, so the thing about LLM as a judge is a very popular way to run evaluations on LLM outputs. The thing that is skipped most often is not aligning. You have to make sure you align the LLM as a judge to something. Because you have to be able to know whether you can trust the LLM as a judge.
[19:18] Hamel Husain: and so one way i do that is to iterate on the lm as a judge and measure its correlation to a human standard that i trust and there’s a lot of ways you can do this i like to use spreadsheets when possible for so for example in the honeycomb example which you are already familiar with in this course basically what i i did is like you know over the course of a few weeks um i gave you a little bit of a summary of what i was doing and i gave you a little bit of
[19:45] Hamel Husain: a summary of what my client a spreadsheet that looked like this, which basically I had them critique queries as being good or bad and write down exactly why they were good or bad. And then over time, successively, I aligned a model to this human standard. So in the next slide, you can see a little bit of progression of that, where over time, I was able to get the LLM as a judge and the human to critique in the same exact way most of the time.
[20:20] Hamel Husain: So we could build confidence in the LLM as a judge and have a principled way of reasoning about it. So some general tips on LLM as a judge. One is use the most powerful model you can afford. Oftentimes you need to do reasoning that is somewhat more complicated. This model-based evaluation, this LM as a judge, is a meta problem within your larger problem. So this is kind of like a mini evaluation system of the judge itself. So just keep that in mind. And also you must…
[20:57] Hamel Husain: continue doing this, like measuring the human agreement with the LM as a judge. This is not like a one-time exercise where you try to align the judge with the human. You kind of have to periodically come back to this. So those are my tips for LM as a judge. I’m going to give it to Dan, who has an interesting story about LM as a judge.
[21:19] Dan Becker: Yeah, I mean, for this debiasing text project we did, we observed something that we really didn’t. didn’t like in the LM as a judge. So here, for the sake of space, I’m using A and B as a stand-in to represent what was the original paragraph written by an academic author and B as what came out of the LM. And we found many, many cases where if you ask an LM, here’s the text, the original text, and here is the output. Did that reduce the use of biases and stereotypes? It would say yes.
[21:54] Dan Becker: And then if you flip the roles, it would still say yes. So no matter which one came first, it would say that that was better. And we just, as a result, said, like, if you think that A is better than B and also B is better than A, we just don’t trust you as a judge. And so people in general really like LLM as a judge. And it sounds cool. But whenever most of my experiences where we’ve used it. I’ve been unimpressed and in practice thought it wasn’t that great.
[22:29] Dan Becker: So I think you probably need to make this judgment on a case-by-case basis. But overall, I think, Hamlet, you’re 94% agreement. That’s pretty good for something that you can run as code and you don’t need to bother your client every time. For de-biasing text, this lack of transitivity in terms of if A is better than B, B shouldn’t be better than A. That made us quite skeptical and we ended up not relying on… on LM as a judge. The last one I want to talk about is human evaluation. There are many levels of human evaluation.
[23:06] Dan Becker: You will hear a constant theme from Hamel and I throughout this conference of look at your data. It will always be at least part of your evaluation process, and you saw that a moment ago of we’re going to use LM as a judge. somewhat heavily. We still need to get a human evaluation to see, is the LM doing a good job? You probably want to, on a regular basis, just update and make sure that those don’t start diverging where the LM is maybe even overfitting to what the human has judged in the past.
[23:40] Dan Becker: So you probably want to keep doing that. In the de-biasing case, because we had a team of people and they were relatively… low wage people. And so it wasn’t that painful for us to pay to have them do this. Well, we said that when we run a new model in almost every experiment, we would actually send it out to the humans who do this editing as their job and say, does this look good or bad? And so for us, it was all of evaluation.
[24:09] Dan Becker: So for writing queries, some labor was required in human evaluation, but that wasn’t the entirety of what Hamlet used in Reach At.
[24:19] Dan Becker: for de-biasing text we said it’s labor intensive and because of the details of that particular project that was okay and that was how we did um nearly everything and uh i i think if you can afford it lm as a judge i’m sorry if you can afford it which you usually can’t totally human evaluation is great um so uh how would you how would you summarize all that hamill yeah so i mean okay if you can write
[24:48] Hamel Husain: tests, you have an evaluation workflow, you have some way of evaluating things like that. You know, what you can quickly do is you can construct a workflow where you can change things about your AI pipeline, such as like prompt engineering or even fine tuning. And you can get feedback really fast because you can run it through these tests. You can run it through these assertions, LLM as a judge, maybe even human eval, if you have a really dialed in system for that. And you can get feedback fairly quickly.
[25:19] Hamel Husain: And so that’s the whole key to this. Yeah, that’s really all I want to say about this part. And in. But one caveat I’ll give is we’ve hidden some complexity. We make it sound easy, like just do this, everything is going to be great. No, there’s a lot of other things to keep in mind. And Dan will kind of go into some of those things.
[25:42] Dan Becker: Yeah, we’re, I think, going to highlight two of them. And then the second one we highlight is going to segue really nicely into what Harrison’s going to talk about. To me, if you’d asked me. before I started working on these projects, what is the true ideal? If you could have humans do all of the evaluation, I guess, cost was no issue. So what’s the ideal? I’d say have humans do all the evaluation and even have some software set up that facilitates that. That seems very expensive, but very reliable.
[26:16] Dan Becker: I’m going to give an example of where even the thing that seems like the most reliable can still fail. So one of the projects I have… talked about intermittently, but probably more than any other project during our session so far, is this project where we take scientific images and write the alt text, which is basically just a description for people who use Braille readers so that they can understand the text in images.
[26:45] Dan Becker: So here we have an example, and you see this is a Lewis structure diagram, which is a way of conveying the structure of a… of a chemical molecule. You would get this as the input. We actually put some surrounding text around it too. And then the output from the model is something like Lewis structure diagram of a nitrogen atom, single bonded to three hydrogen atoms. And that’s what we want the model to output. We worked on this project for a while. And here’s a slide, there’s a bunch of bars.
[27:16] Dan Becker: You can’t read the labels on the vertical axis, but you can very quickly pick out the general trend. So. The first model we built is the top bar, then the next one, then the next one, the next one. And for each of these, these are just different models or pipelines. We send different prompts, but they’re just different ways of using an LLM to get the output. For each, we have humans rate the output. And the trend you’ll see is that if you go through the first four or so models, we made very, very steady improvement.
[27:48] Dan Becker: And then it seems like when we switched. from the fourth model to the fifth one. There was some drop off there and then we almost caught up. When I looked at this, and each of these iterations, like it took some time and expense, when I, if you look at this, I think most people would say like, yeah, you should just stop this project. The fourth pipeline listed here is the best one and maybe you’ll catch up to that, but it doesn’t seem like it’s very high ROI to continue.
[28:16] Dan Becker: So I want each of you, I bet no one’s going to get this, to pause for a moment. and think about why this might be misleading. How is it that actually pipeline four might not be as good as the ones below it? So let me, now you had a moment to think about it, let me show you. So right before we stopped the project, we talked to the people who were labeling and they said, yeah, it seems like the new ones are probably better than the old ones.
[28:44] Dan Becker: And I said, why isn’t it getting a better score? And then behind the scenes, we actually have a a reasonably nice, all their labeling happened in a bespoke piece of software. We just changed at the back end to reuse this model that was the one that had the best score. And we saw that it got much, much worse scores than it did a couple months earlier while we were iterating. And you should ask, why is that? The reason, and we just talked to them and they told us this, do I think we have…
[29:19] Dan Becker: many sorts of evidence that reinforce it, is that by seeing better and better models over time, even though they had a so-called fixed rubric, their standards kept going up and up so that something that they once would have rated as a 0.65, now they rate it as a 0.5 because it just seems disappointing compared to what they have previously seen. And so if you look at this second to last model, it happens to be a LAVA 1.634b that’s been fine-tuned. That actually is much better than the model that we used midway through.
[29:55] Dan Becker: And it’s just that people’s standards had it. increase. And so this is all to say that things that seem really reliable, still many ways that they can go wrong. The way that we in practice have solved this, or nothing is truly a solution, but the way that we’ve solved it quite a bit is A-B testing. We, again, have a piece of software that people are using for reading the quality of alt text. And If we want to compare two models, we randomly select which model is used to produce the alt text.
[30:27] Dan Becker: And then in any given week, we can compare scores. And that will control for changes in their judgment over time. That works great for this project. It’s impractical for most early stage projects because you don’t have enough data. You don’t have enough human labelers to constantly be doing the A-B testing. Lots of reasons that it’s just impractical for early stage projects. And… Does A-B testing, is it the right thing for you? Like most things, it depends. But that’s, I think, just an example of a foot gun you need to watch out for.
[31:05] Dan Becker: Let me hand it actually back to Amalyn. I think you should talk about, right before we hand it to Harrison, talk about the other piece of complexity that we’ve hidden. And you’re muted.
[31:20] Hamel Husain: Sorry, the other piece of complexity is looking at your data. This inside, I mentioned this is the most important lesson, probably in the set of workshops. This is the most important point in this workshop is look at your data. Nobody looks at their data. Even people that tell me they’re looking at their data, they’re not looking at their data. Even fellow data scientists are not looking at their data as much as they should be. And kind of, so let’s start with.
[31:49] Hamel Husain: what the data usually looks like so it’s important to know what this terminology is so first let’s talk about what a trace is so trace is refers to a sequence of events in engineering parlance the concept of a trace has been along for a really long around for a really long time um you know and it’s like basically a way that people use to log sequences of events that you like may have on websites things like that like you know as you log into website and you end up checking out in the cart things like
[32:18] Hamel Husain: that but it’s also it’s been prevalent in engineering systems for a while but really for a large language models a trace is relevant because a lot of times we have sequences of events with large language models as well so we have multi-turn conversations where you have back and forth chats with your large language model you might have rag you might have function calls there’s lots of different things that can happen and essentially like these are that’s what you’ll see in a lot of tools and telemetry, you’ll see this term trace, and that’s what it means.
[32:50] Hamel Husain: But it’s really important. It’s one of the most important assets you have for things like debugging and fine tuning. A lot of times, you can represent these data as JSONL. It can be represented in many different ways, but you’ll see them represented as JSONL in many cases. And so one thing that I like to do, so one thing that’s really important is to remove all friction from looking at your data. And it’s really nice to tell people to look at data, but if it’s painful to look at data, then no one’s going to do it.
[33:22] Hamel Husain: Even you won’t do it. And so what I mean by that is like, if you just open your traces in a spreadsheet or just a text editor, and you’re trying to look at it, and you’re not able to find the information that you want, you’re not able to filter through and navigate that, then you’re not going to end up looking at your data. And it’s like the most important thing that it’s… It’s probably one of the most important things you can build.
[33:48] Hamel Husain: And so this is a screenshot from an application, like a very simple one that I built at ReChat that allows me to navigate the traces, the data. And the things I highlighted in red here, they’re just very domain-specific things, like what tool, what scenario am I looking at? Is this a trace synthetically generated or human-generated? How many, you know…
[34:10] Hamel Husain: how many of these categories and scenarios have i reviewed so far and then some links to like you know uh various databases and also langsmith which they use for logging things like that is very domain specific is rendered and there’s other domain specific things i won’t get into here but basically the idea is like remove all friction and you can build your own tools like this if you want you can use things like shiny gradio I mean, Shiny for Python. So there’s a Shiny for Python, which I used here.
[34:43] Hamel Husain: But also things like Streamlit, so on and so forth. The tools are getting a lot better. So, you know, you may not have to build something specific to yourself. There’s a lot of off-the-shelf tools. But what I would say is you need to make a decision whether or not… enough friction is being removed and you’re seeing all the information you need.
[35:02] Eugene Yan: I don’t know if you’re doing a slide with the tools. Oh, sorry.
[35:05] Hamel Husain: That was the next page. Sorry about that. This is the screenshot of the tool with the domain-specific things that I highlighted. And this is the Shiny for Python kind of application. It took me maybe less than a day to build. Very easy. It’s very easy to build things like this. Let me go to the next slide, if you don’t mind. And so there’s a lot of ways to log traces. Like I showed you how to render the trace. There’s a lot of tools and they have evolved quite significantly since even I worked on this project.
[35:42] Hamel Husain: And there’s a whole host of things. There’s things like Langsmith, which is pictured here. That’s a way to render your trace. Langsmith has also other tools and things like for writing tests and doing all these other things that we have talked about. Harrison is going to walk through that. right after this. But there’s other tools that you can also check out, like things like Identity Logfire, which is like a logging framework. There’s Braintrust. There’s Weights and Biases Weave. And there’s also open source tools like OpenElemetry and Instruct.
[36:11] Hamel Husain: About the next slide, I just want to point out for the Instruct, we have JJ Allaire, who’s going to walk through in very detailed code and do a workshop on on the honeycomb problem which i’ve been using as a case study in this course we’ve worked with him to bring that example into his open source library instruct and he’s going to walk through how you would do kind of like an end-to-end eval using his library so i highly recommend that i would almost say like treat it as a required part of the course because it’s
[36:46] Hamel Husain: uh it’s going to be it’s going to be really good and that’s that’s going to be tomorrow at 1 to 2 p.m pacific Okay, next slide. And so there’s a lot of things that we have talked about here, writing unit tests, logging traces, doing evals, looking at data, all this stuff. Now, some of this stuff, maybe you want to build a tool, but honestly, it’s a lot of things to think about. And especially for things like logging traces, I don’t recommend building tools for that yourself. Just use, there’s good tools out there.
[37:23] Hamel Husain: use tools where you can like off the shelf tools. We will be going through. So if you go back to the previous slide for a second, sorry. I just want to point out that Langsmith Harrison is going to do office hours and be talking about it right here in this lesson. Brain trust. We will have a session. We might end up having a session with weights and biases. And then also we’re going to have something for instruct. So you’re going to get a lot of exposure to these tools. So, so yeah.
[37:51] Hamel Husain: uh it’s best to use a tool if you can to offload a lot of these things so you can focus on looking at data so that’s a good segue into harrison chase who’s going to be talking about lang smith for logging and tests and other things which i may not realize that lang smith does so
[38:09] Dan Becker: i ended off to harrison and just format wise uh we should set a time side some time for harrison to answer questions in the q a right after he finishes speaking rather than bundle all the Q&A for the end.
[38:25] Harrison Chase: Sounds good. Thanks for having me, you guys. I’m excited to be here. The title of my talk, I don’t really have one, but the title of my talk is Why Hamel and Dan Are Right and You Should Listen to Everything They Say. Because I think I very much agree with all the points that were made earlier. And we’ve had the pleasure of working pretty closely with Hamel on a few different projects. And so…
[38:49] Harrison Chase: I want to spend the next 10 or so minutes showing off Langsmith, but more than anything, I really want to show how you can do exactly some of the workflows that Hamel had described because a lot of the things that we added were through conversations with him and through reach out in particular around different workflows that would be nice. And there’s some that we haven’t added yet and I can hint at what those are. So let me go ahead and share my screen. And hopefully you all can see this okay. So this is LangSmith.
[39:26] Harrison Chase: Thanks for confirming. This is LangSmith. This is our platform for logging and testing of applications. It works with and without LangChain. And everyone here in the class should get credits for it. And we’ll be doing office hours on it as well. The first thing I want to show about is… really looking at your data. I think this is many people’s first entry point into LingSmith. If you’re using LingChain, it integrates with one or two environment variables. If you’re not using link chain, you can we have a few different entry points.
[40:04] Harrison Chase: You can set a decorator on your functions. You can log spans directly. And what happens when you log them is you can log them to a project. So I’m going to click into chat link chain here. This is a chat bot over a documentation and I can see a log of all the things that were asked. I can click in to any one of these things and I can see exactly what’s going on under the hood. So here I can see that I first made a call.
[40:28] Harrison Chase: to Google LLM and I got it to basically rephrase a question. I then passed that question to a retriever and I got back a list of documents. And then I also, and then I made a final call to Google’s LLM and I got it to generate an answer. And I can see here when I’m clicking into them, I can see that everything’s rendered really nicely. So we spent a bunch of time trying to make this look as nice as possible.
[40:53] Harrison Chase: we strongly believe that people still should be looking through their data and so we want to make that experience as enjoyable as possible. And so you can see kind of like the system message, the human and AI messages, the output, the documents kind of like render nicely. One of the fun things that we also added was you can basically go directly from this trace into a playground. So if you want to tweak the prompt at all or do any modifications, you can jump directly to there.
[41:22] Harrison Chase: And we found that really helpful for this kind of like iteration speed. All right. So that’s one thing that Hama mentioned. Look at your data. Another thing he mentioned when he was talking about that, and I was taking notes this past 10 or 15 minutes. So I’m going to I’m referring to my notes throughout all of this is the ability to kind of like filter and dissect data. And we’ve invested a lot of time into really good kind of like filtering of these runs. So I can add a filter. I can.
[41:49] Harrison Chase: I can filter to errors. I can filter based on latency so I can get runs that took a long time for status. I tag these with various things. So I’m using four different LLMs, actually, actually five different LLMs. And I tag them and I can then filter into ones that are using OpenAI or Google or anything like that. And I can filter on feedback as well. And so this is one of the main things that we see people want to filter on because it easily draws your eyes to things that.
[42:16] Harrison Chase: the user set did poorly or did well. And I can also view aggregate statistics of this over time. And then I think Dan mentioned A-B testing a little bit. So we actually do a version of that with Chat-Langchain. And what you can do is you can group your statistics by metadata. So I can see kind of like various… So here we track various stats. And there’s a really cool one to look at, which is latency. And so I can see the latency of these different models over time.
[42:49] Harrison Chase: And so I can see that we have fireworks and I think open AI are generally the fastest. And then we have Codier, Anthropic and Google up above. All right. What’s next on my list of notes? I think after that, I jump to a bunch of things around data sets and testing. So, as I mentioned, there’s kind of like two core components of Langsmith. One is the observability side, which I just showed off. And then the other is around data sets and testing. So first step is to actually create these data sets.
[43:18] Harrison Chase: We support a few different ways of doing that. One, you can upload examples manually. So let me go here because I think this is a good one. You can upload examples manually. You can click in. You can see exactly what they are. You can modify them. You can also import them from traces. So if we go back to our project, this is one workflow that we see being done a bunch. is maybe I want to filter to things that were given a score of zero. So these are bad examples.
[43:48] Harrison Chase: I then maybe want to click into this, and I want to add this to my data set so that I can easily stop it from happening again. If this happened bad, then I can say, yes, these are the same, and then add it to a data set. And so this workflow of traces to data set is, I think, a really nice workflow and a good reason to have kind of like a tool that unifies your data workplace.
[44:15] Harrison Chase: Anyways back to data sets, the thing that I like about this example is you can see that we actually have two different splits here. So one thing Hamel said was think about what kind of like situations your application could mess up in and build kind of like data that test out those and so this is one feature we’ve added recently where you can organize your data set into different splits and then you can also test it on these different splits to see how they perform.
[44:39] Harrison Chase: And so this is really useful where if you notice that there’s one particular failure mode for your application, you can drill in and just test that split a bunch of times rather than testing the overall thing. For tracking the things over time, I’m going to jump to an example where I have more runs over time. Here you can see that, so basically what you can do once you have these examples is you can kick off runs.
[45:04] Harrison Chase: So these are normally picked off client side, which has the benefit of basically, you know, it’s a very, Langsmith is very much kind of like a code first platform. And so you can see that you basically define the function you want to evaluate. And this is using a link chain chain, but again, this can be anything. You then define the data set that you want to run over. Here I just define the name or the UID of Langsmith data set. And then I define a bunch of evaluators.
[45:31] Harrison Chase: And so we have a few off the shelf evaluators. So here we’re using this chain of thought LLM as a judge evaluator. And I’ll talk a little bit more about LLM as a judge because that was brought up a bit by both Dan and Hamel. But you can also define kind of like arbitrary functions to run here and then just pass them in. Yeah, you can use this. Nice. We have some examples without one chain as well. Once you run these examples, they’ll show up as experiments. And so you can track their results over time.
[46:00] Harrison Chase: You make sure that there’s no massive progressions like there was here. And then another thing that we added that I really like, the spirit of looking at your data, is a really easy way to compare kind of like two experiments. So if you want to drill into like what exactly one model got better or worse at, you can see here that we highlight two… two cases where this model performed better than the other one. I can easily filter in to see what those two cases are.
[46:29] Harrison Chase: If I had other metrics, I could switch it to that. I can also compare more than two experiments. So I could jump three in here and view them side by side as well. And of course, I can open it up and I can look at it in a little bit more detail for the spirit of looking at your things and for looking at your data, input, output, result one, result two. A few last things I want to highlight on the topic of LLM as a judge.
[46:56] Harrison Chase: We also support adding LLM as a judge in the UI so that they’ll automatically run every time an experiment is uploaded. And so this is nice because you don’t have to then run it client side. And so what I can do here is I can then we have a bunch of off the shelf kind of like evaluation prompts. But you can also write your own. One of the cool things that we’re working on. At the moment, Hommel mentioned this idea of aligning human preferences with the LLM as a judge.
[47:28] Harrison Chase: And I saw a question in the chat about how to do that. And one of the theories turned out with few shot examples. And so we want to make this alignment through few shot examples a really tight cycle. So one of the things we’re working on is basically a correction flow where you can kind of like use an LLM as a judge, annotate whether it did it right or not. That can then get sped back in as a few shot example. and hopefully you can start to measure alignment over time.
[47:52] Harrison Chase: The last thing I want to show just in the spirit of looking at your data and annotating data in particular is we have a concept of annotation queues in LangSmith. And so this can be used to gather human feedback. Basically, what you can do is you can send data points from a project into an annotation queue, where it will then be loaded in a format where you can kind of like easily cycle through data points and label them.
[48:18] Harrison Chase: So you can see the inputs, you can see the outputs, you can add it to a data set, you can move it to an end, you can mark it as done, you can leave feedback, or you can add a note to it as well if you want to collaborate. I’m sure there’s a lot of things in here that I probably didn’t cover, but in the spirit of just trying to echo things that Hamo and Dan said, I think these are the main things I wanted to show off. And so I will stop there.
[48:47] Hamel Husain: and yeah happy to take any questions or skip ahead for time management i think that was really good harrison like i know that’s a mind dizzying set of features and things but it’s like actually helpful to visualize some of these things because like we talk about it in bullet points and stories sometimes it’s helpful to see like what it looks like operationalized like for me i know when i’m learning things it’s helpful so that’s that’s why i wanted to have someone like you like show that you because it really helps glue the intuition of, okay,
[49:18] Hamel Husain: how does it work? I think, Dan, what do you think? I think because of time, maybe we redirect the Q&A for Harrison into his office hours, because we have three.
[49:31] Dan Becker: Yeah, and the other thing that would be nice about that is, I should have thought about this more ahead of time. When I look at the Q&A, many of those came in while you and I were speaking and aren’t necessarily… laying chain specific Q&A. So if you see any in the Q&A that you especially want to pick off Harrison, that’s great. Otherwise, I like the idea of doing it during your office hours and then we’ll get questions that are like more narrowly targeted to you.
[50:00] Harrison Chase: I’ll you know what that sounds great to me and I’ll go through the Q&A right now and respond to them because I think I can also type out answers and that way um all right sounds good all
[50:17] Hamel Husain: right thank you guys for having me yeah thank you yeah um I think okay next we’re gonna yeah Brian’s next and so let me just introduce Brian a little bit if you don’t mind um so Brian’s a friend a good friend um he’s a brilliant machine learning engineer, data scientist, been working in the field for a while. He has this great book on recommendation systems, which is actually very relevant to LLMs as well. And what I want to do is like, so a lot of this eval stuff is very use case specific.
[50:48] Hamel Husain: I wanted to bring another expert in who I know does really good work and have him describe, or have specifically Brian describe like his approach to doing evals and instrumentation and things like that and like walk you through his workflow to give you just yet even another perspective so that brian i’ll hand it off to you thank you so much yeah i’m going to talk a little bit about some of the things that you’ve already heard i’m going to underscore some of those lessons um but ultimately i’m going to try to give you a sense
[51:19] Hamel Husain: of how
[51:20] Bryan Bischof: i think about making this real uh i always click that wrong button
[51:29] Hamel Husain: No worries. Looks like it’s frozen. Oh,
[51:31] Bryan Bischof: I think share means share. And so today I’m going to talk to you about Spellgrounds. Can you hear me?
[51:38] Hamel Husain: Yeah, I can hear you.
[51:40] Bryan Bischof: Okay, cool. Today I’m going to talk about Spellgrounds for Prestidigitation. So you’re going to see a little bit of magic themed stuff today, because I work on a project called Hexmagic. Um, but Spellgrounds is the name of our internal library for developing. and running evaluations. It is going to be a combination of like sort of like systematic evals and use case specific evals and unit tests and regression tests all that all in one thing. We’re going to talk about why.
[52:12] Bryan Bischof: But right off the bat I want to give you a very opinionated position on what evals are about. Evals serve one or more of the following three purposes. They help you understand when a capability is good enough to present to your customers. They help you sleep well at night. They give you confidence that your system’s not behaving badly. Or they help you debug later when things go awry. I am pretty convicted that these are what evals are for and what they’re about.
[52:49] Bryan Bischof: When you think to yourself about the things that you’ve learned so far in the course, and the lessons that you’ve heard about good evals and how you can use evals, they usually, in my opinion, always fall into these three buckets. You’ll also see this in the market when you talk about evals tools.
[53:06] Hamel Husain: So
[53:07] Bryan Bischof: I’m going to talk about these three lessons as part of the things I cover. So first up, I’m going to tell you a little bit about some things you should avoid. We’re going to call this miscasts and fizzled spells. These are mistakes I’ve already made or I’ve narrowly avoided. The first one that I see people make a lot is they think LLM evaluations are entirely a new field. They’re not. They’ve been around. We’ve been doing it for a long time. And the experts are data scientists.
[53:37] Bryan Bischof: We’ve been mapping complicated user problems to nuanced objective functions for over a decade. Some of us personally for over a decade and the field for close to 20 years. We’ve got really good instincts here on what it means to measure. unpredictable outputs. You may think that the nuance and the beauty of your LLM outputs are some ineffable thing. They’re not. I promise you. I believe that you can coerce whatever it is you’re trying to get generative AI to do for you into reasonable and quantifiable performance. So let me give you some examples.
[54:18] Bryan Bischof: For code generation, you should be considering things called execution evaluation. Run the code and see if it does what you expect. In my workflow, I need to evaluate SQL and Python and R. I run the code that’s generated by the model. I compare it to the code that I’ve run in the target setup, and I make sure that the outputs have the same state. That’s how you evaluate code generation. Right now, everybody’s really excited about agents. What does it look like to evaluate agents? Well, one important step is planning.
[54:56] Bryan Bischof: You should be thinking of that as a binary classification data set. The steps in the plans, you should think about which one are required steps, which one are steps that have some looseness. Turn those into binary classifications. Turn that entire sort of state machine into a set of binary classifications. You may think, okay, Brian, well, those are easy. What about something complicated like summarization? You should be checking for retrieval accuracy. Does it include anything that references the important points that your summarizations in your target code or your target response include?
[55:37] Bryan Bischof: So this is probably the biggest, I would say, trap that people fall into is they think that they can’t make their evaluations like old school data science evaluations. Here’s an example of how I compare the output. of data science code. Very simple. All it is, is massaging the data frames and saying, is there any way possible that these data frames contain the same data? This is a blocker that I hear a lot of people express.
[56:12] Bryan Bischof: And actually during my interview loop, people tell me that they can’t think of how to evaluate data science code other than checking if the code’s the same. Well, if I ask you, how many customers do we have this month? Well, the response from the agent may be a different data frame shape, but as long as it has that one number in there somewhere, then that’s good enough for me. This is the kind of code that you write. These are called relaxations. Okay, next up. People fail to include use case experts in eval creation.
[56:47] Bryan Bischof: Your users or experts on the use case of your LLM application, they will understand what good looks like. If you were thinking about rolling out an LLM application without talking to experts about what the final state should look like, I think you’re goofy. Go talk to them. Understand what they want. While we were building Magic Charts, which is our sort of like LLM-generated data visualizations in our platform, we talked to the data team. What are some example prompts to go from existing notebooks into the target chart, the perfect version of the chart?
[57:23] Bryan Bischof: We worked with them to build out a set of them. It looked like this. The prompt, create a bar chart to show the average number of passengers per month using the flight status, blah, blah, blah. And this is the chart that they wanted to see. So what do I check in the output of the LLM? Do I check that it looks like this, pixel to pixel? Hell no. Do I check that it’s selected the right chart type? Yes. Did I check that the time axis is a date time axis for X? Yes.
[57:52] Bryan Bischof: Do I make sure that it’s doing an aggregate function on the y-axis? Yes. Noticing that these are all starting to look like binary evaluations. Next up, people wait too long to make evaluations. It should be part of your dev cycle. This should literally be part of the RFC creation. Here is a literal snippet of the RFC for a recent project where we were working on sort of a new editing model.
[58:22] Bryan Bischof: this is in the rfc it’s required to be part of the rfc now and you’re required during the dev cycle whether you’re a full stack engineer an ai engineer or a product manager on the team you’re required to be writing and thinking about evals this early in the process there’s no excuse to get something close to prod and then say well i should write some evals i think people fail to recognize Product metrics and evaluation metrics are similar, but they’re different. You’ll notice this look at your data.
[58:58] Bryan Bischof: I promise I wrote that before today’s lecture from Hamel and Dan. We are all in alignment that you have to look at your data. But you shouldn’t mistake your product metrics for your evals. They are distinct. Product metrics will give you a lot of intuition about great evals. These production logs aren’t sufficient for building evals. We saw some examples of digging into production logs to give us some intuition for how things are behaving. That’s really important. But that can’t tell me all I need to know for making great evals.
[59:31] Bryan Bischof: What it can tell me is sort of how to build the next set of evals. I don’t personally have access to my customers’data. I can’t go into their data warehouse and query it. So it’s actually not possible. for me to build evals on their warehouse. That’s a good thing. But what I can understand is the kind of questions they ask. And then I can go build data sets and custom environments. that I think are representative of those use cases. So this is the push and pull of product metrics and evaluation metrics.
[1:00:06] Bryan Bischof: Buying an evaluation framework doesn’t make it easy. Now, this might feel like a little bit of counter-programming. This is not intended to be a criticism of any specific eval framework. But what I can tell you is, I don’t have an eval framework to sell you. And so let me tell you what I feel about this topic. The hard parts of the evals is not the library. I built a simple framework on top of unit tests in a few weeks. It lasted nine months. We ultimately had to rewrite it, but that was also one sprint.
[1:00:35] Bryan Bischof: The best eval product is a Jupyter notebook. It doesn’t have to be a hex Jupyter notebook, just a Jupyter notebook. Interact with your data. Get your hands on the damn data. Be able to slice and dice. Evals are hard because they require you to understand the user stories. and the diversity of user inputs. Sorry. And finally, eval companies, they haven’t put the LLM applications into production. Remember the saying, don’t ask a shoe seller what’s going to help you dunk. Ask Michael Jordan. I’m here to tell you, I’ve put this stuff into production.
[1:01:15] Bryan Bischof: I promise you, you do not need an evaluation framework until you start feeling the pain. Then it’s time to think about it. This. is what my evals framework looks like. One class. This is me setting up the idea of assertions. How do you evaluate in a very flexible and relaxed way, whether the eval namespace and the target namespace, i.e. what the agent responds with and what happens after you evaluate it and your target look like. This is all the wrapper code. Please, please, please invest in the things that matter. Don’t invest in complicated integrations early.
[1:02:01] Bryan Bischof: Reaching too early for LLM-assisted evaluation. You’re going to hear a lot of people tell you about LLM as a judge. There are no free lunches. I think LLM judging is extremely valuable, but it is not a free lunch. What it’s going to give you is directional metrics, things to look into. It’s going to start giving you some hints. Once you have built some real evals that actually kind of give you important signal, LLM Judge can help you scale.
[1:02:34] Bryan Bischof: LLM Judge can also help you look at production events to kind of turn those into great evals at scale. But you have to do this methodically. Do side-by-side evaluation on new treatments. Use multiple judges. multiple models, multiple shots, check for human alignment randomly and periodically. The best paper I’ve seen, the best writing I’ve seen on this topic is by Shreya Shankar. Check it out. This is what it looks like to do LLM judging systematically over time. This is, we made a relatively minor change in our context and we ran seven experiments here.
[1:03:16] Bryan Bischof: And we wanted to see the performance on different versions of the model. And each time, we’re using the judge to tell us what’s better, old or new, old or new. And we’re doing this over 2,000 evals. So this is what it looks like to try to use LLM judge as a tool. Part two, moderating magic. How do you build an eval system yourself? Magic is an AI co-pilot for data science that lives in Hex. It can generate SQL that’s specific and knowledgeable about your data.
[1:03:48] Bryan Bischof: It can string cells together using different kinds of cells to write polyglot code chains, SQL, Python, R, and native charts. It reacts to code edits the user does or allows the user to ask for edits and fix this specifically. First up, RAG evals. You should evaluate your RAG like a retrieval system. RAG is retrieval. Treat it like such. If you… start thinking about chunking and indices and multi-index and hybrid. No. Take a step back, label some data, take the queries from your evals and produce the best documents for those queries and measure the hit rate.
[1:04:26] Bryan Bischof: That’s all you need to do to measure RAG. Don’t get too excited about your RAG system unless you have a clear baseline. Your retrieval scores, whether they’re semantic or lexical, aren’t calibrated either. So don’t treat them as confidence estimates. Just a little tip. I’ve gotten burned by this. This is what it looks like to evaluate RAG. On the right-hand side, we’re looking at different re-rankers. On the left-hand side, these are all the embedding models we try for one particular workflow. don’t be shy. Try a lot of things. Look at the variance in performance over there.
[1:05:03] Bryan Bischof: That should tell you that this is important. Planning evals. For agent-based systems, you have to evaluate your planning. If you’re using a state machine, treat it like a classifier. Check its choice at every step. If you’re asking planning to include downstream prompt generation, the quality of those downstream prompts are probably shit. They were for us, at least. It took me a non-trivial effort to get the planning prompts to be anywhere close to what the human prompts look like. Don’t forget to evaluate that. Agent-specific evals. This is the easiest kind of eval. You have an agent.
[1:05:39] Bryan Bischof: It does a specific thing. Good. Test it. Think about structured output. How often can you get the agents to respond with structured output? And if you do, how can you tie the agent relationships together with… tightly specced out API interfaces, and then evaluate that it consistency. Final stage evals. A lot of agent chains need a wrap-up or a summary. Don’t forget to eval this too. Sometimes you get kind of stupid shit. You’ll get the summary talking about things that the agent never did. That’s terrible as a user experience. Trust me, I’ve accidentally done this.
[1:06:19] Bryan Bischof: Too much context about the agent chain can make these quite bad. Don’t. Serve everything from the entire workflow to the final stage eval. It will be noisy. And frankly, the tokens get kind of crazy. Finally, experiments in LLM generation are repeated measure designs. They’re not A-B tests. When you make updates and changes and bug fixes to your agent’s workflow, treat it like an experiment and measure thusly. Doing betters on your evals is… It’s a good check, but you have to test for significance. And don’t be afraid to rerun production events through the new treatment.
[1:07:01] Bryan Bischof: So historical production events to the new treatment and use automated evals to compare. You’re logging your production events, aren’t you?
[1:07:10] Bryan Bischof: You didn’t forget to do that,
[1:07:11] Hamel Husain: did you?
[1:07:14] Bryan Bischof: This is what it looks like to run experiments. On the left-hand side, this is repeated versions, variants of a different approach. And you can see the error rate is going down on historical events. On the right-hand side, that’s LLM as a judge. Don’t sleep on this. And then I have one bonus lesson for you. Production endpoints minimize drift. I thought, I’ve got this production system. I’m going to build a clone of the production system in my evals framework. so that I can keep things as similar as possible to production.
[1:07:55] Bryan Bischof: And some of you are already shaking your head, Brian, you’re an idiot. That’s never going to work. Of course it didn’t work. We had to refactor our evals framework because we built a clone. Tightly coupled systems that aren’t actually identical. This is standard software engineering. Like, don’t do it. And I did it. And I regret it. Don’t be like me. Make your evals framework directly connect. to your production environment. Make them endpoints. Call those endpoints. Use that as the same backbone that calls every single thing in your evals framework.
[1:08:32] Bryan Bischof: But make sure that every step is exposed. Be able to hook in halfway through that workflow, make modifications upstream, and see what happens in the end. That’s how you keep these tightly coupled in a sane way. That’s what I have for you today. You can find us at hex.tech. You can find me on Twitter at BEBischoff or on LinkedIn, Brian Bischoff. I’ll pause there for any commentary, but I think we’re on a tight schedule today.
[1:09:03] Hamel Husain: Okay, I’ll ask you one question from the audience. Of course. How are you organizing all your unit tests? Where are you running them? Local GitHub actions, et cetera. And where are you logging them to in practice?
[1:09:15] Bryan Bischof: Yeah, so… So… We are very fortunate that hex notebooks can be scheduled. And so I run Jupyter notebooks that are scheduled. So believe it or not, I orchestrate and run all of my evals in Jupyter notebooks. No, no asterisk there. That means that A, they are incredibly reproducible. B, if I go back to an individual experiment and I say, huh, that performance is very different than I expect, I can pull the individual eval logs. and literally look at every single response from the agent.
[1:09:50] Bryan Bischof: That’s an amount of power that no other system is going to afford. Now, granted, you could set this up to log to your data warehouse, and you could use a downstream sort of notebook to, like, evaluate all this. But frankly, like, I’m very fortunate that I just have access to Hex like this. So I just do it in Hex.
[1:10:10] Hamel Husain: That’s really fascinating. That should be a whole podcast in itself about books in production. And like, I would love to probably talk about that.
[1:10:20] Harrison Chase: Happy to.
[1:10:21] Hamel Husain: Okay, no, that’s good. We have a lot of questions, but I think for time management sake, I guess like we can, what do you think, Dan? Should we do one more or should we move on?
[1:10:32] Dan Becker: Yeah, let’s do one more.
[1:10:33] Hamel Husain: Okay. Okay, so Alex Strick asks, for unit tests on their own, they take a very short time to run, but lots of these individual tests together take a really long time. Do you have any tips on batching these together? Is there a good framework for running through a bunch of these queries in parallel?
[1:10:52] Bryan Bischof: and he says I have a bunch of these unit tests in my code base but it still take a few minutes to run through sequentially yeah so we have a batch version of the evals that runs on a schedule and I don’t like you know runs on a Friday and I like collect my results later on but actually again like I’ll harken back to the good old days of data science don’t be afraid to do two things one look at the sample that is most concerning to you and two Don’t be afraid to use bootstrap
[1:11:22] Bryan Bischof: sampling. Like bootstrap sampling gives you a really good picture of the overall like behavior of the population from a couple random samples. Do that. That’s a great way to get early signal on these things. But also, let’s be real. Some of my evals almost never fail. Some of my evals fail every time. And that’s a good thing. If your eval suite is 100% passing, your eval suite is not hard enough. Make it harder. You should be… really targeting evals that succeed 60 to 70% of the time.
[1:11:55] Bryan Bischof: Because otherwise, you’re missing out on the signal of improvement. How will you even know if you’ve made anything better? And so the way I tend to look at this is I tend to look at the marginal ones, which ones flip when I change things. If I can’t get them to change, actually, to be honest, I’m eventually going to delete them. They’re not doing much if they’re never failing. If we’ve moved on in the world from GBD 3.5 quality on a certain test, well, then why do I need to test it every time?
[1:12:24] Hamel Husain: That’s what makes sense. As the scientists like to say, there’s no variance, there’s no signal. Okay, that’s great. That’s a very great presentation, Brian. Really appreciate that. I think people are commenting. They really love this presentation, by the way.
[1:12:40] Bryan Bischof: Glad to hear it.
[1:12:41] Hamel Husain: So the next guest that we’re going to have is Eugene Yan. He’s also… a good friend who I’ve known for some time now. He is quite prolific and writes a lot about machine learning, large language models, you name it. The thing that he will be… Sorry, Eugene is a senior machine learning scientist. I hope I’m not getting the title wrong, at Amazon. I’ll let him correct that. But what he’s going to be talking today about is… So a lot of people are asking like what metrics should you use?
[1:13:17] Hamel Husain: We’re talking about like, okay, measuring, writing evals, having metrics, but I haven’t like really gone into metrics. So Eugene is actually going to talk about that in more detail. Eugene.
[1:13:29] Eugene Yan: Thank you, Hamil. That’s way too kind. All right, everyone, we are almost at a one and a half hour mark and I will be going fast and there’s going to be a lot of code. In fact, it’s just all code and graphs. So I hope. You all pay attention, I’m dropping the notebooks that I will be sharing in the Discord right now. So don’t bother to take notes, don’t bother to do screenshots, all of this is fully available. So all I have right now is six cells, four slides and three notebooks. Alright, let’s go.
[1:13:59] Eugene Yan: So the question I have is how do we evaluate summaries on factual inconsistency or hallucination? Right, everyone’s using some kind of LLM to summarize stuff, but how do you know if it’s actually correct? Well, if you actually look at it via human eval, you’ll find that hallucinations happen about 5 to 10% of the time, and sometimes it’s pretty bad. So the way we can do this is we can evaluate the model, right? The input is a source document and the summary, and optionally the label, if you’re fine tuning on it.
[1:14:30] Eugene Yan: And the output is the probability of the summary being factually inconsistent. So we can frame this as a natural language inference task. So natural language inference is a classical NLP task whereby given some premise and hypothesis, we have the label of it being. So imagine the premise is John likes our fruits. The hypothesis that John likes apples is entailment. And the hypothesis that John dislikes apples is contradiction. And of course, we also have a neutral, which is we don’t have enough information to tell whether it’s correct or not.
[1:15:04] Eugene Yan: Now, if we apply natural language inference to factual inconsistency detection, what happens is that we can use contradiction, the contradiction label as factual inconsistency. So imagine we have a document, maybe this is some talk abstract for a previous talk I did. Eugene’s talk is about building LLM systems, et cetera. So the summary is the talk is about LLMs, that would be entailment. And the summary of the talk being about apples, that would be contradiction. So then all we have to do is to get the probability of contradiction. And there we have it.
[1:15:33] Eugene Yan: We have a factual inconsistency classifier or hallucination detector evaluator model. All right. So my objective here is I’m going to show you how to fine tune an evaluator model that can catch hallucinations on the factual inconsistency benchmark. This is my new Kaggle. Then we’re going to eval the evaluator model through each epoch. And we’re going to see how we can blend data to make this evaluator model way better. Then, optionally, you can use this evaluator model to then eval generative models.
[1:16:09] Eugene Yan: I know it’s kind of meta, but we’re going to eval the evaluator model, which then evals generative models. And then you can also use this evaluator model as a guardrail, right? You’re going to summarize things in production, and then you can check, hey, is the summary factually consistent or not? We’re going to see how this happens. So this is a three-body problem. We’ll first examine, prepare, and split out data. Note, this is completely coincidental.
[1:16:34] Eugene Yan: I have not spoken to the instructors about the importance of looking at your data, but everyone has mentioned that, and I do have a notebook for that. After that, we’re going to fine-tune on the factual inconsistency benchmark, and then we’re going to blend in data. from the unified summarization benchmark and see how that helps. Okay, I have a few appendices here. I have three write-ups on LLM evals and hallucination detection and R of domain fine-tuning, all of which are compressed into a 15-minute session for you here.
[1:17:06] Eugene Yan: And I also have some slides on evals and fine-tuning and how they are two sides of the same coin. So next, let’s prepare some data. So over here, we have the factual inconsistency benchmark. It contains one sentence summaries from the CNN Daily Mail and the Exum news articles. We exclude the CNN Daily Mail data because it is pretty bad. I will not show you how bad it is, but you can look at it yourself. So now here is an example of the Exum data.
[1:17:33] Eugene Yan: The first two rows, you see that they have the same input and the next two rows have the same input. So how this looks like over here is that for the same input, we have a choice that is inconsistent and we have a choice that is consistent. So if you… If you look really hard at this, you will be able to understand why is it inconsistent or consistent. But if you just briefly glance through it, you might actually miss it.
[1:17:56] Eugene Yan: And a lot of times I’ve been looking at this myself and I’m like, wow, this is a really difficult data set. So here’s the CNN Daily Mail data set. And you can see it’s full of XML tags and quite a number of them actually have ads in there. So actually, we just discard them. So FIB starts with about 3,600.
[1:18:16] Eugene Yan: rows of data after we excluded the cnn daily meal data set we are still left with 3100 rows the authors themselves of this data set didn’t even bother um using cnn too much because i think they just labeled 100 and realized actually it’s not that not that good and they just invested more of their resources on x some so that’s the fib data set so eventually what we’re going to happen what we’re going to have is and of course we split it we split the data we’re going to group it by input this ensures that
[1:18:46] Eugene Yan: the same article doesn’t appear across train and val. So we want to only make sure that the same data only appears in either train, val, or test, so there’s no data leakage. So eventually, what happens is we have, and then, of course, we try to balance the data. So there’s only one positive summary and one negative summary. Eventually, what happens is we have 700 data points for training, of which it’s 350 positive and 350 negative, and our val sets are only 150. So that’s the factual inconsistency benchmark. Next, let’s look at the unified summarization benchmark.
[1:19:18] Eugene Yan: The unified summarization benchmark is slightly different. It’s based on summaries of Wikipedia articles. So you can see over here, again, the first two rows, they are the same. And the next two rows, they are the same. And when you look at it very carefully, let’s look at the second row here, it actually points out what the inconsistency is. And it’s highlighted, it’s surrounded in the red boxes, which is the summary actually includes this actual data which is not in the source.
[1:19:45] Eugene Yan: But there are times when you look at it over here, it took me quite a while to spot what the difference here is. And the difference here is the word the. And even though not having the word the is not a hallucination problem, it’s more of a grammatical or sentence structure problem, it is part of the data set. So, suffice to say, I didn’t clean this up, but it just goes to tell you that you do need to look at your data when you’re fine tuning to try to understand the quality of it.
[1:20:11] Eugene Yan: Okay, so we get the data, we try to prepare it, and what happens again is this, we have a summary sentence where the label of zero is correct and the label of one is incorrect. And of course, we do the same thing, train test, val split, et cetera. In the end, this is the amount of data we get. So we’re going to first look at the factual inconsistency benchmark. So over here, we load the data. We do some, we tokenize them all up front in a batch.
[1:20:42] Eugene Yan: And then over here, we’re going to fine tune our model. The model we’re going to fine tune is what we call distilled BERT. Essentially, you can think of it like an encoder decoder version of BERT from Meta. but it’s also fine-tuned on MNLI, which is multilingual natural language inference data. And of course, we have our parameters here, nothing too interesting. We use LoRa, we apply LoRa on the QKV vectors, our projection, etc. In the end, the number of the more trainable parameters is less than 3%.
[1:21:11] Eugene Yan: This allows us to fit this on a very nice small GPU. So over here, I have some custom metrics that I track. So this would be akin to the callbacks that Wing mentioned about during the X-A-Lotto session. So over here, at each EPUB or at each eval, I actually, firstly, I pre-process the logits. So if you recall, NL actually produces three probabilities. What I do is I only care about the entailment and the contradiction probability. And I do a softmax on them, which sums up the probability to one.
[1:21:45] Eugene Yan: And then I just get the probability of contradiction. So essentially those I care about. And then I also compute some custom metrics that I have, essentially PRAC, ROC, recall and precision. And recall and precision, I have an arbitrary threshold of 0.8, where I want the model to be pretty confident. Okay, so I have a custom trainer here. I won’t go into it right now, but I’ll go into it in the next notebook. So again, some standard…
[1:22:16] Eugene Yan: uh training arguments nothing special here so let’s and of course i also have some plotting code here again nothing special here so let’s look at how the model performs before any fine tuning before any fine tuning you see that roc auc is at 0.56 if you recall an roc auc of 0.5 means that it’s really a coin flip and the model is not doing anything better in charts so you can see okay roc aucs coin flip and what i really love is the graph all the way on the right which is the overlaps of red
[1:22:45] Eugene Yan: and greens. The reds are the we have the ground truth, we know what is inconsistent and the greens are what we know is consistent. And we can see the overlap and you can see that in this case, the model just cannot distinguish the overlap, it cannot distinguish between them before any fine tuning. Now note that this model has already been fine tuned on MNLI, but still is doing a pretty bad job. So now let’s start fine tuning. And so you can see these are the custom metrics I have.
[1:23:11] Eugene Yan: Usually if you don’t have any metrics, you probably only get loss, training loss and training loss. But I do have extra losses. I do have extra metrics such as PRAUC, ROCAUC, recall and precision. So let’s just focus on ROCAUC. You can see that ROCAUC increases from 0.56, which have over here to 0.65. Not too bad, but at least it suggests that the model is learning. But what is very disappointing though, is that the recall at 0.8 does not go above 10%. So in this case, this is just not usable.
[1:23:45] Eugene Yan: This model just cannot identify factual inconsistencies. So we check the evals after fine tuning. This is on the training set. We train on this, and we just check on this to make sure that our model can learn. We see ROC, AOC is 0.71. And we start to see a little bit of separation of distribution. I start to see a little bit of separation. But On the validation set, ROC-AOC 0.66, not that good. And the separation is pretty bad.
[1:24:12] Eugene Yan: And over here on the test set, which we have never seen and which we are not using to pick our checkpoint, ROC-AOC is slightly, I think it’s not statistically significant. But you can see the separation of distribution is pretty bad. So next, let’s try, let’s see how we can fine tune on a different data set. We fine tune on the unified summarization benchmark. followed by the factual inconsistency benchmark. So the factual inconsistency benchmark in this case is the data that we care about.
[1:24:42] Eugene Yan: You can imagine in your own use case, you would have some kind of summarization data and all you care about is how it performs on that data. So what you can do is you can take all this open source permissive use datasets, which both of these are, and you can use them. to bootstrap your own model by blending in the data. And we’ll see how you do that here.
[1:25:01] Eugene Yan: Again, so we have the fact, so again, to remind you about our data, our factual inconsistency benchmark, we only have 350, we only have 700 training samples, but the unified summarization benchmark, we have 5,000 training samples. And I deliberately split a big chunk of it into the validation set, right? So you can see that USB has almost 10X more. of the factual inconsistency benchmark. So by using some kind of external data, you can improve your own models. And in this case, what I care about is hallucination detection on summarization.
[1:25:36] Eugene Yan: So again, we do some kind of batch tokenization, so we don’t have to tokenize on the fly, and we set up our models, nothing special here. So what is this custom trainer? This custom trainer is because at a point of me trying to do this, which is about six, seven months ago, Hugging Face Trainer didn’t allow for evaluation on multiple datasets. So this is really just copying the trainer code and overriding some of the methods that it has, specifically eval and maybe log-eval. You can just use this code if you want to.
[1:26:12] Eugene Yan: So how it looks like here is that without the custom trainer, all you could do was this, which is eval dataset, you provided a single eval dataset. But with the custom trainer, what you can do now is this. You can provide a dictionary of data set to the eval data set, and it will just work. So again, we have our usual visualization code, and this is the same thing. USB, what we saw, ROC-AOC 0.56. This is, sorry, that was FIB, the data set we care about. ROC-AOC is 0.56. And this one for USB.
[1:26:51] Eugene Yan: ROC-AOC 0.66, a little bit better. But the divergence looks a little bit funky. In the previous case, you can see it was a little bit too strict. Most of the probability was about 0.75. Over here, it’s a little bit too lenient. Most of the probability was close to 1. So let’s look at this. So what we’re going to do is we’re going to fine tune this model. And let’s just focus on the USB metrics first, which is USB, PRAC, ROC, da, da, da.
[1:27:17] Eugene Yan: You can see that our the USB ROC AUC very quickly ramps up from 0.76 to 0.94. That’s pretty freaking amazing. And recall and precision is 0.75 and 0.93. That’s really good. I don’t think in production, I don’t think your data is going to be as clean. And you can achieve something like this. We also probably want to tweak the threshold to maybe either bias towards recall precision. But let’s look at FIPS ROC AUC. You can see FIPS ROC AUC.
[1:27:50] Eugene Yan: And then what I’m going to do is I’m going to split the screen here for a second. So you can see previously FIPS ROC AUC went up to 0.65. Over here, by solely fine tuning on USB, not a drop of FIB data, we could get the FIB ROC AUC to go up to 0.64 or 0.63. Let’s. What the heck is going on, man? I mean, it’s the same task, but these are completely different domains. But the thing is, look at the recall.
[1:28:23] Eugene Yan: The recall has gone up from 8% to 25% to 25%, et cetera, et cetera. That’s a huge world of difference from what we were seeing previously, right? Where we were stuck at 6%, 4%. So what we’re going to do is we fine-tune on this, and we can see that it’s superb. on USB. Over here, you can see that it does superbly on USB and this is the validation set. You can see, oh my goodness, the ROC AUC is freaking amazing. And look at the divergence, right?
[1:28:57] Eugene Yan: Over here, you could probably just cut at 0.8 or 0.9 and you have the clear factual inconsistence. Or over here, you cut at 0.1 and you have the clear factual consistency. It depends on how conservative or how lenient you want to be or with your false positives or false negatives. And now let’s look at it on the validation set. for FIP, the data set that we care about. We see that the ROC AUC has not improved that much. So this is the previous ROC AUC. This is the previous one. And this is the current one.
[1:29:27] Eugene Yan: It’s like 0.66 versus 0.64. The separation is not that good. So we may feel like, hey, it didn’t work here. But now let’s give it the same 10 epochs of FIP data, which it has never seen before. the same training. Okay, so you can see the same training over here. Previously, ROCAUC was 0.65 and recall never got above 0.6. Now with this, you can see FIB ROC AUC and we don’t actually have to measure it on USB.
[1:29:59] Eugene Yan: I just, because I was just curious how much adding in additional data would cost us in terms of alignment text on USB. You don’t actually have to track this, but I did. Oh my god, the ROC AUC immediately started at 0.75 and then went up to 0.86. And look at the recall, it’s like 10x higher than what we had previously, 0.57 and with precision at 0.8.
[1:30:22] Eugene Yan: 0.93 that is insane right so all in all these are some metrics on how i would use to evaluate a classifier model so now this is here’s how it looks like in the end uh our fit test set so you can see the test set previously we’ve only fit data on the left and here’s the test set or with fit data and usb data on the right and this model with maybe a little bit more data we can probably get get to higher recall maybe 0.8 and i think a good balance is 0.8 0.8 uh
[1:30:57] Eugene Yan: of uh 0.8 0.8 in terms of this evaluator models ability to catch hallucinations but so now you can do this to evaluate your generative model right i mean we have certain n-gram metrics like rouge and meteor or you can use lm as a judge which is very expensive but this model is super fast every query is like 20 milliseconds. Super fast, super scalable, and it’s fully within your control. And now what I will ask you is, how could you fine-tune evaluator models to help evaluate on your task or evaluate summaries on relevance, information density?
[1:31:38] Eugene Yan: And OK, that’s all I had. So now, so long story short, this is my evaluator model to evaluate my summaries on factual inconsistency.
[1:31:48] Eugene Yan: and eventually there will be more evaluative models to evaluate on relevance and informational density etc so now go and find you in your own evaluative models that’s all i had that’s really great eugene i think that was like really great run
[1:32:04] Hamel Husain: through like a lot of different things that you might want to think about with like concrete examples i really recommend everybody check out these notebooks i went through them a little bit fast but you know, this is recorded. You can always slow it down. You can step through these notebooks. Also, um, you know, at times Eugene shared some additional resources, um, for like his writing. And that really gives a lot of color to these things. I highly recommend you, uh, like check all of these things out.
[1:32:32] Hamel Husain: Um, you know, make sure, just make sure you don’t skip that. Cause I, I really enjoyed that, those writings myself. Um, let’s see if there’s a, let’s see if there’s a question. Let me open up the Q&A. Okay, someone’s asking, how do you think about evaluating agents?
[1:33:07] Eugene Yan: That’s a good question. I would just echo what Brian said, which is evaluating agents is a step-by-step process. I would break it down. I mean… I actually have a recent post that I just wrote over the weekend, which is essentially about prompting, but I can talk about how we evaluate agents here. So one of it is you can split the catch-all prompt to multiple smaller ones. So here’s an initial prompt where we try to extract some data. We try to summarize a transcript, right?
[1:33:37] Eugene Yan: So over here, you can break down a transcript and extract the list of decisions, action items, and owners. This is just a classification metric, right? Simple. Now, over here, the second step, we ask it to check the transcript and the extracted information to make sure it’s factually consistent. Again, a classification metric. Now, the final one is given the extracted information, write it out into parentheses or bullet points. Now, this is an informational density. It’s a relevance. It’s a writing eloquence question.
[1:34:10] Dan Becker: So I know-Hey, Eugene, do you mean to be sharing your screen?
[1:34:13] Eugene Yan: Oh, shoot. i did i did it’s okay i did that earlier it was uh it’s no problem it’s hard sorry guys um so here’s how i would imagine you have an agent again to summarize meeting transcripts that’s the simplest uh example you could over here you could evaluate how well the agent is extracting a list of decisions action items and owners again this is just a classification right or extraction you can it’s just precision and recall now again Another one to check the extracted information against a transcript.
[1:34:45] Eugene Yan: You can imagine that this is the factual inconsistency model we talk about. Think step by step and check if the extracted information is actually consistent. Again, this is a classification task. Now finally, you are asked to rewrite the information into bullet point summaries. You can think of this as an information density task or writing style task, which is a little bit more soft, not as straightforward to measure their classification, but maybe a reward model might work. So that’s how I would do it.
[1:35:13] Eugene Yan: And Alpha Codium had a really great post, really good piece where they actually split up code generation, right?
[1:35:21] Eugene Yan: into multiple steps so you can see each of these steps here you could probably evaluate each step here so that’s a long-winded story to say long-winded answer to how i might evaluate agents the same way as i would evaluate a multi-step workflow that’s great um
[1:35:39] Hamel Husain: okay well thank you eugene the next speaker we have let me i just want to give you some background for the next speaker so A lot of times when I talk about evals, as you can tell, there’s a lot of information about evals, like how to organize them, what tools you should use, metrics, workflow, things like that. I always find with all of my clients is that people get stuck. Like, how do you write evals? Like, where do I even begin? How do I think about it?
[1:36:12] Hamel Husain: And they can maybe write one test or two tests, but then they kind of have a mental block. And… I think like we’re still in like a very nascent period of all this tooling and like how to go about thinking about evaluations. Shreya, who is, you know, who is a prolific researcher in this area of like LLM ops and also specifically evals, has done a lot of research on things like UX, workflow, developer tools and tools more generally. She’s been doing research for a really long time.
[1:36:48] Hamel Husain: even prior to large language models on machine learning tools and workflows and things like that. And so I’m going to hand it over to Shreya. Shreya is going to walk through some of the research that she’s done around workflows, around evals, and ways to think about evals that I think is really helpful for everybody. So with that, I’ll give it over to Shreya. Great intro.
[1:37:10] Shreya Shankar: Super nice of you. I think Eugene… You have to stop sharing so I can share. Sorry to boot you off.
[1:37:22] Eugene Yan: Shoot, I didn’t know I was you sharing.
[1:37:25] Shreya Shankar: All good. Okay, so let me know if you can or can’t see my screen.
[1:37:32] Hamel Husain: Yeah, I can.
[1:37:33] Shreya Shankar: All good? Great. So today I’m going to give a pretty short talk on some recent research that I’ve been doing with a lot of, with many amazing collaborators across many different institutions and companies. And the theme of the talk is, you know, how can we use LLMs or AI to scale up our own human decision functions? How do we scale up the vibe checks that we know and trust with
[1:37:59] Hamel Husain: LLMs?
[1:38:01] Shreya Shankar: And I have a brief intro slide in case people don’t know who I am or in case Hamilton intro me, but I can skip this. But basically, I am a PhD student. This talk will be a little bit more high level. I study.
[1:38:15] Shreya Shankar: how do people write llm pipelines how do people evaluate them and how can we improve the experience for everyone i also do ml engineering at an ai startup and i like to think about you know how can we work with data at reasonable scale how can we ensure good data quality and anything around the people in ml and llm ops so without further ado i will get into my talk i don’t need to go over this probably i mean you been in this workshop for so long but you know we really really like llm pipelines
[1:38:48] Shreya Shankar: because their zero shot capabilities can enable intelligent pipelines without having to train models so if you take a look at a bunch of prompt templates from langsmith you’ll see that people are doing all sorts of stuff you know they’re using llms to write code reviews they’re using llms to convert youtube transcripts to articles and when you kind of write a prompt template or pipeline around this i’ll illustrate with this figure say we have this YouTube transcript to blog post pipeline, we might feed in some input document, which is a YouTube transcript, put it in some
[1:39:21] Shreya Shankar: prompt template that has some instructions, and then expect to get some transcript, or sorry, some blog post out of the entire pipeline. So this all looks great, but the problem is when you try to do it at scale, the LLM doesn’t always listen to instructions. So maybe you have an instruction around HTML structure, and one out of 10 times it… doesn’t output in proper HTML structure. Or maybe you have a sentence that says, avoid copying sentences directly from the transcript, but somewhere in the middle of the document, even GPT-4 might…
[1:39:54] Shreya Shankar: you know, exactly verbatim output the same instruction. And the theme here is, you know, no matter what, as we suppose we fine tune these LLMs on our tasks, there’s no guarantee that it’s going to listen to every single instruction that we’ve included in the prompt. We need some form of guardrails or assertion or evals to be able to quantify, you know, how well does the LLM do our task and listen to what we define as good or bad. And that’s where our kind of vibe checks and rules and guardrails come in.
[1:40:28] Shreya Shankar: And insight from traditional ML is to simply put rules and guardrails around the model to detect bad outputs and correct them or even rerun the pipeline. But this is really hard to do for LLMs. This goes back to what Hemal mentioned before. It’s difficult to even get started thinking about what does accuracy or good even mean for your specific task, your outputs. Maybe 10 of us are trying to write pipelines that are converting YouTube articles.
[1:40:53] Shreya Shankar: or say YouTube transcripts to articles, but maybe we all want different formats, or we want the response to do something slightly different, right? So our vibe checks are all going to be different, making this a pretty hard problem. You can’t just kind of use what somebody tells you off the shelf. You’ve got to come up with your metrics yourself. Metrics might be complicated, requiring humans or even LLMs to evaluate. Say something like tone. If we want the tone of a blog post to be informal, or we want it to not sound like an AI.
[1:41:20] Shreya Shankar: right how do you encode that into something to evaluate um and every prompt task application is different all of us are going to have different metrics even if we’re trying to do the same task or different implementations of the metrics if we’re trying to do the same task um and i like to think about these vibe checks a lot or guard rails or evals or whatever you want to call it in general along the scale of generic to task specific.
[1:41:46] Shreya Shankar: On one hand, we’ve got common MLP metrics that model providers talk about when they release new models, which is great, but doesn’t really tell us how well those models are gonna do for our custom tasks. You’ve got something in the middle where we know of good metrics for common architectures. For example, RAG pipelines. We know that faithfulness is a good metric from the Raga’s paper. So that’s great. But we really also want to be pushing towards these task-specific metrics.
[1:42:15] Shreya Shankar: So if you have an exact structure that you want your output to follow, not just JSON, but a specific, you know, you want at least two of those JSON keys to follow the same pattern. You want some more finer-grained constraints on that, you know, that goes more towards survive checks. I showed you one axis in the previous slide from generic to task-specific, but there’s also another to consider, which is how simple or scalable is the method. Stress testing prompts in Chat2BT don’t really scale, especially in production.
[1:42:45] Shreya Shankar: And then fine-tuning evaluator models are pretty high effort because you have to constantly be collecting data and determining whether this is good or bad and then be able to fine-tune the models. Vibe checks performed by humans don’t scale but we shouldn’t discount them because most people do this and they’re quite effective especially in the early days of prompt engineering. And what we really want to do is move these five checks towards the upper right quadrant. and codify them. You can call them validators, you can call them assertions, you can call them guardrails.
[1:43:15] Shreya Shankar: I honestly don’t know what to call them, but the idea here is to have a set of task specific constraints or guidelines that you feel confident aligns with what you think is good for your task. So a lot of our recent research has been in developing evaluation assistants, which are tools that aid humans in creating these task specific evaluations and assertions that how they’re going to be able to do that. that align with how they would grade.
[1:43:40] Shreya Shankar: So in my talk I’m going to briefly cover, you know, what do we want to think about and how can you build your own evaluation assistance. The key idea here is to use LLMs to scale, not replace, your own judgments and decisions. And I’ll talk a little bit around different parts of the workflows and how we can use LLMs to do that. I’ll start with how can we use LLMs to auto-generate criteria and various implementations of that criteria.
[1:44:06] Shreya Shankar: And then I’ll talk about a recent mixed initiative interface that we built to develop customer search engines and some lessons that we learned from, you know, briefly prototyping this interface with a bunch of LLM experts. All right, so let me jump into bootstrapping criteria. Okay, so let’s pose this concrete example here. Let’s say we have a document summarization pipeline. but the summarization pipeline is for medical documents.
[1:44:34] Shreya Shankar: So there are some additional examples or sorry additional instructions like return your answer and markdown because you want it to be a report that your doctors are going to read in a custom interface. And maybe you also have some instructions like don’t include any sensitive information like race or gender and have a professional tone. So you can see how these get pretty specific towards the end user. Problem assertions gives you a fine-grained view of correctness or goodness of LLM input and output quality. And every custom LLM pipeline should have some table like this.
[1:45:10] Shreya Shankar: But the challenge really is coming up with the criteria. What are the columns that you want here? And then good ways to implement this criteria. Some of them, for example, can be implemented with code. Some of them might not be able to, you might need to use an LLM to evaluate something like professional tone. And engineering that prompt itself can be hard, right? You’re already engineering your main prompt. So engineering the validator’s prompts is a little bit excessive.
[1:45:39] Shreya Shankar: So how can we enable humans to efficiently come up with good and bad examples of professional tone to seed the validator prompts, for example? So an overview of this problem, which we talk about in our SPADE paper, is how can we generate a small set of assertions that have good coverage with what humans think are bad outputs and also have good accuracy? So challenges here are how do we find the right assertion criteria desired by the developer? And how should we guarantee the coverage of failures with a small amount of assertions?
[1:46:14] Shreya Shankar: We don’t want to give you thousands of assertions to run in production. thousands of guardrails because monitoring that or visualizing that would be a mess. And our SPADE system employs a two-step workflow to do this. First we generate a bunch of candidate assertions with LLMs and then we filter them based on human preferences. So our insight here for how do we generate custom assertion criteria is that the criteria are hidden in prompt version history.
[1:46:41] Shreya Shankar: So when humans are iterating and improving on the we can tell what is it that they care about and what are unique mistakes that LLM makes. Maybe there are document summarization pipeline starts out with a template like this, which is very common. A lot of docs summarization pipelines will start out with a prompt like this. And then when trying it on their data, they might notice that sensitive information is included in the summary. And the specific application developer doesn’t like that. So they add an instruction that says don’t include the sensitive information in the summary.
[1:47:12] Shreya Shankar: And maybe we might see another. human-generated prompt delta or prompt edit that says, do not under any circumstances include sensitive information. So what does this tell us? This tells us that the LLM is kind of bad at determining what does sensitive information mean, doesn’t listen to that instruction, I don’t know, come up with a lot of conclusions there, but so forth, right? You can imagine looking at how humans evolve their prompts to determine what it is they care about and what magnitude, right?
[1:47:40] Shreya Shankar: If you edit the same line maybe 15 times, maybe that’s a sign where the LLM is a little bit more bad there than in other places. So what we did here to build the first part of the SPADE assertion generator was to look at a bunch of prompt templates across different domains and categorize all of the edits people made to those prompts. And we came up with this taxonomy. Some examples of edits are maybe inclusion instructions or exclusion instructions.
[1:48:10] Shreya Shankar: A lot of people have very specific phrases that they might want to include or exclude which should be caught in these assertion criteria. So using that, we can then seed the LLM to help us come up with assertion criteria custom to our prompt. We can maybe use Chat2BT even to just copy in that text, copy your template, and then say what are some assertion criteria that I should use based on my prompt, based on these edits. and come up with as much assertion criteria as you can. And the LLMs are pretty good at this.
[1:48:48] Shreya Shankar: They’re slow, but they’re good. They at least find things that are aligned with, you know, mistakes that we might want to catch via assertions. So SPADE first gets a bunch of natural language criteria and then generates a bunch of icon function implementation. I think the second part is less relevant. Like if you want to use a a Python function or a JavaScript function, or you want to use an LLM-based validator, whatever it is. I think the key idea to take away here is that you as a human are editing your prompt.
[1:49:20] Shreya Shankar: You have really good insights into what the failure modes are. And so how can you kind of modify that, maybe using such a taxonomy into assertion concepts that you should think of? And we deployed this, and we had a bunch of people try it out in a UI. We just played this with the lane chain, so thank you to the lane chain team for following us here. And we found that across the board, across different fields, inclusion and exclusion assertions are most common. And we found a number of problems also with the assertions, right?
[1:49:53] Shreya Shankar: Who knows if LLM-generated code is correct? We found redundant ones. We found incorrect ones. And then if you’re interested in learning about how we solve those problems, you can read our paper for more insights there. Cool. Now I want to talk about thinking about a UI around this experience. I mentioned you can use ChatGPT to maybe bootstrap some assertion criteria. But that’s…
[1:50:19] Shreya Shankar: kind of underwhelming and requires a lot of back and forth right maybe you go through chat gpt many times maybe you test it out in your jupyter notebook or the open ai playground you’re jumping between different interfaces and trying to figure out how to make sense of it if you are a developer maybe at a larger company or trying to build evaluation tooling at your company right how do you think about interfaces for that um so our the main motivation for this actually came out of spade just taking forever to run How can we use
[1:50:51] Shreya Shankar: humans more efficiently in this process? We found that people wanted to improve and iterate on the space generated assertions. And they also didn’t feel like the assertions were aligned with their end goals or with their own preferences, partly because they didn’t fully know their preferences yet, which I’ll get to later. But the goal of this interface here, or thinking about an interface here, is how do you help people iterate really quickly? and discover their own preferences on what are good and bad outputs and codify those into assertions as quickly as possible.
[1:51:26] Shreya Shankar: So the idea here is we’ve got to support, we’ve got to minimize wait time when this entire evaluation process. So take a typical evaluation pipeline, look something like this. You’ve got a prompt that you’re testing. You’ve got a bunch of inputs, out and outputs. This latter part. of the pipeline where you’re generating metrics, trying to identify which metrics are good, trust your own metrics, and so forth. That part takes a really long time. So how can we include a human in the loop? Maybe humans can edit criteria, refine criteria that LLMs come up with.
[1:52:05] Shreya Shankar: And maybe humans can also interactively grade LLM outputs to figure out what are better prompts for the evaluators. the LLM-based evaluators. So our interface, which we also describe in the paper, is built on top of Chainforge, which is a prompt engineering tool. And the idea here is to go through this kind of workflow, the specific prescribed workflow. You start out with saying, I want to evaluate this prompt.
[1:52:34] Shreya Shankar: Then you go to the B section here, which is maybe I want to grade some responses first to look at outputs to determine what criteria that I should include. Or maybe I want the LLM to just use a taxonomy and infer criteria for my context, whatever it is. So you might go through that, see the LLM generated criteria, or add your own criteria, decide whether you want them to be implemented with LLMs or code. Then evalgen takes you to a grading process, which you give thumbs up and thumbs down on different examples.
[1:53:09] Shreya Shankar: and then evalgen under the hood will determine you know what examples failed what criteria so we can use them as few examples for those validator prompts and so forth engineers your prompts there and at the end when you’re tired of grading or you’ve graded all your examples then we show you like a report card of here are the functions we chose here the alignment with your grapes um and then you can also scale those eval up the lm based validators up to all of your ungraded outputs with this table view.
[1:53:46] Hamel Husain: I like this a lot, actually. I really like that interface a lot. It’s really cool because what happens is you start from your prompt, and your large language model is used to look at your prompt and kind of guess what kinds of assertions, what kind of tests that you may want to write. And then it helps bootstrap that. It gives you a starting point. It’s like a…
[1:54:08] Shreya Shankar: it’s like writer’s block uh it like gets rid of writer’s block for writing evals it’s really great okay i’m excited to show you the v2 that i’m gonna i was screenshot oh nice okay um in here yeah so i
[1:54:24] Dan Becker: also know i’m running out of time i can no keep going directly to that yeah sure if people have to go they can always watch the video and we’ve always run over so you keep going yeah
[1:54:38] Hamel Husain: Great.
[1:54:39] Shreya Shankar: Okay, so speed and speeding away, but here’s our interface. And, you know, it’s a research prototype it’s super happy breaks all the time. So, Anyways, we decided, you know, how do people even use an interface like this? This is fairly new, right? Figuring out how to help people, assist people in coming up with evals for their tasks in an interactive interface. I don’t know if people have done this before, but certainly there’s probably a lot we can learn just by putting that interface in front of people. So that’s exactly what we did.
[1:55:08] Shreya Shankar: We got 10 people. We ended up not using one study, but we got 10 people who are experts, who have built LLM.
[1:55:17] Shreya Shankar: based products and pipelines in production before and we asked them to use evalgen in an open-ended way we gave a sample task around named entity recognition from Tweets a data set of tweets or they could bring their own task which I think what only one person did their own task And we found generally people like evalgen as a starting point for their assertion So zero people thought the assertions that you have been came up with upfront, we’re good, but they saw the value and, you know, unblocking themselves from moving forward and we realized that, you
[1:55:53] Shreya Shankar: know, this evaluation and coming up with good evaluators is definitely an iterative process. And people had a lot of mixed opinions on assertion alignment, which we dig into in the paper in more depth, but I’ll talk about, you know, two big things that I didn’t really expect going into the study that I learned. The first one is that we noticed as people were grading, their own criteria for what is good and bad output is drifting. It’s a function of the output. It’s a function of the LLM. It’s a function of viewing more outputs.
[1:56:22] Shreya Shankar: It’s a function of the rules that they include in the output. Whatever it is. Grading outputs spurred changes or refinements to eval criteria. So not only were they adding new criteria, but also the participants were reinterpreting the criteria to better fit LLM’s behavior. So one great example of this was there’s this instruction in the prompt that says extract all entities from this tweet and don’t include hashtags as entities. So that was the criteria, no hashtags as entities. And we found that there were multiple outputs that included hashtags as entities.
[1:56:59] Shreya Shankar: The first time people saw something like this, they graded it badly. The second time they saw the LLM extract the hashtag. as an entity. They might say something like, hmm, I said no hashtag as entities, but I think the LLM did the right thing here. Colin Kaepernick is like a very famous American football player, and they thought that did something right. And then when they saw this failure mode again, for example, Nike being extracted as an entity, and they noticed, I’m failing everything, I actually think the criteria should be no hashtag sign in the output.
[1:57:35] Shreya Shankar: And some people thought like, the llm was smart enough to keep the hashtag if it thought the hashtag was like for example just do it the hashtag is part of the entity itself so it might include the hashtag in the output or if the entity is famous without the hashtag then you know maybe it wouldn’t include the hashtag i don’t really know everyone had different opinions here which is the point um the point here is that everyone has different opinions and whatnot people want to go back and change their grades and people um also have
[1:58:06] Shreya Shankar: different opinions than what they had five grades ago. So how do we build these interfaces and support, you know, this dynamic evolving nature of what makes a good output. Sense making, what is the LLM doing? How can I make sense of it? This is a natural part of grading, human grading. And the implications here are that grading has to be continual. You’ve always got to be looking at your production data. You’ve always got to be learning from that.
[1:58:31] Shreya Shankar: No evaluation interface, we learned, no evaluation assistant can just be a one-stop thing where you grade your examples, come up with evals, and then push it to your CI or push it to your production workflow. No, you’ve got to always be looking at outputs. And one of the things that we’ve been doing that’s quite exciting at the student that I’m doing ML engineering for is we have a Slack channel where we just log a bunch of outputs every single day.
[1:58:59] Shreya Shankar: for different LLM-based workflows and we literally look at them and we try to go back and re-inform our assertions, which definitely helps things evolve. The second thing that we learned from the evalgen study was that code-based evals are very very different from these LLM-based evals. I don’t know why, but going into the study they were similar, but they’re not. People want to grade outputs to align the LLM-based evals. but not necessarily the code-based evals.
[1:59:31] Shreya Shankar: When they want to evaluate something like markdown format, for example, using a validator, they just want to see the code that’s generated to implement that criteria. They don’t want to look at examples of good markdown, not good markdown, and hope that the LLM finds a good example there. And so when asked, okay, when do you want to use LLM-based evaluators? People want to use them when the criteria is fuzzy, so they themselves find it hard to evaluate, or they don’t have a good idea in their head of what is good and what is bad.
[2:00:01] Shreya Shankar: They can’t succinctly describe it in one sentence, but maybe by giving enough examples, the LLM might learn something or learn that decision function for them. And then also, people want to use LLM-based evals when the data is dirty. there’s typos in it. So for example, if Kaepernick, who’s a football player, there’s a typo in the tweet, the output might, you know, correct that typo. And a simple code-based function that asserts that the entity name is in the input will fail because the typo was corrected.
[2:00:34] Shreya Shankar: But if you maybe use an LLM-based validator, then the LLM-based validator will understand that the typo is fixed.
[2:00:44] Eugene Yan: Cool.
[2:00:45] Shreya Shankar: So the last thing I want to briefly show you is that from these learnings, we’re doing an eval gen B2, which hopefully I can. And the idea here is how do we make coming up with evals a much more iterative process that doesn’t just have this one step generate criteria grade. You’re done workflow. OK, so. The idea here is keep your criteria as a dynamic list at the same pain as your grading.
[2:01:23] Shreya Shankar: And when you grade, you should be able to provide natural language feedback, which might, you know, add new criteria, might refine existing criteria, definitions, and so forth. Yeah. I can probably skip this. This is one minute of video. Oh, another thing that we found that was interesting, which I didn’t include in the slides, but it’s in the paper, is that people want to give per criteria feedback. So maybe it’s grammatically correct, but not following some certain tone instruction. So people want to give a thumbs up on one, thumbs down on another.
[2:02:01] Hamel Husain: Are you going to share where people can play with this? Is this public enough to where?
[2:02:06] Shreya Shankar: Well, it’s very close. We’re all academics and like working. So we move very, very slowly. This is a pet project. But you can go to chainforge.ai and play around with Chainforge at least. And you can play around with the table view that I’ve got here. This table view on the right. If you write your own LLM-based evaluators, write your own prompts for criteria, then you can run those and see this table view.
[2:02:42] Hamel Husain: Yeah, I recommend playing with this. I find it, like, whenever I hit a wall of not being able to explain, like, writing evals to people, I show them this, and it works wonders.
[2:02:54] Shreya Shankar: Thank you. Awesome. Yeah, I’m excited for the V2 to come out. But the V2, I just need to implement some more algorithms in the market. And I will do that when I have time. Cool. So this is my last slide. My overall takeaways from all of this is when running LLMs at scale, there’s going to be mistakes. And we can use LLMs to, along with context for what humans care about for their prompts and what makes for good output, for example, prompt deltas, we can use that to assist them in coming up with good evals.
[2:03:31] Shreya Shankar: So there’s no…my slide animation is off. Okay, cool. Yeah, prompt deltas can form assertion criteria. And when you build an eval assistant, it’s got to be iterative. It’s got to work and consistently solicit grades from the human as data and the LLM prompts and parts of the pipeline as well. And yeah, if you have any questions, please feel free to email me. Check out the preprints. They’re on my website, and they’re on archive. Thanks so much, Hema, for having me and Dan.
[2:04:08] Hamel Husain: we thought this is excellent yeah the discord the discord is going wild they really they really love this uh stuff about you know these uh interfaces and this like uh explanation oh that’s great yeah this is really awesome to see um
[2:04:33] Dan Becker: should we go through i don’t know straight in i think eugene might be back I don’t know if you guys have time to go through questions. I think we’ll either way go through some of the questions that we have queued up here.
[2:04:47] Hamel Husain: Yeah, let’s do it.
[2:04:50] Shreya Shankar: I have like five minutes,
[2:04:55] Dan Becker: and then I got to eat lunch before the meeting. I’m going to look through these and see if there are any that. Actually, Trey, you can see some of these. Are there any that immediately come up to you as interesting ones you want to cover?
[2:05:08] Hamel Husain: You can sort it by most of those. It’s like sorted by time, but you can see that. I feel like some of these already answered those.
[2:05:25] Shreya Shankar: Yeah, a good number of them are also for Eugene’s talk. Would we get access to this notebook? Yes, Eugene will probably share it.
[2:05:35] Eugene Yan: Yes, all the notebooks are available. I’ve posted a link on the discord. I don’t know if Hamil, you can help to pin it. So it’s all available. There’s two of appendices as well. I will tag you Hamil. I will create a trend and tag you Hamil, so maybe you can help to pin.
[2:05:51] Hamel Husain: Okay, yeah, please tag me.
[2:05:55] Shreya Shankar: Okay, I found one question that is…
[2:05:57] Shreya Shankar: upvoted a lot and relevant for me. Using prompt history to generate assertions is very interesting. I believe this can be used for unit tests LLM as a judge assertions. It’s a goal here to improve assertion coverage and reduce the time it takes to write these assertions by hand. Okay, this is a great question. So the first thing about having the prompt history is that it focuses the LLM’s attention when you’re asking the LLM to generate criteria for you.
[2:06:26] Shreya Shankar: If you just paste in your prompt into chat new bt and ask it, you know, what What criteria should I design unit tests around? Chat new bt will just design unit tests for every sentence in your prompt, which maybe that’s something that you want. Chat new bt is very verbose. It just comes, it just recalls everything. But I find that, you know, hovering with 15 criteria, especially for long prompts, is like a little bit much.
[2:06:51] Shreya Shankar: I want to start out with like two or three that I feel like are good ideas and really work on you know fixing those criteria making sure we’ve got implementations of those before adding new criteria in this case then providing the prompt history the deltas right like it’s a great example of what you care about what the llm is bad at and this can focus chat gbt or the um attention and generating assertion so maybe it’ll only come up with four or five eval criteria which is a much better starting point i think than 15.
[2:07:21] Shreya Shankar: hopefully answers part of the question. Reducing the time it takes to write these assertions by hand. I don’t think it’s extremely difficult to come up with at least one or two good assertions. What I think is really hard is coming up with assertions that align with what you think is good or bad. This is hard because you don’t even know what you think is good or bad. Like, you have to look at a bunch of outputs to be able to define what’s good and bad. And that process also evolves if you deploy things, right?
[2:07:55] Shreya Shankar: Your users might complain about things that you didn’t expect, and then suddenly you have to also now incorporate that into your definitions of good or bad. So having kind of an evaluation assistant to help you draw conclusions from all of this constantly changing data, help you define what you think is good, that’s where I think the biggest value lies, not just like generating code.
[2:08:20] Hamel Husain: from an llm yeah i agree with that there was a question that i don’t know if it was answered or not from wade how are you organizing your new tests where are you running them oh no eugene did answer that sorry
[2:08:48] Eugene Yan: for Brian.
[2:08:50] Hamel Husain: Gotcha.
[2:08:51] Eugene Yan: And he went for something like notebooks as unit tests. Oh,
[2:08:55] Dan Becker: yeah. He talked about the superpower of hacks for running those.
[2:08:59] Hamel Husain: It’s my favorite subject, which we won’t get into from a former life. Okay, so other questions?
[2:09:15] Shreya Shankar: Well, there’s… There’s some I can do pretty fast, like how well do these assertion criteria attract different patterns in the prompt versions generalized across models? Pretty well. The data set of prompt edits that we looked at had prompts that were intended for Mistral, Llama2, ChihachiBT 3.5 as well as 4, and Clawed2. So I don’t know. I think people…
[2:09:44] Hamel Husain: make edits to their prompt in similar ways no matter what lm they’re using there’s a question from lucas in the honeycomb example it was clear that you could write good unit tests for the data because you know is the query valid etc etc but i imagine it’s a lot more difficult for the general llm input output pairs curious to learn more about that um so okay um Generally speaking, in an applied use case, I find that more often than not, it’s narrow enough to where it’s not just general language, like reply to anything, do anything.
[2:10:26] Hamel Husain: There’s some specific task that you want to get done. You’re trying to aid the user to do something specific. And a lot of times, like, there’s a lot of components, like there’s function calls, there’s RAG, there’s something like that. And there’s a lot of failure modes that can happen that… you can test for.
[2:10:45] Hamel Husain: However, as Dan mentioned, there’s some use cases where it doesn’t really, if it’s just an internal tool where you’re just trying to do something like reword text or Summarize text or, you know, there’s definitely use cases like that where maybe unit tests are not going to have as much teeth. But it’s always good to think about, you know, these unit tests. Dan, you see any of these questions you want, you think are interesting?
[2:11:22] Dan Becker: So a lot of the top ones are about function calling.
[2:11:25] Dan Becker: at the end agents i think um eugene sort of answered on function calling an agent so i might mark those as uh as answered unless yeah there’s anything else to say there um we got it somewhat uh heavily uploaded and there’s a quick answer open ai has a temperature parameter api is there something similar in open source lms that we have to account for yeah and unit test yeah so um these models also have a temperature set it to zero it’s actually some class models where get an assertion error you get some error if you
[2:12:07] Dan Becker: get set it literally to zero and you said it’s a 1e minus six or something but yeah all the open source models you can set temperature to zero as well i thought this question is good it says before
[2:12:20] Hamel Husain: before starting a task how important is it to have evaluation method how to fix this as you learn more about the tasks in the process of doing this um okay so you don’t need to set up evaluations in the very very beginning like make some minimal product or something like you don’t want to just be like super academic and say i have evals before like before even beginning you kind of you might not even know what you’re trying to build you Like as far as like when I build stuff, like I don’t necessarily know all,
[2:12:52] Hamel Husain: I don’t really have a clear picture. I kind of have to like shape it a little bit. And so you can certainly start with like without evals, but you need to quickly think about evals at some point when you want to improve the system. And so don’t let evals necessarily get in the way of making something. Just know that like it is a like a crucial thing.
[2:13:17] Dan Becker: when you’re trying to at some point make it better and that’s how you make it better and to carry that even a step further i think like one of the things that i’ve not thought about very crisply but really came out of straight talk is that when you see you actually don’t up front know all the evals you want and one of the nice things about working with lms is like to call chat like to open up a browser window and try something and see what isn’t working is super easy so i would say probably
[2:13:47] Dan Becker: don’t want to do anything formal up front like run a few examples like literally type ad hoc in what you think the question is into plot or chat gpt and be like ah here’s the thing that’s not working and that i realize is difficult and you’ll do a better job once you’ve just experimented around and then build up the complexity over time rather than um after you’ve like experimented rather than trying to imagine the pitfalls up front.
[2:14:21] Eugene Yan: Yeah, I agree. I just want to reiterate what Hemal and Dan have been saying. We don’t want to get into eval paralysis. We don’t need to have all our evals laid out front before we start doing something. Just build something small, 30 samples maybe, would be good enough. And as you start doing it, you start to see more and more edge cases. And that’s how you add evals, right? Add evals like test cases. So it’s an iterative process. So We don’t want to have evals also slow you down, slow down your building process.
[2:14:48] Eugene Yan: It’s good as a test harness, it’s good as an insurance, it helps you stay safe, but it shouldn’t slow you down too much.
[2:14:56] Dan Becker: It’s also interesting to think about, there are different use cases or different problems. Some the eval is actually just classification. And for those, Eugene showed a bunch, showed examples, the eval is pretty obvious what the evals should be. And then there are others. where it’s not classification, you’re generating free-form texture. It’s very fuzzy. And those accumulate evals iteratively as you see what doesn’t feel right. Another one that’s highly uploaded but I think has a simple answer. Is LM as a judge a fine-tuned model or only works by…
[2:15:42] Dan Becker: proving prompting uh i think lm is a judge there may be exceptions but it’s almost always just like a very very good very very smart publicly available model because if you’re fine tuning That’s typically a case where you have a bunch of examples of the correct behavior. And you might use that to build the model that outputs the data. But the number of ways these models can fail is open-ended and fuzzy. So I would say I’ve only used LLM as a judge with models that are not fine-tuned. And I think that’s probably generally true.
[2:16:23] Hamel Husain: Yeah, I mean, I have only fine-tuned the model wants for this in a specific use case i can’t talk about but i think i would avoid it because then it becomes like turtles all the way down or like you know yeah like basically uh you want to try to use off-the-shelf model and align it with with the human because because like the complexity is like way too high if you start like fine-tuning this other model it’s like judge model it becomes insane and i don’t recommend it you
[2:16:57] Dan Becker: How do we go from zero to one in starting the data flywheel for collecting user data and curating the data set? I actually think this has a similar answer of starting. The great thing about LLMs is that there are some very good ones off the shelf. There’s no fine tuning. You don’t need data. I would start with a prompt. And you can probably, if you want to build a data flywheel, the implicit in that is that you’re going to build some.
[2:17:26] Dan Becker: product that you’re going to increasing usage on and then collect data but i would actually start with just a prompt and for most problems you can write a prompt and use a generally available model that’s reasonably good yeah
[2:17:42] Hamel Husain: synthetic data generation that’s the whole magic of lms you can unblock yourself a lot of times not every time but a fair number of times
[2:17:58] Dan Becker: Your code, here’s one with three upvotes. Your code uses doSample equals false, but real-life prod will use doSample equals true. What’s the thinking here? It varies a ton by use case, but I would say that for the use cases I deal with, doSample is pretty much always equal to false. doSample is basically like, is temperature non-zero? We like our, for the projects I work on, We just say we want something deterministic. We want the best answer. We don’t need variety. If you were building character AI, you would want it to be more creative and varied.
[2:18:44] Dan Becker: What about you guys? Do sample in prod, is that usually true or false?
[2:18:50] Hamel Husain: You mean for few shot examples?
[2:18:52] Dan Becker: No, sorry. The do sample parameter. in when you make your generation call?
[2:19:05] Hamel Husain: No, I haven’t had a reason for that yet.
[2:19:09] Dan Becker: What’s your temperature? Is it 0 or non-zero in prod?
[2:19:13] Hamel Husain: I think mine is 0 most of the time.
[2:19:15] Dan Becker: That’d be the same as do sample equals false. And that’s always been the case for me.
[2:19:20] Eugene Yan: I’ve been also addicted to this. I usually start with 0.8. and then I lower it as necessary to achieve whatever performance. I think there’s another heuristic I’ve heard people say, which is you want to get as close to zero as you want, as you can for classification or extraction tasks. And you want to get as close to one for creativity and generation tasks. So that could explain why I’m closer to 0.8. But essentially, if you get too low, it’s almost like a dumb intern. And it depends on what you…
[2:19:53] Eugene Yan: want to do so do try that out the crazy thing is for open ai the temperature max is two i don’t know why don’t ask me why um so if you’re if you if you’re thinking about temperature that’s something to think about yeah
[2:20:10] Dan Becker: if they let it be high enough to just be a random token generator exactly um There’s a comment from, I probably can’t even pronounce their, or question from, I probably can’t pronounce their username, Manimic. When doing AB tests on an LM, how would you prepare the data? What I mean is, do you just ask your LM to vote AB, or do you do some prep ahead of time? There’s a chance that that was misinterpreting something that I talked about earlier, where I said we’re using AB tests. We actually have two different models.
[2:20:59] Dan Becker: that are producing output. So in this case, that was for alt text, like two different models that you could use to take an image and get a description from it. And then we had people who rated each of the, some people would rate one model and some people would rate another model. And then whichever got higher scores, we just said like, okay, that’s the model that we’re going to continue using. The model that got worse scores, we would just discard, but it was the people rather than the model that were assigning scores.
[2:21:28] Dan Becker: You could, in theory, have an LLM pick between two end-of-date pieces of text. I’ve never done it, and I don’t immediately know the use case for it. And then it would be hard to answer this question of, like, how much data cleaning do you do before that? I think it would always depend on the particulars of the problem and why you’re using an LLM for this so-called A-B testing. Let’s go back to the top of the upvotes. Can you talk about what metrics to use to evaluate retriever performance in RAG?
[2:22:18] Dan Becker: I think there’s probably a better answer than what I’ve done historically. Eugene, do you have any thoughts on this one?
[2:22:27] Eugene Yan: Yeah, a few things. I won’t go into the standard ones. I think the first thing is, let’s say if your contact size is like 10, you want to make sure that at least you have some of it as relevant. You have some relevant stuff in there that’s recall, and then that’s also ranking. But what is also quite important, so recall is really just recall at 10, how many of those documents are relevant, and ranking is you want to make sure that the more relevant ones are closer to the top.
[2:22:52] Eugene Yan: Personally, for me, what I find to be quite important. for RAG is this metric that I’ve never had to consider before. And this metric comes about. What this metric is, is how often can you actually return zero results if the customer is asking a question that you have no documents for? So if you’re using purely semantic search, semantic search is just K and N. You just go grab whatever’s the nearest and the similarity could be 0.1, but you just pull it out, that’s crap.
[2:23:20] Eugene Yan: The problem this happens is that LLMs just cannot distinguish relevant from irrelevant data. The more recent ones can, but it’s just not very good at doing that. And a lot of this is also because of this eval, which is called needle in a haystack, that forces the LLM to try to pay attention to everything and try to use it. I believe that there’s an alignment text where you try to optimize for needle in a haystack, you also reduce the LLMs ability to reason over irrelevant. documents in the context.
[2:23:52] Eugene Yan: So that’s why it’s very important to me that, hey, we’re going to have some test queries that have absolutely no data in our retrieval index, and we want to make sure that it’s always we get zero or close to zero. So, long story short, recall, for one, just to make sure you have at least relevant data, and that’s recall at 10. Ranking, which is NDCG, you want to make sure that your relevant data is close to the top, and then also…
[2:24:16] Eugene Yan: I don’t know what this metric is, but ability to return zero results, especially for queries that you have absolutely no data for, so that you don’t return trash. And you can deterministically say that there’s no answer. Instead of letting LLM say that there’s no answer, if your retrieval results are zero, size of zero, you can deterministically say, hey, I don’t know. And you don’t even have to make a stupid mistake in front of your customers.
[2:24:41] Dan Becker: Yeah, that’s a good point. And, um… It reminds me, when I talked about this project I did in workshop one, I talked about this project I did for this chatbot for a European shipping company called DPD. And they had someone ask the model, a user asked the model to write a haiku about how crappy the company is. And then this person published it on Twitter and it got picked up by the news. And that was embarrassing for the company.
[2:25:13] Participant 4: when we said like how are we going to fix this and make the system more secure this isn’t a perfect solution but um one of the things that we did was to say if you don’t have any document that meets some relevance threshold there’s a good chance that actually this is um a user who’s trying to do something that we actually don’t want to allow and so it’s not even like that we can’t do a good job of it this was just a way of detecting adversary like a not perfect way but a way of
[2:25:43] Participant 4: detecting adversaries and shutting down that whole interaction.
[2:25:50] Eugene Yan: And the amazing thing with lexical retrieval, if lexical retrieval can have a score and even embedded retrieval, you actually have a score in terms of distance, right? You can see if anything that’s way too way too low on similarity, just trash away. That’s beautiful.
[2:26:07] Participant 4: This one I think is for you, the top voted one from Sam Silver. So Google had a kerfuffle where they retrieved relevant documents, but the documents were not factual. I liked the description. It didn’t hallucinate relative to the documents, it hallucinated relative to reality. I’d be curious to hear about the relative importance of these problems. And if you’ve ever worked on filtering documents, the documents or the corpus itself to ensure that the documents themselves are factual or unbiased or whatever else.
[2:26:45] Eugene Yan: That’s a great question. And sometimes this happens. Sometimes your documents could be factual, but it could be biased. It could be about racism, fascism, whatever. And, you know, there’s documents out there like this. I don’t know how to solve this problem yet. I think you could probably solve this problem with a content moderator over all your documents. Like, say, if you check if your documents are… I mean, the ones that are very clear is like toxicity, bias, offensive behavior, not safe for work, sexual content. Those you can easily do.
[2:27:15] Eugene Yan: And because you’re using RAG, you can very easily exclude them from your retrieval index. That’s your immediate end in court, right? So imagine if I was Google’s case, you know, the pizza and the glue. Okay, we know glue is causing it.
[2:27:28] Eugene Yan: we pull the emblem cord remove that piss off that piece of data from your retrieval index so that the google search ai summarizer never sees it problem solved i think that’s how i would solve it but as to how to actually check this kind of data where it’s clearly misleading or clearly untrue data i actually don’t know yet if we have some data that we can learn uh i i think Content safety is very straightforward. Offensive data is very straightforward. We have a lot of data on that.
[2:28:01] Eugene Yan: But for things like this, that’s really about a threshold that we’ve never had to grapple with before. I think it’s still an open problem.
[2:28:14] Participant 4: Are you running unit tests during CICD? It could just take very long with non-mocked LLM calls. I think that I’m supposed to be running it in CICD, but to be honest… So I was quite interested. So Brian said that the purpose of most evaluations is to let you be able to sleep at night. And for unit tests, there’s probably some truth to that. For me, the way that I think about it is actually quite different from Brian in that there are a thousand different modeling choices I can make. What base model do I use?
[2:28:54] Participant 4: What’s my prompt? What is the context that I feed in? Do I fine tune? If I fine tune, what’s the data that I fine tune on? I could make arbitrarily many models and I need to decide which of them are better and which of them are worse.
[2:29:11] Participant 4: And so if it’s just for making decisions, and a lot of those are like, we want just a quick decision and then we’re gonna make a change to the prompt, then we’re gonna run this again and see is it better or worse, and we’re gonna make another change to the prompt. And so frequently, I just want it to be really. easy and low latency. And for that reason, we typically run them, the developer runs them locally. You don’t even need to push anywhere.
[2:29:35] Participant 4: If we wanted it to work like conventional unit tests of like a safety thing, you know, Hamill gave the example of not exposing private data, then I would probably put it in CICD to avoid the possibility that we forget to run it on something that gets deployed. My use case is…
[2:29:55] Participant 4: aren’t like that and we’re just measuring quality and so um i’ve run always run it locally and then yeah can take a long time um yeah do you guys have a different answer for that no that was a good answer all right um oh i like this one from LZ. Good ways to check for, this is by three right here, but are good ways to check for contamination of base models with existing eval data using paraphrasers, for example. So you’re going to test your model.
[2:30:39] Participant 4: You want to assume that how it does in your test is a good proxy for how it will do on new data. How do you know that your base model wasn’t contaminated with the same thing you’re going to use for evaluation? I’m going to answer first.
[2:30:59] Hamel Husain: Yeah. I mean, it’s kind of very similar to machine learning in general. Okay, it is useful to kind of also look at your production data and see whether that is like skewing really bad relative to like your validation data and whatever. That’s a smell that you have some kind of leakage. Another smell of leakage is it’s too good. Leakage is hard, to be honest. I don’t necessarily have bulletproof defense for it.
[2:31:42] Participant 4: This one also seems super context-specific. So, for instance, let’s use Hamel’s honeycomb data as an example. There could be some honeycomb queries that are in the GPT-4 training data. There probably are. But there’s no reason to think that the ones that was collected for him to fine tune on are more likely to have been pulled from the GPT-4 training data than ones that they will have in production. And so they are like, you can just sort of reason about it. Or if you use my example, we had, I talked about today was this de-biasing essays.
[2:32:24] Participant 4: Those were essays that just got written. Or not essays, like we call them journal articles. They just got written.
[2:32:30] Participant 4: they were submitted to an academic publisher and now we’re going to edit them the fact that they were just submitted for the first time literally like days before we came to edit them would make us think they probably weren’t in the training data so i think this probably happens sometimes and you just have to just probably just i don’t think there’s a general rule for how you avoid it should we go for uh which one you want to go hamill i think we can end it okay Yeah, let’s close it here.
[2:33:09] Participant 4: Anyone who, we’ve got 160 people left. Actually, before you guys all drop off, for our Discord server, we are rotating links. I should figure out if there’s a way to make it not have links expire every seven days. Email me if you have an outdated link. I want to understand how many people this affects. You have a selected sample here. Sorry. Then please let the form redeem your credits. Then I think we’ve got some good sessions lined up for tomorrow and basically the rest of this week. Thanks everyone.