LLMs on the command line

applications

llm-conf-2024

Published

June 11, 2024

Abstract

The Unix command-line philosophy has always been about joining different tools together to solve larger problems. LLMs are a fantastic addition to this environment: they can help wire together different tools and can directly participate themselves, processing text directly as part of more complex pipelines. Simon shows how to bring these worlds together, using LLMs - both remote and local - to unlock powerful tools and solve all kinds of interesting productivity and automation problems.

He’ll cover LLM, his plugin-based command-line toolkit for accessing over a hundred different LLMs - and show how it can be used to automate prompts and compare them across different models.

LLM logs prompts and responses to SQLite, which can then be explored and analyzed further using Datasette. Other CLI tools covered will include ollama, llamafile, shot-scraper, files-to-prompt, ttok and LLM’s support for calculating, storing and querying embeddings.

Subscribe For More Educational Content

If you enjoyed this content, subscribe to receive updates on new educational content for LLMs.

Chapters

00:00 Introduction

Simon Willison introduces LLM - a command line tool for interacting with large language models.

01:40 Installing and Using LLM

Simon demonstrates how to install LLM using pip or homebrew and run prompts against OpenAI’s API. He showcases features like continuing conversations and changing default models.

10:30 LLM Plugins

The LLM tool has a plugin system that allows access to various remote APIs and local models. Simon installs the Claude plugin and discusses why he considers Claude models his current favorites.

13:14 Local Models with LLM

Simon explores running local language models using plugins for tools like GPT4All and llama.cpp. He demonstrates the llmchat command for efficient interaction with local models.

26:16 Writing Bash Scripts with LLM

A practical example of creating a script to summarize Hacker News threads.

35:01 Piping and Automating with LLM

By piping commands and outputs, Simon shows how to automate tasks like summarizing Hacker News threads or generating Bash commands using LLM and custom scripts.

37:08 Web Scraping and LLM

Simon introduces ShotScraper, a tool for browser automation and web scraping. He demonstrates how to pipe scraped data into LLM for retrieval augmented generation (RAG).

41:13 Embeddings with LLM

LLM has built-in support for embeddings through various plugins. Simon calculates embeddings for his blog content and performs semantic searches, showcasing how to build RAG workflows using LLM.

Notes

Note

These notes were originally published by Simon Willison here

Notes for a talk I gave at Mastering LLMs: A Conference For Developers & Data Scientists.

Getting started

brew install llm # or pipx or pip
llm keys set openai
# paste key here
llm "Say hello in Spanish"

Installing Claude 3

llm install llm-claude-3
llm keys set claude
# Paste key here
llm -m haiku 'Say hello from Claude Haiku'

Local model with llm-gpt4all

llm install llm-gpt4all
llm models
llm chat -m mistral-7b-instruct-v0

Browsing logs with Datasette

https://datasette.io/

pipx install datasette # or brew or pip
datasette "$(llm logs path)"
# Browse at http://127.0.0.1:8001/

Templates

llm --system 'You are a sentient cheesecake' -m gpt-4o --save cheesecake

Now you can chat with a cheesecake:

llm chat -t cheesecake

More plugins: https://llm.datasette.io/en/stable/plugins/directory.html

llm-cmd

Help with shell commands. Blog entry is here: https://simonwillison.net/2024/Mar/26/llm-cmd/

files-to-prompt and shot-scraper

files-to-prompt is described here: https://simonwillison.net/2024/Apr/8/files-to-prompt/

shot-scraper javascript documentation: https://shot-scraper.datasette.io/en/stable/javascript.html

JSON output for Google search results:

shot-scraper javascript 'https://www.google.com/search?q=nytimes+slop' '
Array.from(
  document.querySelectorAll("h3"),
  el => ({href: el.parentNode.href, title: el.innerText})
)'

This version gets the HTML that includes the snippet summaries, then pipes it to LLM to answer a question:

shot-scraper javascript 'https://www.google.com/search?q=nytimes+slop' '
() => {
    function findParentWithHveid(element) {
        while (element && !element.hasAttribute("data-hveid")) {
            element = element.parentElement;
        }
        return element;
    }
    return Array.from(
        document.querySelectorAll("h3"),
        el => findParentWithHveid(el).innerText
    );
}' | llm -s 'describe slop'

Hacker news summary

https://til.simonwillison.net/llms/claude-hacker-news-themes describes my Hacker News summary script in detail.

Embeddings

Full documentation: https://llm.datasette.io/en/stable/embeddings/index.html

I ran this:

curl -O https://datasette.simonwillison.net/simonwillisonblog.db
llm embed-multi links \
  -d simonwillisonblog.db \
  --sql 'select id, link_url, link_title, commentary from blog_blogmark' \
  -m 3-small --store

Then looked for items most similar to a string like this:

llm similar links \
  -d simonwillisonblog.db \
  -c 'things that make me angry'

Full Transcript

Expand to see transcript

[0:00] Simon Willison: Hey, hey everyone, it’s great to be here. So yeah, the talk today, it’s about command line tools and large language models. And effectively the argument I want to make is that the Unix command line dating back probably 50 years now is it turns out the perfect environment to play around with this new cutting edge technology, because the Unix philosophy has always been about tools that output things that get piped into other tools as input. And that’s really what a language model is, right?
[0:27] Simon Willison: An LLM is a, it’s effectively a function that you pipe a prompt to, and then you get a response back out, or you pipe a big chunk of context to, and you get a response that you can do things with. So I realized this last year and also realized that nobody had grabbed the namespace on PyPI, the Python Packaging Index, for the term LLM. So I leapt at that. I was like, okay, this is an opportunity to grab a really cool name for something. And I built this…
[0:54] Simon Willison: LittleTool, which originally was just a command line tool for talking to OpenAI. So you could be in your terminal and you could type LLM, say hi in French, and that would fire it through the OpenAI API and get back response and print it to your terminal. That was all it did. And then over time, as other model providers became interesting, and as local models emerged that you could run on your computer, I realized there was an opportunity to have this tool do way more than that.
[1:24] Simon Willison: So I started adding plugin support to it so you can install plugins that give you access to flawed and local mistral and all sorts of other models. There’s hundreds of models that you can access through this tool now. I’ll dive into that in a little bit more detail in a moment. First thing you need to know is how to install it. If you are Python people, that’s great. Pip install LLM works. I recommend using pip x to install it because then the dependencies end up packaged away somewhere nice. Or you can install it using homebrew.
[1:54] Simon Willison: I think the one I’ve got here, yeah, this one I installed with brew install LLM. I made the mistake of running that command about half an hour ago and of course it’s homebrew so it took half an hour to install everything. So Treat with caution. But it works. And once it’s installed, you have the command. And so when you start using LLM, the default is for it to talk to OpenAI. And of course, you need an OpenAI API key for that. So I’m going to grab my API key. There’s a command you can run.
[2:25] Simon Willison: LLM secrets. Is it LLM secrets?
[2:30] Hugo Bowne-Anderson: Yes.
[2:32] Simon Willison: Yes. No, it’s not. It’s. What is it? LLM keys. That’s it. LLM. So I can type LLM keys, set OpenAI, and then I paste in the key, and I’m done. And having done this, I’ve actually got all sorts of keys in here, but my OpenAI API key has now been made available. And that means I can run prompts. Five great names for a pet pelican. This is my favorite test prompt. And it gives me five great names for a pet pelican. So that’s fun. And that’s running over the API.
[3:05] Simon Willison: And because it’s a Unix command line thing, you can do stuff with the app. So you can do things like write that to pelicans.txt. The greater than sign means take the output and run it to a file. Now I’ve got a nice permanent pelicans.txt file with my five distinctive names for a pet pelican. Another thing you can do is you can continue. If you say dash C. which stands for continue, I can say now do walruses. And it will continue that same conversation, say here are five fitting names for a pet walrus.
[3:35] Simon Willison: I’m going to say justify those. Oops. And now it says why each of these names are justified for that walrus. That’s super, super, super delightful.
[3:48] Hugo Bowne-Anderson: I like Gustav.
[3:50] Simon Willison: Gustav, what was, let’s do LLM logs dash. Gustav, a touch of personality and grandeur, strong regal name that suits the impressive size and stature of a walrus, evoking a sense of dignity. It’s good. The justifications are quite good. This is GPT-4-0 I’m running here. You can actually say LLM models default to see what the default model is, and then you can change that as well. So if I want to… be a bit cheaper, I can set it to chat GPT.
[4:24] Simon Willison: And now when I do this, oops, these are the GPT 3.5 names, which are slightly less exciting. So that’s all good and fun. But there are a whole bunch of other useful things you can do when you start working with these things in the terminal. So I’m going to… Let’s grab another model. So LLM, as I mentioned, has plugins. If you go to the LLM website, look at list of plugins, there is a plugin directory. This is all of the plugins that are currently available for the tool. And most of these are remote APIs.
[5:06] Simon Willison: These are plugins for talking to Claude or Rekha or Perplexity or any scale endpoints, all of these different providers. And then there are also some local model plugins that we’ll jump into in a moment. But let’s grab Claude 3. Claude 3 is my current favorite, my favorite family of models. So I can say LLM install LLM Claude 3, and it will go ahead and install that plugin. If I type LLM plugins, it shows me the plugins that it has installed. Oh, I didn’t mean to install LLM Claude, but never mind. And if I see LLM…
[5:39] Hugo Bowne-Anderson: Is Claude 3 currently your favorite? Just out of interest?
[5:41] Simon Willison: Two reasons. One, Claude 3 Haiku is incredible, right? Claude 3 Haiku is cheaper than GPT 3.5. I think it’s the cheapest decent model out there. It’s better than 3.5. It has the 100,000 token limit. So you can dump a huge amount of stuff on there. And it can do images. So we’ve got, I think it’s the most exciting model that we have right now if you’re actually building because for the price, you get an enormous array of capabilities. And then Opus, I think Opus was better than GPT-4 Turbo.
[6:15] Simon Willison: I think 4.0 is just about caught up for the kind of stuff that I do. But Opus is still like a really interesting model. The other thing I’ll say about Claude is the… There’s this amazing article that just came up about Claude’s personality, which talks about how they gave Claude a personality. And this is one of the most interesting essays I have read about large language models in months. Like, what they did to get Claude to behave the way it does and the way they thought about it is super fascinating.
[6:45] Simon Willison: But anyway, so I’ve installed Claude. I’ve now got Claude 3 Opus, Claude 3 Sonnet and Claude 3 Haiku. LLM gives everything long names, so you can say that, say hi in Spanish, and it’ll say hola. If you say it with a flourish, it’ll add an emoji. That’s cute. Hola mi amigo. But you can also say lm-m and just the word haiku because that’s set up as a shorter alias. So if I do that, I’ll get the exact same response. Crucially, you can…
[7:22] Simon Willison: This is how I spend most of my time when I’m messing around with models. install plugins for them, or often I’ll write a new plugin because it doesn’t take much effort to write new plugins for this tool. And then I can start mucking around with them in the terminal, trying out different things against them. Crucially, one of the key features of this tool is that it logs absolutely everything that you do with it. It logs that to a SQLite database.
[7:47] Simon Willison: So if I type lmat logs path, it will show me the path to the SQLite database that it’s using. And I absolutely adore SQLite databases, partly because my main project, the thing I spend most of my time building, is this thing called Dataset, which is a tool for exploring data in SQLite databases. So I can actually do this. I can say Dataset that, if you put it in double quotes, it makes up that space in the file name. This is taking, this is a good command line trick.
[8:18] Simon Willison: It’s taking the path to the log database and passing it to dataset. And now I’ve got a web interface where I can start browsing all of my conversations. So let’s have a look at responses. We sort by ID. Here we go. Say hi in Spanish with a flourish. Hola mi amigo. There it is. It stores the options that we used. It stores the full JSON that came back. It also organizes these things into conversation threads. So earlier we started with, we built, we had that conversation that started five great names for a pet pelican.
[8:52] Simon Willison: We replied to it twice and each of those messages was logged under that same conversation idea. So we started five great names for pet pelican. Now do walruses justify those. As a tool for evaluating language models, this is incredibly powerful because I’ve got every experiment I’ve ever run through this tool. I’ve got two and a half thousand responses that I’ve run all in one place, and I can use SQL to analyze those in different ways.
[9:18] Simon Willison: If I facet by them, I can see that I’ve spent the most time talking to GPT 3.5 Turbo, Cloud 3 Opus. I’ve done 334 prompts through. I’ve got all of these other ones, Gemini, Gemini Pro. Orca Mini, all of these different things that I’ve been messing around with. And I can actually search these as well. If I search for pelican, these are the six times that I’ve talked to Claude 3 Opus about pelicans. And it’s mostly… Oh, interesting.
[9:46] Simon Willison: It’s mostly asking names from Pelicun, but I did at one point ask it to describe an image that was a close-up view of a Pelicun wearing a colorful party hat. The image features aren’t in the main shipped version of the software yet. That’s a feature I need to release quite soon. Basically, it lets you say LLM. like a dash I and then give it the path to an image and that will be sent as part of the prompt. So that’s kind of fun.
[10:10] Simon Willison: But yeah, so if you want to be meticulous in tracking your experiments, I think this is a really great way to do that. Like having this database where I can run queries against everything I’ve ever done with them. I can try and compare the different models and the responses they gave to different prompts. That’s super, super useful. Let’s talk a little bit more about plugins. So I mentioned that we’ve got those plugins that add additional models. We also have plugins that add extra features.
[10:45] Simon Willison: My favorite of those is this plugin called LLM-CMD, which basically lets you do this. If I say in the LLM-CMD, that’s installing the plugin. That gives me a new command, llmcmd. That command wasn’t there earlier. I can now say llmcmd-help, and it will tell me that this will generate and execute commands in your shell. So as an example, let’s do llm convert the file demos.md to uppercase and spit out the results. Oh, I forgot. So llmcmd. And what it does is it passes that up to, I think, GPT-4.0 is my default model right now.
[11:36] Simon Willison: Gets back the command that does this thing. Why is this taking so long? Hang on. Models. Default. Let’s set that to GPT-4.0. Maybe GPT-3.5 isn’t very good at that. Anyway, when this works, it populates my shell with the command that I’m trying to run. And when I hit enter, it will run that command. But crucially, it doesn’t just run the command, because that’s a recipe for complete disaster. It lets you review that command before it goes. And I don’t know why this isn’t working. This is the one live demo I didn’t test beforehand.
[12:21] Simon Willison: I will drop, I will make notes available afterwards as well. But this is a right, here we go. Here’s an animated GIF showing me, showing you exactly what happens. This is show the first three lines, show the first three lines of every file in this directory. And it’s spats out head dash N three star. And that does exactly the job.
[12:41] Simon Willison: A fun thing about this is that because it’s a command here, it actually, And tab completion works as well, so you can give it file names by tab completing, and when the command is babing itself, that will do the right thing. But that’s kind of fun. It’s kind of neat to be able to build additional features into the tool, which use all of the other features of LLM, so it’ll log things to SQLite and it’ll give you access to all of those different models.
[13:07] Simon Willison: But it’s a really fun playground for sort of expanding out the kind of command line features that we might want to use. Let’s talk about local models. So local models that run on your laptop are getting shockingly effective these days. And there are a bunch of LLM plugins that let you run those. One of my favorites of those is called LLM GPT-4-all. It’s a wrapper around the GPT-4-all library that Nomic put out.
[13:35] Simon Willison: So Nomic have a desktop application that can run models that is accompanied by a Python library that can run models, which is very neatly designed. And so what I can do is I can install that plugin, LLM install. LLM GPT for all. This will go around fetch that. And now when I run the LLM models command, we’ve got a whole bunch of additional models. This is all of these GPT-4 ones. And you’ll note that some of these say installed. That’s because I previously installed them and they’re sat on my hard drive somewhere.
[14:08] Simon Willison: I’m not going to install any new models right now because I don’t want to suck down a four gigabyte file while I’m on a Zoom call. But quite a few of these are installed, including Mistral 7b instruct. So I can grab that and I can say, lm-m Mistral, let’s do. Five great names for a pet seagull. Explanations. And I’m going to fire activity monitor for this right now. Let’s see if we can spot it doing its thing. Right now we’ve just asked a command line tool to load in a 4 gigabyte model file.
[14:42] Simon Willison: There we go. It’s loading it in. It’s at 64.3 megabytes. 235 megabytes. It’s spitting out the answer. And then it goes away again. So this was actually a little bit wasteful, right? We just ran a command which loaded four gigabits of memory, of file into memory, and I think onto the GPU, ran a prompt for it, and then threw all of that away again. And it works. You know, we got our responses. And this here ran entirely on my laptop. There was no internet connection needed for this to work.
[15:13] Simon Willison: But it’s a little bit annoying to have to load the model each time. So I have another command I wrote, a command I added, llm chat. And llm chat, you can feed it the ID of a model, llm chat dash m mistral 7b. And now it’s giving me a little…
[15:33] Simon Willison: chat interface so I can say say hello in Spanish the first time I run this it will load the model again and now now in French and So this is if you’re working with local models This is a better way to do it because you don’t have to pay the cost of loading that model into memory every single time there’s also You can stick in pipe multi and now you can copy and paste a whole block of code into here and it translates And then you type exclamation mark end at the end.
[16:08] Simon Willison: And if we’re lucky, this will now give me… Huh. Okay. Well, that didn’t. I may have hit the… I wonder if I’ve hit the context length. Yeah, something went a little bit wrong with that bit. But yeah, being able to hold things in memory is obviously really useful. There are better ways to do this. One of the best ways to do this is using the O-Lama tool, which I imagine some people here have played with already. And O-Lama is an absolutely fantastic tool.
[16:50] Simon Willison: It’s a Mac, Linux, and Windows app that you can download that lets you start running local language models. And they do a much better job than I do of curating their collection of models. They have a whole team of people who are making sure that newly available models were available in that tool and work as effectively as possible. But you can use that with LLM as well. If I do LLM install LLM-o-Lama, actually… That will give me a new plugin called LLM-OLAMA. And now I can type LLM models.
[17:22] Simon Willison: And now, this time, it’s giving me the OLAMA models that I have available in my machine as well. So in this case, we can do Mixtral. Let’s do LLM-M, LLM-CHAT, LLM-M Mixtral Latest. I’ll write it in Spanish. And this is now running against the Ollama server that’s running on my machine, which I think might be loading Mixtral into memory at the moment, the first time I’m calling it. Once it’s loaded into memory, it should work for following prompts without any additional overhead. Again, I spend a lot of time in an activity monitor these days.
[18:00] Simon Willison: There we go. Ollama Runner has four gigabytes in residence, so you’d expect that to be doing the right thing.
[18:07] Hugo Bowne-Anderson: So Simon, I just have a quick question. I love how you can use Ollama directly from LLM. I do think one of the other value props about Ollama is the ecosystem of tools built around it. Like you go to Olamas GitHub and there are all types of front ends you can use. And so I love using your LLM client, for example, with like a quick and dirty Gradio app or something like that. But I’m wondering, are there any front ends you recommend or any plugins or anything in the ecosystem to work with?
[18:38] Simon Willison: I’ve not explored the Olam ecosystem in much detail. I tend to do everything on the command line and I’m perfectly happy there. But one of the features I most want to add to LLM as a plugin is a web UI. So you can type LLM web, hit enter, it starts a web server for you and gives you a slightly more… slightly more modern interface for messing around with things. And I’ve got a demo that relates to that that I’ll show in a moment.
[19:04] Simon Willison: But yeah, so front-end, and actually the other front-end I spend a little bit of time with is LM Studio, which is very nice. That’s a very polished GUI front-end for working with models. There’s a lot of…
[19:19] Hugo Bowne-Anderson: It’s quite around with getting two LLMs answering your same question. But there’s a mode where you can…
[19:24] Simon Willison: get two or n llms if you have enough processing power to answer the same questions and compare their responses in real time yeah it’s a new feature very cool that is cool i’ve been me i’ve been planning a plugin that will let you do that with llms like llm multi dash m llama dash m something and then give it a prompt um but one of the many ideas on the on the on the backlog at the moment would be super super useful um the other one of course that people should know that if they don’t
[19:53] Simon Willison: is llama file I’m going to demonstrate that right now.
[19:57] Hugo Bowne-Anderson: Sorry, I’m going to shush now.
[20:00] Simon Willison: This is one of the most bizarrely brilliant ways of running language models is there’s this project that’s sponsored by Mozilla called Llama File. And Llama File effectively lets you download a single file, like a single binary that’s like four gigabytes or whatever. And then that file will run, it comes with both the language model and the software that you need to run the language model. And it’s bundled together in a single file and one binary works on Windows, Mac OS, Linux and BST, which is ridiculous. What this thing does is technically impossible.
[20:38] Simon Willison: You cannot have a single binary that works unmodified across multiple operating systems. But LamaFile does exactly that. It’s using a technology called Cosmopolitan. which I’ve got here we go I’ve got an article about Cosmopolitan when I dug into it just a while ago to try and figure out how this thing works astonishing project by Justin Tunney But anyway, the great thing about LamaFile is, firstly, you can just download a file and you’ve got everything that you need. And because it’s self-contained, I’m using this as my end-of-the-world backup of human knowledge.
[21:16] Simon Willison: I’ve got a hard drive here, which has a bunch of LamaFiles on. If the world ends, provided I can still run any laptop and plug this hard drive in, I will have a GPT 3.5 class. language model that I can just start using. And so the one that I’ve got on there at the moment, I have, I’ve actually got a whole bunch of them, but this is the big one. I’ve got Lama 370B, which is by far, I mean, how big is that thing? That’s a 37 gigabyte file. It’s the four byte quantized version.
[21:55] Simon Willison: That’s a genuinely a really, really good model. I actually started this running earlier because it takes about two minutes to load it from over USB-C from drive into memory. So this right here is that. And all I had to do was download that file, shimod7558, and then do.slash metal armor 370B and hit enter. And that’s it. And that then fires up this. In this case, it fires up a web server which loads the model and then starts running on port. Which port is it?
[22:32] Simon Willison: So now if I go to localhost 8080, this right here is the default web interface for Llama. It’s all based on Llama.cpp but compiled in a special way. And so I can say, turn names for a pet pelican and hit go. And this will start firing back tokens. Oh, I spelt pelican wrong, which probably won’t matter.
[22:58] Hugo Bowne-Anderson: And while this is executing, maybe I’ll also add for those who want to spin up LamaFile now you can do so as Simon has just done. One of the first models in their README they suggest playing around with, which is 4 gigs or 4.7 gigs or something, is the Lava model, which is a really cool multimodal model that you can play around with locally from one file immediately, which is actually mind blowing if you think about it for a second.
[23:22] Simon Willison: It really is. Yeah. Do I have that one? Let’s see. Let’s grab.
[23:30] Hugo Bowne-Anderson: With your end of the world scenario, now you’ve got me thinking about like post-apocalyptic movies where people have like old LLMs that they use to navigate the new world.
[23:40] Simon Willison: Completely. Yeah, that’s exactly how I.
[23:43] Hugo Bowne-Anderson: Preppers with LLMs.
[23:47] Simon Willison: I’m going to. Wow. Wow. Yeah, I very much broke that one, didn’t I?
[23:58] Hugo Bowne-Anderson: We’ve got a couple of questions that may be relevant as we move on. One from Alex Lee is how does LLM compare with tools like Ollama for local models? So I just want to broaden this question. I think it’s a great question. It’s the Python challenge, right? The ecosystem challenge. When somebody wants to start with a tool like this, how do they choose between the plethora? How would you encourage people to make decisions?
[24:21] Simon Willison: I would say LLMs… LLM’s unique selling point is the fact that it’s scriptable and it’s a command line tool. You can script it on the command line and it can run access both the local models and the remote models. Like, that I feel is super useful. So you can try something out against a hosted like cloud-free haiku and then you can run the exact same thing against a local Lama 3 or whatever. And that, I think, is the reason to pay attention to that. I just…
[24:52] Hugo Bowne-Anderson: I also, I do want to build on that. I’d say if like I’ve been using dataset for some time now, and I love local SQLite databases as well. So the integration of those three with all the dataset plugins as well, make it really, really interesting also. So I think that’s a huge selling point.
[25:07] Simon Willison: So what I’ve done here, I closed Llama 370B and I have switched over to that. I switched over to that Lava demo. And if we’re lucky, look at this. Person features a person sitting in a chair with a rooster nearby. She’s a chicken, not a rooster. A white ball filled with eggs. This I think is astoundingly good for a four gigabyte file. This is a four gigabyte file. It can describe images. This is remarkably cool. And then from LLM’s point of view, if I saw LLM Lava file. And then run LM models.
[25:50] Simon Willison: I’ve now got a model in here which is Lama file. So I can say LM dash Lama file. Describe chickens. This doesn’t yet. Ooh, what happened there? Error file not found. Not sure what’s going on there. I’ll dig into that one a little bit later.
[26:11] Simon Willison: the joys of live demos but yeah um so we’ve got all of this stuff what are some of the things that we can start doing with it well the most exciting opportunity i think is that we can now start um we can now start writing little bash scripts writing little tools on top of this and so if i do This is a script that I’ve been running for quite a while called HN summary, which is a way of summarizing posts on Hacker News or entire conversations on Hacker News. Because Hacker News gets pretty noisy.
[26:47] Simon Willison: What’s a good one of these to take a look at? Yeah, I do not have time to read 119 comments, but I’d like to know a rough overview of what’s going on here. So I can say HN-summary. And then paste in that ID. And this is giving me a summary of the themes from the hack news point. So, he’s in theme static versus dynamic linking, package management dependency, Swift cross-platform language. That totally worked. And if we look at the way this worked, it’s just a bash script which does a curl command to get the full.
[27:25] Simon Willison: This is from one of the hack news APIs. If you hit this URL here. You will get back JSON of everything that’s going on in that thread as this giant terrifying nested structure. I then pipe it through the JQ command. I use ChatGP to write this because I can never remember how to use JQ. That takes that and turns it into plain text. And actually, I’ll run that right now just to show you what that looks like. There we go.
[28:01] Simon Willison: So that essentially strips out all of the JSON stuff and just gives me back the names of the people and what they said. And then we pipe that into LLM-M model. The model defaults to Haiku, but you can change it to other models if you like. And then we feed it this, the dash S option. to LLM, also known as dash dash system, is the way of feeding in a system prompt.
[28:26] Simon Willison: So here I’m feeding the output of that curl command, goes straight into LLM as the prompt, and then the system prompt is the thing that tells it what to do. So I’m saying summarize the themes of the opinions expressed here for each theme, output a markdown header. Let’s try that. I’m going to try that one more time, but this time I’ll use GPT-4-0. So we’re running the exact same prompt, but this time through a different model. And here we’re actually getting back quotes.
[28:52] Simon Willison: So when it says, you man wizard said this, Jerry Puzzle said this about dynamic and static linking. I really like this as a mechanism of sort of summarizing conversations because the moment you ask for direct prompts, you’re not completely safe from hallucination, but you do at least have a chance of fact checking what the thing said to you. And as a general rule, models are quite good at outputting texts that they’ve just seen. So if you ask for direct quotes from the input, you’ll often get bad results. But this is really good, right?
[29:20] Simon Willison: This is a pretty decent, quick way of digesting 128 comments in that giant thread. And it’s all been logged to my SQLite database. I think if I go and look in SQLite, I’ve got hundreds of hack and use threads that I’ve summarized in this way, which if I wanted to do fancy things with later, that’s probably all sorts of fun I could have with them. And again, I will. Here we go.
[29:49] Simon Willison: I will share full notes later on, but there’s a TIL that I wrote up with the full notes on how I built up this script. That’s another reason I love Cloud3 Haiku, is that running this kind of thing through Cloud3 Haiku is incredibly inexpensive. It costs… And it may be a couple of cents for a long conversation. And the other model I use for this is Gemini Flash that just came out from Google. It’s also a really good long context model with a very low price per token. But yeah. Where did we go to?
[30:28] Hugo Bowne-Anderson: So we have a couple of questions that maybe you want to answer now, maybe we want to leave until later and maybe you cover. And the fact that we saw some sort of formatted output leads to one of these. Is LLM compatible with tools like Instructor, Kevin asks, to generate formatted output? And there are other questions around like piping different. different things together. So I wonder if you can kind of use that as a basis to talk about how you pipe,
[30:56] Simon Willison: you know? Yes, I will jump into some quite complex piping in just a moment. LLM does not yet have structured output function calling support. I am so excited about getting that in there. The thing I want to do, there are two features I care about. There’s the feature where you can like get a bunch of unstructured text and feed in a JSON schema and get back JSON. That works incredibly well. A lot of the models are really good at that now.
[31:19] Simon Willison: I actually have a tool I built for dataset that uses that feature, but that’s not yet available as a command line tool. And the other thing I want to do is full-blown like tool execution where the tools themselves are plugins. Like imagine if you could install an LLM plugin that added Playwright functions, and now you can run prompts that can execute Playwright automations as part of those prompts, because it’s one of the functions that gets made available to the model.
[31:46] Simon Willison: So that, I’m still gelling through exactly how that’s going to work, but I think that’s going to be enormously powerful.
[31:53] Hugo Bowne-Anderson: Amazing. And on that point as well, there are some questions around evals. And if you, when using something like this, you can do evals and how that would work.
[32:05] Simon Willison: So work in progress. Two months ago, I started hacking on a plugin for LLM for running evals. And it is. Very, very alpha right now. The idea is I want to be able to find my evals as YAML files and then say things like LMEVAL simple dot YML with the 40 model and the chat GPT models. I’m running the same eval against two different models and then get those results back, log them to SQLite, all of that kind of thing. This is a very, very early prototype at the moment.
[32:37] Simon Willison: But it’s partly, I just, I’ve got really frustrated with how difficult it is to run evals and how little I understand about them. And when I don’t understand something, I tend to write code as my way of thinking through a problem. So I will not promise that this will turn into a generally useful thing for other people. I hope it will. At the moment, it’s a sort of R&D prototype for me to experiment with some ideas.
[32:58] Hugo Bowne-Anderson: I also know the community has generated a bunch of plugins and that type of stuff for Dataset. I’m not certain about LLM, but I am wondering if people here are pretty, you know, pretty sophisticated audience here. So if people wanted to contribute or that type of thing.
[33:13] Simon Willison: OK, the number one way to contribute to LLM right now is is by writing plugins for it. And I wrote a very detailed tutorial. The most exciting is the ones that enable new models. So I wrote a very detailed tutorial on exactly. how to write a plugin that exposes new models. A bunch of people have written plugins for API-based models. Those are quite easy. The local models are a little bit harder, but the documentation is here.
[33:39] Simon Willison: And I mean, my dream is that someday this tool is widely enough to use that when somebody releases a new model, they build a plugin for that model themselves. That would be the ideal. But in the absence of that, it’s pretty straightforward building new models. I’m halfway through building a plugin for… the MLX Apple framework, which is getting really interesting right now. And I just this morning got to a point where I have a prototype of a plugin that can run MLX models locally, which is great. But yeah, let’s do some commands.
[34:17] Simon Willison: Okay, I’ll show you a really cute thing you can do first. LLM has support for templates. So you can say things like LLM dash dash system, you are a sentient cheesecake, tell it the model, and you can save that as a template called cheesecake. Now I can say LLM chat dash T cheesecake, tell me about yourself. And it’ll say, I’m a sentient cheesecake, a delightful fusion of creamy textures. So this is, I have to admit, I built this feature. I haven’t used it as much as I expected it I would.
[34:47] Simon Willison: It’s effectively LLM’s equivalent of GPT’s, of chat GPT’s. I actually got this working before GPT’s came along. And it’s kind of fun, but it’s, yeah, like I said, I don’t use it on a daily basis. And I thought I would when I built it. Let’s do some really fun stuff with piping. So I’ve got a couple of the, one of the most powerful features of LLM is that you can pipe things into it with a system prompt to have it, to then process those things further.
[35:17] Simon Willison: And so you can do that by just like catting files to it. So I can say cat demos.md pipe LLM dash S summary short. And this will give me a short summary of that document that I just piped into it, which works really well. That’s really nice. Cool. A little bit longer than I wanted it to be. Of course, the joy of this is that once this is done, I can then say lm-c, no, much, much, much shorter and in haikus.
[35:50] Simon Willison: And now it will write me some haikus that represent the demo that I’m giving you right now. These are sentient cheesecake, templates to save brilliant minds, cheesecake chats with us. That’s lovely. So being able to pipe things in is really powerful. I built another command called files to prompt, where the idea of this one is if you’ve got a project with multiple files in it, running files to prompt will turn those into a single prompt.
[36:14] Simon Willison: And the way it does that is it outputs the name of the file and then the contents of the file and then name of the next file, contents of the file, et cetera, et cetera, et cetera. But because of this, I can now do things like suggest tests to add to this project. Oh. I’m sorry. I’m sorry. I forgot the LLM-S. Here we go. And this is now, here we go, reading all of that code and suggesting, okay, you should have tests that, oh, wow, it actually, it’s writing me sample tests.
[36:46] Simon Willison: This is very, this is a very nice result. So I use this all the time. When I’m hacking on stuff on my machine, I will very frequently just. cat a whole directory of files into a prompt in one go, and use the system prompt to say, what tests should I add? Or write me some tests, or explain what this is doing, or figure out this bug. Very, very powerful way of working with the tool. But way more fun than that is another tool I built called ShotScraper.
[37:14] Simon Willison: So ShotScraper is a browser automation tool which started out as a way of taking screenshots. So once you’ve got it installed, you can do ShotScraper and then the URL to a web page, and it will generate a PNG file with a screenshot of that web page. That’s great. I use that to automate screenshots in my documentation using this. But then I realized that you can do really fun things by running JavaScript from the command line.
[37:38] Simon Willison: So a very simple example, if I say, shot scraper JavaScript, give it the URL to a website and then give it that string, document.title, it will load that website up in a hidden browser. It’ll execute that piece of JavaScript and it will return the result of that JavaScript directly to my terminal. So I’ve now got the title of this webpage and that’s kind of fun. Where that gets super, super fun is when you start doing much more complicated and interesting things with it. So let’s scrape Google. Google hate being scraped. We’ll do it anyway.
[38:12] Simon Willison: Here is a Google search for NY Times plus slop. There’s an article in the New York Times today with a quote for me about the concept of slop in AI, which I’m quite happy about. And then so you can open that up and start looking in the. If you start looking at the HTML, you’ll see that there’s H3s for each results, and the H3s are wrapped by a link that links to that page.
[38:37] Simon Willison: So what I can do is I can write a little bit of JavaScript here that finds all of the H3s on the page, and for each H3, it finds the parent link and its href, and it finds the title, and it outputs those in an array. And if I do this… This should fire up that browser. That just gave me a JSON array of links to search results on the New York Times. Now I could pipe that to LLM. So I’m gonna do pipe LLM.
[39:06] Simon Willison: Actually, no, I’m gonna do a slightly more sophisticated version of this. This one goes a little bit further. It tries to get the entire… It tries to get the snippet as well, because the snippet gives you that little bit of extra context. So if I take that, and I’m just going to say dash S describe slot. And what we have just done. is we have done RAG, right? This is retrieval augmented generation against Google search results using their snippets to answer a question done as a bash one-liner effectively.
[39:42] Simon Willison: Like we’re using ShotScraper to load up that web page. We’re scraping some stuff out with JavaScript. We’re piping the results into LLM, which in this case is sending it up to GPT-4.0, but I could equally tell it to send it to Claude or to run it against a local model or any of those things. And it’s a full RAG pipeline. I think that’s really fun. I do a lot of my experiments around the concept of RAG, just as these little shell scripts here.
[40:09] Simon Willison: You could consider the hack and use example earlier was almost an example of RAG, but this one, because we’ve got an actual search term and a question that we’re answering, I feel like this is it. This is a very quick way to start prototyping different forms of retrieval augmented generation.
[40:27] Hugo Bowne-Anderson: Let me ask though, does it use? I may have missed, does it use embeddings?
[40:32] Simon Willison: Not yet, no, but I’ll get into embeddings in just a second.
[40:35] Hugo Bowne-Anderson: And that’s something we decided not to talk too much about today, but it’d be sad if people didn’t find out about your embeddings.
[40:42] Simon Willison: I have a closing demo I can do with embeddings. Yeah, this right here, effectively, we’re just copying and pasting these search results from the browser into the model and answering a question, but we’re doing it entirely on the command line, which means that we can hook up our own bash scripts that… automate that and pull that all together. There’s all sorts of fun bits and pieces we can do with that. But yeah, let’s… The ShotScraper JavaScript thing I’ll share later. Let’s jump into the last… Let’s jump into embedding stuff. So…
[41:16] Simon Willison: If you run llm dash help, it’ll show you all of the commands that are available in the LLM family. The default command is prompt. That’s for running prompts. There are also these collections for dealing with embeddings. I would hope everyone in this course is familiar enough with embeddings now that I don’t need to dig into them in too much detail. But it’s exactly the same pattern as the language models. Embeddings are provided by plugins. There are API-based embeddings. There are local embedding models. It all works exactly the same way.
[41:46] Simon Willison: So if I type LLM embed models, that will show me the models that I have installed right now. And actually, these are the open AI ones, the three small, three large, and so on. If I were to install additional plugins, is the embeddings documentation. There’s a section in the plugin directory for embedding models. So you can install sentence transformers, you can get clip running, and various other bits and pieces like that. But let’s embed something. So if I say lm embed, let’s use the OpenAI 3 small model and give it some text.
[42:28] Simon Willison: It will embed that text and it will return an array of, I think, 6,000. How many is that? Like JQ length. An array of 1,536 floating point numbers. This is admittedly not very interesting or useful. There’s not a lot that we can do with that JSON array of floating point numbers right here. You can get it back in different shapes and things. You can ask for it in, I think I can say, dash, dash, X. I can say dash f hex and get back a hexadecimal blob of those. Again, not particularly useful.
[43:04] Simon Willison: Where embeddings get interesting is when you calculate embeddings across a larger amount of text and then start storing them for comparison. And so we can do that in a whole bunch of different ways. There is a command called, where is it? Embed multi. Where’s my embed multi documentation gone? Here we go. The embed multi command lets you embed multiple strings in one go, and it lets you store the results of those embeddings in a SQLite database because I use SQLite databases for everything.
[43:40] Simon Willison: So when I have here a SQLite database, I’m going to open it up actually using Dataset Desktop, which is my Mac OS Electron app version of Dataset. How big is that file? That’s a 129 megabyte file. Wow. Does this have embeddings in already? It does not. Okay, so this right here is a database of all of the content on my blog. And one of the things I have on my blog is I have a link blog, this thing down the side, which has 7,000 links in it.
[44:16] Simon Willison: And each of those links is a title and a description and a URL, effectively. So I’ve got those here in a SQLite database, and I’m going to create embeddings for every single one of those 7,168 bookmarks. And the way I can do that is with a, well, firstly, I need to figure out a SQL query that will get me back the data I want to embed. That’s going to be select ID, link URL, link title, commentary from blog, blogmark.
[44:44] Simon Willison: The way LLM works is when you give it a query like this, it treats the ID there as the unique identifier for that document, and then everything else gets piped into the embedding model. So once I’ve got that in place, I can run this command. I can say, LLM embed multi. I’m going to create a collection of embeddings called links. I’m going to do it against that Simon Wilson blog SQLite database. I’m going to run this SQL query, and I’m using that three small model.
[45:11] Simon Willison: And then dash dash store causes it to store the text in the SQLite database as well. Without that, it’ll just store the IDs. So I’ve set that running, and it’s doing its thing. It’s got 7,000 items. Each of those has to be sent to the OpenAI API in this case, or if it was a local model, it would run it locally. And while that’s running, we can actually see what it’s doing by taking a look at the embeddings table. Here we go.
[45:38] Simon Willison: So this table right here is being populated with the being populated by that script. We’re at one thousand two hundred rows now. I hit refresh. We’re at two thousand rows. And you can see for each one, we’ve got the content which was glued together. And then we’ve got the embedding itself, which is a big binary blob of it’s a. binary encoded version of that array of 1,500 floating point numbers. But now that we’ve got those stored, we can start doing fun things with them. I’m going to open up another. There we go.
[46:20] Simon Willison: So I’ve opened up another window here so that I can say LLM similar. I’m going to look for similar items in the links collection to the text, things that make me angry. Oh, why doesn’t the, oh, because I’ve got to add the dash D. Here we go. So this right here is taking the phrase, things that make me angry, it’s embedding it, and it’s finding the most similar items in my database to that. And there’s an absolutely storming rant from somebody. There’s death threats against bloggers. There’s a bunch of things that might make me angry.
[46:56] Simon Willison: This is the classic sort of embedding semantic search right here. And this is kind of cool. I now have embedding search against my blog. Let’s try something a little bit more useful. I’m going to say. Let’s do dataset plugins. So we’ll get back everything that looks like it’s a dataset plugin. There we go. And I can now pipe that into LLM itself. So I can pipe it to LLM and I’m gonna say system prompt, most interesting plugins. And here we are. Again, this is effectively another version of command line rag.
[47:39] Simon Willison: I have got an embeddings database in this case. I can search it for things that are similar to things. I like this example because we’re running LLM twice. We’re doing the LLM similar command to get things out of that vector database. And then we’re biking to the LLM prompt command to summarize that data and turn it into something interesting. And so you can build a full rag-based system again as a… little command line script. I think I’ve got one of those. Log answer. Yes, there we go.
[48:13] Simon Willison: This one I don’t think is working at the moment, but this is an example of what it would take. What it would take to… take a question, run a embedding search against, in this case, it’s every paragraph in my blog. I’ve got a little bit of JQ to clean that up. And then I’m piping it into… In this case, the local Lama file, but I’ve typed into other models as well to answer questions.
[48:38] Simon Willison: So you can build a full RAG Q&A workflow as a bash script that runs against this local SQLite database and does everything that way. It’s worth noting that this is not a fancy vector database at all. This is a SQLite database with those embedding vectors as binary blobs. Anytime you run a search against this, it’s doing effectively it’s a… Effectively, it’s doing a brute force. It’s calculating the vector similarity difference between your input and every one of those things, and then it’s sorting by those records.
[49:13] Simon Willison: I find for less than 100,000 records, it’s so fast it doesn’t matter. If you were using millions and millions of records, that’s the point where the brute force approach doesn’t work anymore, and you’ll want to use some kind of specialized vector index. There are SQLite vector indexing tools that I haven’t integrated with yet, but they’re looking really promising. You can use pinecone and things like that as well. One of my future goals for LLM is to teach it how to work with external vector indexes.
[49:40] Simon Willison: Cause I feel like once you’ve got those embeddings stored, having a command that like synchronizes up your pinecone to run searches, that feels like that would be a reasonable thing to do. I realize we’re running a little bit short on time. So I’m gonna switch to questions for the rest of the section. I think I went through all of the demos that I wanted to provide.
[49:59] Hugo Bowne-Anderson: Awesome. Well, thank you so much, Simon. That was illuminating as always. And there are a lot more things I want to try now. And I hope for those who have played around with LLM and these client utilities that I’ve got a lot more ideas about how to do so. And for those who haven’t, please jump in and let us know on Discord or whatever, like what type of stuff you, what type of fun you get to have.
[50:20] Hugo Bowne-Anderson: Question wise, I haven’t been able to rank all of them some reason with respect to upvotes, unfortunately, this time. There was one, and I don’t know if you mentioned this, there was one quickly about Hugging Face hub models. Are there plugins?
[50:37] Simon Willison: No, there is not. And that’s because I am GPU poor. I’m running on a Mac. Most of the Hugging Face models appear to need an NVIDIA GPU. If you have an NVIDIA GPU and want to write the LLM Hugging Face plugin, I think it would be quite a straightforward plugin to write, and it would be enormously useful. So yeah, that’s open right now for somebody to do that. Same thing with, is it VLLX or something?
[51:05] Simon Willison: There’s a few different serving technologies that I haven’t dug into because I don’t have an NVIDIA GPU to play with on a daily basis. But yeah, the Hugging Face Models thing would be fantastic.
[51:15] Hugo Bowne-Anderson: Awesome. And how about some of the serverless inference stuff? Is there a way we can use LLM to ping those in?
[51:23] Simon Willison: Do you mean like the Cloudflare ones and so on?
[51:27] Hugo Bowne-Anderson: I’m thinking like, let me… If you go to any given model, there is some sort of serverless inference you can…
[51:36] Simon Willison: you can just do to ping the apis that they’ve already got set up there oh interesting i mean so as you can see we’ve got like 20 different plugins for any scale endpoints is a very good one fireworks um open router so if it’s available via an api you can build a plugin for it the other thing is if it’s an open air compatible if it’s open ai compatible as the api you don’t have to build anything at all you can actually configure llm you can teach it about additional Yeah, you can teach about additional
[52:08] Simon Willison: OpenAI compatible models by just dropping some lines into a YAML file. So if it already speaks OpenAI, without writing additional plugins, you can still talk to it.
[52:20] Hugo Bowne-Anderson: Amazing. And just check out the link I shared with you. If you want to open that one, it should be in the chat.
[52:27] Simon Willison: Is it the…
[52:29] Hugo Bowne-Anderson: It’s the hugging face. API.
[52:34] Simon Willison: No, I’ve not built something against this yet.
[52:38] Hugo Bowne-Anderson: This could actually be really exciting. Yeah. Because I’ve got a lot of pretty heavy-duty models that you can just ping as part of their serverless.
[52:49] Simon Willison: I don’t think anyone’s built that yet, but that would be a really good one to get going, absolutely.
[52:55] Hugo Bowne-Anderson: So if anyone’s interested in that, definitely jump in there. We do have questions around using your client utility for agentic workflows.
[53:07] Simon Willison: Yeah, not yet, because I haven’t done the function calling piece. Once the function calling piece is in, I think that’s going to get really interesting. And that’s also the kind of thing where I’d like to, I feel like you could explore that really by writing additional plugins, like an LLM agents plugin or something. So yeah, there is the other side of LLM which isn’t as mature is there is a Python API. So you can pip install LLM and use it from Python code.
[53:35] Simon Willison: I’m not I’m not completely happy with the interface of this yet, so I don’t tend to push people towards it. Once I’ve got that stable, once I have a 1.0 release, I think this will be a very nice sort of abstraction layer over a hundred different models, because any model that’s available through a plugin to the command line tool will be available as a plugin that you can use from Python directly. So that’s going to get super fun as well, especially in Jupyter notebooks and such like.
[53:58] Hugo Bowne-Anderson: Awesome. We actually have some questions around hardware. Would you mind sharing the system info of your Mac? Is it powerful? Is it with all the commands you demoed? Wonder if I need a new Mac? I can tell people that I’ve got an M1 from 2021. So my MacBook’s got a GPU, but it’s like three, four years old or whatever. And it runs this stuff wonderfully.
[54:20] Simon Willison: So yeah, I’m an M2 Mac, 64 gig. I wish I’d got more RAM. At the same time, the local models, the Mistral 7Bs and such like, run flawlessly. PHY3, absolutely fantastic model, that runs flawlessly. They don’t even gobble up that much RAM as well. The largest model I’ve run so far is Lama 370B, which takes about 40 gigabytes of RAM. And it’s definitely the most GPT 3.5-like local model that I’ve ever run.
[54:53] Simon Willison: I have a hunch that within a few months the Mac will be an incredible platform for running models because Apple are finally like all speed ahead on local model stuff. Their MLX library is really, really good. So it might be that in six months time, an M4 MacBook Pro with 192 gigabytes of RAM is the best machine out there. But I wouldn’t spend any money now based on future potential.
[55:18] Hugo Bowne-Anderson: Right. And I’m actually very bullish on Apple and excited about what happens in the space as well. Also, we haven’t talked about this at all, but the ability to run all this cool stuff on your cell phone is people are complaining about all types of stuff at the moment and Apple hasn’t done this. But this is wild. This is absolutely like cosmic stuff.
[55:40] Simon Willison: There is an app that absolutely everyone with a modern iPhone should try out called MLC Chat. Yeah. It straight up runs Mistral on the phone. It just works. And it’s worked for like six months. It’s absolutely incredible. I can run Mistral 7B Instruct Quantized on my iPhone. Yeah. And it’s good. I’ve used this on flights to look up Python commands and things. Yeah. That’s incredible. And yeah, Apple stuff.
[56:09] Simon Willison: it’s interesting that none of the stuff they announced the other day was actually a chatbot you know that they’re building language model powered features that summarize and that help with copy editing and stuff they’re not giving us a chat thing which means that they don’t they’re not responsible for hallucinations and all of the other weird stuff that can happen which is a really interesting design choice i feel like apple did such a good job of avoiding most of the traps and pitfalls and weirdnesses in in the in the products that they announced yesterday
[56:40] Hugo Bowne-Anderson: Totally agree. And so two more questions, we should wrap up in a second. I wonder, people have said to me, and I don’t… My answer to this, people like, hey, why do you run LLMs locally when there are so many ways to access bigger models? And one of my answers is just, like you mentioned being on a plane or in the apocalypse or that type of thing. But it’s also just for exploration to be able to do something when I’m at home to use my local stuff.
[57:14] Simon Willison: The vast majority of my real work that I would do LLMs, I use Clod 3 Opus, I use GPT-4O, I occasionally use Clod 3 Haiku. But the local models as a way of exploring the space are so fascinating. And it’s also, I feel like if you want to learn how language models work, the best way to do that is to work with the really bad ones.
[57:35] Simon Willison: Like the working, spending time with a crap local model that hallucinates constantly is such a good way of getting your sort of mental model of what these things are and how they work. Because when you do that, then you start saying, oh, okay, I get it. ChatGPT 3.5 is like Mistral 7b, but it’s a bit better. So it makes less mistakes and all of those kinds of things. But yeah, and I mean, there are plenty of very valid privacy concerns around this as well. I’ve kind of skipped those.
[58:03] Simon Willison: Most of the stuff I say to models is me working on open source code, where if there’s a leak, it doesn’t affect me at all. But yeah, I feel like… If you’re interested in understanding the world of language models, running local models is such a great way to explore them.
[58:20] Hugo Bowne-Anderson: Totally. I do have a question also around, you mentioned the eval tool that you’re slowly working on. Does it incorporate data set as well? Because I couldn’t, when, so when I want to do like at least my first round of evaluations, I’ll do it in a notebook or spin up a basic streamlet out where I can tag things as right or wrong and then filter by those. So these are the types of that could make sense in data.
[58:41] Simon Willison: Where I want to go. So the idea with the evals tools, and it doesn’t do this yet. It should be recording the results to SQLite so that you can have. like a custom data interface to help you evaluate them. I want to do one of those, the LMSIS arena style interfaces where you can see two different prompted, two different responses from prompts that you’ve run evals against and click on the one that’s better and that gets recorded in the database as well.
[59:05] Simon Willison: Like there’s so much that I could do with that because fundamentally SQLite is such a great sort of substrate to build tools on top of. Like it’s incredibly fast it’s free everyone’s got it you can use it as a file format for passing things around like imagine running a bunch of evals and then schlepping a like five megabytes sqlite file to your co-worker to have a look at what you’ve done that stuff all becomes possible as well But yeah, so that’s the ambition there. I don’t know when I’ll get to invest the time in it.
[59:35] Hugo Bowne-Anderson: Well, once again, like if people here are interested in helping out or chatting about this stuff, please do get involved. I do. I am also interested, speaking about the SQLite database and then dataset. So one thing that’s also nice about LM Studio is that it’ll tell you, like it does have some serious product stuff. When you run something, it’ll like give you in your GUI. the latency and number of tokens and stuff like that. We log that stuff to SQLite and have that in. And then like serious, you know, benchmarking of different models.
[1:00:10] Simon Willison: Yep. I’ve been meaning to file this ticket for a while. Awesome. That needs to happen. Yep. I guess it’s tokens per second and total duration. Yeah, exactly. It’s going to be interesting figuring out how to best do that for models where I don’t have a good token count from them, but I can fudge it. Just the duration on its own would be useful things to start recording.
[1:00:41] Hugo Bowne-Anderson: Absolutely. And so there’s kind of a through line in some of these questions. Firstly, a lot of people are like, wow, this is amazing. Thank you. Misha has said Simon is a hero. Thank you. has said this is brilliant. I can’t believe you’ve done this. So that’s all super cool. I want to build on this question. Eyal says a little off topic, but how are you able to build so many amazing things? I just want to-
[1:01:07] Simon Willison: I have a blog post about that.
[1:01:09] Hugo Bowne-Anderson: Raise that as an issue on a GitHub repository? Well,
[1:01:12] Simon Willison: yeah. Here we go. I have a blog- It’s not the building,
[1:01:15] Hugo Bowne-Anderson: it’s the writing as well. So yeah, what structures do you put in your own life in order to- I have a great story.
[1:01:24] Simon Willison: Basically, so this talk here is about my approach to personal projects and effectively, and really the argument I make here is that you need to write unit tests and documentation because then you can do more projects. Because if you haven’t done that, you’ll come across a project like my LLM evals project I haven’t touched in two months, but because I’ve got a decent set of issues and sort of notes tucked away in there, I’m going to be able to pick up on that really easily.
[1:01:48] Simon Willison: And then the other trick is I only work on projects that I already know I can do quickly. I don’t have time to take on a six-month mega project, but when I look at things like LLM, I already had the expertise of working with SQLite from the dataset stuff. I knew how to write Python command line applications. I knew how to build plugin infrastructures because I’d done that for dataset.
[1:02:09] Simon Willison: So I was probably the person on earth most well-equipped to build a command line tool in Python that has plugins and does language model stuff and logs to SQLite. And so really that’s my sort of main trick is I’ve got a bunch of things that I know how to do, and I’m really good at spotting opportunities to combine them in a way that lets me build something really cool, but quite quickly, because I’ve got so many other things going on. Amazing. That’s the trick. It’s being selective in your projects.
[1:02:38] Hugo Bowne-Anderson: And also there are small things you do like your, well, it’s not small anymore, right? But your Today I Learn blog, and what a wonderful way to, you know, it doesn’t necessarily need to be novel stuff, right? But because it’s the Today I Learn, you just quickly write something you learn.
[1:02:52] Simon Willison: I will tell you the trick for that is every single thing that I do, I do in GitHub issues. So if I’m working on anything at all, I will fire up a GitHub issue thread in a private repo or in a public repo, and I will write notes as I figure it out. And one of my favorite examples, this is when I wanted to serve an AWS Lambda function with a function URL from a custom subdomain, which took me 77 comments all from me to figure out because, oh my God, AWS is a nightmare.
[1:03:21] Simon Willison: And in those comments, I will drop in links to things I found and screenshots of the horrifying web interfaces I have to use and all of that kind of thing. And then when I went to write up a TIL, I just copy and paste the markdown to the issue. So most of my TALs take like 10 minutes to put together because they’re basically just the sort of semi-structured notes I had already copied and pasted and cleaned up a little bit. But this is in that productivity presentation I gave. This works so well.
[1:03:49] Simon Willison: It’s almost like a scientist’s notebook kind of approach where anything you’re doing, you write very comprehensive notes on what do I need to do next? What did I just try? What worked? What didn’t work? And you get them all in that sequence. And it means that I can.
[1:04:03] Simon Willison: I don’t remember a single thing about AWS Lambda now, but next time I want to solve this problem, I can come back and I can read through this and it’ll sort of reboot my brain to the point that I can take out the project from where I got to.
[1:04:16] Hugo Bowne-Anderson: Awesome. I know we’re over time. There are a couple of very more interesting questions. So if you’ve got a couple more minutes.
[1:04:22] Simon Willison: Yes, absolutely.
[1:04:23] Hugo Bowne-Anderson: There’s one around, have you thought about using textual or rich in order to make pretty output?
[1:04:32] Simon Willison: I think. Does that work? Like what’s a LM logs? What’s this? LM logs pipe hyphen dash m rich. What’s the thing? Is it rich dot markdown you can do?
[1:04:47] Hugo Bowne-Anderson: I think so, but…
[1:04:50] Simon Willison: Maybe I do. Look at that! There we go, Rich just pretty printed my markdown. So yeah, so I haven’t added Rich as a dependency because I’m very, very protective of my dependencies. I try and keep them as minimal as possible. I should do a plugin. It would be really cool if there was a… I’d need a new plugin hook, but if there was a plugin where you could install LLM Rich and now LLM outputs things like that, that would be super fun. So yeah, I should do that.
[1:05:18] Hugo Bowne-Anderson: That would be super cool. And just for full transparency, occasionally when I want to have fun playing around with the command line, I muck around with tools like CowSay. So piping LLM to CowSay has been fun as well to get cows to. Final question is. Misha has a few GPU rigs and they’re wondering if there’s any idea how to run LLM with multiple models on different machines but on the same LAN.
[1:05:53] Simon Willison: I would solve that with more duct tape. I’d take advantage of existing tools that let you run the same command on multiple machines. Ansible, things like that. I think that would be the way to do that. And that’s the joy of Unix is so many of these problems, if you’ve got a little Unix command, you can wire it together with extra bash script and Ansible and Kubernetes and Lord only knows what else. I run LLM occasionally inside of GitHub Actions, and that works because it’s just a command line.
[1:06:22] Simon Willison: So yeah, for that, I’d look at existing tools that let you run commands in parallel on multiple machines.
[1:06:30] Hugo Bowne-Anderson: Amazing. So everyone, next step, pip install LLM or pipx install LLM and let us know on Discord how you go. I just want to thank everyone for joining once again. And thank you, Simon, for all of your expertise and wisdom. And it’s always fun to chat.
[1:06:46] Simon Willison: Thanks for having me. And I will drop a marked down document with all of the links and demos and things in Discord at some point in the next six hours. So I’ll drop that into the Discord channel.
[1:06:58] Hugo Bowne-Anderson: Fantastic. All right. See you all on Discord. And thanks once again, Simon.

LLMs on the command line

Chapters

Notes

Links

Getting started

Installing Claude 3

Local model with llm-gpt4all

Browsing logs with Datasette

Templates

llm-cmd

files-to-prompt and shot-scraper

Hacker news summary

Embeddings

More links

Full Transcript