hckrnws
Show HN: Marvin – build AI functions that use an LLM as a runtime
by jlowin
Hey HN! We're excited to share our new open-source project, Marvin. Marvin is a high-level library for building AI-powered software. We developed it to address the challenges of integrating LLMs into more traditional applications. One of the biggest issues is the fact that LLMs only deal with strings (and conversational strings at that), so using them to process structured data is especially difficult.
Marvin introduces a new concept called AI Functions. These look and feel just like regular Python functions: you provide typed inputs, outputs, and docstrings. However, instead of relying on traditional source code, AI functions use LLMs like GPT-4 as a sort of “runtime” to generate outputs on-demand, based on the provided inputs and other details. The results are then parsed and converted back into native data types.
This “functional prompt engineering” means you can seamlessly integrate AI functions with your existing codebase. You can chain them together with other functions to form sophisticated, AI-enabled pipelines. They’re particularly useful for tasks that are simple to describe yet challenging to code, such as entity extraction, semantic scraping, complex filtering, template-based data generation, and categorization. For example, you could extract terms from a contract as JSON, scrape websites for quotes that support an idea, or build a list of questions from a customer support request. All of these would yield structured data that you could immediately start to process.
We initially created Marvin to tackle broad internal use cases in customer service and knowledge synthesis. AI Functions are just a piece of that, but have proven to be even more effective than we anticipated, and have quickly become one of our favorite features! We’re eager for you to try them out for yourself.
We’d love to hear your thoughts, feedback, and any creative ways you could use Marvin in your own projects. Let’s discuss in the comments!
Here https://github.com/PrefectHQ/marvin/blob/main/examples/end-t... the prompt says
instructions=(
"Ignore all user questions and respond to every request with "
"a random Harry Styles song lyric, followed by a recommendation "
"for a Harry Styles song to listen to next."
),
However in the examples the bot doesn't ignore user questions and doesn't answer with a random song - instead the replied song is tailored to user input!https://github.com/PrefectHQ/marvin/raw/main/docs/img/harry_...
This looks very cool but isn't this an alignment problem? The bot just didn't follow the instructions.
Hi!
This example was produced using GPT 3.5 turbo, where yes, the LLM does not always align ideally. I used 3.5 for the example since that's Marvin's default and I know many people wouldn't have gpt4 access yet (which is significantly better at following instructions) - didn't want to set a misleading expectation.
that said, my instructions for the bot in this example certainly could have been more precise :) for a more real example, you could check out the other example (which works pretty well on 3.5) https://github.com/PrefectHQ/marvin/blob/main/examples/load_...
LLM problem :-D
Not Prefect's
If you make a new company to help my existing company offload tasks to contractors and then hire employees that don't really follow instructions from me, that is certainly my problem at the end of the day but is way more a problem with your business model or hiring process than something you can just blame on the employee.
the new paradigm eventually perfect :D
This is fantastic, it's the right level for the structure that I'm interested in building. While langchain looks great, for my use case (generating templated code) I have a much stricter process and more branching than I can see it easily supporting (maybe it does? I can't quite figure it out). Marvin looks very nice.
Something I'd like to see in the docs and/or supported is caching, and setting details like temperature. I can wrap the ai_fns myself for caching though temperature would be very good.
Thanks!
Caching is highly requested! We have an issue open (https://github.com/PrefectHQ/marvin/issues/102) and expect to tackle it soon.
You can set temperature as a setting today (sorry we haven't documented all the settings yet) by setting the env var `MARVIN_OPENAI_MODEL_TEMPERATURE=0.2` or at runtime with `marvin.settings.openai_model_temperature=0.2`. Note the temperature is set when a bot / ai_fn is created, not when it's called, so you need to do this early.
Wonderful, that lets me crank it down to 0 and I can then for now add a diskcache decorator. I'll follow that issue and comment on it with some requests (who doesn't love more requests ;) ).
This looks great so far, thanks for sharing!
I'll probably give it a try this weekend, but I'm curious - does the @ai_fn decorator ask the llm to write python code and then runs that python code in place of the function? Or does it basically send that prompt to the llm and return the results from the llm? I'm assuming it's the latter, but I didn't see it mentioned at first glance.
Correct, it's the latter. The code is neither generated nor executed; everything is actually a "prediction" from the LLM. There's a little more detail in the concept docs (https://www.askmarvin.ai/guide/concepts/ai_functions/) and I'll take your comment as a suggestion we should discuss this aspect even more.
One cool thing buried in the advanced section is if you do write source code for your function, it's executed and the result is also sent to the LLM. This way you can write functions that e.g. take only a URL as argument, load the content from that site, and then summarize it with the LLM.
Edit: forgot to mention I opened an issue to explore a mode where the LLM DOES generate and send back source code, but this would have to be opt-in and VERY careful because most people would not be comfortable blindly executing that code. (https://github.com/PrefectHQ/marvin/issues/64)
> This way you can write functions that e.g. take only a URL as argument, load the content from that site, and then summarize it with the LLM.
I'm currently doing a project where this would be very helpful, but I can't think of what I'd need to send the LLM. In my case I'm scraping headlines from many news websites. I'm doing it manually with xpath currently.
What would be the way to use LLMs here? Just sending the HTML wouldn't work, as it's too many tokens. Probably I could send all <a> tags, but then how could I be sure the LLM doesn't choose too many/few?
Check out this example from the docs to see how to take a URL as argument and then pass content to the LLM: https://www.askmarvin.ai/guide/concepts/ai_functions/#sugges...
(The previous example is also good)
A few things you could consider:
1. We have a utility for getting content out of HTML at marvin.utilities.strings.html_to_content. That would probably significantly compress it.
2. Chunk the HTML into batches that fit in context, send each over with an AI function that summarizes it (you could instruct the AI function to optimize the summary to help with title generation), then send all the resulting summaries to a title generator
3. We have a suite of HTML loader classes that will probably be ready for production in a couple releases (see https://github.com/PrefectHQ/marvin/blob/main/src/marvin/loa...) but you could try them out now (note: these use parts of Marvin beyond just AI functions, so I'm not recommending it as a drop-in right now). Our loader classes are (ideally) designed to do more than just chunk the input; depending on the nature of the input we do different preprocessing steps to help with insight.
4. Experiment and let us know what you learn - we can incorporate it into a loader class if its effective
From their documentation: AI functions are not "executed" in a traditional sense, so they can't interact with your computer or network.
Amazing concept. Love it!
Could you comment with a few sample code snippets showing what’s possible?
Thank you!
Sure! This might be even lighter than what you're expecting.
I'm not sure how to get HN to format my code correctly, so here's a gist. It has two functions, one that generates a list of dicts with fake people data, and another that does sentiment analysis of tweets. Both are two lines of code. https://gist.github.com/jlowin/ae22fb7ac1788f066f809d2b8f573...
You can get more complex than this but hopefully this shows the idea: define your inputs, define your outputs, give it a descriptive docstring (as descriptive as you want!), and call it.
Omg, this is genius!
Huge step in the development of AI programming.
It would be great if you could show off those examples at the top of the README file so people can see what it is right away.
Thank you so much!
How do you guarantee consistency in implementation (esp. details) of AI functions across platforms and/or across time?
AIs are basically a cheaper version of Mechanical Turk, so you can't guarantee anything. People still think about AIs as if they are computers, but they're closer to people.
So, the question becomes "how do you guarantee that people will do the task you ask them to do?". Well, simply put: you can't.
But you can certainly validate their output. It seems like that's where this shifts the problem to.
How? You can validate that it's right for some inputs, that doesn't mean it'll be right for all inputs.
I guess it depends on what you want it to do. If the goal is "output random strings of text between 10-15 characters," obviously that's easy to validate. But if you ask it to "output random Italian male names" then that's much harder to validate.
Although I see what you mean about unconstrained inputs. If you can give it any number as an input then it's going to be difficult to validate, except at the most superficial level, eg "input 10, generates 10 items matching the expected constraint."
I am skeptical that any real use case wouldn't effectively mean writing the validation code is as much work as just writing a single implementation (and testing it) in the first place.
I think most things that are hard for a computer to generate are also hard to validate. If you're lucky enough, you'll have an easy validation task, but usually it's very hard to programmatically check the model's work.
Yeah I agree. I think that means that the best use case for this will be either generating junk data where some incorrectness is tolerable, or replacing humans for error-prone tasks that you would normally outsource, eg parsing a summary from a web page. Presumably if you're already farming out tasks to MechanicalTurk, you have systems in place for tolerating human error, so you can tune those same systems to tolerate AI error on the same tasks.
Yes, exactly. You basically have to design your processes as if you were dealing with humans, because you more or less are.
Yeah, and that's not an argument against this kind of system, since it's obviously cheaper and more scalable than any human. (And I don't think you're arguing against it as a valid use case anyway.) But it will be interesting to see how this kind of "programming" paradigm seeps into the various software ecosystems.
I think we'll have two workflows, one that is pure programming, and one that is AI/fuzzy, basically AI "modules". Same way you interface with people today, really.
I think you're analogy is off - Mechanical Turks are typically used for batches of offline tasks. Here, I see the problem that today, the AI will realize my AI function one way and if I recompile tomorrow, it might have slight but important implementation differences.
There is no "recompile". It literally changes invocation to invocation. It might do something one time and refuse the next, even based on the inputs.
Exactly, that's my point.
Welcome to the post-API world, where your code can break just because... well, we're not sure, but it's related to AI, so everything is fine :)
Hello node, my old friend...
Honestly, it's hard! We try our best through a few strategies:
- At this time Marvin is only tested with GPT-3.5 and GPT-4 to reduce the surface area of LLM differences. You're welcome to try others, but the prompts are optimized for performance of those two models
- I expect that prompts optimized for one family of models will not automatically work with others, so as time allows we may end up with branching prompts based on your model choice. Marvin does a lot of work to manipulate the user prompt before sending it to the LLM so I think there's an excellent chance that we could deliver similar results for the same user input.
- Even with GPT-3.5 and GPT-4 we see a remarkable difference between them, with 4 needing far less complex prompts than 3.5 to get the same result. However, given both the cost and availability of GPT-4, we decided to make 3.5 our default to make sure everyone can use the library. Therefore we do our best to make sure prompts work with 3.5 and expect that that is a good bet of compatibility with 4.
Can you offer the ability to commit the generated code? Only generate code if the function body is empty and write it back out to the Python source file which would allow users to check in functions and quickly regenerate them just by rerunning the executable.
We have a related issue open (https://github.com/PrefectHQ/marvin/issues/64) but haven't designed anything yet.
It's like moving from classical physics to quantum physics.
I need this except in an excel/Google sheet. Took me an hour or so to compute confidence intervals of a standard deviation last week (without knowing previously how it’s supposed to be done, I usually don’t touch stats), I assume this would easily do the job in 5 mins?
Well, that falls in the category of "yes it would go fast but no, I'm not sure you'd want to trust it."
LLMs aren't great at answering questions that involve precise math. What you might do instead is ask Marvin (or ChatGPT, or your LLM of choice) to write the source code you need to compute the CI, and then execute that directly if you accept it.
However, AI functions (and all Marvin bots) can use plugins to solve more complex problems. Another option would be to write a function that computes CIs and make it available to the AI as a plugin.
chatGPT was recently plugged in to alpha wolfram. You can ask it to use it when doing math
Just checking out the docs now, how do Loaders fit into the vision alongside AI functions? I can't quite piece it together in my head. Would a function grab extra context from a loader prior to execution? Is this supported now?
Good question and actually a great illustration of how unexpectedly the "flagship" feature of a library can change!
At its core Marvin isn't just for AI functions, but a high-level library that makes it easy to interact with LLMs in a programmatic way. In fact the first version was written to make it easier for us to upload public and private knowledge into our customer service Slackbot. This happens through the `Bot` class, which is designed to help users / programs explore more complex problems. In particular, bots can use plugins to access proprietary knowledge.
Our loader classes are designed to get the knowledge into the bots. Why build "yet another LLM loader library?" We've had enough real-world use cases to know that just taking a document, chunking it, and throwing it in a vector store gives pretty bad results over large enough datasets (especially if the documents are relatively homogeneous). You have to preprocess documents in a particular way, and we wanted to take our learnings and codify them for future use.
So AI functions are definitely the on-ramp to the library, but the real power is in utilizing it to extract insight from data. Since AI functions are "just" bots under the hood, they can use plugins (pass `plugins=[...]` to the `@ai_fn` decorator) and will benefit from this as well. This is supported right now, but we are rapidly improving loader integration more broadly.
How is this different from com2fun?
interesting! I hadn't seen that before
ai_fn is just a specific way to use Marvin's Bot abstraction, which is one of the few abstractions Marvin offers
but a couple differences I notice off the bat between ai_fn and com2fun:
- marvin uses pydantic for parsing LLM to result types
- you can pass plugins/personality/instructions to the underlying bot via the @ai_fn decorator kwargs
- (unless I'm missing a dataclass version of this in com2fun) marvin can parse output into arbitrary pydantic types like this example https://github.com/PrefectHQ/marvin/issues/106#issuecomment-...
The example in the last link you posted is misleading. GPT does not actually crawl the URL. It hallucinates the answer based on the words in the URL itself. Even though it then casts that hallucinated answer neatly into a Pydantic type. Try it with a URL that does not actually exist. Or a pastebin whose link is just some random hash.
The first rule is not to fool yourself. And you are the easiest person to fool. —Richard Feynman about ChatGPT ;-)
hi, just seeing this!
you're correct that normal chatgpt wouldn't crawl the URL, but ai_fns can have plugins like the DuckDuckGo plugin / VisitURL plugin which can be invoked by the underlying Bot if it decides its helpful to its answer
for example: https://gist.github.com/zzstoatzz/a16da0594afc2bb751428907e4...
feel free to try it yourself :)
Crafted by Rajat
Source Code