hckrnws
AI Report #4: AutoGPT And Open-source lags behind Part 2
by primordialsoup
Has anyone actually done anything of use with autogpt? I tied it for 4 tasks and it would inevitably get stuck on each and produce absolutely nothing of value. These were fairly simple like research popular topics and write articles on them etc.
It would do things like Google something, the result wasn’t relevant , try again , got an error from one of the pages and then seemingly started to do something completely incoherent related to the error message
It is likely we can do better than 1:1 human input to GPT output on current tech; but the human in the loop is doing a lot of work very easily that the LLM is very bad at, just like the LLM is doing a lot of work very easily that is otherwise laborious for the human. We can't just take the things that the LLM is bad at and humans do easily and expect to fix it with more LLM.
Right now we have:
Step 1: Human reasoning, tool use, input.
Step 2: LLM output.
Step 3: Human reasoning, tool use, input.
Step 4: LLM output.
&etc.
The observation that the input and output are both just text makes it possible to make "agents". But the "agent" movement trying to totally close the whole loop right away is way too early.
It's fine to lay the groundwork though, and the frameworks for it, like AutoGPT, can be used to just do a couple extra steps rather than close the whole loop.
Plugins and browsing can be seen as merging some of step 2 and 3. But then you still need the &etc iteration with the human closely in the loop.
Chain of thought prompting techniques are similarly an attempt to merge a little bit of the human's process of vetting the output by trying to get better output in individual iterations. Sometimes I make the LLM output multiple options and pick the best one with its reasoning; this is really just compressing multiple runs of the LLM and having it pick one, rather than me retrying if I get a bad output.
Anyway I think this is the right way to look at it; these are good tools for trying to compress iterations of human-in-the-loop. For some things maybe we'll eventually remove the human, but we shouldn't expect it right now. The twitter demonstrations of "it did the whole thing" are a trick; good for influences, but not realistic right now.
Right - programming is how you get computers to do things. AI isn’t magic
Very well put, thanks for laying it out so clearly.
In my experience, AutoGPT is limited primarily by the poor state of its tools. For instance, browsing web pages often does not return relevant text that a human would pick out of the same page content. GPT-4 makes very good plans of what it should do, but the tools fail to give it what a human would receive.
For example, when asked to search for the top executives at company X, it rightly uses Google Search with the query “top executives at company X,” which returns a list of web pages such as the company’s About page. It then parses the About page but because of messed up page formatting, it returns nonsense data like the LinkedIn profile URL and some marketing material like a case study link, even though the executive profiles are right there.
The google search function is also limited. For comparison, SerpAPI masterfully scrapes Google Search using a proxy network and very intelligent parsing. In experiments using SerpAPI in combination with Microsoft’s guidance module, I got much farther than AutoGPT.
Fortunately, a lot of people are contributing to AutoGPT now and it is improving quickly. They are revamping the core right now and I expect it will work far better when they are done. With time, better tools will be made available to GPT-4 and progress should then be faster.
> The google search function is also limited. For comparison, SerpAPI masterfully scrapes Google Search using a proxy network and very intelligent parsing. In experiments using SerpAPI in combination with Microsoft’s guidance module, I got much farther than AutoGPT.
Thanks for your kind words. We are working on SerpApi integration for Auto-GPT: https://github.com/serpapi/public-roadmap/issues/905
> In experiments using SerpAPI in combination with Microsoft’s guidance module, I got much farther than AutoGPT.
would love to see how you implemented this with guidance. Did you use GPT4?
The ratio of 45 second “Twitter video demos” vs. examples of actual code/prompts/real world use cases you can replicate is quite striking. Dipping into related discords, I feel like I’m always missing something obvious because there is so much activity but what feels like to me so little replicable substance. I’m a terrible coder so I partially chalk it up to that but it definitely seems like it’s hitting the current boundaries of a parrot echoing itself into gibberish.
Can you help me understand autogpt? Is it just a recursive gpt where an initial prompt is given and has the ability of the output to be used to pipe to other gpt prompts. Am I missing something?
I tried doing more complex tasks using GPT4 and was initially optimistic about plugins but they have all been very disappointing.
For instance, a dream for me would be something like: "Find some rental property opportunities within 1 hour commute to New York City, that have a high rent to sale price and low taxes"
Broken down into steps it would be:
1. Find towns within 50 miles or so from within Manhattan. Take the top 100 or so by population
2. Find commute times for each one leaving at 9am monday and coming back at 6pm. Narrow down the cities to 1 hours. Unfortunately I didn't see any map plugins but maybe something like wolfram alpha can suffice or just google commute time for each town
3. Use zillow to pull typical rents and sale prices for each town. build a simple model (maybe wolfram) to model the rent and apply them to homes for sale including taxes. calculate median expected rent / median sale price
4. Remove towns that you don't have enough data on (not enough rentals or homes for sale) and return the top towns with a few examples of how much you can get
If I were building something like autogpt, I would start with an example like this and use this almost like an integration test. Theoretically all the pieces are there, but it just falls apart very quickly. I've heard these models can't yet do "planning" and i'm not sure what that means technically but I think this kind of problem requires planning so it might be a model limitation
This is exactly the promise and there’s a number of handwavey demos, and it feels like it should be easy enough to have something exactly like this as a “hello world,” but I haven’t seen any of the auto GPT-type that can reliably execute even a basic version of this. As others have mentioned, a little scaffolding custom for the project can work great, but having GPT build that scaffolding isn’t there, as far as I can see (I think a lot of people could benefit from a step-by-step of the parent’s use case as a proof of concept).
The search plugin is bad. I have done some exploratory work [1] on specializing search agents to understand the query and result syntax and use that to structure results but I started with an explicitly structured data source to prove the concept and then kinda wandered off.
It was my hypothesis that the variety of trash returned by things like serpapi need to be massaged into something consistent and potentially run through a result retrieval and fine tuning stage to be useful to a high level agent like autogpt, but didn't make it far enough to have anything working to show.
The agent loop is too frail, and too prone to lose focus, especially if you also want to combine it with some sort of persistent chat context. Hardcoding the control loop instead of letting the ai ficlgure it out was the only way tlive found to extract actual work out of gpt, and works,to an extent, with smaller models as well.
Yeah I played around with it for a few days to improve my python code, and narrowed the command space enough for it to be helpful 50% of the time
here's the video: https://www.loom.com/share/5e83475be2464778950f7df7e209ac2d
Were you using it with gpt3 or 4? Personally, I can’t get gpt4 api access despite being a PI at a well known research institution with a project that would be great publicity for OpenAI. My theory is that people are running it mostly with 3 and then saying it’s useless. It definitely is useless with 3.
I've used GPT4, and never got it really do anything useful.
Same here, tried both autoGPT and autoGPT.js with GPT-4, kept failing at even the simplest of tasks.
Same experience here, I really wanted it to work but it often got stuck in errors or infinite loops. Hoping it'll improve in the next few months.
Same here with GPT-4 via ChatGPT Plus
Are you an OpenAI user who has a valid payment method attached to your account? I've notice they approve GPT-4 as long as you have a valid payment method and a reasonable justification, which in my case one liner.
There's a waiting list to use the GPT-4 API. https://openai.com/waitlist/gpt-4-api
Anecdotally (and unsurprisingly), they seem to be prioritising those with "value-add" use cases in a variety of industries over individuals just wanting to play.
A waiting list that gave preferential treatment to YC companies, that is. This is according to a recent lawsuit that was filed against OpenAI.
Yes. I think LangChain fills in a lot of the problems and it becomes more like programming. So instead of having it “reason” about the results and plans just code those in and sometimes call a LLM when it makes sense.
Right now it’s extremely chaotic since there’s no human correction in the process so the errors compound until you quickly reach incoherence
Yes I am
That about matches my experience. It's a neat project but that's about it.
It’s a neat version 0.0001 of agentic AI. I’m sure one day it’ll be useful but not for a few years at least
I think it's safe to say that if people were getting value out of this then we would be hearing about it a lot
To all the AI hucksters - there was a decades long effort already tried, known as "5th generation programming languages" where they assumed the next level after high level programming languages would be to remove the human programmer completely.
basically, this effort ended up failing because, well, problem solving itself is inherently complex.
https://en.wikipedia.org/wiki/Fifth-generation_programming_l...
seems like the exact same thing happened with ChatGPT / AutoGPT / GPT4, and this will keep happening.
> seems like the exact same thing happened with ChatGPT/ AutoGPT/ GPT4, and this will keep happening.
A bit early to call it, by far.
Serious development around these capabilities have only just gotten off the ground.
It also doesn't seem like LLMs are done improving.
I have never built an agent before, nor am I knowledgeable about the latest studies in this field. So what I am saying below is likely to be nonsensical.
I was thinking that perhaps we have been working with abstractions that are too low-level. Instead of providing a set of tools such as API calls or text splitters, wouldn't it be more reliable to give agents templates or workflows of successful tasks, such as trimming videos or booking restaurants?
These templates would consist of a set of function calls, or a graph of connected components in low-code tools like LangFlow. I believe auto agents already use a similar concept where they cache successful tasks for future reuse. The idea is to populate these caches with the most common use cases, and use retrieval if they become too large, so that we don't experience cache-miss most of the time and work with lower-level abstractions (tools) as the baseline. Templates, like prompts, should be portable (e.g. JSON) to avoid the need for everyone to reinvent the wheel. While this solution may not be as impressive as a full autonomous agent and may not work for a generalized case, it should produce a more predictable outcome, I think.
In my experience, a GPT 3.5 or 4 agent has trouble accomplishing anything if you make too many APIs available. Using a completion to narrow down the list of options makes the entire exchange very slow. There is also a compounding chance of failure with multi-stage strategies; the “agent” may get stuck responding in the “wrong” way and burn up even more time with error recovery.
I know a big part of the discussion around this link is around whether AutoGPT is viable or not, but I think a more interesting piece are the papers linked at the bottom (especially the "Tool Maker" one). This pattern of "well, maybe the LLM can do that too" is just so obvious but so meta and simultaneously brittle. "LLM can use tools" -> "The tools can, themselves, be LLM-based" -> "The LLM can create it's own tools that it then uses".
It's really turtles all the way down.
Replace LLM with people and it makes a lot of sense.
The amount of hype around AutoGPT is probably good for other people building similar things though. It allows people to build without all the grifter attention that AutoGPT gets.
I would say the same about GPT and LLMs in general. For me, it’s been nothing but garbage in, garbage out and a massive waste of my time, so far. Hopefully they’ll figure out how to do symbolic computation in the future so I can finally trust anything that it spits out.
Feels like this newsletter is actually written by an llm. It's full of repetition and very flat assertion stitched together with high intensity random connectors. "Make no mistake [something incredibly common and already expressed here]", "Our own take on this is [another super cliche platitude here]".
I don't like how this newsletter uses "we/our" terminology - "Our personal take is that it might be beneficial to build agents that excel at some really well-defined and small-scale tasks" - without (as far as I can see) ever saying who "we" are.
Crafted by Rajat
Source Code