hckrnws
I was served with papers yesterday from the Silverman folks. No idea what they could possibly want, since books3 is open source. https://twitter.com/theshawwn/status/1704559992135717238?s=4...
I think we’re witnessing the death of open source AI. The logical outcome of this is that only large companies will be able to acquire and use the training data necessary to compete with ChatGPT.
So anyone who thinks otherwise will have to answer the question: how are we going to make any datasets?
It’s tempting to think that we can pull together data out of non copyrighted work. But there isn’t enough data. That would mean the model has no knowledge about any books.
It will take an act of congress or the supreme court to decide the matter one way or another. But copyright is not universal law.
Japan, for one, has thrown the fight - they stand to gain more by harboring new technology than ensuring royalties are paid to publishers of the last century.
https://web.archive.org/web/20230607091817/https://technoman...
discussion: https://news.ycombinator.com/item?id=36144241
Oh, but this is interesting. If some countries stand to gain more by declaring copyrighted works "fair use" for training generative models, then other countries can either fall in line or risk falling back in technological development.
It's a race to the bottom in favor of open source.
Since the government in many ways views citizens as threats to their conntrol / power, the last thing they want is that population armed with a powerful AI.
I agree - LLMs are a massive threat to the current state of government power.
"<LLM>, analyze the contents of this 500 page bill. Who stands to gain from this bill, and what outcomes is it likely to have for the general public? Is this bill in line with good faith evidence-based policymaking for the good of the population and the planet? What existing legal mechanisms could be used to fight the special-interest aspects of this bill?"
Is there really a shortage of this kind of analysis today?
Using those legal mechanisms requires money and often public support. The LLM can’t conjure either.
The world is already full of smart people and good ideas about policy. The reason they’re not getting implemented probably has little to do with things that AI can solve today. For starters, a lot of voters actively dislike policy suggestions from experts and choose politicians who proudly go against expert opinion. Giving AI tools to the experts won’t change that.
I think the difference is that instead of deferring to an expert's opinion, you can interact with a knowledge machine which can explore the topic in and on your terms, answer your questions about it, respond to your points and concerns about those responses, etc.
It's a fundamentally different kind of knowledge generation than reading an expert opinion - it's branching, self-directed, and responsive.
This excatly the kind of work Universities love.
I personally know three different group trying to excatly that.
Comment was deleted :(
This is incredibly naive. “If only the people were EDUCATED!” No, we are far beyond that.
Are they?
"<More Powerful/Popular State/Corpo-owned LLM>: it's pretty hard to fight this, just trust we've got got your back. Remember, we're currently handling that other case for you. Would be a shame to lose you as a client."
Exactly. I want an unaligned LLM to give me X potential solutions ranging in ethicality from "don't worry about it, they might be nice people" to "steal a nuke and ransom the world", and let me as an aligned human craft my prompt or chain of reasoning to weed out the useless or unethical responses, and then I can decide what is useful and suitable.
This is more or less the process that goes on inside a thinking human, is it not? I don't want to outsource ethical decision making, I want to outsource cognitive effort. By analogy, you don't rely on a bulldozer to decide not to bulldoze a populated nursing home - that's on the user, as are the consequences.
Current power structures demonstrably cannot be trusted to limit themselves to ethical solutions (Military Industrial Complex, Climate Change, etc etc etc pick your poison) - why should they be trusted to censor cognitive tools?
How about stay anonymous and just violate all the copyright laws? There's already pirate bay, libgen, sci-hub, zlibrary, etc., surely it's possible for there to be an opensource & pirate LLM model.
If it were practical to mirror sci-hub and libgen, that would be one thing, but despite a lot of talk online I have yet to see a practical way to put my hands on such a thing.
I'm not sure what you are referring to with 'such a thing', but mirroring libgen and zlib is really not hard. Libgen offers Torrent links as does Anna's Archive. The libgen domains are fragile, but here's a link to the Anna's Archive torrents: https://annas-archive.org/torrents. They even have a page talking about training LLMs on this data: https://annas-archive.org/llm
Do any of these methods actually work though? Last time I looked (admittedly, 6 months or so ago), there were 0 seeders on the torrents.
Exactly, it's easy to say "Just torrent it," but that requires a lot of people to stick their necks out, including the user who just wants a copy of the data.
We need the ability to circulate HDDs physically in a semi-organized fashion, samizdat-style.
Mirroring libgen is definitely within reach, it's "just" 50 or so terabytes with torrents freely available for bulk downloading.
Realistically only maybe 10% of that is actually useful, but reaching that 10% is gonna be very labour-intensive. You would have to do a lot of cleanup of different formats, duplicate uploads, different editions of the same book, scanned PDFs, and what not, while big players with their own ebook stores (Amazon, Google, Apple, any ebook store) already have all of the proper metadata, a common format to work with, and a lot less duplicates.
Isn't there some kind of standard for publication metadata? The one which will allow to uniquely identify publication + further track different editions as children of "original" publication? Maybe we should create one and make it freely available?
How would one anonymously train a LLM of sufficient size to produce the performance needed? Does it not required hundreds/ thousands of expensive Nvidia GPUs?
Hardware gets better, the masses have amassed quite a lot of it already, and it depends how soon you need your AI.
Hostile foreign nation trains it and releases it in the persona of an anonymous hackerman.
Basically the plot of Ghost in the Shell.
Comment was deleted :(
Comment was deleted :(
Take a look at Databricks' getting a hand from internal team to create the dolly 15k dataset. ( https://www.databricks.com/blog/2023/04/12/dolly-first-open-... )
For training AGI (artificial general intelligence) maybe only a select few mega companies with massive datasets will be able to come up with training data.
There are so many other use cases that OSS projects can enable otherwise. Individuals or smaller companies have unique data that can be used to augment existing open source models. Many use cases are area specific, and without the need for general intelligence.
Palantir just did a talk at the AIPCon ( https://youtu.be/o2b0DwNg6Ko ), where they recommended the use of many LLMs, open and closed. ( the example had Llama 2 70B, GPT4, Palm coding, claude, + fine-tuned models) feeding into their synthesizer.
While I want open source to win, especially as an open source maintainer on Ollama (if you haven't seen it yet, it's one of the easiest ways to run LLMs locally - https://github.com/jmorganca/ollama ), I think the work so far in this space has been a positive sum one - open or closed.
People may hate on Palantir but sounds like they get it. I agree that the next step is going to be the interaction of multiple models. We have popular models that have been generalized across domains. The next step is models for specific domains, specific enterprise data sources, individual data sets. Each model has the opportunity to leverage the more generalized model above it. And if done properly, the models will be able to then build back references showing what data validates its output as a source of truth.
This is just a standard demand for preservation of evidence. They’re letting you know that although you’re not a party to the lawsuit, you may be called to produce information as a witness. They are informing you that certain classes of information are relevant to the litigation, and that you have a legal duty to preserve that information in case it is later subpoenaed.
Caveat: I am not your lawyer.
> Caveat: I am not your lawyer
???
Why would he suddenly think you are?
Because they were giving advice about a legal matter, so by stating what seems obvious to you and me, the GP is covering themselves from "implied" lawyer-client relationships which under some jurisdictions could adjudicate them the responsibilities of a lawyer to a client if a person could prove in court that they reasonably believed they were acting as their lawyer.
I don't know if this is actually been a problem ever brought to a judge to be honest, but that's the reason why that disclaimer is made and of course, I am not your lawyer.
I want to find out where lawyers hang out online and if they close ever sentence/post with "I am not your lawyer"
Check Legal Advice subreddits, it happens. It may be an abundance of caution most of the time, but it doesn’t hurt.
It’s pretty common in situations where you could be seen as giving legal, accounting, financial, etc. advice especially if you have some certification to that effect to disclaim what you’re saying as actually being advice. Lawyers are especially careful about this outside of a specific client relationship.
I also choose to not be this guy’s lawyer.
mixedCase outlined the liability point of view for a professional, the other: Each person seeking legal advice should retain a lawyer and not rely solely on advice from faceless internet forums. The same applies to many kinds of professions, such as health. "Hey I've heard x,y and z, but I'm not your doctor."
Everyone is of course free to speak about legal matters and provide their take, but keep in mind that even well studied lawyers, with all of the context and evidence are not always successful in their arguments. Internet forums are many steps removed from these professionals, so you should moderate accordingly.
In forums such as Reddit/Hacker News/etc, there are many discussions that concern legal matters - in my experience the overwhelming majority of highly voted comments and frequently repeated ideas are not ones which are grounded in fact, but rather comments that align with the groups desires. People upvote and repeat ideas that appeal to them, not to ones which they've investigated themselves as factual or accurate.
Another one to look out for is IANAL: "I am not a lawyer". (Yes I'm being serious with this acronym.) This is to avoid a different kind of liability: the unauthorised practice of law.
This line of reasoning hold for pretty much any online advice. You don't close your comments with "I'm not your DBA" or "please consult your own paid certified C++ developer" or whatever
I don't think once you pay a lawyer their IQ jumps 20 points and they're necessarily more correct than crowdsourced information. Doctors and lawyers arent somehow infallible once hired
Being someone's DBA does not imply certain legal status in the way being someone's lawyer does. It can be a real headache for a lawyer if someone they are not representing claims that they are representing them (which happens, a lot: funnily enough a lot of people seeking legal advice online are really not at all familiar with how this works), so it's standard boilerplate to clarify that if there's any potential cause for confusion.
> You don't close your comments with "I'm not your DBA" or "please consult your own paid certified C++ developer" or whatever
I (and others) actually do do that when it seems relevant, eg [0] (and it's parent which does similar[1]).
DBAs and C++ developers aren't licensed professionals. Doctors and lawyers are.
Yes, some types of engineering are licensed professions too. Software engineering isn't one of them.
Reminds me of when the RIAA and MPAA were able to end piracy. Dark days.
They were? how
Definitely a joke, but also maybe not a joke. Enough pressure was put on pirates through ISPs threatening to drop customers (or was it to sue them?) that streaming services saw greater adoption. People are risk averse when it comes to lawsuits from publishers I suppose.
So you can still pirate, but most people don't. Maybe LLMs will be the same, sure you could use an uncensored/unlicensed model that doesn't pay royalties to publishers, but why risk it?
As peer comments mentioned, that's completely missing the order of causation. Steam is a better example, because there was never any weird organization threatening everybody they could send a letter to with lawsuits. Yet it grew exponentially.
I think, in general, there are relatively few completely ideological 'pirates.' People will buy things when they can easily afford it, and it's easy to do so. When it's too expensive, not locally available, or when the payment methods involve hassles, they will simply download it instead.
Streaming services got higher adoption because the price/trouble cost was worth it. People saw value in putting $7/month into a service that delivers high quality video.
This might not last with the fragmentation of streaming services (I don't think people are ready to pay $10x10 for access to all services).
Part of the trouble side though is taking the proper precautions to not trigger a strike from your ISP or a letter from RIAA. And a lot of people didn’t/don’t understand what those precautions are. Don’t know the size of the effect but there was probably some.
It's not quite the same... maybe even reversed. Big companies can "secretly" use private datasets with copyrighted or legally questionable data with a wink and a nod, without anyone being able to prove it in court.
Open source models generally do not have that option. Maybe there are some exceptions like Meta using some private data in future Llamas, but I question how sustainable those efforts are.
Most piracy is P2P and requires some "computer literacy" to use.
Most streaming, AFAIK, is not P2P. Streaming gives you instant access to the content, and streaming clients are often simpler to use, with features such as recommender systems, etc.
The copyright issue is a big one and certainly unsettled but the (very good) IP lawyers I know are pretty much universally of the opinion that training models with copyrighted material probably passes muster for a number of reasons.
Of course, as someone else noted, this is probably ultimately going to be in the US a question for Congress and the courts.
Sarah Silverman is almost certainly going to be crushed in court. The lawsuit is totally without merit.
> I think we’re witnessing the death of open source AI.
Nah, we aren't. You can't make any money off an open-source project via litigation, and torrents exist. This baseless fear-driven nonsense will eventually blow over.
Wow. Did you figure out what your next steps are? (understood if you don't want to discuss)
I spoke with a lawyer earlier tonight. Basically, there are no next steps for me. The next steps for them are to send targeted discovery requests for whatever it is they think I might have in relation to the case, and possibly ask me to testify in a deposition.
Comment was deleted :(
Books3 and The Pile are still available on the internet, just not "legally".
Isn't there sort of an implication here that Criminally Open-Source ai will win?
Who will give them the compute? It requires million dollar clusters to make any progress.
There's a huge amount of money in AI, and there is no global organisation tracking GPUs and making sure that "X startup which purchased a bunch of GPUs then went bankrupt and sold them all to an unknown buyer" doesn't happen. Even stopping China from getting GPUs is proving extremely difficult and that's much, much more powerful actors than even the largest copyright cartel; NVIDIA is chafing even against that. GPU restrictions aren't going to work unless they're being enforced by the US military, and even then China is only ~5-10 years behind the US in fab tech and could easily custom build chips for AI to fill the demand, in the internal market if nothing else.
Basically, there is a lot of money to be made in AI, and if money is available to fund development it will not be that difficult to turn that money into compute, particularly in a context where the current huge "legitimate" buyers like Microsoft, Amazon, Google are banned from using AI chips for AI.
Willingly? I wouldn't know. But then it becomespre a problem of obfuscation doesn't it?
Consider someone 20 years ago saying (of general software), “Open Source Will Win”. Were they right or wrong? I don’t think there’s a simple answer. (Open source is crucial and ubiquitous, but much of the most-used software is closed, etc). I assume AI software will likewise be a complicated mix for the foreseeable future.
Open-source (of general software) won and is winning every day.
For profit code is written on top of open-source, every single day.
If your metric for code success is the money it returns, then that is the only metric in which open-source has lost.
Humanity as a whole would be behind if it wasn't for open-source.
I'm not a FOSS/FSF/OSS freak, but credit is due where it's due. We're all building on the shoulders of unsung giants.
>If your metric for code success is the money it returns, then that is the only metric in which open-source has lost.
What? The dream of open source and its idea of winning, as it was discussed on peak FOSS movement era (late 90s, early 00s), was about widespread user adoption and overthrowing the Microsofts and Apple's of the world on the Desktop, with people using FOSS software, open protocols, and so on, and taking control of their computers.
It wasn't about what Google or Uber or Facebook might use for its backend. Or what FOSS libs some company like Apple or Adobe might use to create their proprietary software.
So, this dream has never happened. Instead businesses used FOSS for their closed off backend stuff, or for the behind-the-scences OS backend of their proprietary mobile phone platforms, whose crucial part is all the proprietary add-on services, and which almost everybody uses in its corporate official release, while people's computers and smartphones were never more locked-in.
The mass movement of FOSS idealism has also vanished (the kind that every kid in computer science cared about, the kind that made headlines, launched thousands of blogs and discussions on the subject) and so on. Now it's either a hobby or a corporate paying gig for most, seen in a pragmatic light.
Agreed, and I think there is a bit of a shift to where open source is being used more frequently as a marketing technique/getting the first traction for funding and eyeballs then land and expand to nearby non-open source areas which muddies the waters of the entire community. Not only do the equivalent of the old hackers of yesteryear who supported this movement no longer feel they need to be running Linux machines as their daily driver (ever startup engineer seems to have a Mac), but those running many well respected projects have to justify business impact which might put the OSS side at risk at times.
There are well known huge OSS projects today that change licenses for future releases even after contributions (ex: text gen), or relase open source models/software that would be very commendable except they have their own modification to the licensing agreements to filter out competitors or just make it ambiguos enough that you don't want to touch it since you don't want to pay for lawyers yourself (ex: Llama, Falcon) to figure out what the non standard additions would mean to your project. Building a customer base off open souce to funnel towards your revenue or kneecap competitors is the direct opposite of what the FOSS advocates of old would advocate for (even if the right strategic move) an would have been such a philosophically bad thing to do that they'd be to embarrased to do it (there are ways to make money without doing that, see Redhat), there was an emotional aspect to the the devs who grew up in or adjacent to this movement 15+ years ago.
Open source for many companies now rhymes with the Open in OpenAI, where yes, it may have started that way with the best intentions, but things change. The idealists are gone, which is a bit sad, bu it's probably ok to not have the next RMS evoling in a computer lab today with a new rendition of the open source software song reconfigured for tik tok virality (j/k the GNU project was a key piece of what drove Linux forward of course).
Yes, there’s a case for “open source has won”.
I guess I was saying it depends on how you define it. For example the percentage of personal computers using an open OS is relatively low. But, eg, macOS depends on a lot of open source.
Every Google search is going through closed software, but also internally uses lots of open source, etc.
So it’s a mix and complicated.
It's not that complicated. The closed-source software wouldn't be where they are today if it weren't from the open-source code they built upon.
Every protocols under the sun that is in use today has an open source implementation used by closed source software and OSes. Why? Because if they didn't, those protocols wouldn't be popular nor used. From file types handling, to images, to audio driver code, TCP/IP libraries, every cryptography implementation, every webpage you visit; from top to bottom is mostly made of open-source code.
Unless you're talking about rockets' real time OS, even the most closed platforms would be made of roughly 50% open-source code.
>It's not that complicated. The closed-source software wouldn't be where they are today if it weren't from the open-source code they built upon.
That's not open-source code winning. It's closed-source (and SaaS and closed off web backends) winning, by taking advantage of open source.
It's like a bunch of nature lovers buying land in some idillyc place, and dreaming of a sustainable way of life there, and getting media coverage, and then some real estale moguls coming in, taking advantage of all the talk about that place in the media, to buy, raze off all the trees and greenery, and build some huge crappy suburb of identical houses and chain shops with huge success.
"GreedyRealEstate LTD wouldn't be where it is today without those nature lovers"
Sure. But it's the opposite of them winning.
Did it 'win' then?
Come on, it's not that cut and dry.
I'm paying a monthly subscription for a dozen services, and there's exactly zero possibility that I could use an open source equivalent for all of them. Maybe a few... but, mostly not.
Is that winning? The 'most important part' of software being trapped behind a monthly subscription? Where I have no freedom to choose what runs on my phone?
I think the free software foundation might take issue with the parent statement you're responding to:
> Open-source (of general software) won and is winning every day.
The battle is not won.
If you re-frame the statement as 'open source is used by lots of people' then suuuuure, once you change what you're talking about, then by all means.
Absolutely. There's a lot of great open source software.
...but that's not 'open source winning', that's just 'people like free stuff they don't have to pay for or make any effort to get'.
> Absolutely. There's a lot of great open source software.
>...but that's not 'open source winning', that's just 'people like free stuff they don't have to pay for or make any effort to get'.
That's the most cynical take I have ever read about opensource software. I think you're measuring the success of opensource software by only the amount of software that you have to buy. IMO, that's not fair. Does the software that you buy today exist without opensource software? In a hypothetical case where your commercial software was built without any opensource software, do you think the price of that software would be the same as it is today?
>That's the most cynical take I have ever read about opensource software
Cynical? If anything it's the opposite of cynical. It's an idealistic view, and closely matches the spirit we had back when the FOSS movement caught on - and the idea about the kind of FOSS future that never came to be.
>I think you're measuring the success of opensource software by only the amount of software that you have to buy. IMO, that's not fair
Fair or not, that was exactly the vision of FOSS, even starting from the first anecdote about the origins of FOSS by RMS. It wasn't "let's build something Apple or Google can use as a backend for their closed software" or even "let there be a lot of FOSS in the world".
It's success was supposed to be measured by the overtaking of proprietary software, not assisting it, nor helping it behind the scenes as a server backend OS or service. And it's adoption at a mass level for the average user, giving them freedom, not as some niche for geeks. "Linux on the desktop", for example, was part of that dream, and it wasn't about "Linux being finally easy/good enough for some people's desktops", but about eclipsing Microsoft.
> "Open-source ... won and is winning every day. For profit code is written on top of open-source, every single day."
Open source is not a popularity contest! Any slob could "win" massive adoption by giving away something of value for nothing.
The whole point of FOSS was to destroy the proprietary, closed-source software industry so that everyone could have access to all code. Copyleft and the GPL was intended to defeat copyright by snowballing into an ecosystem so unstoppable that proprietary software businesses would fold because they couldn't compete. FOSS has unequivocally failed at that.
I see your point and I agree. I wasn't looking at it from this angle. My angle was advancing humanity, nothing regarding beating closed source counter parts.
Commercial interests advance humanity just fine. We'd arguably be approximately in the same place where we are now, with or without open source - and possible even with better business models in tech than we have now (adtech, surveillance capitalism, SaaS).
All non-free OS vendors but Microsoft and Apple who specialize in user facing devices have already folded. And Microsoft will eventually fold as well because wine is just better at their own backwards compatibility game.
There is a steady progress on the Godot / Blender front. Audio is lagging.
There is a steady progress on the KiCad front. In 10 years, it will probably get openEMS integration.
Almost nobody is licensing their programming languages anymore. Except for the silly low-code fad.
It did not win yet.
I agree with you and I’ll add that I’ve also seen paradigm shift.
Years ago it was unthinkable for software company to release open source, whereas today it’s somewhat expected (and I know it’s not an easy feat with lawyers, domain specific knowledge etc.)
On one hand the cloud-based apps trend is the ultimate win for open source: now you can develop on a Linux machine even if 99% of your customers are using Windows.
On the other hand, cloud-based apps are the ultimate closed source code. Now you can't even touch the binary.
Quote:
"First they ignore you, then they laugh at you, then they fight you, then you win."
But it's important to remember that just because someone is laughing at you doesn't mean you will definitely win.
True, obviously. But it was a quote, not a rule.
Google it. Fun. Not only Gandhi (misattributed, maybe), but Red Hat used it.
I would say
Postgres won over Oracle db
Linux won over windows/MacOs on the server
Various languages and framework like Ruby, Python, JavaScript won over Visual Basic and Flash.
Washout these developers would've been paying millions of dollars more in proprietary software.
It's more complicated than that. Open Source platforms win, Open Source products don't - but there are exceptions to both.
I guess there's a point between pain to use and pain to pay and pain to comply with the restrictions and whatever product or platform gravitates to the lowest of the sum of those for each use case wins. Obviously, this assumes good enough outputs.
IMHO, Open Source AI will win because the restrictions of closed source one are too high and their lower price point through deep pockets doesn't compensate enough for that. The UX is not that different.
> Open Source platforms win, Open Source products don't
Except for on laptops where Windows and macOS rule. And on phones, where iOS isn't dead yet. And on gaming computers, where the Steam Deck is very new and nowhere near PS5 or Xbox. And on the server where there's been a mass migration towards proprietary cloud platforms.
Really the only place open source wins is when programmers are the only ones deciding on the product. Which makes sense, because programmers are the only people who benefit from something being open source. For everyone else it's just a question of which set of programmers you pay.
In terms of "stuff that only closed source software can do", apart from 4/5G cellphone modems every large closed source program has an open source alternative, albeit potentially one with a less polished user experience.
Probably true, but “existing” is different than “winning” (cue joke about the year of the Linux desktop). Also, I’ll say “winning” may not imply being better (there are other factors), but I was taking “winning” to mean market share.
You can cue the joke all you want, no one is getting on stage to tell it.
Open source software is viral and self-propelled by nature. Corporate technologists may be on the retreat, spending billions to build moats for themselves with proprietary locked down hardware and cloud services, but these defenses won't last long, and once breached, the blow will be fatal.
But please, tell me the one about the guy using 3DS Max, Direct3D 12, the intel C compiler, IDA pro, winrar, windows media player, uTorrent, and a proprietary cryptocurrency client.
You may see it as the midnight hour, but the entire industry is still in its infancy, maybe yet to even emerge from its womb. imagine how conway's game of software licensing plays out over 200 years.
I wish it were true lol.
This article claims that since most current business uses of LLM's don't require high reasoning (summarization, semantic analysis), that high level reasoning won't be important for businesses, thus they would prefer smaller models, thus open source will win, etc.
This is very naive and assume that all the use cases for LLMs now will remain the same, and the enormous benefits and eventual agentic magic of higher reasoning won't soon overtake the easier LLM stuff in economic value. If OpenAI releases a model that can design, build, deploy, maintain, act as support staff for and do independent research for an entire product line of a tech company, no company is going to prefer sticking to smaller models that can summarize PDFs for them.
I agree that reasoning is quite important, but I think GPT-4 is already very capable and unlocks capabilities or levels that real use-cases can leverage today.
But I also think that increasing performance is just as much about curating data and model feedback, and better architecture, as it is about things like giant datasets. It looks like open source is catching up and will likely shortly reach a performance and efficiency level that rivals the current GPT-4.
Open source doesn't have to be as good as the latest closed models to be useful. Once it can get a little bit smarter and more convenient then we won't need OpenAI etc. to handle many complex tasks.
I'd agree if the scale of money being poured into closed source models was what it was only a year ago, but now it's 10x greater. Open source can't compete with hundreds of millions of dollars and the talent vacuum it creates.
Yes, but then Lama from FB/meta has provided pretty good base for (open source like LLM)
Honestly, the current general state of software may be a good proxy for what AI will settle to. I can imagine major open-source models, trained and generated by nonprofit efforts in a roughly similar fashion to Linux. Entire businesses might be built on top of "servicing" this model, such as enterprise-grade finetuning, serving, etc. Like the author mentioned, no business wants their core functionality to be dependent on external factors. As I understand it, this is also the case for some Linux-based corporations that focus on building a business around open source software. Of course, there will be proprietary models. Will the average home user cook their own custom distro of LLaMA 10 to be their home assistant? No. They'll probably use Alexa or whatever proprietary solution is out there.
Uncertain, but I'd be willing to be that open source AI will win the way Linux won. In the ways that it matters.
I don’t think we can. People are furious at anyone who even touches their data. This fury by the people will transfer from OpenAI towards the smaller players.
Result: open source AI dies. There’s no way to get any data, and what’s available outside of copyright isn’t enough. Not to compete with ChatGPT.
Sure, there will always be cool models. But nothing like what we were hoping for. LLaMA is being sued right now precisely because it used copyrighted books. Open source entities don’t have the legal resources to defend themselves from these threats.
Piracy doesn't respect copyright. Why not steal all the great books and works ever made just to feed this model. Like a protocol for free training data that you could torrent around add contribute to the hive. Not well thought out but that's the idea.
On the contrary, it is possible to train efficient models with purely synthetic data, see Phi-1 and Phi-1.5 from Microsoft.
What if there was nobody to sue? Open source could be structured that way, like MAME.
> I can imagine major open-source models, trained and generated by nonprofit efforts in a roughly similar fashion to Linux.
Linux today is mostly built by for-profit companies, essentially as a large collaboration project. Microsoft, IBM (RedHat), Oracle, Intel, Huawei are some of the largest contributors.
Why Open Source AI won't win --- because "Big Tech" is working hard to convince technical incompetents in government that AI is dangerous and needs to be "regulated" and "licensed".
While technical incompetents may be overrepresented in government they aren't the only ones who think AI needs to be aggressively regulated.
They are the only ones with enough guns to do it.
If you read carefully, I never said they were the only ones.
https://news.yahoo.com/ai-tech-leaders-make-all-the-right-no...
The CEOs of leading AI companies — including Meta's Mark Zuckerberg, Microsoft's Satya Nadella, Alphabet's Sundar Pichai, Tesla's Elon Musk and Open AI's Sam Altman — appeared before Congress once again on Wednesday.
These guys would love to see AI regulated and licensed --- a move that would likely stifle competition --- for the safety of the world --- and their profits.And after 25 years in the tech industry I'm certain that the "AI revolution" ends the same way the last several tech revolutions have: a faceless horde of dumb separated from money, gross misallocations of resources (financial and intellectual), supercharged income inequality, and even more homeless. With the possible exception of a handful of niche models trained to assist pharma research this tech doesn't solve any problems that people have. It does, however, potentially solve problems employers have with labor costs. This doesn't end well.
Suppose, despite your clear belief to the contrary, that it were factually correct that AI is incredibly dangerous. Embody that viewpoint for a moment. What action would you propose in that circumstance that would be more effective?
What action would you propose in that circumstance that would be more effective?
Nothing would be effective. The singularity would have already occurred. No amount of legal and market manipulation would likely be effective at preventing despots and dictators from using AI to achieve world domination.
Fortunately for the world, the idea that a binary logic playback device (aka a computer as we know it) is capable of this is pure fantasy without any basis in fact.
Open source doesn’t have to obey the laws of any particular jurisdiction. Corporations do.
VLC is illegal in the US. But it’s legal in France and that’s all that matters.
Corporations don't care about Open Source as long as they have control and influence over the usage and "licensing" and "regulation" of the technology.
BTW, I think your statement regarding VLC is reversed, it's illegal in France but not in the USA.
No, right way round; in France, interoperability trumps anti-circumvention, in the USA, other way around due to the trafficking clause.
Out of curiosity, why is it illegal in the US? Is it the decss stuff?
Not sure if above had France and the US flipped or not, but this is all I could find for US: https://www.howtogeek.com/138969/why-watching-dvds-on-linux-... (from https://en.wikipedia.org/wiki/VLC_media_player#References > [17])
TL;DR yes, presumably decss stuff
No, that's why it will take open source AI longer to win than it should.
Money talks.
And Open Source has none.
Maybe your git repo with 2 stars doesn’t get money but Open Source projects that thousands of people use do get money. Many countries around the world fund developers to work on open source projects
Where do you think open source AI models and frameworks come from, magical moneyless communist computing fairies?
They come from big businesses who see open source as the most effective way of reaching their goals, and as more of the tech filters out around the world, that will continue to be the case even if regulation makes that less useful in some juriadictions, making both open source in those jurisidctions and the firms that see their benefit in it less competitive.
> Where do you think open source AI models and frameworks come from, magical moneyless communist computing fairies?
Stolen/leaked weights? :).
> big businesses who see open source as the most effective way of reaching their goals
That is the opposite of open source winning, though. Especially with everything becoming a service - SaaS is the ultimate killer of all that open source was supposed to bring to people. That SaaS is thoroughly built on open source - that's just adding an insult to injury.
> They come from big businesses who see open source as the most effective way of reaching their goals
I think people need to think really hard about what those goals are.
Is it to defeat regulation? To destroy their competitors? Retain talent? Sell their non-free-and-better models?
People don’t just hand millions of dollars out on the street corner.
There’s always a catch.
…and the people using llama and falcon, etc. and enjoying how great it is are being blinded by the shiny toys and not really, imo, thinking about what it means, or how they’re playing someone else’s game.
It’s extremely naive.
There's a nice and terrible thing about grand ambitions. It's that they generally don't come to pass. When people do awful things, motivated by great grand ambitions - we have a horrible society. When people do great things, motivated by awful grand ambitions - we have a great society. So in general I couldn't really care less what somebody's ambitions are when they are any meaningful length of time beyond the present.
Comment was deleted :(
I think it can be dangerous if it gets too powerful, but licensing won't help because no one knows how you'd keep it under control yet and stopping open source AI would just hurt attempts to understand and research it
> licensing won't help because no one knows how you'd keep it under control yet
That's a very good argument for not making it more powerful until we can reliably align it.
Yes, what I mean is that allowing some companies to do it but not others won’t help
But it will likely help those specially anointed with a license to larger profits --- at least until the nit wits catch on to the ruse.
> If you’re building an AI native product, your primary goal is getting off of OpenAI as soon as you possibly can
"If you're building a Cloud-native product, your primary goal is getting off of AWS|GCP|Azure as soon as you possibly can"
The idea that there are "AI native" companies is absolutely horrifying to me. Are people really hitching their entire wagons to a fancy chat bot? I feel like I'm taking crazy pills and I'm the only one who can see how out of hand this whole thing is getting. LLMs are cool and somewhat useful but this is all getting a little inflated, isn't it?
The people taking crazy pills are those that are seeing the dumb VC vending machine handing out free piles of cash to anyone who presses the Ai button.
Hitching your wagon to Ai by slapping one word in your pitch puts a near instant $100mil into your pocket.
Yeah, but it sill isn't your money, even with 100mil of valuation/investment you are rich on paper, but until you move your ponzi/AI magic to other investor/IPO, you don't have much.
I as Founder, CEO and Serial Entrepreneur can put a fat salary in my pocket from my now funded Startup to do f'all.
There's iOS native products and Windows native products. All of these are valid, they're just higher up the innovation curve.
And "as early as you can" usually ends up being years after you IPO cough snapchat GCP cough.
So to simplify it:
If you're building a product your primary goal is to IPO as soon as you possibly can.
Well sounds like the state of the industry...
>your primary goal is getting off of OpenAI as soon as you possibly can
I have calculated that LLAMA running model on AWS/CLOUD, even vast.ai with 3090 is much more expensive then openai. Even with collocation 3090 as they charge absurd amount of money for power (hardware free)
Only case that this is cheaper few times, is if I would host it in my house as I have fiber, and not that expensive power.
"If you’re building an AI native product, your primary goal is getting off of OpenAI as soon as you possibly can."
Right now you essentially have:
Customer -> business -> open-ai -> microsoft_azure
Where-as many companies don't have this extra middle-man like this now -- they are more used to this type of situation:
Customer -> business -> microsoft_azure
Hmm.. :-) seems like microsoft just absorbing the tech of open_ai into a set of models that you can use / train on etc into azure proper and forget about the middle man is likely the end game here.
Since Microsoft offers OpenAI model as part of Azure Cognitive Services, we're already in the latter situation, at least for any company that isn't playing fast and loose with its proprietary datta. The recent changes in OpenAI offering is actually undoing this / reintroducing the first chain as an option.
This article is misleadingly conflating short-term analyses with long-term outcomes.
The idea that "reasoning doesn't matter" in the long-term is absolutely asinine. Human-level general reasoning is obviously one of the coveted goals of AI research.
It remains unclear how the open-source community is ever going to amass the tens of millions required to train foundational models. And if it somehow does, no sane government will permit uncontrolled research towards AGI.
>And if it somehow does, no sane government will permit uncontrolled research towards AGI
If there was an effective, distributed means for training LLMs, enough people are passionate about LLMs that the only way governments could stop it is if every country in the world turned into communist China with respect to internet restrictions.
> If there was an effective, distributed means for training LLMs
Is it reasonable to believe that this is possible? Distributed training requires extremely high-bandwidth and low-latency interconnect. The internet ain't that.
Believe me, I dearly want the open-source community to "win". A future where only governments have a monopoly on AI research is absolutely terrifying. Given the known parameters though, that future seems inevitable.
>Is it reasonable to believe that this is possible? Distributed training requires extremely high-bandwidth and low-latency interconnect. The internet ain't that.
The same can be said about piracy and bittorrent. The community torrents is so large that I would bet a lot money they have more computing+bandwidth resources combined than openAI by a wide margin, probably orders of magnitude.
Why low latency? Is the latency here a meaningfully annoying limit in training, that can't be reasonably offset by just adding more compute nodes to the network?
Latency is the limit in the end, but I feel that there's plenty of easy-ish wins to have in redesigning the architecture and training approach to make it an irrelevant in practice.
Distributed training without insane bandwidth requirements is the holy grail here. It has to work in terms of work units like Folding@Home.
> The idea that "reasoning doesn't matter" in the long-term is absolutely asinine. Human-level general reasoning is obviously one of the coveted goals of AI research.
I don't know if I agree with this. Human-level reasoning is what researchers care about, but an AI that is controllable and that is consistent is way more valuable than an AI that is smart. And I think the general point here is that capabilities have outpaced control. There is a huge gulf between the capabilities of current AI and the ability to actually manipulate and utilize that AI.
If you could somehow theoretically build an AI that was half as smart as GPT-4 but that was completely immune to prompt injection in every single situation, it would be more useful than GPT-4. See also Stability AI vs Midjourney, etc... Midjourney is far more capable but it simply doesn't matter -- input methods and control methods and the ability to fine-tune are more important than base model capabilities. Current models are quite capable for the tasks they're being used for, the reason they fall over and the reason why it's difficult to use them in those tasks is specifically because of the lack of control and reliability; and making them smarter seems to be only making them harder to control.
If you go long-long-term then we get into science fiction territory and it's easy to say that human-level general reasoning is the highest priority. But that's because when you think about that long-term you are not thinking about tradeoffs. You're assuming a theoretical world where human-level reasoning is perfectly controllable and doesn't give wildly inconsistent results that make it useless for critical tasks and that the safeguards you need to put around it don't make it harder to work with than a human being. And yeah, if you can have literally everything, sure, you want human-level reasoning.
But there are a lot of things in AI that matter a lot more than human-level reasoning and it's skipping a lot to say "human level reasoning is the most important" and to just assume that the other issues will get sorted out. Most businesses using AI are not using it because they're invested in replicating humanity, they're using it because they want to accomplish a specific task. If an AI replicates humanity but is bad at that task, businesses will go with the tool that's good at that task instead.
And researchers will be disappointed because they want AGI, but successful businesses don't choose their tech stack based on what makes researchers happy.
----
> It remains unclear how the open-source community is ever going to amass the tens of millions required to train foundational models.
I also think this is a little over-confident. This assumes that tens of millions are always going to be necessary to train foundational models, which I don't think is a safe bet to make. My impression looking at some of the more targeted work people are doing is that better curated data sets that are more focused on specific tasks may end up being straight-up better to train with than "the entire Internet".
To add to that, I don't think it's a safe bet that model knowledge won't at some point be fully transferable between models or that these foundational models won't become a commodity. I mean, heck, we don't even know if the model weights are under copyright. It is very feasible that some kind of collective Open model might end up being good enough and that everyone just kind of standardizes on that as a base and builds on top of it. If that becomes useful enough that the companies investing into OpenAI decide "meh, we'll just invest resources/GPUs/etc to the Open version" then there will be a point where no VC-backed competitor will then be able to outpace the speed of those contributions, because they'll be racing alone against the entire market contributing to a single Open base.
Even if none of that happens, it's also worth noting that the way we currently train LLMs is biased towards inefficiency, in part because research is primarily conducted by companies who can afford to be inefficient. But it's not a safe bet to assume that we won't find a better way to train models that uses less data and that doesn't try to recreate reasoning capabilities out of pure syntactic language -- a learning process that basically no living intelligent agent follows. Humans don't develop emergent reasoning from language, they learn language after developing reasoning and by mapping that language to real-world experiences; this is basically the complete opposite to how LLMs approach training. If different training methods get discovered, will they have the same ridiculous data requirements? I don't think I can say with complete confidence that they will.
----
> And if it somehow does, no sane government will permit uncontrolled research towards AGI.
My issue here is that no sane government would trust AGI to a private corporation either. The realistic outcomes here are that either the government won't interfere, in which case Open Source communities will be able to do their own research the same as private companies, or the government will heavily interfere, in which case probably only state-developed AGIs will exist.
A lot of businesses would love to have selective regulation, but the idea that AI is too dangerous to trust to hobbyists but is not too dangerous to put in the hands of people like Elon Musk or Sam Altman is ludicrous. OpenAI/Google aren't even responsible currently with their existing LLMs, I can only imagine how horribly they'd handle an actual AGI. If the government is currently shrugging its shoulders over the dumpster fire that is current LLM safety mechanisms, I'm not sure why they'd suddenly start caring about Open Source communities.
If the government is currently shrugging its shoulders over the dumpster fire that is current LLM safety mechanisms, I'm not sure why they'd suddenly start caring about Open Source communities.
For the same reason that nobody will bother to shut down your pirate radio station if you broadcast on an empty channel at 79.1 MHz, but the wrath of God will fall on you if you broadcast on an empty channel at 89.1 MHz.
In the first case you're just breaking the law. No biggie. In the second case you're competing with powerful corporate interests who paid a lot of money for their own licenses and don't want competition.
Open source AI - OSAI - does not necessarily mean it is open. Take the latest OSAI models and try to run them. If you really want to support the idea of open source you’d need to build a model in a way that it is not just open but also accessible to everyone. There needs to be something like Napster that distributes those models in the hands of everyone not just a few folks gate keeping access because they can spend 150$k / month on h100 GPUs
The moat isn't the code, it's the obscene amount and expense of resources needed to actually do the training.
In that way, it makes sense to just release it under a permissive license because there's still a massive cost to use it.
I wonder if one could build something like SETI@home, but for open source model training. Assuming the model fits on a gaming GPU, it's just data distributed parallel training but with a large distance between training nodes.
There's the RNDR token that allows you to buy compute. There are also distributed rendering networks out there (whose names escape me).
I wouldn’t discount the complexity of the code and development. The model architecture itself is incredibly complex, likely with tons of custom layers and tensor operators, along with all the custom tooling for data I/o, likely custom optimization package for training, utilities for observability and diagnostics, and the actual configuration/orchestration of storage and compute resources…
And then you have the resources themselves. Which enable them to iterate more quickly on building all of the above. Oh and the training dataset.
It’s a big moat, all things considered.
Scaled inference isn't cheap either :/
As a govvie sitting on vast data resources, I would love to figure out how to liberate them for the people of my country. But 1) how do I ensure that it helps my countrymen first? 2) How do I allow them to use the data and have high confidence they don't have a copy?
I have to bring them into our fold, and let them run code in our environment. I have done these deals, and would be happy to do more. Does anyone actually want to be sponsored in? If so, reach out.
Open source hardware is as important as open source software. Many laws established for physical goods have been applied to techniques, software and artwork.
>Do you really want to outsource your core business, one that relies on confidential data, to OpenAI or Anthropic?
My man. Half the internet is hosted on either AWS or Google Cloud, databases included.
When I was working at a large broadcast company, there were serious concerns voiced about using AWS for certain workloads, precisely because of that threat. Some people do take it seriously.
My 2 cents on the arguments:
- Reasoning doesn’t actually matter
That's the most absurd arguments at all. Currently no AI really reason, so yeah, it doesn't matter that open source AI doens't also. But the day a AI is really able to reason it's game over
- Control above all else
Control is good but pass after ease of use for most users. Although, yes, open source allow people to do experiment that a company would not do by itself.
- The Real Problem is Hype
The real problem is money. I can barely run small LLM or stable diffusion XL with my gamer graphic card of few years ago, and I will not buy a new one before long if at all. Maybe the AI craze will cause development in hardware to run up-to-date models and be reasonably cheap, but for the moment I can't really use AI except with on-line services
Lack of censorship and surveillance is the key reason why Open Source LLM's are my go-to.
Once again with the "human learning and machine learning are the same" .... then why not call them both 'learning'? There are qualitative and quantitative differences that people here are eager to ignore.
But on the main topic, I read the updates to Microsoft's Terms of Use that come into effect 30 September 2023. They are making jailbreaking and reverse engineering their chatbots explicitly against those ToS and granting themselves rights to whatever you generate using their various services - neither of which is unexpected, but now if you ask Bing to revert to Psychotic Sydney, Microsoft has an excuse to lock you out of every Microsoft service you own for breaking their rules.
As a commercial user for most of the year now, it mostly just comes down to convenience and quality. Right now the closed AI models offer me both at a low price. I really don’t care about other factors, even though I am ideologically aligned with open tech.
This . Path of least resistance wins. At the end if the day it's easier to get an oauth token and quota from OpenAI than spend time setting things up locally and get inferior results.
I do have llama 2 running locally but for fun but in business I buy unless it's a core competency. You'd burn millions per year trying to match off the shelf just in eng costs.
In practice the OpenAI-GPT and Azure-GPT are severely rate limited such that they err out one in 10 attempts, impose timeouts, and most importantly, require you trust them with your data.
That’s really not the case at all. In our experience in production there’s no timeouts and latency has only gotten better over time. Your comment was more accurate in early May of this year (but even then an exaggeration).
Just yesterday I was writing a test function with 4 cases, each sending one call to chatGPT. I didn't have delays between them, they were running back-to-back. After 1-2 attempts it started complaining of API limits and advised me to wait 51 seconds. How can I make a test like that? It raises exceptions for reasons of Azure congestion, not task error.
Sounds like you're hitting rate limiting for tokens per minute. I'd suggest working on reducing the size of your prompt, which will also result in more accurate responses (since LLMs do poorly when prompts get past a certain size).
The author says:
"Reasoning, the type you get from scaling these models to get larger, doesn’t matter for 85% of use cases. Researchers love sharing that their 200B param model can solve challenging math problems or build a website from a napkin sketch, but I don’t think most users (or developers) have a burning need for these capabilities."
This disqualifies his thesis for me. The generative AI revolution is not about having nive little features like summarization implemented in your apps. It is about much greater things than that. It's about real AI agents.
The cat is already out of the bag and $0 free AI models or open source AI cannot be stopped. Anyone building $0 free AI models that can be used on-device, like Stability, Meta, Apple, etc have already 'won' the AI race to zero.
As soon as it is on-device and is as good enough or better than GPT-3.5 and far more transparent then eventually open source AI and $0 free AI models by default reduces the prices to $0.
The comments here sound like lots of people here have invested in cloud AI model companies or SaaS businesses and are having to aggressively justify their investment with so-called regulation or regulatory action.
No surprises here.
Same comments were about Napster. Now everyone is still trading mo3, but from a commercial perspective the music industry got bailed out.
The censorship is already good enough reason to use open source models, or actually the only option if you want to do stuff that is censored by the commercial models.
Yes, we were impressed originally by the pictures of astronauts, but it turns out astronauts are fairly easy and a suitable choice for a demo that's designed to impress. They don't have faces or fingers, and nobody's going to notice if the astronaut suit doesn't look quite right.
Also, guitars are pretty easy due to there being so many examples. I'll be more impressed if they can get an accordion right.
Ask Dall-E to draw a goalpost mounted on wheels. That'll solve your problem in the general case.
We should expect goalposts for different games to be different. A "wow, neat" game is different from a "can I actually use this" game. It would be weird if they had the same scoring rules.
(And I actually do play accordion, so that's the test I use. Nothing I've tried gets accordions right yet.)
I have two things to say.
Linux did not succeed (though Linux users might think so). Unless you think 1% market share is a win.
Secondly, open source means very little if it’s not paired with a GPL license. Please stop throwing the words Open Source around as if it means “free”.
All in all though, great article. I think we can apply this community for the greater good principle in many more areas of society (IRL).
1% desktop share you mean. Over 50% mobile and server share.
At some point an animal is different enough to be called a new species. Android passed that threshold on release day.
But if you still somehow think so. Why not install some deb files, open up that native terminal and install steam.
This seems wrong on the face of it. Even assuming open source AI is everything the article promises, surely taking an open source model then adding a pile of specialist private data and lots of training compute is going to make something just a bit better?
I guess the closest software analogy is AWS. Leverage open source hard, but don't bother about giving back.
I guess AWS never gave OpenSearch back to the open source community? Or anything on https://aws.amazon.com/opensource/
When I was writing that comment, I did come across this article from Apr 2023: https://www.infoworld.com/article/3694090/amazon-s-quiet-ope...
But please don't worry too much about the example I used, it's just a tangent. What's wrong with my argument about closed-source AI always having an edge?
(Actually, I can think of one possibility - when a company has an incentive to open-source AI in order to commoditize their complement. Content aggregation companies might want to commoditize content-generating AI in the same way that hardware companies wanted to commoditize software. Same argument holds for Nvidia maybe - what better way to sell silicon than to make sure there are a lot of really useful open source models out there to take advantage of it?)
If you want to make god laugh, make a prediction.
There will be personal use AI which could be cheaply made in the future and then there is pro grade which solves different set of problems such as medical or military etc. So no open source will not win, it will be like today where both can thrive
Just like Open source won with everything else.. right? Right?
Next year is the Year of Linux
strong doubt because the data sets are the hard part
a nuke and a bioweapon ready to download..
Define "win". What are the victory conditions? Whichever project contributes the most homeless worldwide?
Comment was deleted :(
[flagged]
Crafted by Rajat
Source Code