hckrnws
> Indeed, last year GitHub was said to have tuned its programming assistant to generate slight variations of ingested training code to prevent its output from being accused of being an exact copy of licensed software.
If I, a human, were to:
1. Carefully read and memorize some copyrighted code.
2. Produce new code that is textually identical to that. But in the process of typing it up, I randomly mechanically tweak a few identifiers or something to produce code that has the exact same semantics but isn't character-wise identical.
3. Claim that as new original code without the original copyright.
I assume that I would get my ass kicked legally speaking. That reads to me exactly like deliberate copyright infringement with willful obfuscation of my infringement.
How is it any different when a machine does the same thing?
You might not get your ass kicked. Copyright doesn't protect function, to the point where the court will assess the degree to which the style of the code can be separated from the function. In the even that they aren't separable, the code is not copyrightable.
https://www.wardandsmith.com/articles/supreme-court-announce...
https://easlerlaw.com/software-computer-code-copyrighted#:~:...
Software like Blackduck or Scanoss is designed to identify exactly that type of behaviour. It is used very often to scan closed source software and to check whether it contains snippets that are copied from open source with incompatible licenses (e.g. GPL).
To be able to do so, these softwares build a syntax tree of what your code snippet is, and compare the tree structure with similar trees in open source software without being fooled by variable names. To speed up the search, they also compute a signature for these trees so that the signature can be more easily searched in their database of open source code.
And that's all well and good, but that code that asserts to be protected by GPL still has to stand the abstraction-filtration-comparison test.
The plain fact is that you can claim copyright on plenty of stuff that isn't copyrightable.
Consider AI model weights at all: they're the result of an automatic process and contain no human expression; almost by definition, model weights shouldn't be copyrightable, but people are still releasing "open source" models with supposed licenses.
But there has to be a threshold. If a GPL project contains a function which takes two variables and returns x+y, and I have functionally identical code in a project I made with an incompatible license, it is obviously absurd to sue me.
You are right but there is no legally defined threshold so it's subjective.
As a matter of fact, the Eclipse Foundation requires every contributor to declare that every piece of code is their own original creation and is not a copy/paste from other projects, with the exception possibly of other Eclipse Foundation or Apache Foundation projects because their respective licenses allow that. Even code snippets from StackOverflow are formally forbidden.
If I am not mistaken, in the Oracle-Google trial over Java on Android, at the end Google re-implementation of Java API on Android was considered fair-use, because Google kept the original "signatures" of the Java SDK API and rewrote most of the implementation with the exception of copying "0.4% of the total Java source code and was minimal" [1] However the trial came to this conclusion after several iterations in court.
[1] https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....
You're right, there is. The threshold is whatever a court decides is "substantial similarity" in that particular case. But there's no way to know that ahead of time as the interpretation/decision is subjective.
The simple version is that code is copyrightable as an expression. And the underlaying algorithm is patentable.
The legal term you're looking for here is the "Abstraction-Filtration-Comparison" test; What remains if you subtract all the non-copyrightable elements from a given piece of code.
Algorithms have become patentable only very recently in the history of patents, without a rationale being ever provided for this change, and in some countries they have never become patentable.
Even in the countries other than USA where algorithms have become patentable, that happened only due to USA blackmailing those countries into changing their laws "to protect (American) IP".
It is true however that there exist some quite old patents which in fact have patented algorithms, but those were disguised as patents for some machines executing those algorithms, in order to satisfy the existing laws.
Doesn't really matter, the point is that they're patentable. They clearly shouldn't be IMO, but they are.
US copyright does protect for "substantial similarity" [0]. And at the other end of the spectrum, this has been abused in absurd ways to argue that substantially different code has infringed.
In Zenimax vs Oculus they basically argued that a bunch of really abstract yet entirely generic parts of the code were shared, we are talking some nested for loops, certain combinations of if statements, and due to a lack of a qualitative understanding of code, syntax, common patterns, and what might actually qualify for substantively novel code in the courtroom, this was accepted as infringing. [1]
Point is, the legal system is highly selective when it comes to corporate interests.
[0] https://en.wikipedia.org/wiki/Substantial_similarity
[1] https://arstechnica.com/gaming/2017/02/doom-co-creator-defen...
> US copyright does protect for "substantial similarity"
Substantial similarity refers to three different legal analyses for comparing works. In each case what the analysis is attempting to achieve is different, but in no case does it operate to prohibit similarity, per se.
The Wikipedia page points out two meanings. The first is a rule for establishing provenance. Copyright protects originality, not novelty. The difference is that if two people coincidentally create identical works, one after another, the second-in-time creator has not violated any right of the first. (Contrast with patents, which do protect novelty.) In this context, substantial similarity is a way to help establish a rebuttable presumption that the latter work is not original, but inspired by the former; it's a form of circumstantial evidence. Normally a defendant wouldn't admit outright they were knowingly inspired by another work, though they might admit this if their defense focuses on the second meaning, below. The plaintiff would also need to provide evidence of access or exposure to the earlier work to establish provenance; similarity alone isn't sufficient.
The second meaning relates to the fact that a work is composed of multiple forms and layers of expression. Not all are copyrightable, and the aggregate of copyrightable elements needs to surpass a minimum threshold of content. Substantial similarity here means a plaintiff needs to establish that there are enough copyrightable elements in common. Two works might be near identical, but not be substantially similar if they look identical merely because they're primarily composed of the same non-copyrightable expressions, regardless of provenance.
There's a third meaning, IIRC, referring to a standard for showing similarity at the pleadings stage. This often involves a superficial analysis of apparent similarity between works, but it's just a procedural rule for shutting down spurious claims as quickly as possible.
> Point is, the legal system is highly selective when it comes to corporate interests.
I don't even think it's that. In recent cases like Oracle v. Google and Corellium v. Apple, Fair Use prevailed with all sorts of conflicting corporate interests at play. The Zenimax v. Oculus case very much revolved around NDAs that Carmack had signed and not the propagation of trade secrets. Where IP is strictly the only thing being concerned, the literal interpretation of Fair Use does still seem to exist.
Or for a more plain example, Authors Guild. v. Google where Google defended their indexing of thousands of copywritten books as Fair Use.
In fact, go to far as to argue your example of Authors Guild v. Google is a good indication that most cases will probably go an AI platform's way. It's a pretty parallel case to a number of the arguments. Indexing required ingesting whole works of copyright material verbatim. It utilized that ingested data to produce a new commercial work consisting of output derived from that data. If I remember the case correctly, google even displayed snippets when matching a search so the searcher could see the match in context, reproducing the works verbatim for those snippets and one could presume (though I don't recall if it was coded against), that with sufficiently clever search prompts, someone could get the index search to reproduce a substantial portion of a work.
Arguably, the AI platforms have an even stronger case as their nominal goal is not to have their systems reproduce any part of the works verbatim.
> In fact, go to far as to argue your example of Authors Guild v. Google is a good indication that most cases will probably go an AI platform's way.
The more recent Warhol decision argues quite strongly in the opposite direction. It fronts market impact as the central factor in fair use analysis, explicitly saying that whether or not a use is transformative is in decent part dependent on the degree to which it replaces the original. So if you're writing a generative AI tool that will generate stock photos that it generated by scraping stock photo databases... I mean, the fair use analysis need consist of nothing more than that sentence to conclude that the use is totally not fair; none of the factors weigh in favor it.
I think that decision is much narrower than "market impact". It's specifically about substitution, and to that end, I don't see a good argument that Co-Pilot substitutes for any of the works it was trained on. No one is buying a license to co-pilot to replace buying a license to Photoshop, or GIMP, or Linux, or Tux Racer. Nor is Github selling co-pilot for that use.
To the extent that a user of co-pilot could induce it to produce enough of a copyrighted work to both infringe on the content (remember that algorithms are not protected by copyright) and substitute for the original by licensing in lieu of, I would expect the courts to examine that in the ways it currently views a xerox machine being used to create copies of a book. While the machine might have enabled the infringement, it is the person using the machine to produce and then distribute copies that is doing the infringing not the xerox machine itself nor Xerox the company.
Specifically in the opinion the court says:
>If an original work and a secondary use share
>the same or highly similar purposes, and the secondary use
>is of a commercial nature, the first factor is likely to
>weigh against fair use, absent some other justification for
>copying.
I find it difficult to come up with a good case that any given work used to train co-pilot and co-pilot itself share "the same or highly similar purposes". Even in the case of say someone having a code generator that was used in training of co-pilot, I think the courts would also be looking at the degree to which co-pilot is dependent on that program. I don't know off hand if there are any court cases challenging the use of copyright works in a large collage of work (like say a portrait of a person made from Time Magazine covers of portraits), but again my expectation here is that the court would find that while the entire work (that is the magazine cover) was used and reproduced, that reproduction is a tiny fraction of the secondary work and not substantial to its purpose.
Similarly we have this line:
>Whether the purpose and character of a use weighs in favor
>of fair use is, instead, an objective inquiry into what use
>was made, i.e., what the user does with the original work.
Which I think supports my comparison to the xerox machine. If the plaintiffs against Co-Pilot could have shown that a substantial majority of users and uses of Co-Pilot was producing infringing works or producing works that substitute for the training material, they might prevail in an argument that co-pilot is infringing regardless if the intent of github. But I suspect even that hurdle would be pretty hard to clear.
Of the various recent uses of generative AI, Copilot is probably the one most likely to be found fair use and image generation the least likely.
But in any case, Authors Guild is not the final word on the subject, and anyone trying to argue for (or against) fair use for generative AI who ignores Warhol is going to have a bad day in court. The way I see it, Authors Guild says that if you are thoughtful about how you design your product, and talk to your lawyers early and continuously about how to ensure your use is fair and will be seen as fair in the courts, you can indeed do a lot of copying and still be fair use.
I agree. Nothing is going to be the final word until more of these cases are heard. But I still don't think Warhol is as strong even against other uses of generative AI, and in fact I think in some ways argues in their favor. The court in Warhol specifically rejects the idea that the AWF usage is sufficiently transformed by the nature of the secondary work being recognizably a Warhol. I think that would work the other way around too, that a work being significantly in a given style is not sufficient for infringement. While certainly someone might buy a license to say, Stable Diffusion and attempt to generate a Warhol style image, someone might also buy some paints and a book of Warhol images to study and produce the same thing. Provided the produced images are not actually infringements or transformations of identifiably original Warhol works, even if they are in his style, I think there's a good argument to be made that the use and the tool are non-infringing.
Or put differently, if the Warhol image had used Goldsmith's image as a reference for a silk screen portrait of Steve Tyler, I'm not sure the case would have gone the same way. Warhol's image is obviously and directly derived from Goldsmith's image and found infringing when licensed to magazines, yet if Warhol had instead gone out and taken black and white portraits of prince, even in Goldsmith's style after having seen it, would it have been infringing? I think the closest case we have to that would have been the suit between Huey Lewis and Ray Parker Jr. over "I Want a New Drug"/"Ghostbusters" but that was settled without a judgement.
I do agree that Warhol is a stronger argument against artistic AI models, but it would very much have to depend on the specifics of the case. The AWF usage here was found to be infringing, with no judgement made of the creation and usage of the work in general, but specifically with regard to licensing the work to the magazine. They point out the opposite case that his Campbell paintings are well established as non-infringing in general, but that the use of them licensed as logos for soup makers might well be. So as is the issue with most lawsuits (and why I think AI models in general will win the day), the devil is in the details.
A key finding that the judge said in the Authors Guild v. Google case was that the authors benefited from the tool that google created. A search tool is not a replacement for a book, and are much more likely to generate awareness of the book which in turn should increase sales for the author.
AI platforms that replaces and directly compete with authors can not use the same argument. If anything, those suing AI platforms are more likely to bring up Authors Guild v. Google as a guiding case to determine when to apply fair use.
Copyright is abused often. Our modern version of copyright is BS and only benefits large corps who buy a lot of IP.
Yep. Now it is a legal cudgel wielded most effectively by corporate giants. It has mutated to become completely philosophically opposed to what it was expressly created to protect.
If I were to license a cover of a song for a music video, I'd have to license both the original song and the cover itself.
I'd say this is extremely relevant in this case.
if that is the case why do people ever license covers?
to clarify - I thought you just had to negotiate with the cover artist about rights and pay a nominal fee for usage of the song for cover purposes - that is to say you do not negotiate with the original artist, you negotiate with a cover artist and the whole process is cheaper?
You're maybe thinking about this in a way that's not helping you to understand the system and why it works the way it does. It's very clear when you think of a specific case.
Say you want to make a recording of "Valerie" by the Zutons. You need permission (a license) from the songwriters (the Zutons presumably) to do this. You usually get this permission by paying a fee. Having done that, you can do your recording. Whenever that recording is played (or used) you will get a performance royalty and they will get a songwriting royalty.
Say you want to use a cover of "Valerie" by the Zutons in your film or whatever. Say the Mark Ronson version featuring Amy Winehouse. You need permission (a license) from the person who produced that version (Mark Ronson or his company) and will need to pay them a fee, some of which goes to the songwriter as part of their deal with Mark Ronson which gave him the license to produce his cover in the first place.
The Zutons don't have the right to sell you a license to Mark Ronson's version so if that's the version you want you have to negotiate with him. Likewise he doesn't have the right to sell you a license like the license he has (ie a license to do a recording/performance) so if you want that you have to negotiate with them.
OK it seems exactly what I thought and described, and the opposite of what the parent poster described. The parent poster said that if you want to use the cover of the song you need to negotiate with both the people who did the cover and the original rights owner.
The closest I could get to a situation like that would be if I told Band B do a cover of Song A for my movie and I paid the licensing costs as part of my deal with Band B, but still not the same as the parent poster's description.
Cover songs have a special abd explicit law covering them. Not relevant.
While correct, the example given is that they COPY the code, then make adjustments to hide the fact. I suspect this is still a copyright violation. It’s interesting that a judge sees it differently when it’s just run through a programme. I’m not a legal expert so I’m guessing it’s a bit more complex than the headline?
Ok I read the article and it looks like the issue is the DMCA specifically, which require the code to be more identical than is presented. I’m guessing separate claims could still come from other copyright laws?
No copy-paste was explicitly used. They compressed it into a latent space and recreated from memory, perhaps with a dash of "creativity" for flavor. Hypothetically, of course.
The distinction is pedantic but important, IMHO. AI doesn't explicitly copy either.
But isn’t that the same as memorising it and rewriting the implementation from memory? I’m sure “it wasn’t an exact reproduction” is not much of a defence.
I sure think so. I also think that (to first order) this is exactly what modern AI products do. Is a lossy copy still a copy?
I would have thought so but I’m not a lawyer. The article suggests DMCA is intended for direct copies so that’s why it failed here. Maybe more general copyright laws would apply for lossy copies.
You have a much smaller lobbying budget than the AI industry, and you didn't flagrantly rush to copy billions of copyrighted works as quickly as possible and then push a narrative acting like that's the immutable status quo that must continue to be permitted lest the now-massive industry built atop copyright violation be destroyed.
Violate one or two copyrights, get sued or DMCAed out of existence. Violate billions, on the other hand, and you magically become immune to the rules everyone else has to follow.
> Violate one or two copyrights, get sued or DMCAed out of existence. Violate billions, on the other hand, and you magically become immune to the rules everyone else has to follow.
Sounds like the same concept as commonly said of "murderer vs conqueror".
Could probably be applied to many other fields for disruption too. Not the murderer bit (!), more the "break one or two laws -> scaled up massively to a potential new paradigm".
"If you owe the bank $100 that's your problem. If you owe the bank $100 million, that's the bank's problem."
Pretty sure there's a bunch of pre-existing laws around that though, so not really ripe for disrupting by scaling up the problem. ;)
There's a strong geopolitical angle as well. If you force American companies to license all training data for LLMs, that is such a gargantuan undertaking it would effectively set US companies back by years relative to Chinese competitors, who are under no such restrictions.
Bottom line, if you're doing something considered relevant to the national interest then that buys you a lot of leeway.
You will need to first demonstrate that actual copying took place. And that what copying that did take place was actually illegal or infringing.
As we're seeing in court, that's a very interesting question. It turns out that the answers are very counter-intuitive to many.
What about the copyrights purpose of furthering the arts and sciences?
You want to look at the Supreme Court case "Eldred v. Ashcroft." Eldred challenged Congress for retroactively extending existing copyrights, for extending the patent protections on existing inventions could not possibly further arts and sciences. They also argued that if Congress had the power to continually extend existing copyrights by N years every N years, the Constitutional power of "for a limited time" had no meaning.
The Supreme Court's decision was a bunch of bullshit around "well, y'know, people live longer these days, and some creators are still alive who expected these to last their whole lives, and golly, coincidentally this really helps giant corporations."
Copyright has utterly failed to serve that purpose for a long time, and has been actively counterproductive.
But if you want to argue that copyright is counterproductive, I completely agree. That's an argument for reducing or eliminating it across the board, fairly, for everyone; it's not an argument for giving a free pass to AI training while still enforcing it on everyone else.
Could these "free passes" for AI training serve as a legal wedge to increase the scope of fair use in other cases? Pro-business selective enforcement sucks, but so long as model weights are being released and the public is benefiting then stubbornly insisting that overzealous copyright laws be enforced seems self-defeating.
Without copyright, entire industries would've been dead a long time ago, including many movies, games, books, tv, music, etc.
Just because their lobbies tend to push the boundary of copyright into the absurd doesn't mean these industries aren't worth saving. There should be actually respectful lawmakers who seek for a balance of public and commercial interests.
> Without copyright, entire industries would've been dead a long time ago, including many movies, games, books, tv, music, etc.
Citation needed. There are many ways to make money from producing content other than restricting how copies of it can be distributed. The owner should be able to choose copyright as a means of control, but that doesn't mean nobody would create any content at all without copyright as a means of control.
There's nothing preventing people from producing works and releasing them without copyright restriction. If that were a more sustainable model, it would be happening far more often.
As it is now, especially in the creative fields (which I am most knowledgeable about), the current system has allowed for a incredible flourishing of creation, which you'd have to be pretty daft to deny.
> If that were a more sustainable model, it would be happening far more often.
that's not the argument. The fact that there currently are restrictions on producing derivative works is the problem. You cannot produce a star wars story, without getting consent from disney. You cannot write a harry potter story, without consent from Rowling.
That's not actually true. There's nothing stopping you from producing derivative works. Publishing and/or profiting from other people's work does have some restrictions though.
There's actually a huge and thriving community of people publishing derivative works, in a not-for-profit basis, on Archive of Our Own. (Among other places.)
> There's actually a huge and thriving community of people publishing derivative works, in a not-for-profit basis, on Archive of Our Own. (Among other places.)
Yes, and none of those people are making a living at creating things. That's why they are allowed by the copyright owners to do what they're doing--because it's not commercial. Try to actually sell a derivative work of something you don't own the copyright for and see how fast the big media companies come after you. You acknowledge that when you say there are "restrictions" (an understatement if I ever saw one) on profiting from other people's work (where "other people" here means the media companies, not the people who actually created the work).
It is true that without our current copyright regime, the "industries" that produce Star Wars, Disney, etc. products would not exist in their current form. But does that mean works like those would not have been created? Does it mean we would have less of them? I strongly doubt it. What it would mean is that more of the profits from those works would go to the actual creative people instead of middlemen.
> Yes, and none of those people are making a living at creating things.
Again, not true. One of the most famous examples is likely Naomi Novik, who is a bestselling author, in addition to a prolific producer of derivative works published on AO3. Many other commercially successful authors publish derivative works on this platform as well.
> It is true that without our current copyright regime, the "industries" that produce Star Wars, Disney, etc. products would not exist in their current form. But does that mean works like those would not have been created? Does it mean we would have less of them? I strongly doubt it. What it would mean is that more of the profits from those works would go to the actual creative people instead of middlemen.
Speculate all you want about an alternative system, but you really don't know what would have happened, or what would happen moving forward.
> not true
Sorry, I meant they're not making a living at creating derivative works of copyrighted content. They can't, for the reasons you give. Nor can other people make a living creating derivative works of their commercially published work. That is an obvious barrier to creation.
> the current system has allowed for a incredible flourishing of creation
No, the current system has allowed for an incredible flourishing of middlemen who don't create anything themselves but coerce creative people into agreements that give the middlemen virtually all the profits.
People do not put out their stuff. People get lured into contracts selling their IP to a shitty company that then publishes stuff, of course WITH copyright so they can make money while the artist doesnt
Given that copyrighting is automatic at the instant of creation, that is, um, debatable.
Slapping 3 lines in LICENSE.TXT doesn’t override the Berne convention.
Are you claiming that an author cannot place their work in the public domain?
Yes, they can't, because there is no legally reliable way to do it (briefly, because the law really doesn't like the idea of property that doesn't have an owner, so if you try to place a work of yours in the public domain, what you're actually doing is making it abandoned property so anyone who wants to can claim they own it and restrict everyone else, including you, from using it). The best an author can do is to give a license that basically lets anyone do what they want with the work. Creative Commons has licenses that do that.
In most of the world no, they can't.
Copyright laws prevent piracy. It is interesting to live in a country with no enforced copyrights and EVERYTHING is pirated. I think it is easy to not know about that context and just see the stick side of copyright vis-a-vis big money corporations
Technically speaking, copyright laws create piracy as without them we would still have our free speech rights to share whatever we want without the approval from third parties and thus so-called piracy aka copyrigh infringement would not be a thing. Laws also hardly prevent sharing of copyrighted content, they only make it illegal.
> we would still have our free speech rights to share whatever we want
This is a false dichotomy. It's not "free speech" to copy someone else's video game and then sell it for your own profit. By "copy", in the old days that was literally copying the distribution CDs and providing a cracked keycode (it was not even a question of trademarks being close or what not. It's literally people taking the stuff, duplicating it, and selling it for their own profit. Eastern European mafia were greatly financed by this and ran this type of operation at industrial scale).
> Laws also hardly prevent sharing of copyrighted content, they only make it illegal.
Yeah, that's the point. Without that, everything is bootlegged. Imagine video games - they get bootlegged. DVDs, all bootlegged. Clothing bootlegged. Whatever your business is - bootlegged. Zero copyright is not a utopia of free speech, it is people ripping everyone else off. Per lived experience, I'm just saying the other extreme is not a utopia.
So true! Copyrights that last 20 years would be completely reasonable. Maybe with exponentially increasing fees for successive renewals, for super valuable properties like Disney movies.
Nobody cares anymore. We're sick of their rent seeking, of their perpetual monopolies on culture. Balance? Compromise? We don't want to hear it.
Nearly two hundred years ago one man warned everyone this would happen. Nobody listened. These are the consequences.
"At present the holder of copyright has the public feeling on his side. Those who invade copyright are regarded as knaves who take the bread out of the mouths of deserving men. Everybody is well pleased to see them restrained by the law, and compelled to refund their ill-gotten gains. No tradesman of good repute will have anything to do with such disgraceful transactions. Pass this law: and that feeling is at an end. Men very different from the present race of piratical booksellers will soon infringe this intolerable monopoly. Great masses of capital will be constantly employed in the violation of the law. Every art will be employed to evade legal pursuit; and the whole nation will be in the plot. On which side indeed should the public sympathy be when the question is whether some book as popular as “Robinson Crusoe” or the “Pilgrim’s Progress” shall be in every cottage, or whether it shall be confined to the libraries of the rich for the advantage of the great-grandson of a bookseller who, a hundred years before, drove a hard bargain for the copyright with the author when in great distress? Remember too that, when once it ceases to be considered as wrong and discreditable to invade literary property, no person can say where the invasion will stop. The public seldom makes nice distinctions. The wholesome copyright which now exists will share in the disgrace and danger of the new copyright which you are about to create. And you will find that, in attempting to impose unreasonable restraints on the reprinting of the works of the dead, you have, to a great extent, annulled those restraints which now prevent men from pillaging and defrauding the living."
https://www.thepublicdomain.org/2014/07/24/macaulay-on-copyr...
Books, music, and games are a lot older than copyright.
Have you looked at who created these things by and large? For the most part, you have: - aristocrats that were wealthy that didn't need to "work" to survive and put food on the table - crafts people supported through the patronage of a rich person (or religious order) who deign to support your art - (kinda modern world) national governments who want to support their national art often as a fear that other larger nations cultural influences will dwarf their
Are you implying that these three pillars will be able to produce anywhere near the current amount of content we produce?
How in the world where digital copies are effectively free to copy and infinitum would a creator reap any benefits from that network effect?
A modern equivalent would be famous YouTubers who all they do all day is "watch" other people's hard earned videos. The super lazy ones will not direct people to the original, don't provide meaningful commentary, just consumes the video as 'content' to feed their own audience and provides no value to the original creator. The position to kill copyright entirely would amplify this "just bypass the original source" to lower value of the original creator to zero.
> Are you implying that these three pillars will be able to produce anywhere near the current amount of content we produce?
Do you think the vast "amount of content we produce" is actually propped up by copyright? Have you ever heard of someone who started their career on YouTube due to copyright? On the contrary, how often have you heard of people stopping their YouTube career due to copyright, or explicitly limiting the content they create? I have only heard of cases of the latter. In fact, the latter partially happened to me.
> How in the world where digital copies are effectively free to copy and infinitum would a creator reap any benefits from that network effect?
You are making an assumption that people should reap (monetary) benefits for creating things. What you are ignoring is that the world where digital copies are effectively free is also the world where original works are insanely cheap as well. In this world, people create regardless of monetary gain.
To make this point: how much money did you make from this comment that you posted? It's covered by copyright, so surely you would not have created it if not for your own benefit.
Spending 6 minutes of my life engaging in political discourse is a far swing from hundreds of individuals producing a movie that took millions of dollars to produce. Both are just as easily digitally repeatable, but the expensive content is likely way more beneficial to society as a whole. I am choosing to engage in this hobby because I receive the means to provide this content recreationally. I fail to see this scaling to anything of any real quality outside of some isolated instances. For instance, some video game enthusiasts are using the work of Bethesda to make a new game call fallout London. It's a knock off fallout game using the base code engine that Bethesda built for their commercial games. The game is exceptional in that it could actually achieve a mostly compatible level of a commercial product as long as you ignore that they're leveraging the engines and story which were developed by commercial interests. In the same time, 10's to hundreds of thousands of people are employed every year to produce video games for commercial reasons. Will they all stop making games if copyright was dead? No, but the vast majority would.
> Are you implying that these three pillars will be able to produce anywhere near the current amount of content we produce?
Yes, and better quality content too as it doesn't need to be compromised as much to allow for commercial exploitation in the current model.
But these are also not the only ways to fund content. Patronage in particular does not need to be restricted to singular rich patrons but can be extended to any group of people that decide to come together to make something exist. This does already happen to some extend (e.g. Kickstarter) but is actually hobbled by copyright where the norm is that the creator retains all rights while individual contributors to the funding are restricted in how they are allowed to share the creation they helped realize.
> How in the world where digital copies are effectively free to copy and infinitum would a creator reap any benefits from that network effect?
By having fans willing to pay him to create new content.
For that matter, if you think China ripping everyone else off is bad now… well, just wait until every company can do that.
If everyone could do it, it wouldn't be as big a deal - small western businesses would be on a more level playing field, since they would be almost as immune from being sued by big businesses as Chinese businesses are. As it is, small businesses aren't protected by patents (because a patent is a $10k+ ticket to a $100k+ lawsuit against a competitor with a $1M+ budget for lawyers) while still being bound by the restrictions of big business's patents. It's lose/lose.
Trademark isn't copyright, so no.
Yeah many industries like:
- Big Corps that buy IP
- Patent Trolls
- Companies that fuck over artists
Why would anyone make video games if they couldn't make money from selling them?
Video games would actually be better of if the profit incentive was removed. Modern high-budget video games have become indistinguishable from slot machines that are optimized by literal psychologists to get you to waste as much of your money (and time) as possible without providing any meaningful experience. I'd rather see much fewer games created if what remains are games focused on having artistic and/or educational value rather than investment opportunities for wall street.
This is just your own sanctimony, go to a gamestop and ask people if they think we should have an IP regime where there is no gta or football games. What a ridiculous response.
Out of passion for the art. See also: free (libre) software video games released and distributed for free (gratis).
Of course, money is a huge motivator, but so is self-expression.
Well they certainly aren't on level with each other in terms of motivation, so I don't think it's fair for you to say they are both huge motivators.
[flagged]
This is a specious argument. It is impossible for us to gesture at the works of art that do not exist because of draconian copyright. Humans have been remixing each others' works for millions of years, and the artificial restriction on derivative work is actively destroying our collective culture. There should be thousands of professional works (books, movies, etc.) based on Lord Of The Rings by now, many of which would surpass the originals in quality given enough time, and we have been robbed of them. And Lord Of The Rings is an outlier in that it still remains culturally relevant despite its age; most works will remain copyrighted for far longer than their original audience was even alive, meaning that those millions of flowers never get their chance to bloom.
> It is impossible for us to gesture at the works of art that do not exist because of draconian copyright.
We can gesture at the tiniest tip of the iceberg by observing things that are regularly created in violation of copyright but not typically attacked and taken down until they get popular:
- Game modding, romhacks, fangames, remakes, and similar.
- Memes (often based on copyrighted content)
- Stage play adaptations of movies (without authorization
- Unofficial translations
- Machinima
- Speedruns, Let's Play videos, and streams (very often taken down)
- Music remixes and sampling
- Video mashups
- Fan edits/cuts, "Abridged" series
- Archiving and preservation of content that would otherwise be lost
- Fan films
- Fanfiction
- Fanart
- Homebrew content for tabletop games
> "- Speedruns, Let's Play videos, and streams (very often taken down)"
Very often taken down, only by nintendo.
There are several other publishers who regularly go after gameplay footage of people playing their games. It's not as visible, because it's hard to notice the absence of a thing.
This is all true, and in a vacuum I agree with it. There's a pretty core problem with these kinds of assertions, though: people have to make rent. Never have I seen a substantiative, pass-the-sniff-test argument for how to make practical this system when your authors and your artists need to eat in a system of modern capital.
So I'm asking genuinely: what's your plan? What's the A to B if you could pass a law tomorrow?
> What's the A to B if you could pass a law tomorrow?
Top priority: UBI, together with a world in which there's so much surplus productivity that things can survive and thrive without having "how does this make huge amounts of money" as its top priority to optimize for.
Apart from that: Conventions/concerts/festivals (tickets to a unique live event with a crowd of other fans), merchandise (pay for a physical object), patronage (pay for the ongoing creation of a thing), crowdfunding/Kickstarter (pay for a thing to come into existence that doesn't exist yet), brand/quality preference (many people prefer to support the original even if copies can be made), commissions (pay for unique work to be created for you), something akin to "venture funding", and the general premise that if a work spawns ten thousand spinoffs and a couple of them are incredible hits they're likely to direct some portion of their success back towards the work they build upon if that's generally looked upon favorably.
People have an incredible desire both to create and to enjoy the creations of others, and that's not going to stop. It is very likely that the concept of the $1B movie would disappear, and in trade we'd get the creation of far far more works.
> UBI, together with a world in which there's so much surplus productivity that things can survive and thrive without having "how does this make huge amounts of money" as its top priority to optimize for.
The poster didn't posit it as "how does this make huge amounts of money," they asked how copyright authors are supposed to pay their rent in your scenario. Your solution of course, has nothing to do with copyright policy.
Yeah, this is what I was expecting. I have no love for Disney et al but I think that this is dire (aside from UBI, which would be great but is fictional without a large-scale shift in American culture).
"Everybody else gets paid for the work they do; you get paid for things around the work you do, if you're lucky" is a way to expect creatives to live that, to put a point on it, always ends up being "for thee, but not for me". It's bad enough today--I think you described something worse.
The current model is "most people get paid for the work they do, but you get paid for people copying work you've already done", which already seems asymmetric. This would change the model to "people get paid for the work they do, and not paid again for copying work they've already done".
We converged on a system that protects the commercialization of copies because, in practice, "the first copy costs $X0,000" is not a viable way to pay your rent.
If we want art to be the province of the willfully destitute or the idle rich (and I do mean rich, the destruction of a functional middle class has compacted the available free time of huge swaths of society!), this is a good way to do it. I would rather other voices be included.
We converged on a system that makes copying illegal because that system was invented in an era when the only people who could copy were those with specialized equipment (e.g. printing presses). In that world, those who might do the copying were often larger than those whose works were being copied, and copyright had more potential to be "protective".
That system hasn't been updated for a world in which everyone can make perfect-fidelity copies or modifications at the touch of a key; on the contrary, it's been made stricter. And worse, per the story we're commenting on here, the much larger players who are mass-copying works largely by individuals or smaller entities have become effectively exempt from copyright, while copyright continues to restrict individuals and smaller entities, and the systems designed by those large players and trained on all those copied works are crowding individuals out of art and other creative endeavors.
I don't think the current system deserves valorizing, nor can it be credited as being intentionally designed to bring about most of the effects it currently serves.
I'm not suggesting that deleting copyright overnight will produce a perfect system, nor am I suggesting that it has zero positive effects. I'm suggesting that it's doing substantial harm and needs a massive overhaul, not minor tweaks.
> the much larger players who are mass-copying works largely by individuals or smaller entities have become effectively exempt from copyright
That's not true. I'm a copyright attorney and I spend my day extracting money from the largest players on behalf of individuals.
I was referring to AI training here.
We'll see, but hopefully they will not.
They don't have to copy work, they can make their own work!
Many of the funding models Josh listed are directpayment for creative work being done. If anything, in the current model creative work is often not paid directly (unless done as work for hire where the creative doesn't get to own their creation) but instead is a gamble that you can later on profit from the "intellectual property".
Not the person you responded to, but:
>So I'm asking genuinely: what's your plan? What's the A to B if you could pass a law tomorrow?
Patreon (or liberapay etc). Take a look at youtube: so many creators are actively saying "youtube doesn't pay the bills, if you like us then please support us on Patreon". Patreon works. Some of the time, at least - just like copyright. Also crowdsourcing (e.g. Kickstarter), which worked out well for games like FTL and Kingdom Come: Deliverance.
Although, I personally don't believe copyright should be abolished - it just needs some amendments. It needs a duration amendment - not a flat duration (fast-fashion doesn't need even 5 years of copyright, but aerospace software regularly needs several decades just to break profitable), but either some duration-mechanism or a simple discrimination by industry.
Also, I think any sort of functional copyright (e.g. software copyright) ought to have an incentive or requirement to publish the functional bits - for instance, router firmware ought to require the source code in escrow (to be published once copyright duration expires) for any legal protections against reverse-engineering to be mounted. Unpublished source code is a trade secret, and should be treated as such.
Also, these discussions don't seem to mention fanfiction, which demonstrates plenty of people write good works without being professionally paid and without the protection of copyright.
How many subscribers on patreon are there because the creators provides pay-walled extra content? How many would remain if that pay-walled content would be mirrored directly by youtube or on youtube?
Crowdsourcing might work better, but how many would donate to a game where, instead of getting it cheaper as a kickstarter supporter, they could get free after it is released?
I completely forgot about Patreon's paywalled content. Plenty of channels don't have any, though, so I don't think it's that important.
Copyright is not optimized for making sure artists and authors get enough to eat. It's optimized for people with a lot of money to make even more money by exploiting artists and authors.
I doubt there's a simple answer (I certainly don't have one), but the current system is not exactly a creators' utopia.
My own business model is to create Things That Don't Exist Yet. This (typically bespoke work) is actually the majority of work in any era I think. For me, copyright doesn't do much, it mostly gets in the way.
If you pass the law tomorrow -all else being equal- my profits would stay equal or go up somewhat.
Fashion is traditionally not copyrightable[1] , and the fashion industry is doing rather well.
Similarly our IT infrastructure is now built mostly on [a set of patches to the copyright system][2] called F/L/OSS that provided more freedom to authors and users, and lead to more innovation and proliferation of solutions.
So even just in the modern west, we can see thriving ecosystems where copyright is absent or adjusted; and where the outcomes are immediately visible on the street.
[1] Though a quick search shows that lawyers are making inroads.
[2] One way of describing it at least, YMMV.
That ship sailed long ago. While copyright can and is used at times to protect the "little guy", the law is written as it is in order to protect and further corporate interests.
The current manifestation of copyright is about rent-seeking, not promoting innovation and creativity. That it may also do so is entirely coincidental.
Also, if it wasn't about rent-seeking and preventing access to works, copyright wouldn't have to last for decades, many multiples of a work's useful commercial life. The fact that it does last this long shows that it's not about promoting innovation and creativity.
Copyright was invented by a cartel of noblemen, the British Stationer's Company, who, due to liberal reform, were going to lose their publishing monopoly. The implementation of copyright law as they helped pen allowed them to mostly continue their position while portraying it as "protecting the little guy".
Funny how both the rhetoric and intentions are the same after three hundred years.
Copyright’s purpose is a cudgel to be wielded to enrich the holder for, ideally, eternity. If “eternity” is threatened, you use proceeds from copyright to change copyright law to protect future proceeds.
works the same for banks and owing them money
Violate billions or millions is what they used to nail warez folks with. So there is that.
> acting like that's the immutable status quo
It is immutable.
What are you going to do about it? Confiscate everyone's home gamer PCs?
Even in the most extreme hypothetical where lawsuits shutdown OpenAI, that doesn't delete the stable diffusion models that I have on my external hard drives.
The tech is out there. It's too late.
Somehow this argument does not seem to hold for copyright enforcement of works that have been shared over BitTorrent and it's predecessors for decades.
I can start downloading any major/popular piece of media, starting right now, in under 60 seconds through bittorrent.
I cannot think of a better example of how futile copyright enforcement has been than the example that you just brought up.
That's a significant over simplification of how it works though to the point of almost not being a useful analogy.
If your analogy was you were a human who memorized every variation of a problem (and every other known problem) and there was a tiny perctange of a chance where you reproduced that exact varation of one you memorized, but then added an after the fact filter so you don't directly reproduce it...
It's more like musicians who basically copy a bunch of music patterns or chord progressions before then notice their final output sounds too similar to another song (which happens often IRL) then changes it to be more original before releasing it to the public.
> If you analogy was you were a human who memorized every variation of a problem (and every other known problem)
This is mere assumption. AI is supposed to work like that, but that's a goal, and not the result of current implementations. Research shows that they do memorize solutions as well, and quite regularly so. (This is an unavoidable flaw in current LLMs; They must be capable of memorizing input verbatim in order to learn specific facts.)
> and there was a tiny perctange of a chance where you reproduced that exact varation of one you memorized
This is copyright infringement. Actionable copyright infringement. The big music publishers go after this kind of accidental partial reproduction.
> but then added an after the fact filter so you don't directly reproduce it...
"Legally distinct" is a gimmick that only works where the copyright is on specific identifiable parts of a work.
Changing a variable name does not make a code snippet "legally distinct", it's still copyright infringement.
Meh I still see that as a big oversimplification. Context matters. Even if the copyright courts often ignore that for wealthy entities. Someone reproducing a song using AI and publishing it as their own copyright infringement, a person specifically querying an AI engine, that sucked up billions of lines of information and generates what you ask it do with a sma probability it will reproduce a small subset of a larger commercial project and sends it to someone in a chatbox is not exactly the same IMO.
This is Github Copilot after all. I use it daily and it autocompletes lines of code or generates functions you can find on stackoverflow. It's not letting giving you the source code to Twitter in full and letting you put it on the internet as a business under another name.
We are currently seeing the music industry reacting to AI learning a bunch of music patterns and chord progressions and outputting works that sounds very similar to existing music and artists. They are not liking it.
To just see how much they disliked it, youtube copyright strikes is basically a trained AI to detect music patterns to identify sound with slight variations or copyrighted songs and take videos down. Generating slight variations was one of the early method that videos used to bypass the take down system.
From the article:
> The most recently dismissed claims were fairly important, with one pertaining to infringement under the Digital Millennium Copyright Act (DMCA), section 1202(b), which basically says you shouldn't remove without permission crucial "copyright management" information, such as in this context who wrote the code and the terms of use, as licenses tend to dictate.
> It was argued in the class-action suit that Copilot was stripping that info out when offering code snippets from people's projects, which in their view would break 1202(b).
> The judge disagreed, however, on the grounds that the code suggested by Copilot was not identical enough to the developers' own copyright-protected work, and thus section 1202(b) did not apply. Indeed, last year GitHub was said to have tuned its programming assistant to generate slight variations of ingested training code to prevent its output from being accused of being an exact copy of licensed software.
So (not a lawyer!) this reads like the point about GitHub tuning their model is not a generic defense against any and all claims of copyright infringement, but a response to a specific claim that this violates a provision of the DMCA.
I don't know whether this is a reasonable defense or not, but your intuitions or mine about whether there is a general copyright violation or what's fair are not necessarily relevant to how the judge construes that very specific bit of legal code.
What I got from this is, you can copy someone's copyrighted work provided you tweak a few things here and there. I wonder how this holds up in court if you don't have billions at your disposal.
Weird Al should be in the clear then, he changes probably 85% of all the song lyrics in his covers.
Weird Al explicitly seeks out permission from copyright holders and won't do a cover if he doesn't get their go-ahead [1].
Pretty much the exact opposite of all these AI companies :p
I'm implying that he doesn't seem to have to.
Just to set the stage and not entirely specific to this complaint... It really depends on what is and isn't subject to copyright for software.
Broadly, there is the distinction between expressive and functional code. [1]
And then there are the specific tests that have been developed by the courts to separate the expressive and functional aspects of software. [2] [3]
In practice it is very expensive for a plaintiff to do such analysis. For the most part the damages related to copyright are not worth the time and money. Plaintiffs tend to go for trade secret related damages as they are not restricted by the above tests.
There are also arguments to be made of de minimis infringements that are not worth the time of the court.
Most importantly the plaintiff fundamentally has the burden of proof and cannot just say that copying must have taken place. They need concrete evidence.
[1] https://en.wikipedia.org/wiki/Idea–expression_distinction
[2] https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...
[3] https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...
The guy who owns the machine is really rich, while you are more or less (all due respect of course) not worth suing.
That’s why I think the opposite of what you claim is true: if you were to do this, absolutely nothing would happen. When they do it, they will get sued over and over until the law changes and they can’t be sued, or they enter some mutually-beneficial relationship with the parties who keep suing.
> if you were to do this, absolutely nothing would happen
Read up on the DMCA and the impact it has on e.g. nintendo emulators and the developers thereof
Those emulators are very popular though to the point of potentially impacting another business's bottom line. Where an individual putting it out a small block of code isn't exactly going to attract expensive lawyers.
I'm skeptical Github Copilot reproducing a couple functions potentially used by some random Github project is going to be a threat to another party's livelihood.
When AI gets good enough to make full duplicates of apps I'd be more concerned about the source. Thousands of smaller pieces drawn from a million sources and being combined in novel ways is less worrying though.
There is no impact to a company's bottom line when you are emulating a product they do not sell.
Yuzu, the emulator that was sued by Nintendo, was emulating the Nintendo Switch, which is a product Nintendo does sell.
Yuzu is not the only emulator taken down by Nintendo and Nintendo is not the only company that has gone after emulators.
In that case, could you clarify what instances of this you're referring to?
The death of Citra wasn't really a deliberate action on the part of Nintendo, it was collateral damage. Citra was started by Yuzu developers and as part of the settlement they were not able to continue working on it. Citra's development had long been for the most part taken over by different developers, but the Yuzu people were still hosting the online infrastructure and had ownership of the GitHub repository, so they took all of it down. Some of the people who were maintaining Citra before the lawsuit opened up a new repository, but development has slowed down considerably because the taking down of the original repository has caused an unfortunate splintering of the community into many different forks.
There is some speculation Nintendo was involved with the death of the Nintendo 64 emulator UltraHLE a long time back, but this was never confirmed. If indeed they did go after UltraHLE, then this would just like Yuzu be a case of them taking down an emulator for a console they were still profiting from, as UltraHLE was released in 1999.
The most famous example of companies going after emulators is Sony, which went after Connectix Virtual Game Station and Bleem!. Both were PS1 emulators released in 1999, a period during which Sony was still very much profiting from PS1 sales. Sony lost both lawsuits and hasn't gone after emulators since.
In 2017, Atlus tried to take down the Patreon page for RPCS3, a PS3 emulator. However, Atlus only went after the Patreon page, not the emulator itself, which they did because of their use of Persona 5 screenshots on said page. The screenshots were simply taken down and the Patreon page was otherwise left alone. Of note is that Atlus is a game developer, so they were never profiting from PS3 sales. However, they were certainly still profiting from Persona 5 sales, which had only released in 2016.
These are the only examples I can remember. Did I miss anything?
emulators for many nintendo consoles have been developed and released while the console was still sold and have been left alone as long as they had no direct links to piracy, recent events are a bit of a change.
> There is some speculation Nintendo was involved with the death of the Nintendo 64 emulator UltraHLE a long time back, but this was never confirmed.
iirc it got c&d but a case was never filed in court, the source code turned up eventually anyways.
the bnetd emulator, that let Diablo and StarCraft players not have to pay Blizzard for the privilege of buying the game, though that's a bit different.
Yes there is. If I can emulate Super Mario Odyssey on my PC, I don't need to buy a Nintendo Switch. If it wasn't available there, I'd have to buy a Nintendo Switch to play it. That's a lost sale for Nintendo. You could argue that I wasn't going to buy a switch anyway, but then we're getting too into hypotheticals.
This is the same reasoning the music and movie industries use when they go after people downloading music. And contrary to the popular opinion, I think it is wrong: if people want to pay, they will pay. Same for movies: if people would really want to pay for a movie, they would go to a cinema. Or stream it after a week or two. But there are also people who would jump through hoops than pay for music or movies. And that is not a lost sale because there was never an intention to buy something in the first place.
Music isn't video games or movies, and is experienced differently, so while there are similarities, it's not the same because they aren't same thing.
Locks keep people honest. Unfortunately, software lockpicks have the unfortunate reality of being as easily distributable as the software itself.
I enjoy how you removed the “I think” qualifier which suggested that it’s very possible that you’re right.
I’m quite well read on the DMCA but admit you probably know far more about how Nintendo wields it.
Still, I suggest that it’s a lot more likely that GitHub is going to get sued than you or GP.
Finally, I believe using the legal system to bully independent software developers is, in legal terms, super lame. We are probably in the same side here.
DMCA (at least the take down requests part) is not really suing someone and not really about making money. Its about getting certain works off the internet.
You are probably more likely to be on the wrong end of a dmca take down request as a poor person since you dont have the resources to fight it, and its not about recovering damages just censorship.
We are really losing the plot of what this thread is about here, but: DMCA takedown requests that are ignored or wheee the site does not comply with the process are subject to private civil action. Obviously, a takedown request is distinct from suing someone. And the way that the rights holder forces the site to remove the content is under threat of monetary penalties.
> How is it any different when a machine does the same thing?
I think the argument is that the machine is not doing that, or at least there isn't evidence that it is doing that.
Specificly no evidence that github is doing both 1 and 2 at the same time. There might be cases where it makes trivial changes to code (point 2) but for code that does not meet the threshold of originality. Similarly there might be cases with copyrighted code where the idea of it is taken, but it is expressed in such a different way that it is not a straightforward derrivitave of the expression (keeping in mind you cannot copyright an idea, only its expression. Using a similar approach or algorithm is not copyright infringement)
And finally, someone has to demonstrate it is actually happening and not just in theory could happen. Generally courts dont punish people for future crimes they haven't comitted yet (sometimes you can get in trouble for being reckless even if nothing bad happens, but i dont think that applies to copyrighg infringement)
No clue.
But what if the generative AI were used to create music instead of code would the court have ruled differently?
CONSIDER:
In 2015, a federal judge order Thicke & Pharrell to pay 50% of proceeds to the Marvin Gaye estate for being “too similar” to the song, “Gots to Give It Up”.
Comparison and commentary: https://youtu.be/7_UiQueteN4?si=SkClbyBMOcucigRm
Comparison of both songs: https://youtu.be/ziz9HW2ZmmY?si=3_VZzfoLT-NrozoK
Regardless of the details here, it's become quite clear that the judicial system is for corporations. It doesn't matter whether they win, lose, or settle, as they win regardless, since the monetary benefits of what got them in court in the first place far outweigh any punishment or settlement cost.
You probably do this all the time. Forget memorizing but undoubtedly you've read code, learned from it, and then likely reproduced similar code. Probably nothing terribly important, just a function here or there. Maybe even reproduced something you did for a previous employer.
arr.sort((a, b) => a - b);
comes to mind. I bet most js devs have written this verbatim.
The machine alone doesn't do anything. The user and machine together constitute a larger system, and with autocomplete, the user is charge. What's the user's intent?
I suspect that a lot of copyright violations are enabled by cut-and-paste and screenshot-taking functionality, and maybe we need to be careful with autocomplete, too? It's the user's responsibility to avoid this. We should be careful using our tools. Do users take enough care in this case? Is it possible to take enough care while still using CoPilot?
I've switched from CoPilot to Cody, but I use them the same way, to write my code. There's no particular reason to use CoPilot's output verbatim and lots of good reasons not to. By the time I've adapted it to my code base and code style and refactored it to hell and back, it's an expression of how I want to solve a problem, and I'm pretty confident claiming ownership.
Is that confidence misplaced? Are other people more careless?
> The machine alone doesn't do anything.
By the same token, the machine alone can't download pirated movies. Yet the sites hosting those movies are targeted as the infringers.
There's a point at which foisting this responsibility on the users is simply socializing losses. Ultimately Copilot is the one serving the code up - regardless of the user's request. If the user then goes on to republish that work as their own it becomes two mistakes. It'll be interesting to see if any lawyers are capable of articulating that well enough in any of these lawsuits.
> Is that confidence misplaced? Are other people more careless?
I would say yes, for two reasons. One is that using code of unknown provenance means you're opening yourself to unknown legal risks. The second is if you're rewriting it fully (so as not to run afoul of easily spotted copyright) that's not actually "clean room" and you're still open to problems. I'd also wonder what the point of using a code writing LLM is anyways if you're doing all the authorship yourself. It seems like doing double the work.
It is a lot of work to do a lot of rewrites, but it’s noncommercial and I’m not in a hurry. And autocomplete is still pretty useful.
>I assume that I would get my ass kicked legally speaking.
Why? This is no different than copy pasting and modifying a bit of code from some documentation/other project/tutorial/SO. Surely if that were a basis for copyright infringement most semi-large software projects would be infringing on copyright.
I don't think anyone here should be willing to open the can if worms that is copy pasting small snippets of code and modifying them.
The judge seems to argue that the non-identical copies are at issue here and that they only happen under contrived circumstances. My moral opinion is that this is irrelevant and that even the defendant is the wrong person. Even verbatim copies of code snippets shouldn't be copyright infringement and suing the company providing the AI is wrong to begin with, as the AI or its providercan not possibly be the one to infringe.
I don't think it works that way. During the course of your professional career as a developer you change jobs. And let's say that at every job you create APIs. Besides the particular functions those API provide, the API code itself (how you interact with clients, databases etc.) will be pretty much the same as whatever you did at previous jobs. Does this constitute copyright experience or is just experience?
My analogy is that if Copilot doesn't provide 100% code from another repository it is OK to be used by other people trained with code available on GitHub.
It would. And this is where some legislation "in the spirit of" would have helped. So Microsoft's huge legal arm can't just wiggle their way out on technicalities. Clearly, the law is not prepared to face the challenge of copyright violations on the scale created by the LLMs.
I also think it's not just copyright. It's simply not right to create a product on top of the collective work of all open source developers monetize them on the absurt scale Microsoft operates and never ever credit the original creators.
Why stop there? Extrapolate that thought, keep generating more variants of the code, claim copyright, and seek rent from other people doing the same thing. To extrapolate full circle, there would be a business opportunity to generate as many variants as possible for the original author, to prevent all this from happening.
As long as we're not required to register copyright there's no reason to think the above will play out. International copyright agreements are not limited to verbatim copies only.
> Why stop there? Extrapolate that thought, keep generating more variants of the code, claim copyright, and seek rent from other people doing the same thing. To extrapolate full circle, there would be a business opportunity to generate as many variants as possible for the original author, to prevent all this from happening.
This has already been done[1] in music, though in their case they released them to the public domain. Admittedly I think that was more of a protest than anything.
[1]: https://www.vice.com/en/article/wxepzw/musicians-algorithmic...
You are taking the plaintiff statement as is, which is wrong. You can blame the media that didn't made it clear that it was a statement from the plaintiff.
> I assume that I would get my ass kicked legally speaking. That reads to me exactly like deliberate copyright infringement with willful obfuscation of my infringement.
It looks like wilful obfuscation because the obfuscation is so simplistic. But as the obfuscation gets increasingly sophisticated, it becomes ever harder to distinguish wilful obfuscation from genuine originality.
> But sufficiently complex obfuscation of infringement is very hard to distinguish from genuine originality.
for the purposes of copyright, originality is not required, just different expressions. It's ideas (aka, patent) that require originality.
The 'sufficiently complex obfuscation' is exactly what people's brains go through when they learn, and re-produced what they learnt in a different context.
I argue that AI-training can be considered to be doing the same.
Some different scenarios:
(1) You leave your employer, don’t take any code with you, start your own company, reimplement your ex-employer’s product from scratch, but you do it in a very different way (different language, different design choices, different tech stack, different architecture)
(2) You leave your employer, take their code with you, start your own company, make some superficial changes to their code to obscure your theft but the copying is obvious to anyone who scratches the surface
(3) You leave your employer, take their code with you, start your own company, start very heavily manually refactoring their code, within a few months it looks completely different, very difficult to distinguish from (1) unless you have evidence of the process of its creation
(4) You leave your employer, take their code with you, start your own company, download some “infringement obfuscation AI agent” from the Internet and give it your employer’s codebase, within a few hours it has transformed it into something difficult to distinguish from (1) if you didn’t know the history
(1) is unlikely to be held to be infringing. (2) is rather obviously going to be held to be infringing. But what about (3)? IANAL, but I suspect if you admitted that is how you did it, a judge would be unlikely to be very sympathetic. Your best hope would be to insist you actually did (1) instead. And then the outcome of the case might come down to whether the judge/jury believes your claim you actually did (1), or the plaintiff/prosecution’s claim you did (3).
And (4) is basically just (3) with AI to make it a lot faster and quicker. Such an agent likely doesn’t exist yet, but it could happen.
Timing is obviously a factor. If you leave your employer and launch a clone of their app the next week, everyone is going to think either you stole their code, or you were moonlighting on writing it (in which case they may legally own it anyway). If it takes you 12 months, it becomes more believable you wrote it from scratch. But if someone uses AI to launder code theft, maybe they can build the “clone” in a few days or weeks, and then spend a few months relaxing and recharging before going public with it
Numbers 2, 3, & 4 are all illegal because they start with an illegal action.
If I find a dollar on the sidewalk and put it in my wallets, is that stealing? If I punch a man getting change at a hotdog stand and a dollar falls on the sidewalk and then I put that in my wallet, is that stealing?
It doesn't matter what the scenario is after you stole code from your former employer, all actions are poisoned after.
Although the question is - obviously the ex-employee is likely to be found guilty of copyright infringement (civilly or criminally or both). But what is the copyright status of the resulting work? Does its infringing origins condemn it to always be infringing? Or at some point if it is refactored/rewritten enough it ceases to so be?
Imagine the ex-employee open sources it, and I’m an innocent third party using that code base, ignorant of its unlawful origins. Am I infringing their ex-employers copyright (even if unintentionally)? For (2), obviously “yes”. But what about (3) or (4)?
I agree. I don’t see the difference.
That’s the entire reason “clean room reverse engineering” is done.
Using nothing but the binary itself, work out how things are done. Making sure that the reverse engineers don’t even have access to any material that could look like it came from the other organization in question. And that it is provable.
How is it anything different? You have no money. And Microsoft has. The problem on this is that it will give a huge leverage to rich companies over poor, because those rich can steal (memorize with AI) anything including music
It seems the total disregard that the tech community showed toward copyright when it was artists losing out has come back to bite. Face-eating leopards, etc.
The actual answer here, regardless of a court ruling, is that you'd go broke if anyone big enough tried to go after you for it.
Legal protections for source code are still pretty fuzzy, understandably so given how comparatively new the industry is. That doesn't stop lawyers from racking up huge fees though, it actually helps because they need so much more prep time to debate a case that is so unclear and/or lacking precedent.
> How is it any different when a machine does the same thing?
Because intent matters in the law. If you intended to reproduce copyrighted code verbatim but tried to hide your activity with a few tweaks, that's a very different thing from using a tool which occasionally reproduces copyrighted code by accident but clearly was not designed for that purpose, and much more often than not outputs transformative works.
> clearly was not designed for that purpose,
I'm not aware of evidence that support that claim. If I ask ChatGPT "Give me a recipe for squirrel lemon stew" and it so happens that one person did write a recipe for that exact thing on the Internet, then I would expect that the most accurate, truthful response would be that exact recipe. Anything else would essentially be hallucination.
Recipes are not copyrightable for that exact reason.
Substitue recipe for literally any other piece of unique information.
Copyright doesn't apply to unique pieces of information. Copyright applies to unique expressions. You can't copyright a fact.
i think you are misconceiving then how LLMs work / what they are
You can certainly try to hit a nail with a screw driver, but that doesn't make the screw driver a hammer.
As I understand it, LLMs are intended to answer questions as "truthfully" as they can. Their understanding of truth comes from the corpus they are trained on. If you ask a question where the corpus happens to have something very close to that question and its answer, I would expect the LLM to burp up that answer. Anything less would be hallucination.
Of course, if I ask a question that isn't as well served by the corpus, it has to do its best to interpolate an answer from what it knows.
But ultimately its job is to extract information from a corpus and serve it up with as much semantic fidelity to the original corpus as possible. If I ask how many moons Earth has, it should say "one". If I ask it what the third line of Poe's "The Raven" is, it should say "While I nodded, nearly napping, suddenly there came a tapping,". Anything else is wrong.
If you ask it a specific enough question where only a tiny corner of its corpus is relevant, I would expect it to end up either reproducing the possibly copyright piece of that corpus or, perhaps worse, cough up some bullshit because it's trying to avoid overfitting.
(I'm ignoring for the moment LLM use cases like image synthesis where you want it to hallucinate to be "creative".)
I get that's what you and a lot of people want it to be, but it isn't what they are. They are quite literally probabilistic text generation engines. Let's emphasise that: the output is produced randomly by sampling from distributions, or in simple terms, like rolling a dice. In a concrete sense it is non-deterministic. Even if an exact answer is in the corpus, its output is not going to be that answer, but the most probable answer from all the text in the corpus. If that one answer that exactly matches contradicts the weight of other less exact answers you won't see it.
And you probably wouldn't want to - if I ask if donuts are radioactive and one person explicitly said that on the internet you probably aren't going to tell me you want it to spit out that answer just because it exactly matches what you asked. You want it to learn from the overwhelimg corpus of related knowledge that says donuts are food, people routinely eat them, etc etc and tell you they aren't radioactive.
They are all hallucinations. Calling lies hallucinations and truths normal output is nonsense.
Perfect analogy.
Not in copyright. The work speaks for itself, and the function of code is not a copyrightable aspect.
The intent of the work can matter when determining if de minimis applies as well as fair use.
Part of my point is that fair use doesn't apply.
Training a model doesn't involve reproducing a copyrighted work, preparing a derivative work, distributing that work, or performing that work.
Fair use isn't required because none of the exclusive rights afforded by copyright apply.
It's equally plausible to say you don't intend to reproduce copyrighted code verbatim but occasionally do so given either a sufficiently specific prompt or because the reproduced code is so generic that it probably gets rewritten a hundred times a day because that's how people learned to do basic things from books or documentation or their education.
Um, the entire intent of these "AI" systems is explicitly to reproduce copyrighted work with mechanical changes to make it not appear to be a verbatim copy.
That is the whole purpose and mechanism by which they operate.
Also the intent does not matter under law - not intending to break the law is not a defense if you break the law. Not intending to take someone's property doesn't mean it becomes your property. You might get less penalties and/or charges, due to intent (the obvious examples being murder vs manslaughter, etc).
But here we have an entire ecosystem where the model is "scan copyrighted material" followed by "regurgitate that material with mechanical changes to fit the surrounding context and to appear to be 'new' content".
Moreover given that this 'new' code is just a regurgitation of existing code with mutations to make it appear to fit the context and not directly identical to the existing code, then that 'new' code cannot be subject to copyright (you can't claim copyright to something you did not create, copyright does not protect output of mechanical or automatic transformations of other copyrighted content, and copyright does not protect the result of "natural processes", e.g 'I asked a statistical model to give me a statically plausible sequence of tokens and it did'). So in the best case scenario - the one where the copyright laundering as a service tool is not treated as just that, any code it produces is not protectable by copyright, and anyone can just copy "your work" without the license and (because you've said if you weren't intending to violate copyright it's ok) they can say they could not distinguish the non-copyright-protected work from the protected work and assumed that therefore none of it was subject to copyright. To be super sure though they weren't violating any of your copyrights, they then ran an "AI tool" to make the names better and better suit your style.
I am so sick of these arguments where people spout nonsense about "AI" systems magically "understanding" or "knowing" anything - they are very expensive statistical models, the produce statistically plausible strings of text, by a combination of copying the text of others wholesale, and filling the remaining space with bullshit that for basic tasks is often correct enough, and for anything else is wrong - because again they're just producing plausible sequences of tokens and have no understanding of anything beyond that.
To be very very very clear: if an AI system "understood" anything it was doing, it would not need to ingest essentially all the text that anyone has ever written, just to produce content that is at best only locally coherent, and that is frequently incorrect in more or less every domain to which it is applied. Take code completion (as in this case): Developers can write code without essentially reading all the code that has ever existed just so that they can write basic code, because developers understand code. Developers don't intermingle random unrelated and non-present variables or functions in their code as they write, because they understand what variables are and therefore they can't use non existent ones. "AI" on the other hand required more power than many countries to "learn" by reading as much as possible all code ever written, and then produce nonsense output for anything complex because they're still just generating a string of tokens that is plausible according to their statistical model - the result of these AIs is essentially binary: it has been in effect asked to produce code that does something that was in its training corpus and can be copied essentially verbatim, with a transformation path to make it fit, or it's not in the training corpus and you get random and generally incorrect code - hopefully wrong enough it fails to build, because they're also good at generating code that looks plausible but only fails at runtime because plausible sequence of tokens often overlaps with 'things a compiler will accept'.
I actually once tracked this claim down in the case of stable diffusion.
I concluded that it was just completely impossible for a properly trained stable diffusion model to reproduce the works it was trained on.
The SD model easily fits on a typical USB stick, and comfortably in the memory of a modern consumer GPU.
The training corpus for SD is a pretty large chunk of image data on the internet. That absolutely does not fit in GPU memory - by several orders of magnitude.
No form of compression known to man would be able to get it that small. People smarter than me say it's mathematically not even possible.
Now for closed models, you might be able to argue something else is going on and they're sneakily not training neural nets or something. But the open models we can inspect? Definitely not.
Modern ML/AI models are doing Something Else. We can argue what that Something Else is, but it's not (normally) holding copies of all the things used to train them.
I think this argument starts to break down for the (gigantic) GPTs where the model size is a lot closer to the size of the training corpus.
Thinking in terms of compression, the compression in generative AI models is lossy. The mathematical bounds on compression only apply to lossless compression. Keeping in mind that a small fraction of the training corpus is presented to the training algorithm multiple times, it's not absurd to suggest that these works exist inside the algorithm in a recallable form. Hence the NYT's lawyers being able to write prompts that recall large chunks of NYT articles verbatim.
Well, certainly up to GPT-3 that would seem a little odd. Models of somewhat similar capability are not THAT big, really. Eg:
$ ollama list
NAME ID SIZE MODIFIED
yi:34b ff94bc7c1b7a 19 GB 7 days ago
mistral:latest 61e88e884507 4.1 GB 2 months ago
mixtral:8x22b bf88270436ed 79 GB 2 months ago
llama3:70b be39eb53a197 39 GB 2 months ago
phi3:latest a2c89ceaed85 2.3 GB 2 months ago
dolphin-mistral:latest 5dc8c5a2be65 4.1 GB 2 months ago
yarn-mistral:7b-128k 6511b83c33d5 4.1 GB 2 months ago
yarn-mistral:latest 8e9c368a0ae4 4.1 GB 2 months ago
llama3:latest a6990ed6be41 4.7 GB 2 months ago
For comparison, here's some stable diffusion checkpoints. ComfyUI/models/checkpoints $ du -h .
6.5G breakdomainxl_v03d.safetensors
6.5G dreamshaperXL10_alpha2Xl10.safetensors
6.5G sd_xl_base_1.0.safetensors
5.7G sd_xl_refiner_1.0.safetensors
...
And I seem to recall there are some theoretical lower bounds on even lossy compression. Some quick back of the envelope fermi estimation gets me a hard lower bound of 5TB for "all the images on the internet"; but I'm not quite confident enough in my math to quite back that up right here and now.> And I seem to recall there are some theoretical lower bounds on even lossy compression.
I'm not sure what your math is coming from and it seems trivially wrong. A single black pixel is a very lossy compression of every image on the internet. A picture of the Facebook logo is a slightly-less-lossy compression of every picture on the internet (the Facebook logo shows up on a lot of websites). I would believe that you can get a bound on lossy compression of a given quality (whatever quality means) only if you assume that there is some balance of the images in the compressed representation. There are a lot of assumptions there, and we know for a fact that the text fed to the GPTs to train them was presented in an unbalanced way.
In fact, if you look at the paper "textbooks are all you need" (https://arxiv.org/pdf/2306.11644) you can see that presenting a very limited set of information to an LLM gets a decent result. The remaining 6 trillion tokens in the training set are sort of icing on the cake.
Ok, that's a really low lower bound.
I think you'll agree that it would be a bit absurd to threaten legal action against someone for storing a single black pixel.
OTOH Someone might be tempted to start a lawsuit if they believe their image is somehow actually stored in a particular data file.
For this to be a viable class action lawsuit to pursue, I think you'd have to subscribe to the belief that it's a form of compression where if you store n images, you're also able to get n images back. Else very few people would have actual standing to sue.
I think that when you speak in terms of images, for a viable lawsuit, you need to have a form of compression that can recall n (n >= 1) images from compressing m (m >= n) images. Presumably n is very large for LLMs or image models, even though m is orders of magnitude larger. I do not think that your form of compression needs to be able to get all m images back. By forcing m = n in your argument, you are forcing some idea of uniformity of treatment in the compression, which we know is not the case.
The black pixel won't get you sued, but the Facebook logo example I used could get you sued. Specifically by Facebook. There is an image (n = 1) that is substantially similar to the output of your compression algorithm.
That is sort of what Getty's lawsuit alleges. Not that every picture is recallable from an LLM, but that several images that are substantially similar to Getty's images are recallable. The same goes with the NYT's lawsuit and OpenAI.
Thank you for talking with me!
I do realize the benefits of the 'compression' model of ML. Sometimes you can even use compression directly, like here: https://arxiv.org/abs/cs/0312044 .
I suppose you're right that you only need a few substantively similar outputs to potentially get sued already. (depending on who's scrutinizing you).
While talking with you, it occurred to me that so far we've ignored the output set o, which is the set of all images output by -say- stable diffusion. n can then be defined as n = m ∩ o .
And we know m is much larger than n, and o is theoretically practically infinite [1] (you can generate as many unique images as you like) , so o >> m >> n . [2]
Really already at this point I think calling SD a compression algorithm might be just a little odd. It doesn't look like the goal is compression at all. Especially when the authors seem to treat n like a bug ('overfit'), and keep trying to shrink it.
That's before looking back at the "compression ratio" and "loss ratio" of this algorithm, so maybe in future I can save myself some maths. It's an interesting approach to the argument I might try more in future. (Thank you for helping me to think in this direction)
* I think in the case of the Getty lawsuit they might have a bit of a point, if the model might have been overfitted on some of their images. Though I wonder if in some cases the model merely added Getty watermarks to novel images. I'm pretty sure that will have had something to do with setting Getty off.
* I am deeply suspicious of the NYT case. There's a large chunk of examples where they used ChatGPT to browse their own website. This makes me wonder if the rest of the examples are only slightly more subtle. IIRC I couldn't replicate them trivially. (YMMV, we can revisit if you're really interested)
[1] However, in practice there appear to be limits to floating point precision.
[2] I'm using >> as "much greater than"
> Also the intent does not matter under law - not intending to break the law is not a defense if you break the law
Intent frequently matters a great deal when applying laws.
In the specific area of copyright law, it doesn't itself make the use non infringing, but it can absolutely impact the damages or a fair use argument.
Great point, I wonder how the court will look at OpenAI's internal discussions around utilizing copyrighted materials.
If you tell a programmer to implement a function foo(a, b) then there are actually only a tiny number of ways to do that, semantically speaking, for any given foo. The number of options narrows quickly as the programmer implementing it gets more competent.
Choosing function signatures is an art form but after that "copying" is hard to judge.
> a function foo(a, b) then there are actually only a tiny number of ways to do that
I'd argue there are infinite ways to implement any function, just almost all of them are extremely bad.
You would not get your ass kicked legally speaking. Copyright is not that broad. It's not a patent.
> How is it any different when a machine does the same thing?
Literally the bank account behind the action...
it depends on how much tax you are paying really. if you pay billions in taxes annually, they might see past it. if the company you copied from pays billions in taxes anually. you will go to jail. if this isn't painfuly obvious by now...
Doesn't seem that taxes are the deciding factor seing as how little the government cares about tax dodging done by corporations and the rich.
Adding to the sibling comments:
First: every human is per se doing that already. We have – to handwave – a "reasonable person" bar to separate violations versus results of learning and new innovation.
Second: You can be a holder of copyright and your creations result in copyrightable artifacts. Anything generated by the program has been held as uncopyrightable.
who gets to copyright claim the various array sorting algorithms then?
Days like this, I wonder what Borges would have made of such questions.
"Pierre Menard, author of redis"
I know from experience that parents are aggressively pushing their children into STEM to maximize their chances of being economically secure, but, I really feel that we need a generation of philosophers and humanists to sift through the issues that our technology is raising. What does it mean to know something? What does authorship mean? Is a translated work the same as the original? Borges, Steiner, and the rest have as much to contribute as Ellison, Zuckerberg, and Altman.
Rules for thee but not for me (rich companies). Think of the shareholders!
[dead]
> I assume that I would get my ass kicked legally speaking.
Maybe, maybe not. It's not as simple as you made it out to be. If you write a book with lots of stuff and you got inspiration from other books, and even put in phrases wholesale, but modified to use your own character names instead, I'm not convinced you would lose.
The court would look at the work as a whole, not single pieces of it.
They would also check if you are just copying things verbatim, or if you memorize a pattern and emit the same pattern - for example look at lawsuits about copying music, where they'll claim this part of the music is the same as that part.
It's really not as cut and dry as you make it out to be.
> The anonymous programmers have repeatedly insisted Copilot could, and would, generate code identical to what they had written themselves, which is a key pillar of their lawsuit since there is an identicality requirement for their DMCA claim. However, Judge Tigar earlier ruled the plaintiffs hadn't actually demonstrated instances of this happening, which prompted a dismissal of the claim with a chance to amend it.
It sounds fair from how the article describes it
Huh. There have definitely been well publicized examples of this happening, like the quake inverse square root
You can't copyright a mathematical operation. Only a particular implementation of it, and even then it may not be copyrightable if its a straightforward and obvious implementation.
That said the implementation doesn't appear to be totally trivial and copilot apparently even copies the comments which are almost certainly copyrightable in themselves.
https://x.com/StefanKarpinski/status/1410971061181681674 https://github.com/id-Software/Quake-III-Arena/blob/dbe4ddb1...
However a twitter post on its own isn't evidence a court will accept. You would need the original poster to testify that what is seen in the post is actually what he got from copilot and not just a meme or joke that he made.
Also the plaintiffs in this case don't include id-Software and there is some evidence that id-Software actually stole the fast inverse sqrt code from 3dfx so they might not want to bring a claim here anyways.
Not sure where you thought I said you could copyright a mathematical operation, I was clearly referring to the implementation due to the mention of “quake”.
When it was reported, I was able to reproduce it myself.
Weren't people getting it to spit out valid windows keys also?
GPT4 regurgitated almost full NYT articles verbatim. It's strange that this lawsuit seems to be so amateurish that they failed to properly demonstrate the reproduction. Though of course it might require a lot of legal technicalities that we naively think are trivial but they might be not.
I read that case.
Absolutely there were a few outliers where a judge might want to look more closely. I'd be surprised if -under scrutiny- there wouldn't be any issues whatsoever that OpenAI overlooked.
However, it seemed to me that over half of the NYT complaints were examples of using the -then rather new- ChatGPT web browsing feature to browse their own website. In the case, they then claimed surprise when it did just what you'd expect a web browsing feature to do.
> You can't copyright a mathematical operation.
i agree from a philosophical pov, but this is clearly not the case in law.
The second step is to remove from consideration aspects of the program which are not legally protectable by copyright. The analysis is done at each level of abstraction identified in the previous step. The court identifies three factors to consider during this step: elements dictated by efficiency, elements dictated by external factors, and elements taken from the public domain.
https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...
Its even simpler, iD is owned by ZeniMax. ZeniMax is owned by Microsoft.. who would they even sue?
That's not how that works.
All the plaintiffs would need to do is provide evidence that copywritten code was produced verbatim. This includes showing the copyrighted code on GitHub, showing copilot reproducing the code (including how you manipulated copilot to do it), showing that they match, and showing that the setting to turn off reproduction of public code is set.
It makes no difference who owns the copyrighted code, it need only be shown that copilot is violating copyright. Microsoft can't say "uhh that doesn't count" or whatever simply because they own a company that owns a company that owns copyright on the code.
"Trust no one... even yourself"
Algorithms can and are definitely patented in utility patents in the US.
It reads like the judge required them to show it happened to their code, not to any code in general. That's a much higher bar. There are thousands of instances of fast inverse square root in the training data but only one copy of your random github repositories. Getting to model to reproduce your code verbatim might be possible for all we know, but it isn't trivial.
>It reads like the judge required them to show it happened to their code, not to any code in general.
Rightly so, you have to show some sort of damage to sue someone, not just theoretical damages.
of course for standing. but it seems like with the right plaintiffs this could have gone forward
But that’s like saying my lawsuit alleging Taylor Swift copied my song could have gone forward with a plaintiff who had, years ago, written a song similar to what Ms. Swift recorded recently. That”s true, but perhaps the lesson here is that damages that hinge on statistically rare victims should not extrapolated out to provide windfalls for people who have not been harmed.
i think that is a weak analogy and also unnecessary bc it is already clear what i am saying
If it only copies code that has been widely stolen already then that's a lot weaker of a case and is something they can do a lot to prevent on a technical level.
Code that has been copied widely != code that has been widely stolen.
Open source licenses allow sharing under certain conditions.
It could be forced, of course. I can republish my copyrighted code millions of times all over the internet. Next time they retrain there is a good chance my code will end up in their corpus, maybe many many times, reinforcing it statistically.
The article mentions that GitHub copilot has been trained to avoid directly copying specific cases it knows, and that although you can get it to spit out copyright code by prefixing the copyrighted code as a starting point, in normal us cases its quite rare.
yes, but you need to show that it happened _in your case_, not that it can happen in general.
Fast inverse square root is now part of the public domain.
Also, even if this weren’t the case you can’t sue for damages to other people (they’d need to bring their own suit)
Is the particular implementation that the model spits out 70+ years old?
Comment was deleted :(
But copilot distributed it (allegedly) without complying with the GPL license (which requires any distribution to be accompanied by the license) so it still would be an instance of copyright infringement. https://x.com/StefanKarpinski/status/1410971061181681674
Has it really already been 70 years since John Carmack died?
Ah, you're right. I was wrong to say "public domain".
It would be more correct to say Quake III Arena was released to the public as free software under the GPLv2 license.
There is a large gap between public domain and GPL. For starters if Copilot is emitting GPL code for closed source projects... that's copyright infringement.
That would be license infringement, not copyright infringement.
Copyright infringement is emitting the code. The license gives you permission to emit the code, under certain conditions. If you don't meet the conditions, it's still copyright infringement like before.
No.
Copyright infringement could be emitting the code in a manner that exceeds fair use.
The license gives you permission to utilize the code in a certain way. If Copilot gives you GPLed code that you then put into your closed source project, you have infringed the license, not Copilot.
> If you don't meet the conditions, it's still copyright infringement like before.
Licensing and copyright are two separate things. Neither has anything to do with the other. You can be in compliance with copyright, but out of license compliance, you can be the reverse. But nothing about copyright infringement here is tied to licensing.
To be clear: I am a person who trashed his Reddit account when they said they were going to license that text for training (trashed in the sense of "ran a script that scrubbed each of my comments first with nonsense edits, then deleted them"). I am a photographer who has significant concerns with training other models on people's creative output. I have similar concerns about Copilot.
But confusing licensing and copyright here only muddies waters.
Without adhering to the conditions of the GPL you have no license to redistribute the code and are therefore infringing the copyright of the author.
Apparently, the court disagrees with you, and doesn't find "emitting" the code a copyright infringement.
It'd be a long bow to draw to say that what is akin to a search result of a snippet of code is "redistributing a software package".
Where it gets ethnically dubious is that:
1. The copilot team rushed to slap a copyright filter on top to keep these verbatim examples from showing up, and now claims they never happen.
2. LLMs are prone to paraphrasing. Just because you filter out verbatim copies doesn't mean there isn't still copyright infringement/plagiarism/whatever you want to call it. The copyright filter is only a legal protection, not a practical protection against the issue of copyright infringement.
Everyone who knows how these systems work understand this. The copilot FAQ to this day claims that you should run copyright scanning tools on your codebase because your developers might "copy code from an online source or library".
Github has it's own research from 2021 showing that these tools do indeed copy their training data occasionally: https://github.blog/2021-06-30-github-copilot-research-recit...
They clearly know the problem is real. Their own research agreed, their FAQs and legal documents are carefully phrased to avoid admitting it. But rather than owning up to the problem, it's "Ner ner ner ner ner, you can't prove it to a boomer judge".
> The copilot team rushed to slap a copyright filter on top to keep these verbatim examples from showing up, and now claims they never happen.
More than that: the fact that they claimed it wasn't possible before adding the filter, to filter out the thing that said wasn't possible. This doesn't help me trust anything else they might say or have already said.
My take on that was always: if it isn't possible, then why are MS not training the AIs on their internal code (like that for Office, in the case of MS with their copilot product) as well as public code? There must be good examples for it to learn from in there, unless of course they thing public code is massively better than their internal works.
How do you know they aren’t training it on their internal code?
Since you really need to work hard to make the AI spit out anything verbatim, and you have no knowledge of their internal code, how could you ever prove or deny it?
> How do you know they aren’t training it on their internal code?
Because if they were, they would have said.
It would be an excellent answer to the concerns being discussed here: “we are so sure that there is nothing to worry about in this regard, that we are using our own code as well as the stuff we've schlepped from github and other public sources”.
> Just because you filter out verbatim copies doesn't mean there isn't still copyright infringement/plagiarism/whatever you want to call it.
Actually, it does. The production of the output is what matters here.
If you copy someone else's copyrighted work and then rearrange a few lines and rename a few things, you're probably still infringing.
For a book or a song, for sure, although that isn't really punished. Search the drama surrounding a popular YA author in the 10's, Cassandra Claire. For code since you can only copy the form and not the function that might actually be enough.
People do clean room implementations because of paranoia, not because it's actually a necessary requirement.
Moving a few things around means your internal process already had copywrite infringement.
Probably not. Copyright infringement in the manner we're talking about presumes you already have license to access the code (like how Github does). What you don't have license to do is distribute the code -- entirely or not without meeting certain conditions. You're perfectly free to do whatever naughty things you want with the code, sans run it, in private.
The literal act of making modifications isn't infringement until you distribute those modifications -- and we're talking about a situation where you've changed the code enough that it isn't considered a derivative work anymore (apparently) so that's kosher.
First the case would be dismissed if Copilot had permission to make copies. Clearly they didn’t. Copyright cares about copies, for profit distribution just makes this worse.
> you already have license to access the code
This isn’t access, that occurs before the AI is trained. It’s access > make copy for training > AI does lossy compression > request unzips that compression making a new copy > process fuzzes the copy so it’s not so obvious > derivative work sent to users.
Clearly Copilot had permission to make (unmodified) copies, the same way Github's webserver had permission to make (unmodified) copies. The lawsuit is about making partial copies without attribution.
GitHub's terms of service (TOS), in my non-lawyerly opinion, clearly states the license for uploaded works granted to them by users doesn't cover using the data to train an LLM or any kind of model beyond those used to improve the hosting service:
>You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time
>This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.
https://docs.github.com/en/site-policy/github-terms/github-t...
I think the important questions are (1) whether "the Service" includes Copilot, and (2) whether GitHub is selling users' content with Copilot.
For (1), I'm unhappy to admit Copilot probably does fall under "the Service," which is nebulously defined as "applications, software, products, and services provided by GitHub." But I'll still say that users' could not agree to this use while GitHub was training The Copilot model but hadn't yet announced it. At that time, a reasonable user would've believed GitHub's services only covered repository hosting, user accounts, and the extra features attached to those (issue trackers, organizations, etc).
GitHub could defend themselves on point (2) by saying they aren't selling the code, instead selling a product that used the code as input. But does that differ much from selling an online service that relies on running user code? The code is input for their servers, and it doesn't need to be distributed as part of that questionable service. But it's a clear break from the TOS.
GitHub’s web server is not the same thing as Copilot and needs separate permission.
GitHub didn’t just copy open source code they copped everything without respect to license. As such attribution which may have allowed some copying isn’t generally relevant.
Really a public repo on GitHub doesn’t even mean the person uploading it owns the code, if they needed to verify ownership before training they couldn’t have started. Thus by necessity they must take the stance that copyright is irrelevant.
If you’ve copied three lines and rearrange and reword them, there’s little infringement left.
If you copy a whole book and do the same, there’s still lines-3 infringement left.
> 1.
Isn't that akin to destruction of evidence?
Legally? No.
In spirit? ... Probably?
Unlike most LLMs, Github copilot can trivially solve their copyright problem by just using only code they have the right to reproduce.
They have a giant corpus of code tagged with license, SELECT BY license MIT/Equivalent and you're done, problem solved because those licenses explicitly grant permission for this kind of reuse.
(It's still not very cash money to take open source work for commercial gain without paying the original authors, and there's a humorous question if MIT-copilot would need to come with a multi-gigabyte attribution file, but everyone widely agrees it's legal and permitted.)
The only reason you'd hack a filter on top rather than doing the above is if you'd want to hide the copyright problem. It's an objectively worse solution.
> Unlike most LLMs, Github copilot can trivially solve their copyright problem by just using only code they have the right to reproduce.
Absolutely not trivial, in fact completely impossible by computer alone. You can't determine if you have the right to reproduce a piece of code just by looking at the code and tags themselves. *Taps the color-of-your-bits sign.*
* I can fork a GPL project on Github and replace the license file with MIT. Okay to reproduce?
* If I license my project as MIT but it includes code I copied inappropriately and don't have the right to reproduce myself, can Github? (No) This one is why indemnity clauses exist on contracted works.
* I create a git repo for work and select the MIT license but I don't actually own the copyright on that code and so that license is worthless.
There is no difference when it comes to MIT and GPL here. If your model outputs my MIT licensed code, you still need to provide attribution in the form of a copyright notice as required by the MIT license.
Have the copyleft people, or anyone else, produced some boilerplate licenses that explicitly deny use in training models?
Comment was deleted :(
I would think it is pretty obviously not.
Is taking away a drunk driver's keys (before they get in the car) destruction of the evidence of their drunk driving?
This is not what I meant. By placing a copyright filter and claiming it never happened (please read the line I was replying to) before the system can be audited, they're indeed taking away the drunk driver's keys, which is a good thing, but also removing the offending car before Police arrives.
In this metaphor, removing the car of someone who was going to drink and drive but didn't, is certainly not a crime. Presumably though you mean removing the car after drunk driving actually took place - which might be, but probably depends a lot on if the person knew, and what the intent of the action was.
In the current case - its unclear if any crime took place at all, it seems clear that the primary intent was to prevent future crime not hide evidence of past ones. Most importantly the past version of the app is not destroyed (presumably). Github still has the version of the software without the copyright filter. If relavent and appropriate, the court could order them to produce the original version. It can't be destroying evidence if the evidence was not destroyed.
Yes, sorta. We're talking about software, therefore a piece of code that does something programmatically isn't like the drunk driver in a car that may cause more accidents, and although we aren't sure about that we prevent him/her to drive anyway just to be safe. The software would most certainly repeat its routine because it has be written to do so, that's why I wondered about destruction of evidence; by removing/modifying it, or placing filters, they would prevent it from repeating the wrongdoing, but also take away any means of auditing the software to find what happened and why.
Not in any way I'm aware of - and would be required if they were served a DMCA notification/Cease and Desist against a specific prompt.
The people that think Copilot is infringng their copyright would be happy with that I would think? Unless they take a much stricter definition of fair use than current courts do.
No more so than scanner/printer manufacturers adding tech to prevent you from scanning and printing currency is destruction of evidence that they are in fact producing illegal machines for counterfeiting.
> The copilot team rushed to slap a copyright filter on top to keep these verbatim examples from showing up, and now claims they never happen.
Well if the copyright filter is working they indeed aren't happening. Putting in safe gaurds to prevent something from happening doesn't mean you're guilty of it. Putting a railing on a balcony doesn't imply the balcony with railing is unsafe.
> LLMs are prone to paraphrasing. Just because you filter out verbatim copies doesn't mean there isn't still copyright infringement/plagiarism/whatever you want to call it
Copyright infringement and plagerism are different things. Stuff can be copyright infringement without being plagerized, and can be plagerized without being copyright infringement. The two concepts are similar but should not be conflated, especially in a legal context.
Courts decide based on laws, not on gut feeling about what is "fair".
> They clearly know the problem is real
They know the risk is real. That is not the same thing as saying that they actually comitted copyright infringement.
A risk of something happening is not the same as actually doing the thing.
> "Ner ner ner ner ner, you can't prove it to a boomer judge".
Its always a cop-out to assume that they lost the argument because the judge didn't understand. I suspect the judge understood just fine but the law and the evidence simply wasn't on their side.
> Well if the copyright filter is working they indeed aren't happening. Putting in safe gaurds to prevent something from happening doesn't mean you're guilty of it. Putting a railing on a balcony doesn't imply the balcony with railing is unsafe.
Doesn't mean you weren't, at some point, guilty of it, either. It doesn't retcon things.
Sure, which is why we require evidence of wrong doing. Otherwise its just a witch hunt.
After all, you yourself probably cannot prove that you didn't commit the same offense at some point in time in the past. Like Russel's teapot, its almost always impossible to disprove something like that.
Yeah but I think the main concern in this situation is copilot moving forward, not their past mistakes.
This is so stupid. Going after likeness is doomed to fail against constantly mutating enemies like booming tech companies with infinite resources. And likeness itself isn’t even that big of a deal, and even if you win it’s such a minor case-by-case event that puts an enormous burden of proof on the victims to even get started. If the narrative centers around likeness, they’ve already won.
The main issue, as I see it, is that they took copyrighted material and made new commercial products without compensating (let alone acquiring permission from) the rights holders, ie their suppliers. Specifically, they sneaked a fair use sticker on mass AI training, with neither precedent nor a ruling anywhere. Fair use originates in times before there were even computers. (Imo it’s as outrageous as applying a free-mushroom-picking-on-non-cultivated-land law to justify industrial scale farming on private land.) That’s what should be challenged.
This is pretty interesting, and I have conflicted feelings about the (seemingly obvious) outcome of this trial.
I wonder, if MS and OpenAI win, does that mean it will be legal for anyone to take the leaked source code for a proprietary product, train an LLM on it, and then ask the LLM to emit a version of it that is different enough to avoid copyright infringement?
That would be quite the double-edged sword for proprietary software companies.
I suspect that this is exactly what will happen; not just with code, but also prose and artwork.
Someone is likely to design an LLM that is specifically trained to do exactly that.
Lots of money to be made...
I was mainly inspired by this section:
> Specifically, the judge cited the study's observation that Copilot reportedly "rarely emits memorized code in benign situations, and most memorization occurs only when the model has been prompted with long code excerpts that are very similar to the training data."
That almost sounds like it'd be fine to train an "art transformation model" which takes an image and transforms it, which for all the frames of a specific Disney movie just so happen to output the very next frame...
That sounds like the opposite from the quote. The art transformation model you propose WOULD emit memorized art in benign situations, so in that judge's opinion it WOULD count as plagiarism.
With how modern video codecs use data from previous frames you could make a not-entirely-specious argument that we already have a tool that can do this and it's called ffmpeg.
> "art transformation model"
There is a reason a famous AI model architecture is called transformer, it is pretty much optimised to be good at transforming artistic and intellectual works.
On the matter of artwork there's no need for suspicion - it is and has been happening for a while now. There are entire online databases dedicated to providing non-consenting artist's "styles" as downloadable model parameters by name.
Style is not copyrightable so I see nothing wrong with making essentially a robot that can paint in the style of someone else.
In isolation, no. But the produced works can be too close for fair use (as demonstrated with the Prince pieces by Andy Warhol), and passing it off as a piece from the original artist can open you up to forgery/fraud charges.
To put another way, the motivations to produce art in another artist's style can still land the artist/buyer in legal trouble regardless of fair use.
Yes that is true, but I don't think the people who use style transfer are actually passing it off as the original, they just like it for the aesthetic value of their own images. In other words, no one using the Van Gogh LoRA is actually trying to forge the Starry Night.
Given the value of an "authentic" painting of the Starry Night (or more realistically the value of something forged in, say, Samwise Didier's style) I can't agree with "no one".
I have to imagine that it's likely quite popular to sell AI generated art that mimics or copies existing works.
I guess there's always a greater fool, but forging an oil painting using AI digital images seems pretty far fetched.
You can paint over the printed image?
Not that it’d look anything like the artist you are copying, but it’s a fun idea.
Do you use AI art generators? Flaws are extremely easily found out, it is only good for a rough snapshot (without much fiddling and even then, artifacts remain). I can guarantee you it is definitely not popular to sell existing works made with AI, you are better off hiring an actual forger. In fact, your suggestion is even the first I've even heard of such an idea.
The legality of using someone’s copyrighted work to train a model to reproduce it without their consent is still under debate - but the morality of the act at least, is not related to its legality - be it positively or negatively; and I personally consider it abhorrent.
Under what morals do you consider it "abhorrent?" I bet got a straight answer from those I've asked about this as the counter arguments seem too easy to make.
It's just pure exploitation. You're using the product of someone's work to create a machine that takes away their work.
Why is doing a task with a machine suddenly objectionable when the same task performed by humans is perfectly fine?
A man with a small canoe catching a few fish with a fishing rod for his dinner is very different to a commercial fishing vessel trawling through the ocean with a massive net to catch thousands of fish at once. The two are treated differently under the law, and have different rules that apply to them due to the difference in scale.
Scale matters, and the scale that computers/these AIs operate under are absurd compared to a person doing it manually.
Why does scale matter in terms of AI? Just because a computer can do it at scale doesn't mean it should be treated similarly to your analogy. Rather than using an analogy, please tell me why it matters that computers can do something like AI at scale rather than individuals doing it.
Chiefly, scale and accountability.
The work of a person can be mitigated and a person can be held accountable for their actions.
Much of our society operates on the idea that we don’t need to codify and enforce every single good or bad thing due to these reasons; and having such an underpinning affords us greater personal freedom.
This does not actually answer the question of why it is bad (in your opinion) in the first place, it just states that bad things are mitigated. I am looking for a concrete answer to the former, not a justification of the latter. The former is what usually AI opponents can never answer, they assume prima facie that AI is bad, for whatever reason.
I answered your question plainly, but I'll try to go into detail. I have a suspicion that you don't see this as the philosophical issue that AI detractors do, and perhaps that hasn't been clearly communicated to you in the answers you've received, leading to your distaste for them or confusion at why they don't meet your criteria.
I believe that this kind of generative AI is bad because it approximates human behavior at an inhuman scale and cannot be held accountable in any way. This upends the entire social structure upon which humans have relied to keep each other in-check since the advent of the modern concept of "justice" beginning with the Code of Hammurabi.
In essence: Because you cannot punish, rehabilitate or extract recompense from a machine, it should not be allowed in any way to approximate a member of society.
This logic does not apply to machines that "automate" labor, because those machines do not approximate human communication - they do not pretend to be us.
Your argument can be applied to the printing press or the automatic loom, and before you say that AI is much more at scale, I do not think that it is any more at scale than producing billions of books and garments cheaply. If you instead say that AI is more autonomous than the prior which require human functionality, I will remind you that no AI today (and likely into the future) produces outputs autonomously with no human input (and indeed, many humans tweak those outputs further, making it more like photo editing than end-to-end solutions). Even if they could perfectly read your mind and output end-to-end, you must first think for them to do what you desire.
Should those machines then be subject to your same philosophies? I'd suspect you'd say "that's different" somehow but it is only because you are alive at this moment and these machines have been normalized to you that you do not care about them. Were you to be born in a few centuries, you would likely feel the same way most do about the prior machines, and indeed, you'd be hard pressed to find anyone who think that future generation's AI (probably simply called technology then) is problematic as you do today. Recency bias is one hell of a drug.
Why does someone's work matter?
Why do you want the end result of the work if the work itself doesn't matter?
I replied to the other comment.
If it didn’t matter, you wouldn’t want to take it.
The word "work" is being overloaded here, their work as in output might matter but I am asking why they must work at all in the first place. If your answer is because they must procure money to survive, that is an economic failure, not one of AI. Jobs are simply a roundabout way of distributing money for output to be produced, if an AI can produce the output, the job need not exist. This is the same argument that has been used for centuries as automation advances in every field, but suddenly, when it comes for my white collar high tech industry? It's an outrage.
Even then, their work as output can matter but that doesn't necessarily mean they (should) have a per se right to their work without other people also using it, especially in cases where their work is not used as outputs directly, which is what plagiarism is. If that were the case, no one could learn from a other's work, regardless of whether that one is a person or a computer.
Remember, we are discussing art here, not white collar tech jobs. AI coming for my job would be unpleasant and devastating, but that, like you said, is an economic problem. That I agree on.
I don't think there is a way to continue this particular branch of this argument without devolving into a debate on the value of human life like a couple of Macedonian philosophers - suffice to say, my point of view is that the work of others has intrinsic value tied to intent, and machines do not have intent.
If no output of humans has intrinsic value, then once machines can approximate humans sufficiently there is no reason for humans to exist - and that is an outcome that I, as a human, reject with all of my being.
Output of humans has value to humans; art does not have value to beings outside of humans, of course. That does not mean that one cannot use a machine to create new outputs, and it doesn't mean that those will or will not have value, as again, value is subjective to the (human) beholder. We see this already with people praising AI art. Therefore, I do not believe that intent matters in the slightest as long as people deem something valuable.
The reason for humans existing is not because of the output they produce (indeed, that is dystopic), humans have worth inherently, regardless of what they output. This is also what nihilists have figured out, so maybe that is something you should look into if you seriously have such an opinion as expressed in your last paragraph.
If you believe intent does not matter and is unrelated to human worth, then we are at an inherently impassible disagreement as to the nature of human society and will never agree on this issue. My belief is that this point of view, as well as others (like Nihilism, as mentioned) are fundamentally destructive to human society, which likely clarifies why you don't see a problem with what I and others believe is an existential threat to said society.
I don't see how your viewpoint is useful, though; Nihilism has been around for a century but I do not see how it is "an existential threat to said society." It seems like you believe something but don't have any empirical backing for it, therefore, let me remind you, as my other comment says [0], that people do not have the best record of stating why "society" is decaying.
[0] https://news.ycombinator.com/item?id=40919253&p=2#40920318
I sure wish I could non-consent to people observing me in the world, I'd like to move through society invisibly and only show myself when it benefitted me. Unfortunately, the only answer is to stay inside if I don't want people to see me.
> I sure wish I could non-consent to people observing me in the world,
You aren't allowed to use photos featuring a non-consenting person to, for instance promote a product.
You are allowed to use photos including a non-consenting person.
There's a lot of complicated law, differing between different jurisdictions to cover this question, and to balance the needs of the public with commercial desires. It's not as simple as you make it sound, and there's no reason we should just default to bending over backwards for commercial interests.
Laws exist to serve society, not the other way around.
I'm sure that the people who are being constantly victimized by paparazzi would like to know those rules that you just quoted, and have them be enforced.
If you had done a little research into this question, you'd realize that 1A use cases ('journalism') are treated by law quite differently than use of likeness for commercial intent.
This is my whole point. There isn't a single, one-size-fits-all rule that a five year old can comprehend that describes any particular country's legal framework around the many, many different dimensions of tension between public and private interests on this incredibly broad question.
And none of the existing frameworks fit the new use cases well, and we should probably have an open political debate about what we want to do going forward.
I'll happily take your picture against your will and put it on the internet with the tag "vkou mad at photographer, news at 11"
Okay? What will that prove? That you can be an ass?
Being an ass is generally not illegal. Particular behaviours might be, but no legal or social system intends to censure you for every possible one, and most people who are experts in law or ethics don't believe that they should.
If you identify particular problems with the particular paparazzi laws in your country, that's an interesting conversation, and maybe, if framed well, an interesting data point for this discussion, but is not in itself the 'last word' on it. Just because you can torture an analogy, doesn't mean the analogy has a lot of power.
> consent Careful... A lot of people online have selective understanding when it comes to this concept. It's selfishness and self-centredness taken to it's extreme, and not seeing other people as humans, but as tools for their consumption to be used and tossed aside for pleasure or for profit. It's one of the most disgusting things I've layed eyes on.
Note that in Europe (broadly speaking), this is a right people have.
We are not discussing people observing people. We are discussing programs observing people.
Seems like a meaningless distinction in the face of a government that defines giving money as speech.
So what? Humans are simply biological programs observing people (other biological programs). If you disagree, explain how humans are not simply biological computers.
Try getting Mickey Mouse comics.
That should be fun...
I feel like what really matters is who has more money to throw in tribunals.
Somehow I feel if it was "Adobe vs dev that claims his code was spit by copilot" it would not end the same.
> Someone is likely to design an LLM that is specifically trained to do exactly that.
Perplexity AI.
> Perplexity AI.
How does this describe Perplexity AI more than any other LLM?
I am referring to their service rather than their LLM in specific.
Perplexity is in the business of using an LLM to paraphrase existing content, then serving that up as their own "work" in a way that directly harms the original content they took.
It's not even a question of "Is AI training copyright infringement", they're just doing copyright infringement with AI. And it's horribly common already.
They plagiarize and blame it on the third party service they use for web scraping
https://www.theverge.com/2024/6/27/24187405/perplexity-ai-tw...
A Wine fork built using an LLM trained on leaked Windows code might be pretty useful.
You'd get a Wine full of ads, the need for an account to use and the not so occasional BSoD /s
Plus probably worse off at running programs written for both old and new Windows versions than current Wine.
No, because judges aren't robots applying the law like code. Intent matters. If you do this it will be painfully obvious that your intent is to duplicate a large body copywritten code.
It's painfully obvious that the intent of GitHub Copilot is to duplicate a large body of copyrighted code.
It doesn't appear to be painfully obvious. Both because they're not losing court cases yet, and there's a huge swath of non copyrighted code being produced by co-pilot every day. By contrast the plaintiffs apparently were unable to induce Copilot to duplicate any parts of their code.
Oh so that's why Copilot has a filter to prevent suggesting copyrighted code, because the intent is to duplicate copyrighted code. It all makes sense now.
Following existing law and applying reasonable expectations, I would point to the old adage "intent is 9/10ths of the law".
It would probably be legal to do this, as long as no one could reasonably show that you intentionally trained the LLM on said leaked source code with the intent to reproduce the product.
Of course, civil suits could be another matter entirely. If you pick a product to rip off that's owned by a multi-billion dollar company, all that can save you is the ethical limits of their legal team's consciences.
The misappropriation of the code (a trade secret) would likely be grounds for legal action against the people who stole it and the people who received it. A lot depends on jurisdiction.
But if it was made public and then if an unrelated third party were to re-write the code in such a way that it was non-infringing, then it would be non-infringing. That’s just a tautology.
By definition you are allowed to take leaked source coded and change it enough such that it avoids infringement, and this will avoid infringement.
The LLM has nothing to do with it, and isn't required here.
Or even those AI-powered decompilers people are working on… you could clone virtually any software with that. Surely there will be limitations.
The source code of Windows XP is widely available. Same with a ~2 year old version of Bing, Bing Maps, Cortana etc. Yet that doesn't seem to have had major negative effects on those products. If anything having the Windows source code available seems to be a net boon for Windows development. Sometimes looking at the source is just better if the documentation is unclear.
MS probably hates that the source for XP/2K3 leaked because it means more people will put in effort to fix and extend/backport, even if it's not truly legal, when MS would rather coerce them into the latest most invasive and user-hostile version. Also because projects like NTVDMx64 show how some of their decisions have been political instead of technical as they like to claim.
Far less people care about Bing or Cortana.
If you compiled it and the resulting binary was substantially similar to the original you’d likely get sued.
The limitation is the amount of money & political power the owner of the software you're cloning has.
That's exactly what I expect to happen with the source code to Microsoft's own software products, namely Windows.
Hilarity will ensue :)
Let's be honest: It will be legal if you're a $3 trillion company, and not if you're not.
Yeah. In a ideal world where an open source developer gets equal treatment from the law facing a giant corporation with hordes of very expensive lawyers and "technical experts".
Not unless a big company is the one doing it lol
I mean you can legally do this by hand right now. That's how they cloned the IBM bios back in the day. IBM sued and lost.
No, that's not.
They cloned the bios by observing how it behaved and writing code that behaved the same way. Nobody even looked at the bios code.
That's not how they did it. They had one team read the BIOS source listings in the IBM PC Technical Reference Manual and create a technical specification and a second team take that specification and write a new BIOS [1]. The second team never saw the original code so therefore they could not have copied it.
To do something similar with AI, you really need to train one AI on the source code and then have it explain that code to a second AI that never saw the original code.
I thought there was a "clean room", where the people reading it and the people writing it were different; and they made a written specification instead of a Vulcan mind meld.
A slight aside, but this is the subtitle:
> A few devs versus the powerful forces of Redmond – who did you think was going to win?
I hate that kind of obnoxious "journalism". Sometimes the little guy is actually wrong. To clarify, I'm not commenting on the specifics of this case, I just hate how fake our online discourse has been by appealing to "big guy evil" before even bringing up the specifics of the case.
> Sometimes the little guy is actually wrong.
He is, sometimes. Also sometimes, the moon passes exactly between the sun and Earth, a new star appears in the sky, the magnetic field of our planet reverses, a proton decays (jury is still out on that one, actually). Etc.
Tools like Copilot are plagiarism machines. We know the data they're being trained on, and a conclusion of "that's plagiarism" is not - or anyway should not be - controversial. I'm not terribly against the notion of a plagiarism machine but I am against the owners of such machines reaping profits from them to the exclusion of the people who provide the source material. This is theft.
More importantly, getting back to big guys and little guys: big guys gang up on little guys all the time. It's usually how they get to be big. They tend to be the ones who realize that working together against the rest of us is to their benefit. So, in the interest of pushing back on that a little, and recognizing that I am after all a fellow "little guy" (figuratively speaking anyway), I tend to support the "little guy" unless I have overwhelming evidence confirming that they are, in fact, both wrong and that supporting them anyway would be against my best interest. Neither is the case, here.
At any rate, the subtitle here references a pretty ubiquitous and, I'm happy to report, increasingly well-known and understood facet of our economic and social institutions, which is that they absolutely positively do not work for us or further our interests in any sense.
> big guys gang up on little guys all the time
And obnoxious individuals gum up enterprises. It's lazy to the point of dismissal to conclude based on bigness.
Those poor corporations, however will they survive? I say we let them dump chemicals straight into our oceans, after all we don't want to gum them up from earning infinite profit!
You can't predict right or wrong based on bigness, but you can very often predict who will win.
EDIT: And by "win" I mean not who the judge will side with, but who will end up chugging along fine financially and who will end up broke.
> EDIT: And by "win" I mean not who the judge will side with, but who will end up chugging along fine financially and who will end up broke.
I can certainly agree with that sentence, but that is definitely not how the Register was referring to "win" (they clearly just meant the judicial outcome), so it's obnoxious to imply the legal ruling went Microsoft's way solely due to their greater resources.
Won't someone PLEASE think about microsoft???
Won't anyone think of the corporations? :(
One would think if these were "plagiarism machines", that one of the plaintiffs would have been able to produce even a single instance of the copying they alleged.
It's The Register, they are always like this. Especially when Microsoft is involved.
I think you're misinterpreting the sentence.
I think it merely implies MS has more resources to throw at the legal case.
I strongly disagree. I don't see how you can interpret that sentence, especially given the "who did you think was going to win?" part, and ignore the implication that Microsoft won solely because of their size and money.
There is actually zero evidence that the judge issued his ruling based on Microsoft's superior legal team, so why even put that sentence in there anyway?
Maybe but lack of resources doesn't seem to be the main problem. A handful of devs claim copyright infringement, the Judge says show me and they can't. Maybe if they had millions of lawyers trying to get Copilot to produce their copyrighted code, their case would be stronger.
I don't think that's something you can take away from the little-guy big-guy narrative. Class actions are funded by courts awarding lawyers huge payouts if they win, not directly by the plaintiffs. There should be plenty of resources on both sides of this fight.
You are sorely underestimating the legal resources available to one of the most powerful companies on earth
I don't believe I am. To flush out my statement more fully there are diminishing returns on investing more money into a lawsuit, and both sides in a class action with this much money at stake should be sufficiently funded to be far beyond the point of diminishing returns.
I'm not claiming Microsoft doesn't have tons of resources, I'm claiming that the plaintiffs attorneys should be sufficiently funded that the difference in outcomes is negligible.
they also have more resources to ensure they covered their liability surface before any legal case materialized
aka the plaintiffs were wrong and had no idea what they were talking about
What were the plaintiffs even thinking when they submitted a claim based on identicality without being able to produce a single instance of copilot generating a verbatim copy. Even the research they submitted was unable to make a claim any stronger than "it's possibly in theory but we've never seen it".
A lot of people post AI outrage comments on HN that are clearly based on a rather poor understanding of the law and legal processes. This entire case and all of the plaintiffs statements about it reads like one of those comments.
I am not strongly opinionated on this, but the very fact Microsoft used all the code it could find, bar their own has always looked suspicious to me.
I mean, I imagine it used a lot of their public code, like VS code, typescript, the new windows terminal, or anything on https://github.com/microsoft . They didn't use their private code, but they didn't use anyone else's private code either.
They claim to not use anyone's private code, but I wouldn't trust the psychopathic C-suite at M$ not to murder kittens and human babies if it made the line go up a quarter of a percentage point, yet alone something like this.
You're free to speculate, but they have on multiple occasions said they don't train on private repos. Furthermore, there's no real incentive for them to do so, since (1) there are a lot of public repos, and (2) training on private repos opens them up to leaking things like private keys which would be a nightmare. It just doesn't make a lot of sense for them to do it.
Also, if anyone else uses private repos like I do, much of it will be "shame corner" code where it's only private because it's either half-finished or just terrible code because I wanted it to work right now.
My maintainable code gets published, my nightmares get banished to private repos so no one else thinks it's a good idea to replicate.
Is that a fact? If true, not sure whether it would have bearing on the legal questions, but certainly would make it seem like their actions are not in very good faith. Would love to hear their explanation if it did get raised in court.
It seems to me that regardless of the outcome of this case, some developers do not want to have their code used to train LLMs. There may need to be a new license created to restrict this usage of software. Or, maybe developers will simply stop contributing open source. In today’s day and age, where open source code serves as a tool to pad Microsoft’s pockets, I certainly will not publish any of my software open source, despite how much I would like to (under GPL) in order to help fellow developers.
If I were Microsoft, I’d really be concerned that I’m going to kill my golden goose by causing a large-scale exodus from GitHub or open source development more generally. Another idea I’ve considered is publishing boatloads of useless or incorrect code to poison their training data.
As I see it, people should be able to restrict how people use something that they gave them. If some people prefer that their code is not used to train LLMs, there should be a way to enforce that.
> I certainly will not publish any of my software open source, despite how much I would like to (under GPL) in order to help fellow developers.
I think this is a rather radical approach. You're undermining the OSS movement because you dislike Microsoft (I do too). I think adding a clause or dual licensing your work is more effective at stopping big-tech funded AI crawlers than just not adhering to open source.
You can host your code on sourcehut or Codeberg (Forgejo), you don't NEED to host it on a Microsoft owned platform.
I love the OSS movement. But the OSS movement is dependent on developers making a living somewhere else. If Microsoft effectively replace our class or at least a big part of it with AI, OSS becomes mostly irrelevant.
Not everyone is multi-generationally rich or absurdly frugal. Most people like having good jobs.
There is, as far as I am aware, nothing to stop Microsoft from crawling any other site’s code. Please correct me if I’m wrong. Things like “copyright” didn’t seem to stop OpenAI.
I am personally happy to share all my public code to support the development of better models. While I believe the benefits of contributing to open source outweigh the drawbacks, and I don't foresee a "large-scale exodus from GitHub", it's ultimately up to individual developers to decide how their code is used.
I get the sentiment but remember that OSS has always come with the caveat that someone you do not approve of benefits from your code. That part isn't really new with LLMs (which I'm definitely not a fan of). Really this goes for productive pursuit where you do not have total control of your clients. I think it's better to focus on how your work benefits those you want to benefit than to worry about how leeches also profit.
"I don't license as open source because $something which I don't like could use my code" is a pretty common note over time but, despite almost always coming with a warning of the end of some large open source segment, is rarely impactful in any way. Some people probably will use a special license and most won't care except for when they run across projects using said one offs and it becomes a pain to integrate licensing models.
From the article:
> The anonymous programmers have repeatedly insisted Copilot could, and would, generate code identical to what they had written themselves, which is a key pillar of their lawsuit since there is an identicality requirement for their DMCA claim. However, Judge Tigar earlier ruled the plaintiffs hadn't actually demonstrated instances of this happening, which prompted a dismissal of the claim with a chance to amend it.
So, the problem is really one of the lack of evidence, which seems... like a pretty basic mistake from the plaintiffs?
They could've taken a screencap video back when Copilot still produced code more verbatim, and used that as evidence, I assume.
Should we move to modified versions of FOSS licenses that forbid AI training?
Found this: https://github.com/non-ai-licenses/non-ai-licenses
Legally sound or not, these should at least prevent your code from being included in Copilot's training data, hopefully without affecting any other use case. I'm going to use one of these next time I start a new project.
If copilot is ruled fair use it doesn't matter what your license is, fair use superceeds it.
You can write whatever words you want on a piece of paper or uploaded to the info section of a GitHub repo.
That doesn't mean anyone has to follow it.
If it's legal to train on other people's stuff, without their permission, this would still apply to your code even if your code includes a license that said "I double extra declare that you can't train AI on this!!".
Note that wouldn’t be F/OSS — maybe OSS but the F wouldn’t be there.
Yes, that is clear. But personally I wouldn't want to write FOSS code anyway until Copilot learns to properly attribute FOSS code. Switching to a more permissive license later on shouldn't be an issue.
> Legally sound or not, these should at least prevent your code from being included in Copilot's training data
Has microsoft said this or something?
I assumed (heard somewhere) that they only include open source repos in the training data.
Turns out I was wrong. They don't care.
https://web.archive.org/web/20210708165143/https://twitter.c...
I would like to ask an obvious question to the legally inclined here. How is this any different than remixing a song (lyrics/audio)? It's not "identical", and doesn't output "verbatim" lyrics or audio. What is the distinction between <LLM> and <Singer/Remixer who outputs remixed lyrics/audio>. By a quick Google search it seems remixes violate copyright.
I'm not legally inclined, but... code and music are different? There must be different standards for when code is too similar, for when music is too similar, for when pictures are too similar, for when books are too similar.
Also, remixes almost always do contain verbatim lyrics and/or samples from the original song. LLM output isn't supposed to contain verbatim copies, but I've been told that sometimes it does. (I don't know much about LLMs and I don't think Copilot is useful. I want my 2010-era Intellisense back, when it was extremely fast and predictable.)
Not a lawyer, but that would be a fair use question, which I hear are notoriously complicated.
Colloquially, I generally expect a remix to be comprised of the original instrumentals/beat (potentially edited, but virtually nothing actually new added), potentially new lyrics, and to still be recognizable as the original.
The "still be recognizable as the original" part is a huge problem for fair use, and why I don't think remixes generally qualify. If it doesn't sound like the original then it's not a remix, but if it does sound like the original it can't be fair use.
I think the underlying issue is the resulting work, not the process that went into creating it. I think (but am in no way sure) that copying parts of songs would be fine if you did something to them so they aren't recognizable as the original.
As an example, if I take a song by the Beatles and repeatedly compress it until it's entirely compression artifacts, I would bet that I could publish that. I don't think it would matter that I started with a copyrighted work, what matters is that my finished product bears no resemblance to any other copyrighted work.
That would mean it's just a normal "is this work too similar to existing works?" standard applied to humans as well.
There is still an ancillary question of whether it's okay to train on copyrighted music, but that's really a different question than whether the works it creates infringe.
[dead]
The issue I have is that these models are inherently trained to duplicate stuff. You train them by comparing the output to the original.
If I made an “advanced music engine” which rips Taylor swift files and duplicates them, I would be sued to oblivion. Why does calling it an AI suddenly fix that?
They should have to train them on information they legally own.
They're not "inherently trained to duplicate"; I think that's a bit of a disingenuous oversimplification. They're trained to learn abstract patterns in large datasets, and remix those patterns in response to a prompt.
"You train them by comparing the output to the original." To the best of my knowledge this isn't correct; can you expand or cite a reference?
They are trained to duplicate, we just hope they do so by abstracting patterns. Various techniques stack the deck to make it difficult to memorize everything but it still happens easily, especially for replicated knowledge.
"You train them by comparing the output to the original." ->
You train neural networks by producing output for known input, comparing the output with a cost-function to the expected output, and updating your system towards minimizing the cost, repeatedly, until it stops improving or you tire of waiting. Cost functions must have a minimal value when the output matches exactly the expected to work mathematically. Engineering-wise you can possibly fudge things and they probably do so ... now.
I don't agree with your critiques. It isn't an oversimplification, published code literally works as stated.
I disagree with the statement "they are trained to duplicate" because "to" implies a purpose/intent which is incorrect. I.e. "they are trained with the purpose of duplication". This is I believe pretty uncontroversially false. We already have methods to duplicate data. They are trained with the purpose of learning abstract patterns is much more correct. One of the biggest _problems_ of training is duplication, aka over-fitting. To say it's the purpose is imo disengenious.
Ah I see what they meant by that statement. It is true that supervised learning operates on labelled input/output pairs, and that neural networks generally use gradient descent/back propogation. (Disclaimer: it's been a few years since I've done any of this myself so don't quite remember it that well, and the field has changed a lot). Note since the parameter space of the neural network is usually _significantly_ smaller than the training data set, a network will not tend to minimise that cost function near 0 for an individual sample since doing so will worsen the overall result. There is inherent "fudging", although near identical output can potentially happen. The statement here is more reasonable and similar to the training process than the first.
All GitHub needs to do to make most happy is offer an opt-out toggle.
It still doesn't.
That wouldn't and shouldn't make most people happy. Repository owner != author for all the code - that's kind of the point of open source.
Good point. But on top of fork hierarchy, git commit authors can be used.
This would mean excluding non-github members, and excluding members that opt out.
Yet people keep feeding it their code by using GitHub as their repo… Just how we use the internet to share information; there’s just no escaping it.
"The lack of documents from the Windows maker is apparently down to "technical difficulties" in collecting Slack messages"
Wait, I'm forced to use Teams at work but Microsoft employees are on Slack?!
> The judge disagreed, however, on the grounds that the code suggested by Copilot was not identical enough to the developers' own copyright-protected work, and thus section 1202(b) did not apply.
How did they reach this conclusion? How can you prove that it never copies a code snippet verbatim, versus just showing that it does for one specific code snippet? The latter is a lot easier to show, but I don't know what is it exactly that the prosecution claimed. I guess the size of the copy also matters in copyright violations?
I think there's a difference between a mathematical proof and legal proof. The mathematical proof would be "show that it never copies a code snippet verbatim", and you of course cannot prove that by example.
Legal proof is I think different (not a lawyer). They're more pragmatic. If, observing a lot of cases where it does not verbatim copy, and, if an expert provides a reasonable argument as to why it is unlikely to verbatim copy, that is enough legal proof for a judge to conclude that the output is not identical enough to the developers copyrighted code.
Big question: this thing called “training” AI off of data, how much of this is “training” and how much of this is “synthesizing”? It seems like if code is being copied and rephrased, it is synthetic. Not much “learning” and “training” going on here.
This kind of argument makes me feel like it also supports the abolition of patents: eventually multiple other people will come up with the same obvious solution, which becomes obvious once a person spends enough time looking at a problem.
The Patent System is not intended to be a test of exclusive original thought.
The function of the Patent System is to incentivize search for solutions by temporarily securing exclusive right to market novel devices and processes for the discoverer.
Of non-obvious inventions. My argument being all inventions are obvious once attention is applied to that area and scope.
Requiring attention IMO takes something out of the realm of “obvious”. And the standard is “novel”.
Everything in the future is novel, so that's a moot qualifier.
Everything requires attention to be seen, once somethign becomes "obvious" is fully determined where you're looking and the scope you're zoomed in on.
E.g. "matter is solid" until you zoom in and realize matter is mostly made up of space.
Moot in your opinion. The idea is to bring the future more expediently by providing temporary incentive to pioneers reaching into the future.
You just proved my point with your second sentence - that everything in the future will come.
And bringing things more expediently is the actual opinion here, unsupported, where arguably it actually slows down not only progress but the value of that progress not being as widely distributed as it otherwise would be.
You continue to miss my point. Your point is a lazy, "the future will get here whenever it does" perspective. Mine is incentivizing discovery brings future innovations sooner.
You don't provide any supportive evidence that "your way of incentivizing" creates a net benefit - "just trust me bro" seems to be your fallback.
Unfortunately USPTO takes "non-obvious" to mean that it wasn't already suggested by combining patents or other written work, so if you are the first to work a problem you can claim easy solutions that anyone with a clue would have quickly reached. Land rushes to fence off new fields seem inevitable.
Can you insist or put instructions that AIs do not train on your code? If they train on your code but don't produce the exact same output, is there any protection you can have from that?
When are people going to get that this isn't a right folks have?
If your code is readable, the public can learn from it.
Copyright doesn't extend to function.
The public is not learning from it. A person or corporation is creating a derivative work of it. Training a model is deriving a function from the training data. It is not "a human learning something by reading it".
It's an extreme stretch to say that the model weights are a derivative work of the training data given the legal definition of "derivative work".
It's not more a stretch than saying that re-encoding a PNG as a JPEG is a derivative work even though the process is lossy and the resulting bits look nothing alike.
I'm not sure you're being intellectually honest.
You think that a model that's capable of being prodded into producing an infringing output in addition to all the other non-infringing outputs it could produce is no different than a compression algorithm?
It is processed data at the end of the day. And no it is not like human reading. You can't read whole Github.
That doesn't make it a derivative work.
If I "process data" by doing a word count of a book, and then I publish the number of words in that book (not the words themself! Just a word count!) I haven't created a derivative work.
Processing data isn't automatically infringement.
People aren't going to get it, because you don't get them.
People have the right to learn non-copyrightable elements from your code.
The claim is that AI learns copyrightable elements.
The comment chain you are replying to includes a request to not train an AI on one's code.
I agree it's certainly possible for AI to produce infringing output.
Nevertheless, people don't have the right to enforce a limitation on training.
And to give a concrete example, in my view it should be allowed to use any source code to train a model such that the model learns that code is bad or insecure or slow or otherwise undesirable. In other words, it should be allowed to train on anything as long as the model does NOT produce that training data verbatim.
Maybe you should update your view with 17 USC 106.
What copyrightable elements of the original work persist in the model, if it is incapable of outputting them? I can derive a SHA-1 hash from a copyrighted image, and yet it would be absurd to call that a derivative work.
[dead]
More thinking out loud than answering your question, but nightshade for code and other plain text formats would be cool.
[dead]
If this is how the law is applied for code, are we to expect this is also how it will be applied for other data (e.g. audio a la Udio and Suno)?
Looking good ! Go Copilot !
copilot was apparently snipping license bearing comments, and applying "semantic" variations of the remaining code.
i would package the entire code as a series of comments, [ideally this would be snipped by the pliagarists] leaving a snippet of example code that no one of sound mind would allow to execute, being proffered by copilot.
> of sound mind
That's a reach, these days...
I'm seeing some really ... interesting ... behavior, being exhibited by folks that, at first blush, I think are kids, just out of bootcamp, but, on further inspection, turn out to be middle-aged professionals.
I really think Teh Internets Tubes have been rather corrosive to collective mental health.
The ability to think for oneself will diminish rapidly in an environment that rewards one for not doing so.
Smart people still exist. They just aren't online.
From Plato's dialogue Phaedrus 14, 274c-275b:
Socrates: I heard, then, that at Naucratis, in Egypt, was one of the ancient gods of that country, the one whose sacred bird is called the ibis, and the name of the god himself was Theuth. He it was who invented numbers and arithmetic and geometry and astronomy, also draughts and dice, and, most important of all, letters.
Now the king of all Egypt at that time was the god Thamus, who lived in the great city of the upper region, which the Greeks call the Egyptian Thebes, and they call the god himself Ammon. To him came Theuth to show his inventions, saying that they ought to be imparted to the other Egyptians. But Thamus asked what use there was in each, and as Theuth enumerated their uses, expressed praise or blame, according as he approved or disapproved.
"The story goes that Thamus said many things to Theuth in praise or blame of the various arts, which it would take too long to repeat; but when they came to the letters, "This invention, O king," said Theuth, "will make the Egyptians wiser and will improve their memories; for it is an elixir of memory and wisdom that I have discovered." But Thamus replied, "Most ingenious Theuth, one man has the ability to beget arts, but the ability to judge of their usefulness or harmfulness to their users belongs to another; and now you, who are the father of letters, have been led by your affection to ascribe to them a power the opposite of that which they really possess.
"For this invention will produce forgetfulness in the minds of those who learn to use it, because they will not practice their memory. Their trust in writing, produced by external characters which are no part of themselves, will discourage the use of their own memory within them. You have invented an elixir not of memory, but of reminding; and you offer your pupils the appearance of wisdom, not true wisdom, for they will read many things without instruction and will therefore seem to know many things, when they are for the most part ignorant and hard to get along with, since they are not wise, but only appear wise."
Awesome. Serves as a counter-example - would HN consider literacy to be damaging to the mind, or are we similarly mistaken by thinking that LLMs necessarily degrade the abilities of their users?
Pre-writing 'texts' (such as the Iliad) were memorized by poets, which is reflected in their forms which made more use of memory-friendly forms like rhyming, consistent meter, and close repetition.
Writing allowed greater complexity and more complex/information dense literary forms.
I feel that intelligent, critical LLM usage is just writing with less laboriousnes, which opens up the writer's ability to explore ideas more widely rather than spend their time on the technical aspects of knowledge production.
Does it serve as a counterexample? Or did the predicted loss of memory function come to pass?
Worth noting that people were smoking plain old opium back in those times; I'd be reluctant to apply their reasoning to fentanyl.
What are you talking about with your second paragraph? I can't tell if it's supposed to be an analogy or whether you actually think everyone was smoking opium back then.
Yes, the ancient Greeks were smoking opium. Nobody said that "everyone" was doing it, but its use was pretty widespread in neolithic Europe even before Sumerians were cultivating poppies Mesopotamia, back in 3400BCE.
I see, thanks for the clarification.
That's great!
They nailed us, what, four thousand years ago?
Humans have been anatomically unchanged for 50,000 years, I'd imagine every generation lamented the young with their new technology, otherwise we wouldn't have seen so many examples in written history, it is just that we have no records from prehistory, by definition.
Precisely the quote I was thinking of, thank you.
Suicide by words, here?
The Internet is still one of the easiest ways to find and participate in communities and conversations with other smart people, if you're invested in vetting and filtering who/what you're engaging with.
That said, I expect the ease of such will continue to decline as we approach a largely dead Internet, primarily consisting of bots talking to bots trying to sell each other herbal brain force supplements or whatever
..that suggests there is actually a chance that someone would go for such a boobytrap.
Comment was deleted :(
Wait... So Microsoft doesn't use Microsoft Teams, it uses Slack?
GitHub uses Slack, and has done since long before the Microsoft acquisition. GitHub also does a ton of chat-ops, or at least used to, so their migration from Campfire to Slack was a big move for the company, I doubt they want to move again.
Comment was deleted :(
Off topic: How does the judiciary decide which judge to choose for such highly technical case?
District courts can set their own policies. The Northern California District - where this was filed - allocates a case according to the last 2 digits of the case number. Source: https://www.cand.uscourts.gov/judges/civil-docketing-assignm...
That‘s Matthew Butterick‘s case.
Microsoft has deep pockets. Judges aren't objective. More at 11.
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
| "Suddenly, you would steal a car"
Nah, but I would download a copy of one without hesitation... ;)
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[flagged]
[flagged]
[flagged]
[dead]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
[flagged]
Linux/OSS is cancer. Said who? Anything in public domain is for grab by them.
Until the open tech community is chicken enough to not boycott their no open source stuff such as github and linked in a proof nothing will happen.
Sir, are you OK??
I thought "the Copilot coding assistant was trained on open source software hosted on GitHub and as such would suggest snippets from those public projects to other programmers without care for licenses" was explicitly allowed by the GitHub Terms of Service: https://docs.github.com/en/site-policy/github-terms/github-t... "If you set your pages and repositories to be viewed publicly, you grant each User of GitHub a nonexclusive, worldwide license to use, display, and perform Your Content through the GitHub Service." In other words, in addition to what's allowed by the LICENSE file in your repo, you are also separately licensing your code "to use ... through the GitHub Service" and this would (in my interpretation) include use by Copilot for training, and use by Copilot to deliver snippets to any other GitHub user.
That just means github can display the code, and you can see the code, but that does not mean you can then profit from or redistribute (profit or no) the code without attribution.
Amazon has the rights to publish a book, and you have the right to receive a copy of the book, but neither of those gives you the right to re-publish the book under your own name.
Comment was deleted :(
"use, display, and perform Your Content through the GitHub Service" might allow a wide range of uses on GitHub Pages websites, even if https://example.github.io is monetized (monetization is permitted by https://docs.github.com/en/site-policy/github-terms/github-t... in a few cases)
That will work if I upload only my code, but there are many open source projects where there are more then one author and GithHub did not acquired the rights from all the authors, the uploader to GitHub might not even be the author too.
Comment was deleted :(
Lots of my code is on github (eg https://github.com/syuu1228/uARM), uploaded by others. I gave no license for its use in training. What now?
If the person didn't have your permission or permission from the license to agree to github's terms, then you sue the person who uploaded it to GitHub.
You don't get to go after GitHub because you have no contractual relationship with them. At best, you can get an injunction forcing them to take it down, though getting them to un-train copilot may not be feasible. At best you'd get a small cash offer, since you're unlikely to be able to justify any damages in a suit.
> then you sue the person who uploaded it to GitHub.
> You don't get to go after GitHub because you have no contractual relationship with them
What makes you say that? If someone eg uploads my copyrighted work to YouTube, I file a DMCA notice with YouTube to stop distributing my work. If YT ignores the notice then I can pursue them with a lawsuit.
How is this situation different?
DMCA explicitly gives you a cause of action against the party who does not properly comply with your request. GP asserts that you lack a cause of action against GitHub before they fail to comply with DMCA but I’m not certain I agree.
DMCA is a narrow protection for operators of public websites like GitHub. I don't see what it has to do with GitHub taking the data submitted to it with dubious sourcing and developing their CoPilot whatever based on it. That has nothing to do with the privileges in DMCA.
Comment was deleted :(
That’s right. You have lost the thread of what we are talking about: causes of action based on privity vs those created by statute.
17 USC §504 says otherwise:
... the copyright owner may elect, at any time before final judgment is rendered, to recover, instead of actual damages and profits, an award of statutory damages for all infringements ... in a sum of not less than $750 or more than $30,000. ... in a case where the copyright owner sustains the burden of proving, and the court finds, that infringement was committed willfully, the court in its discretion may increase the award of statutory damages to a sum of not more than $150,000.
<https://www.law.cornell.edu/uscode/text/17/504>
The issue isn't contract. It's copyright infringement.
Comment was deleted :(
So hypothetically, if a developer publishes GPL software on Codeberg, and someone uploads it to GitHub, could the original developer file takedowns against the Github copy?
I'm curious if Github's ToS make uploading GPL software you don't own a copyright violation.
No, because the GPL is already more permissive than the GitHub TOS.
Comment was deleted :(
Crafted by Rajat
Source Code