How to Finetune GPT-Like Large Language Models on a Custom Dataset
I have been working in this space for quite a while and while I think the beginning of pytorch lightning meant well, it seems modern use-cases have outgrown it.
These days, when I see content from Lightning AI, I prepare for a contrived approach to doing something that fits within the ecosystem. I can't help but feel they are trying to induce "vendor lock-in" where there really isn't a business case for it...
Anyways, I tried to follow these steps and hit a dead-end. I have to say the content put out by huggingface is always way more straightforward and gets me to where I need to be when I want to spin up quickly.
Have a question to the Generative AI experts here.
So, I can use smthg like GPT-4 to label data and then use that as a train set for my own LLM, right?
EDIT: adding this from OpenAI Restriction TOS: "(iii) use output from the Services to develop models that compete with OpenAI;"
> I can use smthg like GPT-4 to label data and then use that as a train set for my own LLM, right?
Yes, almost all improved LLama models are tuned exactly that way (trained on examples of questions and answers from say GPT 4). If OpenAI stole copyrighted works to train their models it is morally fair game to do the same to them regardless of their TOS. It's not like they can prove it anyway.
Plus there's the other point where they also say that everything generated by their models is public domain, so which one is it eh?
Use of copyrighted material in such a way that it’s aggregated into statistical properties is almost certainly fair use. Use of the model to produce reproductions of copyrighted material then consuming or distributing it is almost certainly violating the copyright. But it was the facsimile of the material that’s the violation, not the abstract use of it to generate an aggregate model.
You understand these things have a very very wide interpretation scope here that has yet to be tested in court. I wouldn’t make these statements so confidently as courts tend to reinterpret the law significantly for the balance of societal factors when serious technology changes occur.
AI generated work is not copyright-able. I guess the courts later could disagree though.
They are in the UK:
If the AI generates a new Eric Clapton album, with the same similar voice and guitar playing style?
your example doesn't have to be AI generated. Human cover-bands play Song X in the style of Y all the time.
This is true - afaik there’s been no specific rulings on whether training models on copyright material is a violation. But to my mind it harkens back to stuff like xerox and such where the tool itself isn’t the violating thing it’s the use of the tool. Likewise, derivative works are often largely reproductions with minor variations and are protected under fair use. A model that takes enormous amounts of data and distills it into a tiny vector representation way below the information theoretic levels for any meaningful fidelity and mixes and overlaps data in a way that the original data isn’t plausibly stored in the model… I’m definitely not going to wager my life that’s fair use, but I would wager my company on it.
In the history of media law I’ve seen judged lean into whatever interpretation balances the ecosystem more than what is “literally the law”. The law is meant to serve people not the other way around. I hope judges will understand the contribution and theft can’t just be “haha fuck humanity love, openAI”
I want to train my own LLM on public but copyrighted data. I think this is serving humanity (and fucking OpenAI). I also think it is ethical because there's a big difference between "learning from" and "copying".
Your proposed reading of the law means only big tech will be able to afford the license fees to train on large amounts of data.
How do YOU plan on compensating those whose labor helped you? I bet you don’t. Same thing you are just imagining being David rather than Goliath makes it ok for you.
It's not always necessary to compensate those whose labor helped you. I haven't compensated many of the open source projects I use, for example, even those who clearly want me to (with nagging pop-ups). If the use of copyrightable material to train a model is legal, and it does not legally require compensation, it might be difficult to argue that the use of such material should be compensated or else. It would depend IMO on whether there are norms in place for this kind of thing, and I don't necessarily see wide agreement.
Ok, what about the open source and research models? I wouldn’t wager much on openai keeping a lead indefinitely. Certainly not to establish case law on what’s a pretty new technology (at least in its current use)
Yes, laws are about politics and dispute resolution more than reasoning or correctness. Focusing on the pure logic is a trap for the computationally inclined.
I'm a lawyer so one should never break the law.
Nonethless, I can observe and predict that non-consensual "open sourcing" of these models would likely end up probably the best and safest way to do all of this stuff.
This ... but we all know business is corrupt.
The current attempts to spur on regulation by OpenAI is moat building
We were complacent while it happened because OpenAI wasn't a business, it wasn't seen as unethical to use community work to contribute to community research. Now they're entrenched and pulled the rug out from the community, whilst also trying to shut the door on anyone else.
Just a really disappointing series of events, the money and profit were never the big issue.
It's against the terms of service to do the generation, but the generated text is not copyrighted. Those are different things.
GPT-4 is trained on a large number of web pages, some of which will have had their own terms of service.
Not only web sites, full books from scribd and other sources.
Is it legal for one of their computer systems to access mine without my consent, even if publicly routable via the internet?
If I found an open port on a government computer it is still illegal for me to access that isn't it? Is the difference that this is port 80/443 and happens to serve HTTP requests something that has been described in law or court?
see LinkedIn vs HiQ (which HiQ won) covering fair use of logged-out web pages.
I have to log in to OpenAI to generate conversations but the conversations I can post on my own logged-out blog. It's the same thing OpenAI would probably say if they got sued because GPT spits copyrighted content it found on a logged-out webpage. They can't reasonably expect people to not use them for training.
show me the ToS where it says that, and I still won't care, because it would absolute be legal under the same principle openAI is using for the training data as a transformative work.
FYI: here are the relevant parts from the TOS:
(iii) use output from the Services to develop models that compete with OpenAI; (iv) except as permitted through the API
sounds like you are allowed to as long as it's from the api, as this "imaginary" restriction isn't in https://openai.com/policies/api-data-usage-policies, or https://openai.com/policies/usage-policies.
Because by training it they created something new.
I don't mind just making a point.
But I don't think they mind. I don't believe that this type of model training is able to be bleeding edge which should guarantee that openai has enough motivation to continue the development and having a healthy competition
It is my understanding that this is how “alignment” works.
That is, openAI paid people to chat with their LLM to fine tune it and then other LLMs use chatgpt to generate training data to align their models.
There are three ways
1. make your own RLHF dataset - like OpenAI and Open Assistant
2. exfiltrate data from a bigger/better LLM - Vicuna & family
3. use your pre-trained LLM to generate RLAIF data, no leeching - ConstitutionalAI, based on a set of rules instead of labelling examples
I wonder whether these approaches fit into the above categories:
That is against their ToS though if you use your new LLM commercially.
It prohibits anything that competes with OpenAI services i.e as long as you're not literally providing an LLM API commercially you should be fine
Does it compete with them if you stop paying for their API?
Comment was deleted :(
And yet they trained theirs on commercial content on the internet. If that’s legal I doubt their argument holds up in court right?
They trained on publicly-available (no signup with TOS agreement) data, on the theory that training is fair use.
You signed up and agreed to their TOS to use GPT-4.
The legal situations are not similar.
OTOH, lots of people are openly using GPT-4 in one way or another to develop models, though they might generally be at arm’s length from people intending to sell services.
> They trained on publicly-available (no signup with TOS agreement) data, on the theory that training is fair use.
They openly state they used thousands of books from a pirate site as a training source. Go look up the datasets listed in the GPT-3 paper.
So set up a shell company that uses GPT4 to make public domain examples of what RLHF data would look like, and then the parent company takes that data afterwards since it's public domain. Shell company didn't break TOS.
Of course it will hold up in court, it's their service and their terms of service.
So what are they going to do about it?
Great question! I don’t know the end game there. Maybe if they suspected their model was used they would sue, and in discovery find you used their model for training?
Maybe we don't need to worry, OpenLLaMA is under training right now. It will be the commercial version of LLaMA.
> Update 05/22/2023
> We are happy to release our 700B token checkpoint for the OpenLLaMA 7B model and 600B token checkpoint for the 3B model. We’ve also updated the evaluation results. We expect the full 1T token training run to finish at the end of this week.
So we could develop on LLaMA for now and switch to OpenLLaMA later.
> So what are they going to do about it?
If they think they can prove you used it to develop a competing service, sue you for breaking the TOS and recover the greater of the harm it did to their business or the amount of your profits from the service that are due to the uae of GPT-4 in violation of the agreement.
Have companies managed to get awarded damages in lawsuits against their customers who merely broke their terms of service?
Is there existing case law here?
That escalated quickly.
MS lawyers have a good track record at sending out those scary cease&desist letters
I don't think that works. LLM-generated contents are not copyrightable.
Breach of contract for violating the TOS agreed to when signinf uo for the service doesn’t depend on copyright.
What I don't understand - is there anything that would prevent Alice from publishing ChatGPT prompts and outputs for anyone to use, with no T&C attached?
Once Alice has done that, is there anything to prevent Bob, who has never agreed to ChatGPT ToS, to use those prompts and outputs to train his own models to compete with OpenAI's?
(Purely from a contractual/legal/IP angle rather than ML/technical.)
Right but cease and desist usually relates to intellectual property or copyright matters, typically not TOS violations. Please correct me if I am mistaken.
Cease and desist can be used for any issues where the person or entity issuing the C&D thinks they have a legal right that is being violated and wants to put the violator on notice in the hopes of securing a change in behavior short of legal action.
Comment was deleted :(
Is a terms of service considered a contract?
They can terminate your account.
Nothing until it’s worth their while.
As far as I remember, I fully own all the right to the output of OpenAI (for example).
I wonder how they reconcile naming themselves "Open"AI, telling people that generated works can be used however they please, except for training a potential competitor.
Yup, totally. This is a form of knowledge distillation. Openai, or other foundational model providers, can't really do anything about it.
Well they can sue you and bankrupt you by delaying trial for a decade. That's how the US patent system works anyways...
Sue on what grounds? It will be quickly dismissed.
If by quickly you mean 5 to 10 years of paying a retainer on a lawyer sure. Even if you win the case you lose in life. Most individuals can't afford 500k in legal fees to have them be reimbursed years later. Big companies have lawyers on staff at a discount and they play these games every day.
This happens with illegal things all the time. IE manager sexually harasses someone on video or something, it's some CEOs nephew who did it, so they fire the person who got harassed. The person who got harassed now has to aquire legal counsel on top of paying relocation claw backs etc. Few years ago by and the person who was in the right is trying to hold down a job, a family, and the stress of the legal battle. The company offers to settle two years in for 50k and 99% of people take it, sometimes at a loss. Also, getting employed is a lot harder when a background check reveals suing a previous employer or really any company, because shocker, most companies do illegal shit regularly... So it's almost always best to settle
I realize I painted a picture pretty far from my previous statement but I figured you were new in your career and could benefit from an allegory of how stuff like this goes down.
That is not how the US legal system works. You can sue someone for and regardless of merit, and they will have to defend themselves. That costs time and legal fees. If they lose, they can appeal, and continue appealing. If it's baseless, they'll lose, but you still spend a lot of money and time dealing with the lawsuits.
Yes, and in fact that's the best method available if you want good performance. I would suggest using a local open source model to do this however, to cut down on costs and make it far simpler to deal with than the unwieldy OpenAI systems.
Indeed, fine tuning with either synthetic data (as you are proposing) or human review works like that. you can read more here: https://huggingface.co/blog/rlhf
not an AI expert but from a talk I recently heard... if there is a mismatch in training data between the "teacher" LLM and "student" LLM, you risk teaching the student to hallucinate or to ignore information
Is "ca" "can" or "can't"?
Can someone explain why I'd want to use fine-tuning instead of a vector database (or some other way of storing data/context)?
Assuming you would want to fine-tune over a codebase or set of documents, I would argue vector databases and fine-tuning are completely different tools.
I would strongly recommend against fine-tuning over a set of documents as this is a very lossy information system retrieval system. LLMs are not well suited for information retrieval like databases and search engines.
The applications of fine-tuning that we are seeing have a lot of success is making completion models like LLaMA or original GPT3 become prompt-able. In essence, prompt-tuning or instruction-tuning. That is, giving it the ability to respond with a user prompt, llm output chat interface.
Vector databases, for now, are a great way to store mappings of embeddings of documents with the documents themselves for relevant-document information retrieval.
I would highly recommend skimming this RLHF paper for how demonstration data was used to make a model prompt-able . Keep in mind RLHF is another concept all together and we might be seeing a revolution where it might become optional (thanks to LIMA)!
Great reply, here's an example from my own work:
I want the user to be able to ask technical questions about a set of documents, then the user should retrieve a summary-answer from those documents along with a source.
I first need to finetune GPT4 so it better understands the niche-specific technical questions, the words used, etc. I could ask the finetuned model questions, but it won't really know from where it got the information. Without finetuning the summarised answer will suffer, or it will pull out the wrong papers.
Then I need to use a vector database to store the technical papers for the model to access; now I can ask questions, get a decent answer, and will have access to the sources.
Thanks (to both you and the parent) for sharing these details. So is it fair to say the following:
1. Fine-tuning bakes the knowledge into the model, but getting the "source" of an answer to a specific question becomes cagey and it is unclear if the answer is accurate or just a hallucination.
2. Therefore vector databases, which can provide context to the LLM before it answers, can solve this "citation" problem, BUT:
3. We then have limits because of the context window of the LLM to begin with.
Is that a fair understanding, or have I totally gotten this incorrect?
Edit: Or, are you saying that you both fine-tune AND also use a vector database which stores the embeddings of the dataset used to fine-tune the model?
Ah! That makes sense! That's a neat strategy!
I asked ChatGPT this question, and asked it to simplify as much as possible.
Fine-tuned Models: Imagine you have a super-smart robot that can talk about anything. But you want it to be really good at talking about, say, dinosaurs. So, you teach it more about dinosaurs specifically. That's what fine-tuning is – you're teaching the robot (or model) to be really good at a specific topic.
Vector Databases and Embeddings with LLM: This might be a little tricky, but let's think of it this way. Imagine you have a huge library of books and you want to find information on a specific topic, say, ancient Egypt. Now, instead of reading every book, you have a magical index that can tell you which books talk about ancient Egypt. This index is created by magically converting each book into a "summary dot" (that's the embedding). When you ask about ancient Egypt, your question is also converted into a "summary dot". Then, the magical index finds the books (or "summary dots") that are most similar to your question. That's how the vector database and embeddings work.
So, if you want your super-smart robot to be really good at one specific topic, you use fine-tuning. But if you want it to quickly find information from a huge library of knowledge, you use vector databases and embeddings. Sometimes, you might even use both for different parts of the same task!
First reason that comes to mind is you can make much smaller models, which helps with latency, cost and may enable you to run the model locally.
Fine Tuning = Output
Embeddings = Input
Fine-tuning is like a chef modifying a general pizza recipe to perfect a specific pizza, such as Neapolitan. This customization optimizes the result. In AI, fine-tuning adjusts a pre-existing model to perform better on a specific task.
Embeddings are like categorizing ingredients based on properties. They represent inputs so that similar inputs have similar representations. For instance, 'dog' and 'puppy' in an AI model have similar meanings. Like ingredients in a pizza, embeddings help the model understand and interpret the inputs. So, fine-tuning is about improving the model's performance, while embeddings help the model comprehend its inputs.
It turns out, you can search a vector space of embeddings to find similar embeddings. If I turned my above post into 2 embeddings, and you searched for "golden retreiver" though neither paragraph has that exact phrase, the model should know a golden retreiver is most similar to the second paragraph that compares puppy to dog.
I like to think of an LLM as a literal human. Not sure if it's the best analogy.
Fine tuning = Adding years of experience, in a set environment. e.g. Raise them in a home that only speaks in old english, learn pig latin, send them to a bootcamp.
Embedding = Giving them a book to reference information.
Just like a human, memory might fade a bit through the years but old habits die hard. You might not perfectly recollect what you learned years ago, but you still get the general idea, and if you took a class on the referenced book you'll be better at relaying information from it.
Edit: Asked ChatGPT to create the analogy.
A language model is like an intelligent person.
- Pre-training is their broad education and general knowledge.
- Fine-tuning is their years of specialized experience in a specific field.
- Embedding is like giving them a comprehensive book on a particular subject.
Just as a person gains knowledge, expertise, and specialized resources, the language model develops its understanding and performance through pre-training, fine-tuning, and embedding.
Fine-tuning could be useful to get a high text completion quality out of a small model within a specific domain. You would still use the resulting model alongside an info retrieval system to prompt with real context (unless you have a use case where hallucination is a feature).
Wouldn't a vector database just get you nearest-neighbors on the embeddings? How would that answer a generative or extractive question? I can see it might get you sentiment, but would it help with "tell me all the places that are mentioned in this review"?
i think the point is that you use the vector database to locate the relevant context to pass to the LLM for question answering. here’s an end-to-end example:
Right. You feed the text chunks (from the matched embeddings) to a generative LLM to do the extractive/summarization part.
I've been playing with using documents as OpenAI embeddings for the past weeks and, at least for my use case, the results are meh. It seems sometimes just using context is not enough.
My next step is to play with fine tunning, but I have no results to report yet.
Try using InstructXL for embeddings. It’s got a more complex prompt structure for generating embeddings which might be more useful
have you tried other models to generate embeddings? I am going to that direction too to create an additional layer of helpers for search. Also, thinking if the document is not too big, it might fit into the initial context with the prompt
If the documents are large, try embedding smaller portions. If there's a heavy domain vocabulary, you might need a custom model.
I'd be very interested in knowing the outcome. Do you blog anywhere (or post on social)?
I think it probably works a lot better, but I would love to see some research validating this
I've read in a few places that it actually works worse in most cases. Much better to put the context in your prompt.
Fine tuning + context will outperform context alone, and it's cheaper to burn cycles fine tuning then use a smaller context than to use a larger context in production.
Fine tuning + same context will probably outperform context alone, but if you use a smaller context that does not seem to work that well as GP stated.
While the fine-tuning pipeline is fairly straightforward for tuning and building custom models, the RLHF pipeline doesn't look to be as straightforward. Creating a dataset for RLHF seems like a fairly labour intensive exercise especially if your model is tuned to do work like code generation ?
What about the Replit Ghostwriter? Did it have a RLHF phase?
What is the main difference between training and fine tuning?
Can you start with a model trained only in producing the letter a, and then fine tune it to learn b, then c, then words, sentences, etc?
Ideally you train a model right to begin with, and no fine tuning is necessary.
However, sometimes you can't do that. For example, perhaps you want your model to always talk like a pirate, but you don't have billions of words spoken like a pirate to train on.
So the next best thing is to train a model on all english text (which you have lots of), and then finetune on your smaller dataset of pirate speech.
Finetuning is simply more training, but with a different dataset and often a different learning rate.
Typically, finetuning uses far far far less data and compute, and can be done by individuals with a home PC, whereas training a large language model from scratch is in the $1M - $1B range.
Comment was deleted :(
For "full fine tuning", mathematically there's no difference. Fine tuning is just extending the training on new data.
What you are suggesting is called "curriculum learning", and though it hasn't been applied to LLMs yet to the best of my knowledge, it has proven to improve learning and decrease training times in other areas of ML.
Yeah, since fine tuning seems to be so much more cheaper than training why haven't OpenAI fine tuned ChatGPT on data past 2021?
One argument is that it can contaminate training data from output of itself or other models.
We have already documented evidence of the effect of this. In the GPT-4 technical report , they reported contamination of humaneval data in the training data.
They did measure against a "non-contaminated" training set but no idea if that can still be trusted.
Why would this matter? We can have seemingly strong benchmarks for containments but measures poorly against new and quarantined information. Classic over fitting.
Another argument is that data being put out there could very much be wrong and the amounts of it amplified by other models. Take a look at this sample of demonstration data for codealpaca . Not only is its output wrong but bad practices like,making up a random computation without it having access to a place to run a calculation, teaches the model these type of responses are ok.
1: https://cdn.openai.com/papers/gpt-4.pdf 2: https://github.com/sahil280114/codealpaca/commit/0d265112c70...
My guess is that it's because they've already done RLHF on top of the standard next token prediction. In other words, they can't cheaply fine tune ChatGPT without undoing the RLHF objective by training on next token prediction with post-2021 data, and then retraining with RLHF to make sure it still gives good human-like output.
I mention the "undoing RLHF" since it's not uncommon for fine-tuned models to increase in error in the original training objective after being fine-tuned with a different one. I think people saw this happen in BERT.
Also ChatGPT is almost certainly huge.
Not an expert, but my high level understanding is this: If a model is a set of inputs, some middle layers, and a set of outputs. Fine tuning concentrates on only the output layers.
Useful for taking a generic model with a base level of knowledge, and tuning it so the output is more useful for an application specific use case.
not strictly true I think
- you could add new units throughout and train those while freezing existing units (adapter-based fine-tuning)
- you could train all units and use e.g. low-rank adaptation to limit how much they can change
- you could do prefix tuning and train an input to add at every layer
see e.g. - https://lightning.ai/pages/community/article/understanding-l...
I think that's more in line with transfer learning, a variant of fine-tuning. If I'm reading this article correctly, they're fine-tuning the LMs end-to-end.
It seems training the Vicuna on custom dataset could be quite easy as well, according to the following: https://github.com/skypilot-org/skypilot/tree/master/llm/vic...
Would it be feasible to fine-tune a large, capable model (like the recent LIMA) on the source code (and maybe a few high quality libraries) of a niche language, such that it's much better at helping you write and understand it?
Imagine how many doors it would open if you could fine-tune models capable of writing language bindings for you and keeping them up to date.
Totally. GPT-4 can already do this, untuned, on niche languages and libraries. One of the main problems is still that you don't know when it's hallucinating a function or whatever though.
When is fine tuning worth it, rather than just prompt engineering?
I think these are two very separate concepts.
What we are mostly seeing when it comes to fine-tuning is making a model promptable. Models like LLaMA or the original GPT3 weren't promptable. They were fine-tuned with demonstration data that looks like a prompt input, prompt output.
Prompt engineering is really just carefully designing what inputs and outputs on a prompt-ready model work best.
I highly recommend skimming this RLHF article and looking for the parts where it talks about demonstration data 
Prompt engineering and fine tuning are in many cases alternative ways to achieve the same goal. You claim that the "original GPT3" wasn't promptable. I'm unsure which version you refer to, but I'm guessing you refer to text-davinci-003 and it was definitely promptable. For one app I used prompt engineering to make it behave like a spirit talking through a ouija board. For another, I used prompt engineering to make it act like a dystopian search engine from the future. So, yeah, it's promptable.
Thanks for link 2 - it is worth a proper read! Read half of it already and it is very interesting and useful for understanding this.
From what I've seen, it's when embeddings get too large for the token limit or the embeddings drive the cost up too much because you're always operating near the max token limit. In those cases, it may be worth the up front training cost and slightly higher per-token cost to dramatically reduce the amount of tokens in the average request. If you're building a higher throughput solution, the difference in cost can be quite large.
When you're starting to run into context limits.
It's worth it whenever you have a reasonable amount of training data. You can get substantial quality improvements automatically. Unless you're doing some kind of prompt-optimization, prompt-tuning is a lot of random guessing and trial-and-error. It's also most necessary when you have a smaller base model, as opposed to one of the big ones.
If you want to teach it eg. all of the text in your private training manuals and internal documentation, which wouldn't fit in the input token size.
These NanoGPT based models are great, thank you for contributing to OS. Would love to see this ported to CPUs ala llama.cpp. Any plans in that direction?
Is there are Dreambooth equivalent for fine-tuning ChatGPT as there is for Stable Diffusion? I have to imagine that if we can add custom data to a DL text-to-image model, we should be able to do the same with a text-to-text one.
Edit to add: There are a number of Google Colabs for fine-tuning SD and I wonder if there are (or if it is technically feasible) to accomplish the same with other txt2txt models.
These aren't for ChatGPT, but work on LLaMA, Vicuna, etc.
If you're running the text-generation-webui (https://github.com/oobabooga/text-generation-webui) it has the ability to train LoRAs.
It'll require a beefy GPU but I've seen some fun examples like someone training a LoRA on Skyrim books.
Has anyone tried to use this?
The guide obv didn't make usable code and the github looks nearly unrelated.
I'm somewhat surprised there isnt a parameter for 'input_data' and 'output_data' and it returns a trained model. I can't figure out why there is so much boilerplate when that stuff could be contained as parameters.
Anyone knows the computational cost of training with these LoRa designs? Given that we are talking about rates of token per seconds, it seems training a bigger dataset could be extremely expensive
The adapter and LoRa have a drastically fewer parameters, so one might expect that forward + backward is roughly 2x the cost of forward.
Then (as far as I know), in contrast to generation, training is done on the entire output of the transformer (so all tokens of the full input) rather than serially token-by-token (in the RNN days, this was called teacher-forcing), so that may give you a significant boost in the tokens per second rate over generation.
How does this compare to fine tuning something like BERT?
I would say similar since the building block is the transformer for both. In this blog post, the fine-tuning strategy used is Adapter. It basically adds a learnable layer to the Transformer block.
has anyone here used EasyLM ? it seems the most used for the best finetuned models out there.
Sounds interesting. Curious if there is a tutorial for this.
This looks like the Orion broswer logo
Crafted by RajatSource Code