hckrnws
The RWKV language model: An RNN with the advantages of a transformer
by T-A
Unfortunately its not very good at longer context lengths, which sort of defeats the point of efficient scaling with context. See https://twitter.com/arankomatsuzaki/status/16390003799784038...
Its also not really an RNN. The best way to describe the key time mixing operation is a normalized exponentially weighted moving average (EMA)- no non-linearity. Once viewed this way, its not surprising that it struggles at longer contexts- everything decays, and it has limited space to put things. Of course, it does have some clever tricks, and can choose to remember things for a while by upweighting them, but not forever.
Yup. This is why GPT-4 long context length version is going to be such a god damn gamechanger.
The current memory techniques we have outside of ultra long context lengths are lossy and imperfect. I wish the langchain spammers (yes it's a good tool) would acknowledge this when they keep posting everywhere about the "memory module".
Yes, I am so curious how they did it, I know about flash attention but there is no way this gets us all the way there.
Why not? They charge 8x more for a 32k context than for a 8k context (note that the prices are normally presented per token, here I'm talking about absolute cost). Naive scaling on the self-attention component would suggest 16x compute and 4x memory, while the rest of it (forward layers, embeddings, activation functions etc.) all would go up 4x both compute and memory.
It already is a game changer, Bing Chat's Creative mode appears to be using >8k token context (though I'm not sure if it's 16k or 32k).
The stats shown are pretty compelling that this transformer + RNN approach works. Looking back at the Llama paper (https://arxiv.org/pdf/2302.13971.pdf), many of the results are comparable between llama and RWKV. E.g. Llama 13bn scores 80.1 on PIQA and RWKV scores 77.5
I wonder if RLHF would boost performance for this architecture in the same way
I personally think the way these chat assistants are created using these stateless models is extremely flawed. I'm glad that the technique used in RWKV seems successful, I wonder if this transfomer + RNN combination will be adopted elsewhere.
Transformers scale in training mainly because they are stateless!
They have a long lookback window (30,000 tokens in GPT4!) so the issues are smoothed over.
LSTMs end up being very finnicky to train and their sequential nature makes model scaling bottlenecked in parallelism.
RWKV training can be parallelized just like transformers. It's not LSTM.
Right, I meant why transformers exploded over LSTM-based LMs in 2018 was because of those training issues that transformers solved.
I'm excited about the next iteration of models focusing on inference. RWVK seems promising on this end. Also love the citizen science aspect of RWVK coming out.
You can currently play around with the 7B[0] and 14B[1] parameters models hosted in Hugging Face Spaces.
Also here:
Man fears AGI
AGI fears Fabrice Bellard
You're not lying. When I saw the site I thought "isn't that the dude who managed to emulate an entire PC in Javascript and boot Windows in a browser? Yep that's him."
Somehow not surprised Bellard is also going to make an impact in AI/LLMs
> going to make an impact in AI/LLMs
This seems to be made for mass parallelism.
Question: Would this scale to 100 million of collaborating 4G/5G smartphones? We have operational federated learning code and decentralised learning code (trivial stochastic gradient descent). Application: simple learn-to-Rank based on ClickLogs like its 2005 [1]. Then you have a true decentralised Google.
Would love to link RWKV to other pure decentralised tech. In the past we have build the first self-compiling Android app and first Android-to-Android P2P overlay network. Without any helper peers for carrier-grade NAT puncturing. My university systems lab lacks the size to keep up with the recent pace of innovation.
> Somehow not surprised Bellard is also going to make an impact in AI[... ]
He also wrote an engine: "LibNC: C Library for Tensor Manipulation"
> [ ...]/LLMs
That could be a wish come true.
Edit: in fact, already one of his marks can be seen: «Larger models work optimally on lower cost GPUs (e.g. RTX 3090, RTX A6000) thanks to efficient quantization»
This gave me a good chuckle. Thought I would share. (From ChatRWKV)
Prompt:
A website can be built in 10 simple steps
Output:
1. Research
2. Research
3. Research
4. Research
5. Research
6. Research
7. Design
8. Design
9. Design
10. Design
while this topic has visibility, there is also another relatively unexplored research direction, fine-tuning pretrained transformers into RNNs
Comment was deleted :(
When I've played around with this on my box, I haven't been able to get the output to be very decent. Maybe I'm just used to ChatGPT though.
Does anybody have some examples of output that it generates to explain what this is capable of?
Check out the web demo for the 14B parameter model: https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio
I tried a few queries. It works okay for basic prompts for coding questions but can give pretty good results with more context / prompt engineering
If you're expecting chatGPT-esque outputs I"d suggest this 7B demo that's fine-tuned on the Alpaca dataset.
I've seen this hyped a lot on Reddit, and its main contributing base seems to be on a Discord channel, but have yet to see any scholarly discussion behind it.
Is there a reason the creator hasn't written or tried to publish a paper on it? I'd love to see an peer-reviewed discussion detailing how it works, maybe studying how it behaves internally, evaluating its performance more rigorously than posting a few examples, maybe in comparison to Transformers with similar parameter counts, or training/inference costs in FLOPs. Perhaps that's not the creator's main priority, though.
I think I've seen this same sentiment on many of the Reddit threads on RWKV. I think the author prefers to spend their time on coding it up and explaining it on the GitHub rather than taking the time to write a paper on it.
I kinda respect that, can't force someone to explain their ways if they prefer hacking to writing. I would also love to see more rigorous measurements and discussion though.
The creator replied: "Thank you :) Too busy for that at this moment, but I will get a paper out later this year." once on a reddit post asking exactly this.
https://old.reddit.com/r/MachineLearning/comments/1135aew/r_...
> though in practice, the model might have a hard time generalizing to much longer context lengths than it saw during training
What's the benefit of RWKV then? The results are practically identical to transformers.
It is way cheaper to run inference on commodity hardware, because you don't have to keep track of the activations on all the previous tokens (only the 3 previous ones IIRC).
- from O(N²) to O(N) complexity during training wrt context length
- only need the hidden state at position t to compute t+1, i.e. run inference locally on edge devices
- fine-tunable to longer context lengths than seen in pre-training
If it continues to scale as well as it has so far it's pretty huge.
> RWKV combines the best features of RNNs and transformers. During training, we use the transformer type formulation of the architecture, which allows massive parallelization (with a sort of attention which scales linearly with the number of tokens). For inference, we use an equivalent formulation which works like an RNN with a state. This allows us to get the best of both worlds.
> So we basically have a model which trains like a transformer, except that long context length is not expensive. And during inference, we need substantially less memory and can implicitly handle “infinite” context length (though in practice, the model might have a hard time generalizing to much longer context lengths than it saw during training).
The tl:dr Linear Attention avoids the quadratic token scaling and is equivalent to specific instances of RNNs.
You can easily fine-tune for more context and indeed he does. There are at least 8192 ctx versions out now (started with 1024) and he plans to go all the way to at least 16k
If the context size is unbounded, how does the time complexity scale with the size of the context, and what are the limiting factors that affect performance as the context size grows larger?
The time and space complexity of inference is constant wrt. context size. You will probably need more parameters to match the performance of a transformer though, so whether it scales better in practice is an open question.
The model looks very promising: https://zhuanlan-zhihu-com.translate.goog/p/619721229?_x_tr_...
I hope this only man on the planet who can train big RNN models will beat OpenAI one day, like he said he would.
Happy to see this getting attention, lots of great open models are being worked on and I can't wait to see something people at home with a 3090 could use.
You should be able to run Llama 30B q4 on a RTX 3090. Check the table here: https://bellard.org/ts_server/
30B/q4 requires 20GB of RAM while 3090 has 24GB.
Seconding this.
Llama 30B 4-bit has amazing performance, comparable to GPT-3 quality for my search and novel generating use-cases, and fits on a single 3090. In tandem with 3rd party applications such as Llama Index and the Alpaca LoRa, GPT-3 (and potentially GPT-4) has already been democratized in my eyes.
which one is the next incarnation of performance/price/goodness after the RTX 3090 for running a mini AI homelab?
Not entirely certain about this, but I believe that because Apple M-series GPUs use system memory you could possible run the larger models on accessible (albeit expensive) consumer hardware.
Need 64GB for 30B
And 128GB for the 65B
I'm not sure about the performance, but I think it should be ok? Especially given how much Apple has been investing in what -- I believe they call -- neural cores?
Here's some more context: https://news.ycombinator.com/item?id=35105364
To increase performance, you need bigger foundation models, like LLaMA 65B. You will always be RAM-bound for that (~40GB for LLaMA 65B INT4). So the next step is to upgrade your motherboard to have multiple GPU slots. If you really wanted to stick to a single-GPU setup, you could upgrade to RTX A6000, but it is more expensive than two RTX 3090 while holding the same amount of RAM.
Its a dual slot gpu
What benefit does that offer over 2 GPUs?
The two GPUs will need to exchange information in order to complete inference, by having one GPU hold half of the network weights, and the other holding the other half. That transmission of information will be limited by the PCIe bandwidth; for instance, 30 GB/s with v4x16.
Meanwhile the A6000 VRAM bandwidth is 768 GB/s (= 16 Gb/s (GDDR6) × 384 bit-width ÷ 8 bits per byte).
Two 3090 can be connected using NVLink bridge which is much faster than PCI-E.
I couldn’t find precise information on what bandwidth you’d get with NVLink on the 3090. To be fair, though, if all we do is inference, using Huggingface pipeline parallelism, the amount of data transferred is pretty small: 8192×2×n_tokens bytes; for most uses, with a recent PCIe setup, that bottleneck will take less than 0.1 ms per token generated, which may not be the dominant latency.
(Also, the RTX 3090 has faster VRAM, >900 GB/s, than the A6000, because it is GDDR6X.)
open-source language model that challenges Chat GPT. It highlights the significance of language models in AI, the limitations of proprietary models like GPT-4, and the potential of open-source models like Raven to level the playing field. https://thelearness.com/raven-the-open-source-language-model...
Comment was deleted :(
Can someone ELI5 how linear attention works? Isn't that impossible in principle?
It's not really attention, it's 'just' a exponential moving average over time where different channels have different decay rates. This is a simplification, in the actual architecture there are also convolution layers.
Crafted by Rajat
Source Code