hckrnws
Embeddings are a good starting point for the AI curious app developer
by bryantwolf
One straightforward way to get started is to understand embedding without any AI/deep learning magic. Just pick a vocabulary of words (say, some 50k words), pick a unique index between 0 and 49,999 for each of the words, and then produce embedding by adding +1 to the given index for a given word each time it occurs in a text. Then normalize the embedding so it adds up to one.
Presto -- embeddings! And you can use cosine similarity with them and all that good stuff and the results aren't totally terrible.
The rest of "embeddings" builds on top of this basic strategy (smaller vectors, filtering out words/tokens that occur frequently enough that they don't signify similarity, handling synonyms or words that are related to one another, etc. etc.). But stripping out the deep learning bits really does make it easier to understand.
Those would really just be identifiers. I think the key property of embeddings is that the dimensions each individually mean/measure something, and therefore the dot product of two embeddings (similarity of direction of the vectors) is a meaningful similarity measure of the things being represented.
The classic example is word embeddings such as word2vec, or GloVE, where due to the embeddings being meaningful in this way, one can see vector relationships such as "man - woman" = "king - queen".
> I think the key property of embeddings is that the dimensions each individually mean/measure something, and therefore the dot product of two embeddings (similarity of direction of the vectors) is a meaningful similarity measure of the things being represented.
In this case each dimension is the presence of a word in a particular text. So when you take the dot product of two texts you are effectively counting the number of words the two texts have in common (subject to some normalization constants depending on how you normalize the embedding). Cosine similarity still works for even these super naive embeddings which makes it slightly easier to understand before getting into any mathy stuff.
You are 100% right this won't give you the word embedding analogies like king - man = queen or stuff like that. This embedding has no concept of relationships between words.
But that doesn't seem to be what you are describing in terms of using incrementing indices and adding occurrence counts.
If you want to create a bag of words text embedding then you set the number of embedding dimensions to the vocabulary size and the value of each dimension to the global count of the corresponding word.
Heh -- my explanation isn't the clearest I realize, but yes, it is BoW.
Eg fix your vocab of 50k words (or whatever) and enumerate it.
Then to make an embedding for some piece of text
1. initialize an all zero vector of size 50k 2. for each word in the text, add one to the index of the corresponding word (per our enumeration). If the word isn't in the 50k words in your vocabulary, then discard it 3. (optionally), normalize the embedding to 1 (though you don't really need this and can leave it off for the toy example). initialize an embedding (for a single text) as an all zero vector of size 50k
This is not the best way to understand where modern embeddings are coming from.
True, but what is the best way?
Are you talking about sentence/text chunk embeddings, or just embeddings in general?
If you need high quality text embeddings (e. g to use with a vector DB for text chunk retrieval), they they are going to come from the output of a language model, either a local one or using an embeddings API.
Other embeddings are normally going to be learnt in end-to-end fashion.
I disagree. In most subjects, recapitulating the historical development of a thing helps motivate modern developments. Eg
1. Start with bag of words. Represent words as all zero except one index that is not zero. Then a document is the sum (or average) of all the words in the document. We now have a method of embedding a variable length piece of text into a fixed size vector and we start to see how "similar" is approximately "close", though clearly there are some issues. We're somewhere at the start of nlp now.
2. One big issue is that there are a lot of common noisy words (like "good", "think", "said", etc.) that can make the embedding more similar than we feel they should be. So now we develop strategies for reducing the impact of those words one our vector. Remember how we just summed up the individual word vectors in 1? Now we'll scale each word vector by its frequency so that the more frequent the word in our corpus, the smaller we'll make the corresponding word vector. That brings us to tf-idf embeddings.
3. Another big issue is that our representation of words don't capture word similarity at all. The sentences "I hate when it rains" and "I dislike when it rains" should be more similar than "I hate when it rains" and "I like when it rains", but in our embeddings from (2) the similarity of the two pairs is going to be similar. So now we revisit our method of constructing word vectors and start to explore ways to "smear" words out. This is where things like word2vec and glove pop up as methods of creating distributed representation of words. Now we can represent documents by summing/averaging/tf-idfing our word vectors the same as we did in 2.
4. Now we notice there is an issue where words can have multiple meanings depending on their surrounding context. Think of things like irony, metaphor, humor, etc. Consider "She rolled her eyes and said, 'Don't you love it here?"'" and "She rolled the dough and said, 'Don't you love it here?'". Odds our, the similarity per (3) is going to be pretty similar, despite the fact that its clear these are wildly different meanings. The issue is that our model in (3) just uses a static operation for combining our words, and because of that we aren't capturing the fact that "Don't you love it here" shouldn't mean the same thing in the first and second sentences. So now we start to consider ways in which we can combine our word vectors differently and let the context affect the way in which we combine them.
5. And that brings us to now where we have a lot more compute than we did before and access to way bigger corpora so we can do some really interesting things, but its all still the basic steps of breaking down text into its constituent parts, representing those numerically, and then defining a method to combine various parts to produce a final representation for a document. The above steps greatly help by showing the motivation for each change and understanding why we do the things we do today.
This explanation is better, because it puts things into perspective, but you don't seem to realize that your 1 and 2 are almost trivial compared to 3 and 4. At the heart of it are "methods of creating distributed representation of words", that's where the magic happens. So I'd focus on helping people understand those methods. Should probably also mention subword embedding methods like BPE, since that's what everyone uses today.
I noticed that many educators make this mistake: spend a lot of time on explaining very basic trivial things, then rush over difficult to grasp concepts or details.
> We now have a method of embedding a variable length piece of text into a fixed size vector
Question: Is it a rule that the embedding vector must be higher dimensional than the source text? Ideally 1 token -> a 1000+ length vector? The reason I ask is because it seems like it would lose value as a mechanism if I sent in a 1000 character long string and only got say a 4-length vector embedding for it. Because only 4 metrics/features can't possibly describe such a complex statement, I thought it was necessary that the dimensionality of the embedding be higher than the source?
No. Number of characters in a word has nothing to do with dimensionality of that word’s embedding.
GPT4 should be able to explain why.
Comment was deleted :(
They're not, I get why you think that though.
They're making a vector for a text that's the term frequencies in the document.
It's one step simpler than tfidf which is a great starting point.
OK, sounds counter-intuitive, but I'll take your word for it!
It seems odd since the basis of word similarity captured in this type of way is that word meanings are associated with local context, which doesn't seem related to these global occurrence counts.
Perhaps it works because two words with similar occurrence counts are more likely to often appear close to each other than two words where one has a high count, and another a small count? But this wouldn't seem to work for small counts, and anyways the counts are just being added to the base index rather than making similar-count words closer in the embedding space.
Do you have any explanation for why this captures any similarity in meaning?
> rather than making similar-count words closer in the embedding space.
Ah I think I see the confusion here. They are describing creating an embedding of a document or piece of text. At the base, the embedding of a single word would just be a single 1. There is absolutely no help with word similarity.
The problem of multiple meanings isn't solved by this approach at all, at least not directly.
Talking about the "gravity of a situation" in a political piece makes the text a bit more similar to physics discussions about gravity. But most of the words won't match as well, so your document vector is still more similar to other political pieces than physics.
Going up the scale, here's a few basic starting points that were (are?) the backbone of many production text AI/ML systems.
1. Bag of words. Here your vector has a 1 for words that are present, and 0 for ones that aren't.
2. Bag of words with a count. A little better, now we've got the information that you said "gravity" fifty times not once. Normalise it so text length doesn't matter and everything fits into 0-1.
3. TF-IDF. It's not very useful to know that you said a common word a lot. Most texts do, what we care about is ones that say it more than you'd expect so we take into account how often the words appear in the entire corpus.
These don't help with words, but given how simple they are they are shockingly useful. They have their stupid moments, although one benefit is that it's very easy to debug why they cause a problem.
Are you saying it's pure chance that operations like "man - woman" = "king - queen" (and many, many other similar relationships and analogies) work?
If not please explain this comment to those of us ignorant in these matters :)
It’s not pure chance that the above calculus shakes out, but it doesn’t have to be that way. If you are embedding on a word by word level then it can happen, if it’s a little smaller or larger than word by word it’s not immediately clear what the calculation is doing.
But the main difference here is you get 1 embedding for the document in question, not an embedding per word like word2vec. So it’s something more like “document about OS/2 warp” - “wiki page for ibm” + “wiki page for Microsoft” = “document on windows 3.1”
3Blue1Brown has some other examples in his videos about transformers, most notably I think is that hitler-Germany+italy ~ Mussolini!
I'm trying to understand this approach. Maybe I am expecting too much out of this basic approach, but how does this create a similarity between words with indices close to each other? Wouldn't it just be a popularity contest - the more common words have higher indices and vice versa? For instance, "king" and "prince" wouldn't necessarily have similar indices, but they are semantically very similar.
You are expecting too much out of this basic approach. The "simple" similarity search in word2vec (used in https://semantle.com/ if you haven't seen it) is based on _multiple_ embeddings like this one (it's a simple neural network not a simple embedding).
This is a simple example where it scores their frequency. If you scored every word by their frequency only you might have embeddings like this:
act: [0.1]
as: [0.4]
at: [0.3]
...
That's a very simple 1D embedding, and like you said would only give you popularity. But say you wanted other stuff like its: Vulgarity, prevalence over time, whether its slang or not, how likely it is to start or end a sentence, etc. you would need more than 1 number. In text-embedding-ada-002 there are 1536 numbers in the array (vector), so it's like: act: [0.1, 0.1, 0.3, 0.0001, 0.000003, 0.003, ... (1536 items)]
...
The numbers don't mean anything in-and-of-themselves. The values don't represent qualities of the words, they're just numbers in relation to others in the training data. They're different numbers in different training data because all the words are scored in relation to each other, like a graph. So when you compute them you arrive at words and meanings in the training data as you would arrive at a point in a coordinate space if you subtracted one [x,y,z] from another [x,y,z] in 3D.So the rage about a vector db is that it's a database for arrays of numbers (vectors) designed for computing them against each other, optimized for that instead of say a SQL or NoSQL which are all about retrieval etc.
So king vs prince etc. - When you take into account the 1536 numbers, you can imagine how compared to other words in training data they would actually be similar, always used in the same kinds of ways, and are indeed semantically similar - you'd be able to "arrive" at that fact, and arrive at antonyms, synonyms, their French alternatives, etc. but the system doesn't "know" that stuff. Throw in Burger King training data and talk about French fries a lot though, and you'd mess up the embeddings when it comes arriving at the French version of a king! You might get "pomme de terre".
King doesn’t need to appear commonly with prince. It just needs to appear in the same context as prince.
It also leaves out the old “tf idf” normalization of considering how common a word is broadly (less interesting) vs in that particular document. Kind of like a shittier attention. Used to make a big difference.
It doesn't even work as described for popularity - one word starts at 49,999 and one starts at 0.
Yeah, that is a poorly written description. I think they meant that each word gets a unique index location into an array, and the value at that word's index location is incremented whenever the word occurs.
[dead]
It's a document embedding, not a word embedding.
Maybe the idea is to order your vocabulary into some kind of “semantic rainbow”? Like a one-dimensional embedding?
Is that really an embedding? I normally think of an embedding as an approximate lower-dimensional matrix of coefficients that operate on a reduced set of composite variables that map the data from a nonlinear to linear space.
You're right that what I described isn't what people commonly think about as embeddings (given we are more advanced now the above description), but broadly an embedding is anything (in nlp at least) that maps text into a fixed length vector. When you make embedding like this, the nice thing is that cosine similarity has an easy to understand similarity meaning: count the number of words two documents have in common (subject to some normalization constant).
Most fancy modern embedding strategies basically start with this and then proceed to build on top of it to reduce dimensions, represent words as vectors in their own right, pass this into some neural layer, etc.
A lot of people here are trying to describe to you that no, this is not at all the starting point of modern embeddings. This has none of the properties of embeddings.
What you're describing is an idea from the 90s that was a dead end. Bag of words representations.
It has no relationship to modern methods. It's based on totally different theory (bow instead of the distributional hypothesis).
There is no conceptual or practical path from what you describe to what modern embeddings are. It's horribly misleading.
There is no conceptual or practical path from what you describe to what modern embeddings are.
There certainly is. At least there is a strong relation between bag of word representations and methods like word2vec. I am sure you know all of this, but I think it's worth expanding a bit on this, since the top-level comment describes things in a rather confusing way.
In traditional Information Retrieval, two kinds of vectors were typically used: document vectors and term vectors. If you make a |D| x |T| matrix (where |D| is the number of documents and |T| is the number of terms that occur in all documents), we can go through a corpus and note in each |T|-length row for a particular the frequency of each term in that document (frequency here means the raw counts or something like TF-IDF). Each row is a document vector, each column a term vector. The cosine similarity between two document vectors will tell you whether two documents are similar, because similar documents are likely to have similar terms. The cosine similarity between two term vectors will tell you whether two terms are similar, because similar terms tend to occur in similar documents. The top-level comment seems to have explained document vectors in a clumsy way.
Over time (we are talking 70-90s), people have found that term vectors did not really work well, because documents are often too coarse-grained as context. So, term vectors were defined as |T| x |T| matrices where if you have such a matrix C, C[i][j] contains the frequency of how often the j-th term occurs in the context of the i-th term. Since this type of matrix is not bound to documents, you can choose the context size based on the goals you have in mind. For instance, you could only count terms that are within 10 (text) distance of the occurrences of the term i.
One refinement is that rather than raw frequencies, we can use some other measure. One issue with raw frequencies is that a frequent word like the will co-occur with pretty much every word, so it's frequency in the term vector is not particularly informative, but it's large frequency will have an outsized influence on e.g. dot products. So, people would typically use pointwise mutual information (PMI) instead. It's beyond the scope of a comment to explain PMI, but intuitively you can think of the PMI of two words to mean: how much more often do the words cooccur than chance? This will result in low PMIs for e.g. PMI(information, the) but a high PMI for PMI(information, retrieval). Then it's also common practice to replace negative PMI values by zero, which leads to PPMI (positive PMI).
So, what do we have now? A |T|x|T| matrix with PPMI scores, where each row (or column) can be used as a word vector. However, it's a bit unwieldy, because the vectors are large (|T|) and typically somewhat sparse. So people started to apply dimensionality reduction, e.g. by applying Singular Value Decomposition (SVD, I'll skip the details here of how to use it for dimensionality reduction). So, suppose that we use SVD to reduce the vector dimensionality to 300, we are left with a |T|x300 matrix and we finally have dense vectors, similar to e.g. word2vec.
Now, the interesting thing is that people have found that word2vec's skipgram with negative sampling (SGNS) is implicitly vectorizing a PMI-based word-context matrix [1], exactly like the IR folks were doing before. Conversely, if you matrix-multiply the word and context embedding matrices that come out of word2vec SGNS, you get an approximation of the |T|x|T| PMI matrix (or |T|x|C| if a different vocab is used for the context).
Summarized, there is a strong conceptual relation between bag-of-word representations of old days and word2vec.
Whether it's an interesting route didactically for understanding embeddings is up for debate. It's not like the mathematics behind word2vec are complex (understanding the dot product and the logistic function goes a long way) and understanding word2vec in terms of 'neural net building blocks' makes it easier to go from word2vec to modern architectures. But in an exhaustive course about word representations, it certainly makes sense to link word embeddings to prior work in IR.
[1] https://proceedings.neurips.cc/paper_files/paper/2014/file/f...
Eh, I disagree. When I began working in ML everything was about word2vec and glove and the state of the art for embedding documents was adding together all the word embeddings and it made no sense to me but it worked.
Learning about BoW and simple ways of convert text to fixed length vectors that can be used in ML algos clarified a whole for me, especially the fact that embeddings aren’t magic they are just a way to convert text to a fixed length vector.
BoW and tf-idf vectors are still workhorses for routine text classification tasks despite their limitations, so they aren’t really a dead end. Similarity a lot of things that follow BoW make a whole lot more sense if you think of them as addressing limitations of BoW.
Well, you've fooled yourself into thinking you understand something when you don't. I say this as someone with a PhD in the topic, who has taught many students, and published dozens of papers in the space.
The operation of adding BoW vectors together has nothing to do with the operation of adding together word embeddings. Well, aside from both nominally being addition.
It's like saying you understand what's happening because you can add velocity vectors and then you go on to add the binary vectors that represent two binary programs and expect the result to give you a program with the average behavior of both. Obviously that doesn't happen, you get a nonsense binary.
They may both be arrays of numbers but mathematically there's no relationship between the two. Thinking that there's a relationship between them leads to countless nonsense conclusions: the idea that you can keep adding word embeddings to create document embeddings like you keep adding BoWs, the notion that average BoWs mean the same thing as average word embeddings, the notion that normalizing BoWs is the same as normalizing word embeddings and will lead to the same kind of search results, etc. The errors you get with BoWs are totally different from the errors you get with word or sentence or document embeddings. And how you fix those errors is totally different.
No. Nothing at all makes sense about word embeddings from the point of BoW.
Also, yes BoW is a total dead end. They have been completely supplanted. There's never any case where someone should use them.
> as someone with a PhD in the topic, who has taught many students, and published dozens of papers
:joy: How about a repo I can run that proves you know what you're talking about? :D Just put together examples and you won't have to throw around authority fallacies.
> if you add 2 vectors together nothing happens and its meaningless
In game dev you move a character in a 3D space by taking their current [x, y, z] vector and adding a different [x, y, z] to it. Even though left/right has nothing to do with up/down, because there are foundational concepts like increase/decrease in relation to a [0, 0, 0] origin, it still affects all the axes and gives you a valuable result.
Take that same basic idea and apply it to text-embedding-ada-002 with its 1536 dimensional embedding arrays, and you can similarly "navigate words" by doing similar math on different vectors. That's what is meant by the king - man = queen type concepts.
I think what the person before you meant about it not making sense is that it's strange that something like a dot product gives you similarity. To your point (I think) it only would assuming the vectors were scored in a meaningful way, of course if you did math on nonsense vectors you'd get nonsense results, but if they are modeled appropriately it should be a useful way to find nodes of certain qualities.
Aren't you just describing a bag-of-words model?
Yes! And the follow up that cosine similarity (for BoW) is a super simple similarity metric based on counting up the number of words the two vectors have in common.
How does this enable cosine similarity usage? I don't get the link between incrementing a word's index by it's count in a text and how this ends up with words that have similar meaning to have a high cosine similarity value
I think they are talking about bag-of-words. If you apply a dimensionality reduction technique like SVD or even random projection on bag-of-words, you can effectively create a basic embedding. Check out latent semantic indexing / latent semantic analysis.
You're right, that approach doesn't enable getting embeddings for an individual word. But it would work for comparing similarity of documents - not that well of course, but it's a toy example that might feel more intuitive
I think that strips away way too much. What you describe is “counting words”. It produces 50,000-dimensional vectors (most of them zero for the vast majority of texts) for each text, so it’s not a proper embedding.
What makes embeddings useful is that they do dimensionality reduction (https://en.wikipedia.org/wiki/Dimensionality_reduction) while keeping enough information to keep dissimilar texts away from each other.
I also doubt your claim “and the results aren't totally terrible”. In most texts, the dimensions with highest values will be for very common words such as “a”, “be”, etc (https://en.wikipedia.org/wiki/Most_common_words_in_English)
A slightly better simple view of how embeddings can work in search is by using principal component analysis. If you take a corpus, compute TF-IDF vectors (https://en.wikipedia.org/wiki/Tf–idf) for all texts in it, then compute the n ≪ 50,000 top principal components of the set of vectors and then project each of your 50,000-dimensional vectors on those n vectors, you’ve done the dimension reduction and still, hopefully, are keeping similar texts close together and distinct texts far apart from each other.
> I think that strips away way too much. What you describe is “counting words”. It produces 50,000-dimensional vectors (most of them zero for the vast majority of texts) for each text, so it’s not a proper embedding.
You can simplify this with a map, and only store non-zero values, but also you can be in-efficient: this is for learning. You can choose to store more valuable information than just word count. You can store any "feature" you want - various tags on a post, cohort topics for advertising, bucketed time stamps, etc.
For learning just storing word count gives you the mechanics you need of understanding vectors without actually involving neural networks and weights.
> I also doubt your claim “and the results aren't totally terrible”.
> In most texts, the dimensions with highest values will be for very common words such as “a”, “be”, etc
(1) the comment suggested filtering out these words, and (2) the results aren't terrible. This is literally the first assignment in Stanfords AI class [1], and the results aren't terrible.
> A slightly better simple view of how embeddings can work in search is by using principal component analysis. If you take a corpus, compute TF-IDF vectors (https://en.wikipedia.org/wiki/Tf–idf) for all texts in it, then compute the n ≪ 50,000 top principal components of the set of vectors and then project each of your 50,000-dimensional vectors on those n vectors, you’ve done the dimension reduction and still, hopefully, are keeping similar texts close together and distinct texts far apart from each other.
Wow that seems a lot more complicated for something that was supposed to be a learning exercise.
[1] https://stanford-cs221.github.io/autumn2023/assignments/sent...
>> I also doubt your claim “and the results aren't totally terrible”. >> In most texts, the dimensions with highest values will be for very common words such as “a”, “be”, etc
> (1) the comment suggested filtering out these words,
It mentions that as a possible improvement over the version whose results it claims aren’t terrible.
Embeddings must be trained, otherwise they don't have any meaning, and are just random numbers.
Really appreciate you explaining this idea, I want to try this! It wasn't clear to me until I read the discussion that you meant that you'd have similarity of entire documents, not among words.
Yes! And that’s an oversight on my part — word embeddings are interesting but I usually deal with documents when doing nlp work and only deal with word embeddings when thinking about how to combine them into a document embedding.
Give it a shot! I’d grab a corpus like https://scikit-learn.org/stable/datasets/real_world.html#the... to play with and see what you get. It’s not going to be amazing, but it’s a great way to build some baseline intuition for nlp work with text that you can do on a laptop.
Comment was deleted :(
Without getting into any big debates about whether or not RAG is medium-term interesting or whatever, you can ‘pip install sentence-transformers faiss’ and just immediately start having fun. I recommend using straightforward cosine similarity to just crush the NYT’s recommender as a fun project for two reasons: there’s an API and plenty of corpus, and it’s like, whoa, that’s better than the New York Times.
He’s trying to sell a SaaS product (Pinecone), but he’s doing it the right way: it’s ok to be an influencer if you know what you’re taking about.
James Briggs has great stuff on this: https://youtube.com/@jamesbriggs
> crush the NYT’s recommender as a fun project for two reasons
Could you share what recommender you're referring to here, and how you can evaluate "crushing" it?
Sounds fun!
One of the challenges here is handling homonyms. If I search in the app for "king", most of the top ten results are "ruler" icons - showing a measuring stick. Rodent returns mostly computer mice, etc.
https://www.v0.app/search?q=king
https://www.v0.app/search?q=rodent
This isn't a criticism of the app - I'd rather get a few funny mismatches in exchange for being able to find related icons. But it's an interesting puzzle to think about.
>> If I search in the app for "king", most of the top ten results are "ruler" icons
I believe that's the measure of a man.
Good call out! We think of this as a two part problem.
1. The intent of the user. Is it a description of the look of the icon or the utility of the icon? 2. How best to rank the results which is a combination of intent, CTR of past search queries, bootstrapping popularity via usage on open source projects etc.
- Charlie of v0.app
This is imo the worst part of embedding search.
Somehow Amazon continues to be the leader in muddy results which is a sign that it’s a huge problem domain and not easily fixable even if you have massive resources.
I don't seem to have this issue on any other webshop that uses normal keyword searches and always wondered what Amazon did to mess it up so much and why people use that website (also for other reasons, but search is definitely one of them: no way to search properly). The answer isn't always "massive resources" towards being more hi-tech
But, thanks, this explains a lot about Amazon's search results and might help me steer it if I need to use it in the future :)
I was reading this article and thinking about things like, in the case of doing transcription, if you heard the spoken word “sign” in isolation you couldn’t be sure whether it meant road sign, spiritual sign, +/- sign, or even the sine function. This seems like a similar problem where you pretty much require context to make a good guess, otherwise the best it could do is go off of how many times the word appears in the dataset right? Is there something smarter it could do?
Wouldn’t it help to provide affordances guiding the user to submit a question rather than a keyword? Then, “Why are kings selected by primogeniture?” probably wouldn’t be near passages about measuring sticks in the embedding space. (Of course, this idea doesn’t work for icon search.)
Only if you have an attention mechanism.
I think this is the point of the Attention portion in an llm, to use context to skew the embedding result closer to what youre looking for.
It does seem a little strange 'ruler' would be closer to 'king' versus something like 'crown'.
Comment was deleted :(
Yeah, these can be cute, but they're not ideal. I think the user feedback mechanism could help naturally align this over time, but it would also be gameable. It's all interesting stuff
As the op, you can do both semantic search (embedding) and keyword search. Some RAG techniques call out using both for better results. Nice product by the way!
Hybrid searches are great, though I'm not sure they would help here. Neither 'crown' nor 'ruler' would come back from a text search for 'king,' right?
I bet if we put a better description into the embedding for 'ruler,' we'd avoid this. Something like "a straight strip or cylinder of plastic, wood, metal, or other rigid material, typically marked at regular intervals, to draw straight lines or measure distances." (stolen from a Google search). We might be able to ask a language model to look at the icon and give a good description we can put into the embedding.
Given
not because they’re sufficiently advanced technology indistinguishable from magic, but the opposite.
Unlike LLMs, working with embeddings feels like regular deterministic code.
<h3>Creating embeddings</h3>
I was hoping for a bit more than: They’re a bit of a black box
Next, we chose an embedding model. OpenAI’s embedding models will probably work just fine.
Same here. I was saving the article for when I have a few hours to really dive into it, build upon it, learn from seeing and doing. Imagine my disappointment when I had the evening cleared, started reading, and discover all they're showing is how to concatenate a string, download someone else's black box model which outputs the similarity between the user's query and the concatenated info about each object, and then write queries on the output
It's good to know you can do this performantly on your own system, but if the article had started out with "look, this model can output similarity between two texts and we can make a search engine with that", that'd be much more up front about what to expect to learn from it
Edit: another comment mentioned you can't even run it yourself, you need to ask ClosedAI for every search query a user does on your website. WTF is this article, at that point you might as well pipe the query into general-purpose chatgpt which everyone already knows and let that sort it out
I agree. The article was useful insofar as it detailed the steps they took to solve their problem clearly, and it's easy to see that many common problems are similar and could therefore be solved similarly, but I went in expecting more insight. How are the strings turned into arrays of numbers? Why does turning them into numbers that way lead to these nice properties?
It begs the question though, doesn't it...? Embeddings require a neural network or some reasonable facsimile to produce the embedding in the first place. Compression to a vector (a semantic space of some sort) still needs to happen – and that's the crux of the understanding/meaning. To just say "embeddings are cool let's use them" is ignoring the core problem of semantics/meaning/information-in-context etc. Knowing where an embedding came from is pretty damn important.
Embeddings live a very biased existence. They are the product of a network (or some algorithm) that was trained (or built) with specific data (and/or code) and assume particular biases intrinsically (network structure/algorithm) or extrinsically (e.g., data used to train a network) which they impose on the translation of data into some n-dimensional space. Any engineered solution always lives with such limitations, but with the advent of more and more sophisticated methods for the generation of them, I feel like it's becoming more about the result than the process. This strikes me as problematic on a global scale... might be fine for local problems but could be not-so-great in an ever changing world.
Great project and excellent initiative to learn about embeddings. Two possible avenues to explore more. Your system backend could be thought of as being composed of two parts: |Icons->Embedder->|PGVector|->Retriever->Display Result|
1. In the embedder part trying out different embedding models and/or vector dimensions to explore if the Recall@K & Precision@K for your data set (icons) improves. Models make a surprising amount of difference to the quality of the results. Try the MTEB Leaderboard for ideas on which models to explore.
2. In the Information Retriever part you can try a couple of approaches: a.after you retrieve from PGVector see if you can use a reranker like Cohere to get better results https://cohere.com/blog/rerank
b.You could try a "fusion ranking" similar to the one you do but structured such that 50% of the weight is for a plain old keyword search in the metadata and 50% is for the embedding based search
Finally something more interesting to noodle on - what if the embeddings were based on the icon images and the model knew how to search for a textual descriptions in the latent space?
One of my biggest annoyances with the modern AI tooling hype is that you need to use a vector store for just working with embeddings. You don't.
The reason vector stores are important for production use-cases are mostly latency-related for larger sets of data (100k+ records), but if you're working on a toy project just learning how to use embeddings, you can compute cosine distance with a couple lines of numpy by doing a dot product of a normalized query vectors with a matrix of normalized records.
Best of all, it gives you a reason to use Python's @ operator, which with numpy matrices does a dot product.
100k records is still pretty small!
It feels a bit like the hype that happended with "big data". People ended up creating spark clusters to query a few million records. Or using Hadoop for a dataset you could process with awk.
Professionally I've only ever worked with dataset sizes in the region of low millions and have never needed specialist tooling to cope.
I assume these tools do serve a purpose but perhaps one that only kicks in at a scale approaching billions.
I've been in the "mid-sized" area a lot where Numpy etc cannot handle it, so I had to go to Postgres or more specialized tooling like Spark. But I always started with the simple thing and only moved up if it didn't suffice.
Similarly, I read how Postgres won't scale for a backend application and I should use Citus, Spanner, or some NoSQL thing. But that day has not yet arrived.
Right on: I've used a single Postgres database on AWS to handle 1M+ concurrent users. If you're Google, sure, not gonna cut it, but for most people these things scale vertically a lot further than you'd expect (especially if, like me, you grew up in the pre-SSD days and couldn't get hundreds of gigs of RAM on a cloud instance).
Even when you do pass that point, you can often shard to achieve horizontal scalability to at least some degree, since the real heavy lifting is usually easy to break out on a per-user basis. Some apps won't permit that (if you've got cross-user joins then it's going to be a bit of a headache), but at that point you've at least earned the right to start building up a more complex stack and complicating your queries to let things grow horizontally.
Horizontal scaling is a huge headache, any way you cut it, and TBH going with something like Spanner is just as much of a headache because you have to understand its limitations extremely well if you want it to scale. It doesn't just magically make all your SQL infinitely scalable, things that are hard to shard are typically also hard to make fast on Spanner. What it's really good at is taking an app with huge traffic where a) all the hot queries would be easy to shard, but b) you don't want the complexity of adding sharding logic (+re-sharding, migration, failure handling, etc), and c) the tough to shard queries are low frequency enough that you don't really care if they're slow (I guess also d) you don't care that it's hella expensive compared to a normal Postgres or MySQL box). You still need to understand a lot more than when using a normal DB, but it can add a lot of value in those cases.
I can't even say whether or not Google benefits from Spanner, vs multiple Postgres DBs with application-level sharding. Reworking your systems to work with a horizontally-scaling DB is eerily similar to doing application-level sharding, and just because something is huge doesn't mean it's better with DB-level sharding.
The unique nice thing in Spanner is TrueTime, which enables the closest semblance of a multi-master DB by making an atomic clock the ground truth (see Generals' Problem). So you essentially don't have to worry about a regional failure causing unavailability (or inconsistency if you choose it) for one DB, since those clocks are a lot more reliable than machines. But there are probably downsides.
Numpy might not be able to handle a full o(n^2) comparison of vectors but you can use a lib with hsnw and it can have great performance on medium (and large) datasets.
If I remember correctly, the simplest thing Numpy choked on was large sparse matrix multiplication. There are also other things like text search that Numpy won't help with and you don't want to do in Python if it's a large set.
This sentiment is pretty common I guess. Outside of a niche, the massive scale for which a vast majority of the data tech was designed doesn't exist and KISS wins outright. Though I guess that's evolution, we want to test the limits in pursuit of grandeur before mastering the utility (ex. pyramids).
KISS doesn't get me employed though. I narrowly missed being the chosen candidate for a State job which called for Apache Spark experience. I missed two questions relating to Spark and "what is a parquet file?" but otherwise did great on the remaining behavioral questions (the hiring manager gave me feedback after requesting it). Too bad they did not have a question about processing data using command lines tools.
yeah, glad the hype around big data is dead. Not a lot of solid numbers in here, but this post covers it well[0].
We have duckdb embedded in our product[1] and it works perfectly well for billions of rows of a data without the hadoop overhead.
When I'm messing around, I normally have everything in a Pandas DataFrame already so I just add embeddings as a column and calculate cosine similarity on the fly. Even with a hundred thousand rows, it's fast enough to calculate before I can even move my eyes down on the screen to read the output.
I regret ever messing around with Pinecone for my tiny and infrequently used set ups.
Could not agree more. For some reason Pandas seems to get phased out as developers advance.
Actually, I had a pretty good experience with Pinecone.
Yup. I was just playing around with this in Javascript yesterday and with ChatGPT's help it was surprisingly simple to go from text => embedding (via. `openai.embeddings.create`) and then to compare the embedding similarity with the cosine distance (which ChatGPT wrote for me): https://gist.github.com/christiangenco/3e23925885e3127f2c177...
Seems like the next standard feature in every app is going to be natural language search powered by embeddings.
For posterity, OpenAI embeddings come pre-normalized so you can immediately dot-product.
Most embeddings providers do normalization by default, and SentenceTransformers has a normalize_embeddings parameter which does that. (it's a wrapper around PyTorch's F.normalize)
As an individual, I love the idea of pushing to simplify even further to understand these core concepts. For the ecosystem, I like that vector stores make these features accessible to environments outside of Python.
If you ask ChatGPT to give you a cosine similarity function that works against two arrays of floating numbers in any programming language you'll get the code that you need.
Here's one in JavaScript (my prompt was "cosine similarity function for two javascript arrays of floating point numbers"):
function cosineSimilarity(vecA, vecB) {
if (vecA.length !== vecB.length) {
throw "Vectors do not have the same dimensions";
}
let dotProduct = 0.0;
let normA = 0.0;
let normB = 0.0;
for (let i = 0; i < vecA.length; i++) {
dotProduct += vecA[i] * vecB[i];
normA += vecA[i] ** 2;
normB += vecB[i] ** 2;
}
if (normA === 0 || normB === 0) {
throw "One of the vectors is zero, cannot compute similarity";
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
Vector stores really aren't necessary if you're dealing with less than a few hundred thousand vectors - load them up in a bunch of in-memory arrays and run a function like that against them using brute-force.i get triggered that the norm vars aren't ever really the norms but their squares
I love it!
Even in production my guess is most teams would be better off just rolling their own embedding model (huggingface) + caching (redis/rocksdb) + FAISS (nearest neighbor) and be good to go. I suppose there is some expertise needed, but working with a vector database vendor has major drawbacks too.
Or you just shove it into Postgres + pg_vector and just use the DBMS you already use anyway.
Using Postgres with pgvector is trivial and cheap. Its also available on AWS RDS.
Also on supabase!
hnswlib, usearch. Both handle tens of millions of vectors easily. The latter even without holding them in RAM.
Does anyone know the provenance for when vectors started to be called embeddings?
In an NLP context, earliest I could find was ICML 2008:
http://machinelearning.org/archive/icml2008/papers/391.pdf
I'm sure there are earlier instances, though - the strict mathematical definition of embedding has surely been around for a lot longer.
(interestingly, the word2vec papers don't use the term either, so I guess it didn't enter "common" usage until the mid-late 2010s)
I think it was due to GloVe embeddings back then: I don't recall them ever being called GloVe vectors, although the "Ve" does stand for vector so it could have been RAS syndrome.
>> https://nlp.stanford.edu/projects/glove/
A quick scan of the project website yields zero uses of 'embedding' and 23 of 'vector'
It's how I remember it when I was working with them back in the day (word embeddings): I could be wrong.
Is there any easy way to run the embedding logic locally? Maybe even locally to the database? My understanding is that they’re hitting OpenAI’s API to get the embedding for each search query and then storing that in the database. I wouldn’t want my search function to be dependent on OpenAI if I could help it.
Support for _some_ embedding models works in Ollama (and llama.cpp - Bert models specifically)
ollama pull all-minilm
curl http://localhost:11434/api/embeddings -d '{
"model": "all-minilm",
"prompt": "Here is an article about llamas..."
}'
Embedding models run quite well even on CPU since they are smaller models. There are other implementations with a library form factor like transformers.js https://xenova.github.io/transformers.js/ and sentence-transformers https://pypi.org/project/sentence-transformers/If you are building using Supabase stack (Postgres as DB with pgVector), we just released a built-in embedding generation API yesterday. This works both locally (in CPUs) and you can deploy it without any modifications.
Check this video on building Semantic Search in Supabase: https://youtu.be/w4Rr_1whU-U
Also, the blog on announcement with links to text versions of the tutorials: https://supabase.com/blog/ai-inference-now-available-in-supa...
So handy! I already got some embeddings working with supabase pgvector and OpenAI and it worked great.
What would the cost of running this be like compared to the OpenAI embedding api?
There are no extra costs other than the what we'd normally charge for Edge Function invocations (you get up to 500K in the free plan and 2M in the Pro plan)
neat! one thing i’d really love tooling for: supporting multi user apps where each has their own siloed data and embeddings. i find myself having to set up databases from scratch for all my clients, which results in a lot of repetitive work. i’d love to have the ability one day to easily add users to the same db and let them get to embedding without having to have any knowledge going in
This is possible in supabase. You can store all the data in a table and restrict access with Row Level Security
You also have various ways to separate the data for indexes/performance
- use metadata filtering first (eg: filter by customer ID prior to running a semantic search). This is fast in postgres since its a relational DB
- pgvector supports partial indexes - create one per customer based on a customer ID column
- use table partitions
- use Foreign Data Wrappers (more involved but scales horizontally)
We provide this functionality in Lantern cloud via our Lantern Extras extension: <https://github.com/lanterndata/lantern_extras>
You can generate CLIP embeddings locally on the DB server via:
SELECT abstract,
introduction,
figure1,
clip_text(abstract) AS abstract_ai,
clip_text(introduction) AS introduction_ai,
clip_image(figure1) AS figure1_ai
INTO papers_augmented
FROM papers;
Then you can search for embeddings via: SELECT abstract, introduction FROM papers_augmented ORDER BY clip_text(query) <=> abstract_ai LIMIT 10;
The approach significantly decreases search latency and results in cleaner code.
As an added bonus, EXPLAIN ANALYZE can now tell percentage of time spent in embedding generation vs search.The linked library enables embedding generation for a dozen open source models and proprietary APIs (list here: <https://lantern.dev/docs/develop/generate>, and adding new ones is really easy.
Lantern seems really cool! Interestingly we did try CLIP (openclip) image embeddings but the results were poor for 24px by 24px icons. Any ideas?
Charlie @ v0.app
I have tried CLIP on my personal photo album collection and it worked really well there - I could write detailed scene descriptions of past road trips, and the photos I had in mind would pop up. Probably the model is better for everyday photos than for icons
There are a bunch of embedding models you can run on your own machine. My LLM tool had plugins for some of those:
- https://llm.datasette.io/en/stable/plugins/directory.html#em...
Here's how to use them: https://simonwillison.net/2023/Sep/4/llm-embeddings/
Yes, I use fastembed-rs[1] in a project I'm working on and it runs flawlessly. You can store the embeddings in any boring database (it's just an array of f32s at the end of the day). But for fast vector math (which you need for similarity search), a vector database is recommended, e.g. the pgvector[2] postgres extension.
Fun timing!
I literally just published my first crate: candle_embed[1]
It uses Candle under the hood (the crate is more of a user friendly wrapper) and lets you use any model on HF like the new SoTA model from Snowflake[2].
[1] https://github.com/ShelbyJenkins/candle_embed [2] https://huggingface.co/Snowflake/snowflake-arctic-embed-l
The MTEB leaderboard has you covered. That is a goto for finding the leading embedding models and I believe many of them can run locally.
Comment was deleted :(
This is a good call out. OpenAI embeddings were simple to stand up, pretty good, cheap at this scale, and accessible to everyone. I think that makes them a good starting point for many people. That said, they're closed-source, and there are open-source embeddings you can run on your infrastructure to reduce external dependencies.
If you're building an iOS app, I've had success storing vectors in coredata and using a tiny coreml model that runs on device for embedding and then doing cosine similarity.
Open WebUI has langchain built-in and integrates perfectly with ollama. They have several variations of docker compose files on their github.
Comment was deleted :(
My problem with this is that it doesn't explain a lot.
You can manually make a vector of a word and then step wise get up to word2vec approach and then document embedding. My post[1] does some of the first part and this great word2vec post[2] dives into it in more detail.
[1] https://earthly.dev/blog/cosine_similarity_text_embeddings/
For an article extolling the benefits of embeddings for developers looking to dip their toe into the waters of AI it's odd they don't actually have an intro to embeddings or to vector databases. They just assume the reader already knows these concepts and dives on in to how they use them.
Sure many do know these concepts already but they're probably not the people wondering about a 'good starting point for the AI curious app developer'.
I published this pretty comprehensive intro to embeddings last year: https://simonwillison.net/2023/Oct/23/embeddings/
I found many of your other posts and they were the spark that finally made me "get it" and look deeper into LLMs. This post looks like another slam dunk.
Keep up the good work!
To add to the other recommendations, here's a primer on vector DB's: https://www.pinecone.io/learn/vector-database/
Apologies!
Here's a good primer on embeddings from openai: https://platform.openai.com/docs/guides/embeddings
I learned how to use embeddings by building semantic search for the Bhagavad Gita. I simply saved the embeddings for all 700 verses into a big file which is stored in a Lambda function, and compared against incoming queries with a single query to OpenAI's embedding endpoint.
Shameless plug in case anyone wants to test it out - https://gita.pub
Really nice and beautiful site!
Thank you! :)
I have been saying similar things to my fellow technical writers ever since the ChatGPT explosion. We now have a tool that makes semantic search on arbitrary, diverse input much easier. Improved semantic search could make a lot of common technical writing workflows much more efficient. E.g. speeding up the mandatory research that you must do before it's even possible to write an effective doc.
Embeddings are indeed a good starting point. Next step is choosing the model and the database. The comments here have been taken over by database companies so I'm skeptical about the opinions. I wish MySQL had a cosine search feature built in
pg_vector has you covered
My smooth brain might not understand this properly, but the idea is we generate embeddings, store them, then use retrieval each time we want to use them.
For simple things we might not need to worry about storing much, we can generate the embeddings and just cache them or send them straight to retrieval as an array or something...
The storing of embeddings seems the hard part, do I need a special database or PG extension? Is there any reason I can't store them as a blobs in SQlite if I don't have THAT much data, and I don't care too much about speed? Do embeddings generated ever 'expire'?
You'd have to update the embedding every time the data used to generate it changes. For example, if you had an embedding for user profiles and they updated their bio, you would want to make a new embedding.
I don't expect to have to change the embeddings for each icon all that often, so storing them seemed like a good choice. However, you probably don't need to cache the embedding for each search query since there will be long-tail ones that don't change that much.
The reason to use pgvector over blobs is if you want to use the distance functions in your queries.
Yes, you can shove the embeddings in a BLOB, but then you can't do the kinds of query operations you expect to be able to do with embeddings.
You can run similarity scores with a custom SQLite function.
I use a Python one usually, but it's also possible to build a much faster one in C: https://simonwillison.net/2024/Mar/23/building-c-extensions-...
Right like you could use it sort of like cache and send the blobs to OpenAI to use their similarity API, but you couldn't really use SQL to do cosine similarity operations?
My understanding of what's going on at a technical level might be a bit limited.
Yes.
Although if you really wanted to, and normalized your data like a good little Edgar F. Codd devotee, you could write something like this:
SELECT SUM(v.dot) / (SQRT(SUM(v.v1)) * SQRT(SUM(v.v2))) FROM (SELECT v1.dimension as dim, v1.value as v1, v2.value as v2, v1.value * v2.value as dot FROM vectors as v1 INNER JOIN vectors as v2 ON v1.dimension = v2.dimension WHERE v1.vector_id = "?" AND v2.vector_id = "?") as v;
This assumes one table called "vectors" with columns vector_id, dimension, and value; vector_id and dimension being primary. The inner query grabs two vectors as separate columns with some self-join trickery, computes the product of each component, and then the outer query computes aggregate functions on the inner query to do the actual cosine similarity.
No I have not tested this on an actual database engine, I probably screwed up the SQL somehow. And obviously it's easier to just have a database (or Postgres extension) that recognizes vector data as a distinct data type and gives you a dedicated cosine-similarity function.
Thanks for the explanation! Appreciate that you took the time to give an example. Makes a lot more sense why we reach for specific tools for this.
A KV store is both good enough and highly performant. I use Redis for storing embeddings and expire them after a while. Unless you have a highly specialized use case it’s not economical to persistently store chunk embedding.
Redis also does have vector search capability as well. However, the most popular answer you’ll get here is to use Postgres (pgvectpr).
Redis sounds like a good option. I like that it’s not more infrastructure, I already have redis setup for my app so I’m not adding more to the stack.
Vector databases are used to store embeddings.
But why is that? I’m sure it’s the ‘best’ way to do things, but it also means more infrastructure which for simple apps isn’t worth the hassle.
I should use redis for queues but often I’ll just use a table in a SQLite database. For small scale projects I find it works fine, I’m wondering what an equivalent simple option for embeddings would be.
Perhaps sqlite-vss? It adds vector searches to sqlite.
check out https://github.com/tembo-io/pg_vectorize - we're taking it a little bit beyond just the storage and index. The project uses pgvector for the indices and distance operators, but also adds a simpler API, hooks into pre-trained embedding models, and helps you keep embeddings updated as data changes/grows
Re storing vectors in BLOB columns: ya, if it's not a lot of data and it's fast enough for you, then there's no problem doing it like that. I'd even just store then in JSON/npy files first and see how long you can get away with it. Once that gets too slow, then try SQLite/redis/valkey, and when that gets too slow, look into pgvector or other vector database solutions.
For SQLite specifically, very large BLOB columns might effect query performance, especially for large embeddings. For example, a 1536-dimension vector from OpenAI would take 1536 * 4 = 6144 bytes of space, if stored in a compact BLOB format. That's larger than SQLite default page size of 4096, so that extra data will overflow into overflow pages. Which again, isn't too big of a deal, but if the original table had small values before, then table scans can be slower.
One solution is to move it to a separate table, ex on an original `users` table, you can make a new `CREATE TABLE users_embeddings(user_id, embedding)` table and just LEFT JOIN that when you need it. Or you can use new techniques like Matryoshka embeddings[0] or scalar/binary quantization[1] to reduce the size of individual vectors, at the cost of lower accuracy. Or you can bump the page size of your SQLite database with `PRAGMA page_size=8192`.
I also have a SQLite extension for vector search[2], but there's a number of usability/ergonomic issues with it. I'm making a new one that I hope to release soon, which will hopefully be a great middle ground between "store vectors in a .npy files" and "use pgvector".
Re "do embeddings ever expire": nope! As long as you have access to the same model, the same text input should give the same embedding output. It's not like LLMs that have temperatures/meta prompts/a million other dials that make outputs non-deterministic, most embedding models should be deterministic and should work forever.
[0] https://huggingface.co/blog/matryoshka
This is very useful appreciate the insight. Storing embeddings in a table and joining when needed feels like a really nice solution for what I'm trying to do.
Comment was deleted :(
I store them as blobs in SQLite. It works fine - depending on the model they take up 1-2KB each.
I've been adding embeddings to every project I work on for the purpose of vector similarity searches.
I was just trying to order uber eats and wondering why they don't have a better search based off embeddings.
Almost finished building a feature on JSON Resume, that takes your hosted resume and WhoIsHiring job posts and uses embeddings to return relevant results -> https://registry.jsonresume.org/thomasdavis/jobs
Nice project! I find it can be hard to think of a idea that is well suited to use AI. Using embeddings for search is definitely a good option to start with.
I made a reverse image search when I learned about embeddings. It's pretty fun to work with images https://medium.com/@christophe.smet1/finding-dirty-xtc-with-...
I’d love to build a suite of local tooling to play around with different embedding approaches.
I’ve had great results using SentenceTransformers for quick one-off tasks at work for unique data asks.
I’m curious about clustering within the embeddings and seeing what different approaches can yield and what applications they work best for.
If I have 50,000 historical articles and 5,000 new articles I apply SBERT and then k-means with N=20 I get great results in terms of articles about Ukraine, sports, chemistry, and nerdcore from Lobsters ending up in distinct clusters.
I’ve used DBSCAN for finding duplicate content, this is less successful. With the parameters I am using it is rare for there to be a false positives, but there aren’t that many true positives. I’m sure I could do do better if I tuned it up but I’m not sure if there is an operating point I’d really like.
Embeddings have a special place in my heart since I learned about them 2 years ago. Working in SEO, it felt like everything finally "clicked" and I understood, on a lower level, how Google search actually works, how they're able to show specific content snippets directly on the search results page, etc. I never found any "SEO Guru" discussing this at all back then (maybe even now?), even though this was complete gold. It explains "topical authority" and gave you clues on how Google itself understands it.
This is where I got started too. Glove embedding stored in Postgres.
Pgvector is nice, and it's cool seeing quick tutorials using it. Back then, we only had cube, which didn't do cosine similarity indexing out of the box (you had to normalize vectors and use euclidean indexes) and only supported up to 100 dimensions. And there were maybe other inconveniences I don't remember, cause front page AI tutorials weren't using it.
PGvector is very nice indeed. And you get to store your vectors close to the rest of your data. I'm yet to understand the unique use case for dedicated vector dbs. It seems so annoying, having to query your vectors in a separate database without being able to easily join/filter based on the rest of your tables.
I stored ~6 million hacker news posts, their metadata, and the vector embeddings in a cheap 20$/month vm running pgvector. Querying is very fast. Maybe there's some penalty to pay when you get to the billion+ row counts, but I'm happy so far.
As I'm trying to work on some pricing info for PGVector - can you share some more info about the hacker news posts you've embedded?
* Which embedding model? (or number of dimensions) * When you say 6 million posts - it's just the URL of the post, title, and author, or do you mean you've also embedded the linked URL (be it HN or elsewhere)?
Cheers!
You can also store vectors or matrices in a split-up fashion as separate rows in a table, which is particularly useful if they're sparse. I've handled huge sparse matrix expressions (add, subtract, multiply, transpose) that way, cause numpy couldn't deal with them.
seeing comments about using pgvector... at pinecone, we spent some time understanding it's limitations and pain points. pinecone eliminates these pain points entirely and makes things simple at any scale. check it out: https://www.pinecone.io/blog/pinecone-vs-pgvector/
Has Pinecone gotten any cheaper? Last time I tried it was $75/month for the starter plan / single vector store.
yep. pinecone serverless has reduced costs significantly for many workloads.
> You can even try dog breeds like ‘hound,’ ‘poodle,’ or my favorite ‘samoyed.’ It pretty much just works. But that’s not all; it also works for other languages. Try ‘chien’ and even ‘犬’1!
Can anyone explain how this language translation works? The magic is in the embeddings of course, but how does it work, how does it translate ~all words across all languages?
One thing I'm not sure of is how much of a larger bit of text should go into an embedding? I assume it's a trade off of context and recall, with one word not meaning much semantically, and the whole document being too much to represent with just numbers. Is there a sweet spot (e.g. split by sentence) or am I missing something here?
Tangential question: how are people using GenAI for financial datasets for insights and recommendations? Assume tens of desparate databases with financial data. Does NL2SQL work well for this? Or OpenAI Tools (formerly OpenAI Functions)? What have you found that is consistently accurate?
Can someone give a qualitative explanation of what the vector of a word with 2 unrelated meanings would look like compared to the vector of a synonym of each of those meanings?
If you think about it like a point on a graph, and the vectors as just 2D points (x,y), then the synonyms would be close and the unrelated meanings would be further away.
I'm guessing 2 dimensions isn't for this.
Here's a concrete example: "bow" would need to be close to "ribbon" (as in a bow on a present) and also close to "gun" (as a weapon that shoots a projectile), but "ribbon" and "gun" would seem to need be far from each other. How does something like word2vec resolve this? Any transitive relationship would seem to fall afoul of this.
Yes, only more sophisticated embeddings can capture that and it's over 300+ dimensions.
But we still need a measure of "closeness" that is non-transitive.
I think he is saying: embeddings are deterministic, so they are more predictable in production.
They’re still magic, with little explain ability or adaptability when they don’t work.
They are named 'feature' vectors with scored attributes, similar to associative arrays.Just ask MI. Jordan, D. Blie, S. Mian or A. Ng.
They are embedded into a particular semantic vector space that is learned based on a model. Another feature vector could be hand rolled based on feature engineering, tidf ngrams etc. Embedding is typically distinct from feature engineering that is manual.
I strongly agree with the title of the article. RAG is very interesting right now just as an example of how technology moves from being just fresh out of academia to being engineered and commoditized into regular out of the shelf tools. On the other hand I don't think it's that important to understand how embeddings are calculated, for the beginner it's more important to showcase why they work and why they enable simple reasoning like "queen = woman + (king - men)" and the possible use cases.
Comment was deleted :(
Does anyone have examples of word (ngram) disambiguation when doing Approximate Nearest Neighbour (ANN) on word vector embeddings?
Can embeddings be used to capture stylistic features of text, rather than semantic? Like writing style?
Probably, but you might need something more sophisticated than cosine distance. For example, you might take a dataset of business letters, diary entries, and fiction stories and train some classifier on top of the embeddings of each of the three types of text, then run (embeddings --> your classifier) on new text. But at that point you might just want to ask an LLM directly with a prompt like - "Classify the style of the following text as business, personal, or fiction: $YOUR TEXT$"
You may get way more accurate results from relatively small models as well as logits for each class if you ask one question per class instead.
Likely not, embeddings are very crude. Embeddings of a text is just an average of "meanings" of words.
As is embeddings lack a lot of tricks that made transformers so efficient.
Comment was deleted :(
ah pgvector is kind of annoying to start with, you have to set it up and maintain, and then it starts falling apart when you have more vectors
Can you elaborate more on the falling apart? I can see pgvector being intimidating for users with no experience standing up a DB, but I don't see how Postgres or pgvector would fall apart. Note, my reason for asking is I'm planning on going all in with Postgres, so pgvector makes sense for me.
https://www.pinecone.io/blog/pinecone-vs-pgvector/ check it out :)
What is "more vectors"? How many are we talking about? We've been using pgvector in production for more than 1 year without any issues. We dont have a ton of vectors, less than 100,000, and we filter queries by other fields so our total per cosine function is probably more like max of 5000. Performance is fine and no issues.
Take a look at https://github.com/tembo-io/pg_vectorize. It makes it a lot easier to get started. It runs on pgvector, but as a user, its completely abstracted from you. It also provides you with a way to auto-update embeddings as you add new data or update existing source data.
This is good until it isn’t. Tried to get it working for 4 hours and it just did not.
And then I had an important architectural gotcha moment: I want my database to be dump. Its purpose is to store and query data in an efficient and ACID way.
Adding cronjobs and http calls to the database is a bad idea.
I love the simplicity and that it helps to keep embedding a up to date (if it works), but I decided to not treat my database as application.
on the other hand, if you have postgres already, it may be easier to add pgvector than to add another dependency to your stack (especially if you are using something like supabase)
another benefit is that you can easily filter your embeddings by other field, so everything is kept in one place and could help with perfomance
it's a good place to start in those cases and if it is successful and you need extreme performance you can always move to other specialized tools like qdrant, pinecone or weaviate which were purpose-built for vectors
[dead]
[flagged]
Crafted by Rajat
Source Code