by bclavie on 1/4/2024, 4:49:02 PM
by jimmySixDOF on 1/5/2024, 7:41:51 AM
I looked at this on Twitter and will try it out using the integration with Llamaindex !! The idea of Late Interaction sounds like an improvement either on top of or in place of a vector db approach. I was looking at the BERT family in general so while ColBERT is a great implementation it would be interesting to have this same "everything in the one gumbo pot" type library for working with robert(a)/tinybert/qbert/distilbert/.....
Great project and I hope the post gets on the HN second chance loop !
by okhat on 1/4/2024, 4:51:24 PM
I'll admit even I sometimes wish ColBERT was more user-friendly. I'll probably start using ColBERT throught RAGatouille now.
Longer Background/Explanation:
I’ve been working on RAG problems for quite a while now, and it’s very apparent that solving real-life problems with it is very, very different from the basic tutorials around.
There are a million moving parts, but a huge one is obviously the model you use to retrieve the data. The most common approach rely on just using dense embeddings (like OpenAI’s embedding models), and getting the documents that have the embedding vectors closest to the query’s own embedding.
The problem is that in practice, it’s a bit of a Sisyphean task: you’re asking a model to compress a document into a tiny vector. And then, it must also be able to encode a very differently worded query into another tiny vector, that must look similar to the previous vector. And it must do so in a way that can represent any specific aspect of the document that could be requested.
The result is that dense embeddings require tons of data to be trained (billions+ pertaining examples), are relatively hard to fine-tune (must find a hard-to-strike balance), and have been shown many times in the Information Retrieval (IR) literature to generalise worse outside of known benchmarks. This doesn’t mean they’re not a very useful tool, but there might be more suitable tools for retrieving your data.
In the IR literature again, late-interaction models, or “sparse embedding” approaches like ColBERT or SparseEmbed are clear winners. They train quickly, need less data, fine-tune relatively easily, and generalise very-well (their zero-shot performance is never far behind fine-tuned performance!)
This is because these models don’t encode full documents: they create bags-of-embeddings! It’s a twist on the old-timey keyword-based retrieval, except instead of hardcoded keywords, we now use contextualised semantic keywords. The models capture the meaning of all the “small units of content” within their context.
From there, a document’s represented as the sum of its parts. At retrieval time, “all you need to” is to match your query’s “semantic keywords” to the ones in your documents. It’s much easier for the model to learn representation for these tiny units, and much easier to match them. So what’s the catch? Why is this not everywhere? Because IR is not quite NLP — it hasn’t gone fully mainstream, and a lot of the IR frameworks are, quite frankly, a bit of a pain to work with in-production. Some solid efforts to bridge the gap like Vespa [1] are gathering steam, but it’s not quite there.
[1] https://vespa.ai