Semantic search with transformers.js #5483

do-me · 2023-05-08T13:30:08Z

do-me
May 8, 2023

Hi everyone!

This is a feature idea.

I recently found transformers.js which allows you to use any huggingface model right in your browser, client-wise. I used the logic and built a client-side semantic search with it: SemanticFinder which is quite performant even on mobile. The model itself weighs only 25mb.

While developing, I had the idea that it would be awesome to have a switch in the search bar for toggling between normal full-text search as is, and semantic search. This could allow users to find answers faster.

Such
a) mean embeddings for a full document and
b) paragraph or sentence-wise embeddings

could be calculated server-wise for any new content, once. E.g. a json file with document and heading reference as well as the embeddings could be created (as for the normal search) and only be loaded if the user activates the toggle.
Also, the model would only be loaded on first activation so the initial page load remains fast.

What do you think about it? Maybe it would make for a nice mkdocs extension.

squidfunk · 2023-05-08T15:13:28Z

squidfunk
May 8, 2023
Maintainer

Sure, sounds interesting! I'm very busy with other features right now, but I think this is an excellent case for a third-party plugin.

0 replies

balansczerni · 2023-07-04T12:03:27Z

balansczerni
Jul 4, 2023

@do-me Very interesting! Let me know if you have some initial implementation - I will gladly join work and help you develop this solution (despite my very low programming experience 😇).

1 reply

do-me Jul 4, 2023
Author

I am currently focusing on more features for and integrations of SemanticFinder. It's kind of the ground work and helps imagining how such a feature could work efficiently in mkdocs. When I'm done with it I might work on a proper implementation plan - in any case will let you know here! If in the meantime you'd like to go for a POC, certainly feel free to do so :)

do-me · 2023-12-29T16:04:11Z

do-me
Dec 29, 2023
Author

A quick update with a few thoughts for a potential mkdocs plugin after developing SemanticFinder and some other semantic search projects further:

You can cramp 133.952 Embeddings in a 38Mb gzipped json. Source. It's even possible to go lower with parquet file format and k-means/ product quantization with acceptable quality loss. Considering maybe a few hundred pages max as realistic mkdocs size, that should be a viable option, even if every site would be split in 100 chunks each.
There are plenty of performant small quantized models available with around 25Mb model size. The smallest performant multilingual model for semantic search is https://huggingface.co/Xenova/multilingual-e5-small/tree/main/onnx with 118Mb but supporting more than 100 languages.
Local indexing of one user query with transformers.js is almost always instant on nearly any device! Cosine similarity is a trivial calculation and very fast too.

The workflow would be pretty straightforward: create a mkdocs extension that triggers the indexing when building. In mkdocs load the model & index file only on request. Once loaded it would be cached anyway and could be used in the future. The search bar could allow for both, full-text and semantic search.

If I manage to fine the time, I could think of a simple PoC, similar to the ones I already built. However, on the UI side (especially with a smooth integration in the current search bar) I'd definitely appreciate some help if anyone is interested ;D

P.S. Of course if you can afford a separate server with a vector DB and a model server that would be much more client-friendly and the "proper" way to build such an integration.

0 replies

do-me · 2024-01-14T12:31:44Z

do-me
Jan 14, 2024
Author

Another update: I just finished building an index import/export logic which I'd like to use a basis for this mkdocs plugin. When the index is loaded, the search is very fast (e.g. the whole bible with 23.000 embeddings in <2s) and allows also for super quick hybrid search. The only downside is the initial loading of model & index. The model is cached in the browser, the index not yet (but planned). So in theory, the big no-no could become a small one ;D.

I would like to run some tests for the mkdocs-material documentation (as it's fairly large), but I need a quick tip if possible @squidfunk :
How could I extract only the text from the rendered markdown for the whole wiki? I can concat the md-files easily but the md-syntax is not appropriate for search. Do you have any ideas how I could get there if possible without doing html-scraping? Maybe the current search implementation requires a similar step?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic search with transformers.js #5483

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Semantic search with transformers.js #5483

do-me May 8, 2023

Replies: 4 comments · 1 reply

squidfunk May 8, 2023 Maintainer

balansczerni Jul 4, 2023

do-me Jul 4, 2023 Author

do-me Dec 29, 2023 Author

do-me Jan 14, 2024 Author

do-me
May 8, 2023

Replies: 4 comments 1 reply

squidfunk
May 8, 2023
Maintainer

balansczerni
Jul 4, 2023

do-me Jul 4, 2023
Author

do-me
Dec 29, 2023
Author

do-me
Jan 14, 2024
Author