Home

Awesome

<p align="center"> <a href="https://do-me.github.io/SemanticFinder/"> <img src="https://github.com/do-me/SemanticFinder/assets/47481567/4522ab9d-08f4-4f4c-92db-dbf14ccb2b70" width="320" alt="SemanticFinder"> </a> <h1 align="center">Frontend-only live semantic search and chat-with-your-documents built on transformers.js. Supports Wasm and WebGPU!</h1> </p>

Try the web app, install the Chrome extension or read the introduction blog post.

🔥 For best performance try the WebGPU Version here! 🔥

Semantic search right in your browser! Calculates the embeddings and cosine similarity client-side without server-side inferencing, using transformers.js and latest SOTA embedding models from Huggingface.

Models

All transformers.js-compatible feature-extraction models are supported. Here is a sortable list you can go through: daily updated list. Download the compatible models table as xlsx, csv, json, parquet, or html here: https://github.com/do-me/trending-huggingface-models/. Note that the wasm backend in transformers.js supports all mentioned models. If you want the best performance, make sure to use a WebGPU-compatible model.

Catalogue

You can use super fast pre-indexed examples for really large books like the Bible or Les Misérables with hundreds of pages and search the content in less than 2 seconds 🚀. Try one of these and convince yourself:

filesizetextTitletextAuthortextYeartextLanguageURLmodelNamequantizedsplitParamsplitTypecharacterschunkswordsToAvoidAllwordsToCheckAllwordsToAvoidAnywordsToCheckAnyexportDecimalslinestextNotestextSourceURLfilename
4.78Das KapitalKarl Marx1867dehttps://do-me.github.io/SemanticFinder/?hf=Das_Kapital_c1a84fbaXenova/multilingual-e5-smallTrue80Words20038073164528673https://ia601605.us.archive.org/13/items/KarlMarxDasKapitalpdf/KAPITAL1.pdfDas_Kapital_c1a84fba.json.gz
2.58Divina CommediaDante1321ithttps://do-me.github.io/SemanticFinder/?hf=Divina_Commedia_d5a0fa67Xenova/multilingual-e5-baseTrue50Words383782117956225http://www.letteratura-italiana.com/pdf/divina%20commedia/08%20Inferno%20in%20versione%20italiana.pdfDivina_Commedia_d5a0fa67.json.gz
11.92Don QuijoteMiguel de Cervantes1605eshttps://do-me.github.io/SemanticFinder/?hf=Don_Quijote_14a0b44Xenova/multilingual-e5-baseTrue25Words10471507186412005https://parnaseo.uv.es/lemir/revista/revista19/textos/quijote_1.pdfDon_Quijote_14a0b44.json.gz
0.06Hansel and GretelBrothers Grimm1812enhttps://do-me.github.io/SemanticFinder/?hf=Hansel_and_Gretel_4de079ebTaylorAI/gte-tinyTrue100Chars53045559https://www.grimmstories.com/en/grimm_fairy-tales/hansel_and_gretelHansel_and_Gretel_4de079eb.json.gz
1.74IPCC Report 2023IPCC2023enhttps://do-me.github.io/SemanticFinder/?hf=IPCC_Report_2023_2b260928Supabase/bge-small-enTrue200Chars307811156653230state of knowledge of climate changehttps://report.ipcc.ch/ar6syr/pdf/IPCC_AR6_SYR_LongerReport.pdfIPCC_Report_2023_2b260928.json.gz
25.56King James BibleNoneenhttps://do-me.github.io/SemanticFinder/?hf=King_James_Bible_24f6dc4cTaylorAI/gte-tinyTrue200Chars455616323056580496https://www.holybooks.com/wp-content/uploads/2010/05/The-Holy-Bible-King-James-Version.pdfKing_James_Bible_24f6dc4c.json.gz
11.45King James BibleNoneenhttps://do-me.github.io/SemanticFinder/?hf=King_James_Bible_6434a78dTaylorAI/gte-tinyTrue200Chars455616323056280496https://www.holybooks.com/wp-content/uploads/2010/05/The-Holy-Bible-King-James-Version.pdfKing_James_Bible_6434a78d.json.gz
39.32Les MisérablesVictor Hugo1862frhttps://do-me.github.io/SemanticFinder/?hf=Les_Misérables_2239df51Xenova/multilingual-e5-baseTrue25Words323694119463574491All five acts includedhttps://beq.ebooksgratuits.com/vents/Hugo-miserables-1.pdfLes_Misérables_2239df51.json.gz
0.46REGULATION (EU) 2023/138European Commission2022enhttps://do-me.github.io/SemanticFinder/?hf=REGULATION_(EU)_2023_138_c00e7ff6Supabase/bge-small-enTrue25Words7680942451323https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32023R0138&qid=1704492501351REGULATION_(EU)_2023_138_c00e7ff6.json.gz
0.07Universal Declaration of Human RightsUnited Nations1948enhttps://do-me.github.io/SemanticFinder/?hf=Universal_Declaration_of_Human_Rights_0a7da79aTaylorAI/gte-tinyTrue\nArticleRegex862363510930 articleshttps://www.un.org/en/about-us/universal-declaration-of-human-rightsUniversal_Declaration_of_Human_Rights_0a7da79a.json.gz

Import & Export

You can create indices yourself with one two clicks and save them. If it's something private, keep it for yourself, if it's a classic book or something you think other's might be interested in consider a PR on the Huggingface Repo or get in touch with us. Book requests are happily met if you provide us a good source link where we can do copy & paste. Simply open an issue here with [Book Request] or similar or contact us.

It goes without saying that no discriminating content will be tolerated.

Installation

Clone the repository and install dependencies with

npm install

Then run with

npm run start

If you want to build instead, run

npm run build

Afterwards, you'll find the index.html, main.css and bundle.js in dist.

Browser extension

Download the Chrome extension from Chrome webstore and pin it. Right click the extension icon for options:

Local build

If you want to build the browser extension locally, clone the repo and cd in extension directory then run:

Speed

Tested on the entire book of Moby Dick with 660.000 characters ~13.000 lines or ~111.000 words. Initial embedding generation takes 1-2 mins on my old i7-8550U CPU with 1000 characters as segment size. Following queries take only ~2 seconds! If you want to query larger text instead or keep an entire library of books indexed use a proper vector database instead.

Features

You can customize everything!

Usage ideas

Future ideas

Logic

Transformers.js is doing all the heavy lifting of tokenizing the input and running the model. Without it, this demo would have been impossible.

Input

Output

Pipeline

  1. All scripts are loaded. The model is loaded once from HuggingFace, after cached in the browser.
  2. A user inputs some text and a search term or phrase.
  3. Depending on the approximate length to consider (unit=characters), the text is split into segments. Words themselves are never split, that's why it's approximative.
  4. The search term embedding is created.
  5. For each segment of the text, the embedding is created.
  6. Meanwhile, the cosine similarity is calculated between every segment embedding and the search term embedding. It's written to a dictionary with the segment as key and the score as value.
  7. For every iteration, the progress bar and the highlighted sections are updated in real-time depending on the highest scores in the array.
  8. The embeddings are cached in the dictionary so that subsequent queries are quite fast. The calculation of the cosine similarity is fairly speedy in comparison to the embedding generation.
  9. Only if the user changes the segment length, the embeddings must be recalculated.

Collaboration

PRs welcome!

To Dos (no priorization)

Star History

Star History Chart

Gource Map

image

Gource image created with:

gource -1280x720 --title "SemanticFinder" --seconds-per-day 0.03 --auto-skip-seconds 0.03 --bloom-intensity 0.5 --max-user-speed 500 --highlight-dirs --multi-sampling --highlight-colour 00FF00