Awesome
<p align="center"> <a href="https://all-in-on-ai.vercel.app"> <img alt="Search the All-In Podcast using AI" src="/public/social.jpg" width="600"> </a> </p>YouTube Semantic Search <!-- omit in toc -->
OpenAI-powered semantic search for any YouTube playlist — featuring the All-In Podcast 🔥
Intro
I love the All-In Podcast. But search and discovery with podcasts can be really challenging.
I built this project to solve this problem... and I also wanted to play around with cool AI stuff. 😂
This project uses the latest models from OpenAI to build a semantic search index across every episode of the Pod. It allows you to find your favorite moments with Google-level accuracy and rewatch the exact clips you're interested in.
You can use it to power advanced search across any YouTube channel or playlist. The demo uses the All-In Podcast because it's my favorite 💕, but it's designed to work with any playlist.
How to get started
- Clone the repository to your local machine.
- Navigate to the root directory of the repository in your terminal.
- Run the command
npm install
to install all the necessary dependencies. - Run the command
npx tsx src/bin/resolve-yt-playlist.ts
to download the English transcripts for each episode of the target playlist (in this case, the All-In Podcast Episodes Playlist). - Run the command
npx tsx src/bin/process-yt-playlist.ts
to pre-process the transcripts and fetch embeddings from OpenAI, then insert them into a Pinecone search index. - You can now run the command
npx tsx src/bin/query.ts
to query the Pinecone search index. (Optional) Run the commandnpx tsx src/bin/generate-thumbnails.ts
to generate timestamped thumbnails of each video in the playlist. This step takes ~2 hours and requires a stable internet connection. - The frontend of the project is a Next.js webapp deployed to Vercel that uses the Pinecone index as a primary data store. You can run the command npm run dev to start the development server and view the webapp locally.
Note that a few episodes may not have automated English transcriptions available, and that the project uses a hacky HTML scraping solution for this, so a better solution would be to use Whisper to transcribe the episode's audio. Also, the project support sorting by recency vs relevancy.
Example Queries
- sweater karen
- best advice for founders
- poker story from last night
- crypto scam ponzi scheme
- luxury sweater chamath
- phil helmuth
- intellectual honesty
- sbf ftx
- science corner
Screenshots
<p align="center"> <a href="https://all-in-on-ai.vercel.app"> <img alt="Desktop light mode" src="/public/images/screenshot-desktop-light.jpg" width="45%"> </a> <a href="https://all-in-on-ai.vercel.app"> <img alt="Desktop dark mode" src="/public/images/screenshot-desktop-dark.jpg" width="45%"> </a> </p>How It Works
Under the hood, it uses:
- OpenAI - We're using the brand new text-embedding-ada-002 embedding model, which captures deeper information about text in a latent space with 1536 dimensions
- This allows us to go beyond keyword search and search by higher-level topics.
- Pinecone - Hosted vector search which enables us to efficiently perform k-NN searches across these embeddings
- Vercel - Hosting and API functions
- Next.js - React web framework
We use Node.js and the YouTube API v3 to fetch the videos of our target playlist. In this case, we're focused on the All-In Podcast Episodes Playlist, which contains 108 videos at the time of writing.
npx tsx src/bin/resolve-yt-playlist.ts
We download the English transcripts for each episode using a hacky HTML scraping solution, since the YouTube API doesn't allow non-OAuth access to captions. Note that a few episodes don't have automated English transcriptions available, so we're just skipping them at the moment. A better solution would be to use Whisper to transcribe each episode's audio.
Once we have all of the transcripts and metadata downloaded locally, we pre-process each video's transcripts, breaking them up into reasonably sized chunks of ~100 tokens and fetch it's text-embedding-ada-002 embedding from OpenAI. This results in ~200 embeddings per episode.
All of these embeddings are then upserted into a Pinecone search index with a dimensionality of 1536. There are ~17,575 embeddings in total across ~108 episodes of the All-In Podcast.
npx tsx src/bin/process-yt-playlist.ts
Once our Pinecone search index is set up, we can start querying it either via the webapp or via the example CLI:
npx tsx src/bin/query.ts
We also support generating timestamp-based thumbnails of every YouTube video in the playlist. Thumbnails are generated using headless Puppeteer and are uploaded to Google Cloud Storage. We also post-process each thumbnail with lqip-modern to generate nice preview placeholder images.
If you want to generate thumbnails (optional), run:
npx tsx src/bin/generate-thumbnails.ts
Note that thumbnail generation takes ~2 hours and requires a pretty stable internet connection.
The frontend is a Next.js webapp deployed to Vercel that uses our Pinecone index as a primary data store.
TODO
- Use Whisper for better transcriptions
- Support sorting by recency vs relevancy
Feedback
Have an idea on how this webapp could be improved? Find a particularly fun search query?
Feel free to send me feedback, either on GitHub or Twitter. 💯
Credit
- Inspired by Riley Tomasek's project for searching the Huberman YouTube Channel
- Note that this project is not affiliated with the All-In Podcast. It just pulls data from their YouTube channel and processes it using AI.
License
MIT © Travis Fischer
If you found this project interesting, please consider sponsoring me or <a href="https://twitter.com/transitive_bs">following me on twitter <img src="https://storage.googleapis.com/saasify-assets/twitter-logo.svg" alt="twitter" height="24px" align="center"></a>
The API and server costs add up over time, so if you can spare it, sponsoring on Github is greatly appreciated. 💕