Home

Awesome

<p align="center"> <a href="https://all-in-on-ai.vercel.app"> <img alt="Search the All-In Podcast using AI" src="/public/social.jpg" width="600"> </a> </p>

YouTube Semantic Search <!-- omit in toc -->

OpenAI-powered semantic search for any YouTube playlist — featuring the All-In Podcast 🔥

Build Status MIT License Prettier Code Formatting

Intro

I love the All-In Podcast. But search and discovery with podcasts can be really challenging.

I built this project to solve this problem... and I also wanted to play around with cool AI stuff. 😂

This project uses the latest models from OpenAI to build a semantic search index across every episode of the Pod. It allows you to find your favorite moments with Google-level accuracy and rewatch the exact clips you're interested in.

You can use it to power advanced search across any YouTube channel or playlist. The demo uses the All-In Podcast because it's my favorite 💕, but it's designed to work with any playlist.

How to get started

Note that a few episodes may not have automated English transcriptions available, and that the project uses a hacky HTML scraping solution for this, so a better solution would be to use Whisper to transcribe the episode's audio. Also, the project support sorting by recency vs relevancy.

Example Queries

Screenshots

<p align="center"> <a href="https://all-in-on-ai.vercel.app"> <img alt="Desktop light mode" src="/public/images/screenshot-desktop-light.jpg" width="45%"> </a> &nbsp; &nbsp; &nbsp; &nbsp; <a href="https://all-in-on-ai.vercel.app"> <img alt="Desktop dark mode" src="/public/images/screenshot-desktop-dark.jpg" width="45%"> </a> </p>

How It Works

Under the hood, it uses:

We use Node.js and the YouTube API v3 to fetch the videos of our target playlist. In this case, we're focused on the All-In Podcast Episodes Playlist, which contains 108 videos at the time of writing.

npx tsx src/bin/resolve-yt-playlist.ts

We download the English transcripts for each episode using a hacky HTML scraping solution, since the YouTube API doesn't allow non-OAuth access to captions. Note that a few episodes don't have automated English transcriptions available, so we're just skipping them at the moment. A better solution would be to use Whisper to transcribe each episode's audio.

Once we have all of the transcripts and metadata downloaded locally, we pre-process each video's transcripts, breaking them up into reasonably sized chunks of ~100 tokens and fetch it's text-embedding-ada-002 embedding from OpenAI. This results in ~200 embeddings per episode.

All of these embeddings are then upserted into a Pinecone search index with a dimensionality of 1536. There are ~17,575 embeddings in total across ~108 episodes of the All-In Podcast.

npx tsx src/bin/process-yt-playlist.ts

Once our Pinecone search index is set up, we can start querying it either via the webapp or via the example CLI:

npx tsx src/bin/query.ts

We also support generating timestamp-based thumbnails of every YouTube video in the playlist. Thumbnails are generated using headless Puppeteer and are uploaded to Google Cloud Storage. We also post-process each thumbnail with lqip-modern to generate nice preview placeholder images.

If you want to generate thumbnails (optional), run:

npx tsx src/bin/generate-thumbnails.ts

Note that thumbnail generation takes ~2 hours and requires a pretty stable internet connection.

The frontend is a Next.js webapp deployed to Vercel that uses our Pinecone index as a primary data store.

TODO

Feedback

Have an idea on how this webapp could be improved? Find a particularly fun search query?

Feel free to send me feedback, either on GitHub or Twitter. 💯

Credit

License

MIT © Travis Fischer

If you found this project interesting, please consider sponsoring me or <a href="https://twitter.com/transitive_bs">following me on twitter <img src="https://storage.googleapis.com/saasify-assets/twitter-logo.svg" alt="twitter" height="24px" align="center"></a>

The API and server costs add up over time, so if you can spare it, sponsoring on Github is greatly appreciated. 💕