Home

Awesome

Reader

Your LLMs deserve better input.

Reader does two things:

Check out the live demo

Or just visit these URLs (Read) https://r.jina.ai/https://github.com/jina-ai/reader, (Search) https://s.jina.ai/Who%20will%20win%202024%20US%20presidential%20election%3F and see yourself.

Feel free to use Reader API in production. It is free, stable and scalable. We are maintaining it actively as one of the core products of Jina AI. Check out rate limit

<img width="973" alt="image" src="https://github.com/jina-ai/reader/assets/2041322/2067c7a2-c12e-4465-b107-9a16ca178d41"> <img width="973" alt="image" src="https://github.com/jina-ai/reader/assets/2041322/675ac203-f246-41c2-b094-76318240159f">

Updates

Usage

Using r.jina.ai for single URL fetching

Simply prepend https://r.jina.ai/ to any URL. For example, to convert the URL https://en.wikipedia.org/wiki/Artificial_intelligence to an LLM-friendly input, use the following URL:

https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence

Using r.jina.ai for a full website fetching (Google Colab)

Using s.jina.ai for web search

Simply prepend https://s.jina.ai/ to your search query. Note that if you are using this in the code, make sure to encode your search query first, e.g. if your query is Who will win 2024 US presidential election? then your url should look like:

https://s.jina.ai/Who%20will%20win%202024%20US%20presidential%20election%3F

Behind the scenes, Reader searches the web, fetches the top 5 results, visits each URL, and applies r.jina.ai to it. This is different from many web search function-calling in agent/RAG frameworks, which often return only the title, URL, and description provided by the search engine API. If you want to read one result more deeply, you have to fetch the content yourself from that URL. With Reader, http://s.jina.ai automatically fetches the content from the top 5 search result URLs for you (reusing the tech stack behind http://r.jina.ai). This means you don't have to handle browser rendering, blocking, or any issues related to JavaScript and CSS yourself.

Using s.jina.ai for in-site search

Simply specify site in the query parameters such as:

curl 'https://s.jina.ai/When%20was%20Jina%20AI%20founded%3F?site=jina.ai&site=github.com'

Interactive Code Snippet Builder

We highly recommend using the code builder to explore different parameter combinations of the Reader API.

<a href="https://jina.ai/reader#apiform"><img width="973" alt="image" src="https://github.com/jina-ai/reader/assets/2041322/a490fd3a-1c4c-4a3f-a95a-c481c2a8cc8f"></a>

Using request headers

As you have already seen above, one can control the behavior of the Reader API using request headers. Here is a complete list of supported headers.

Using r.jina.ai for single page application (SPA) fetching

Many websites nowadays rely on JavaScript frameworks and client-side rendering. Usually known as Single Page Application (SPA). Thanks to Puppeteer and headless Chrome browser, Reader natively supports fetching these websites. However, due to specific approach some SPA are developed, there may be some extra precautions to take.

SPAs with hash-based routing

By definition of the web standards, content come after # in a URL is not sent to the server. To mitigate this issue, use POST method with url parameter in body.

curl -X POST 'https://r.jina.ai/' -d 'url=https://example.com/#/route' 

SPAs with preloading contents

Some SPAs, or even some websites that are not strictly SPAs, may show preload contents before later loading the main content dynamically. In this case, Reader may be capturing the preload content instead of the main content. To mitigate this issue, here are some possible solutions:

Specifying x-timeout

When timeout is explicitly specified, Reader will not attempt to return early and will wait for network idle until the timeout is reached. This is useful when the target website will eventually come to a network idle.

curl 'https://example.com/' -H 'x-timeout: 30'
Specifying x-wait-for-selector

When wait-for-selector is explicitly specified, Reader will wait for the appearance of the specified CSS selector until timeout is reached. This is useful when you know exactly what element to wait for.

curl 'https://example.com/' -H 'x-wait-for-selector: #content'

Streaming mode

Streaming mode is useful when you find that the standard mode provides an incomplete result. This is because the Reader will wait a bit longer until the page is stablely rendered. Use the accept-header to toggle the streaming mode:

curl -H "Accept: text/event-stream" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page

The data comes in a stream; each subsequent chunk contains more complete information. The last chunk should provide the most complete and final result. If you come from LLMs, please note that it is a different behavior than the LLMs' text-generation streaming.

For example, compare these two curl commands below. You can see streaming one gives you complete information at last, whereas standard mode does not. This is because the content loading on this particular site is triggered by some js after the page is fully loaded, and standard mode returns the page "too soon".

curl -H 'x-no-cache: true' https://access.redhat.com/security/cve/CVE-2023-45853
curl -H "Accept: text/event-stream" -H 'x-no-cache: true' https://r.jina.ai/https://access.redhat.com/security/cve/CVE-2023-45853

Note: -H 'x-no-cache: true' is used only for demonstration purposes to bypass the cache.

Streaming mode is also useful if your downstream LLM/agent system requires immediate content delivery or needs to process data in chunks to interleave I/O and LLM processing times. This allows for quicker access and more efficient data handling:

Reader API:  streamContent1 ----> streamContent2 ----> streamContent3 ---> ... 
                          |                    |                     |
                          v                    |                     |
Your LLM:                 LLM(streamContent1)  |                     |
                                               v                     |
                                               LLM(streamContent2)   |
                                                                     v
                                                                     LLM(streamContent3)

Note that in terms of completeness: ... > streamContent3 > streamContent2 > streamContent1, each subsequent chunk contains more complete information.

JSON mode

This is still very early and the result is not really a "useful" JSON. It contains three fields url, title and content only. Nonetheless, you can use accept-header to control the output format:

curl -H "Accept: application/json" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page

JSON mode is probably more useful in s.jina.ai than r.jina.ai. For s.jina.ai with JSON mode, it returns 5 results in a list, each in the structure of {'title', 'content', 'url'}.

Generated alt

All images in that page that lack alt tag can be auto-captioned by a VLM (vision langauge model) and formatted as !(Image [idx]: [VLM_caption])[img_URL]. This should give your downstream text-only LLM just enough hints to include those images into reasoning, selecting, and summarization. Use the x-with-generated-alt header to toggle the streaming mode:

curl -H "X-With-Generated-Alt: true" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page

Install

You will need the following tools to run the project:

For backend, go to the backend/functions directory and install the npm dependencies.

git clone git@github.com:jina-ai/reader.git
cd backend/functions
npm install

What is thinapps-shared submodule?

You might notice a reference to thinapps-shared submodule, an internal package we use to share code across our products. While it’s not open-sourced and isn't integral to the Reader's functions, it mainly helps with decorators, logging, secrets management, etc. Feel free to ignore it for now.

That said, this is the single codebase behind https://r.jina.ai, so everytime we commit here, we will deploy the new version to the https://r.jina.ai.

Having trouble on some websites?

Please raise an issue with the URL you are having trouble with. We will look into it and try to fix it.

License

Reader is backed by Jina AI and licensed under Apache-2.0.