Home

Awesome

ChatWeb

Open In Colab

English Doc 中文文档

ChatWeb can crawl any webpage or extract text from PDF, DOCX, TXT files, and generate an embedded summary. It can also answer your questions based on the content of the text. It is implemented using the chatAPI and embeddingAPI based on gpt3.5, as well as a vector database.

Basic Principle

The basic principle is similar to existing projects such as chatPDF and automated customer service AI.

Crawl web pages Extract text content Use GPT3.5's embedding API to generate vectors for each paragraph Calculate the similarity score between each paragraph's vector and the entire text's vector to generate a summary Store the vector-text mapping in a vector database Generate keywords from user input Generate a vector from the keywords Use the vector database to perform a nearest neighbor search and return a list of the most similar texts Use GPT3.5's chat API to design a prompt that answers the user's question based on the most similar texts in the list. The idea is to extract relevant content from a large amount of text and then answer questions based on that content, which can achieve a similar effect to breaking through token limits.

An improvement was made to generate vectors based on keywords rather than the user's question, which increases the accuracy of searching for relevant texts.

Getting Started

Manual installation:

Docker:

if you prefer, you can also run this project using docker:

Set language

Mode Selection

Stream Mode

Setting the Temperature

OpenAI Proxy Settings

"open_ai_proxy": {
  "http": "socks5://127.0.0.1:1081",
  "https": "socks5://127.0.0.1:1081"
}

Install PostgreSQL (Optional)

Compile and install the extension (support Postgres 11+).

git clone --branch v0.4.0 https://github.com/pgvector/pgvector.git
cd pgvector
make
make install # may need sudo

Then load it in the database you want to use it in

CREATE EXTENSION vector;

Example

Please enter the link to the article or the file path of the PDF/TXT/DOCX document: https://gutenberg.ca/ebooks/hemingwaye-oldmanandthesea/hemingwaye-oldmanandthesea-00-e.html
Please wait for 10 seconds until the webpage finishes loading.
The article has been retrieved, and the number of text fragments is: 663
...
=====================================
Query fragments used tokens: 7219, cost: $0.0028876
Query fragments used tokens: 7250, cost: $0.0029000000000000002
Query fragments used tokens: 7188, cost: $0.0028752
Query fragments used tokens: 7177, cost: $0.0028708
Query fragments used tokens: 2378, cost: $0.0009512000000000001
Embeddings have been created with 663 embeddings, using 31212 tokens, costing $0.0124848
The embeddings have been saved.
=====================================
Please enter your query (/help to view commands):

TODO

Star History