Home

Awesome

Standard Template Construct

Welcome, developer! You've arrived at the repository for STC, the library, search engine and AI tooling offering free access to academic knowledge and works of fictional literature.

STC | Help Center

Getting Started

Details

In essence, STC is a search engine Summa coupled with databanks. These databanks reside on IPFS in a format that allows for searching without necessitating the download of the entire dataset. The search engine library can function as a standalone server, an embeddable Python library (requiring no additional software!), and a WASM-compiled module that can be used in a browser. Last way allows to embed search engine in a static site that further can be deployed over IPFS too. This is how Web STC is live.

Putting everything to IPFS allows you to open STC in your browser or on your server and avoid the use of centralized servers that may lose or censor data.

Components

Roadmap

PartTaskDescription
Library Stewardship
✅ Assimilation of LibGen corpusTransition of all items to nexus_science
🚧 Assimilation of SciMag corpusSignificant task of transferring scimag corpus to IPFS
✅ Structured contentEnhance GROBID extraction (headers + content) and store content in structured_content JSON column. Extract entities for cross-linking in Web STC
🚧 Implementing classification (articles, books)
Web STC
UX improvementSTC often requires loading of large data chunks, currently reflected only by a spinner. The UX needs improvement. Following structured content implementation, we can highlight headers and generate cross-links in abstracts/content
Enhancing availabilityFurther testing needed on diverse devices and networks
BookshelfSTC has all tools for generating bookshelves that may offer users high-quality suggestions on read.
Cybrex AI
First-class support of local LLMExtensive testing of prompts with documents is required to identify the smallest model capable of efficiently executing QA and summarization tasks. Most 13-15B models are currently failing (quantized, on CPU)
Building an embeddings datasetThe goal is to build a comprehensive dataset with DOIs and document embeddings. Currently, the Instructor XL model appears most promising, but further testing is necessary
Refining and fixing metadata (cleaning content)Areas for improvement include: detected language, tags, keywords, automated abstracts, Dewey classification
Build QA on local LLMSuch a system should be independently operable and also accessible via Telegram.
Fine-tuning LLMs on STC
Distribution
Building STC BoxDevelop and maintain a definitive guide and scripts for replicating and launching STC on compact devices like PI computers or TV Boxes
Global replicationThe goal is to replicate STC (including the search database and papers) a minimum of 100 times across at least 30 countries
Establishing Frontier OutpostsInvestigate strategies to replicate STC on an orbiting satellite or another planet in the solar system (Mars or Europa preferred)
Communities
Forming Science Communities on TelegramInitiate the first version of Telegram-based forums focusing on specific scientific topics
Addressing Copyright IssuesOrganize more activities aimed at challenging the copyright laws for scholarly and educational writings