Awesome
This is currently very alpha and in development, so expect changes to the API/interface. It aims to walk the line between extracting enough text/data for it to be useful, but no so much that it takes enormous amounts of space.
As it stands I'm sort of pessimistic this would ever be a silver bullet, getting useful info out of arbitrary HTML is hard, so you're sort of stuck writing parsers for each website you're interested in. However, I still use this frequently, especially as a cache for API information like described below
Current TODOs:
- Add more sites using the abstract interface, to get more info from sites I use commonly. Ideally, should be able to reuse common scraper/parsers/API interface libraries in python, instead of recreating everything from scratch
- Create a (separate repo/project) daemon which handles configuring this and slowly requests things in the background as they become available through given sources; allow user to provide generators/inputs define include/exclude lists/regexes. Probably just integrate with promnesia so avoid duplicating the work of searching for URLs on disk
Installation
Requires python3.8+
To install with pip, run:
python3 -m pip install url_cache
As this is still in development, for the latest changes install from git: python3 -m pip install git+https://github.com/purarue/url_cache
Rationale
A file system cache which saves URL metadata and summarizes content
This is meant to provide more context to any of my tools which use URLs. If I watched some youtube video and I have a URL, I'd like to have the subtitles for it, so I can do a text-search over all the videos I've watched. If I read an article, I want the article text! This requests, parses and abstracts away that data for me locally, so I can do something like:
>>> from url_cache.core import URLCache
>>> u = URLCache()
>>> data = u.get("https://purarue.xyz/")
>>> data.metadata["images"][-1]
{'src': 'https://raw.githubusercontent.com/purarue/glue/master/assets/screenshot.png', 'alt': 'screenshot', 'type': 'body_image', 'width': 600}
>>> data.metadata["description"]
"purarue's Home Page"
If I ever request that URL again, the information is grabbed from a local cache instead.
Generally, this uses:
lassie
to get generic metadata; the title, description, opengraph information, links to images/videos on the page.readability
to convert/compress HTML to a summary of the HTML content.
Site-Specific Extractors:
- Youtube: to get manual/auto-generated captions (converted to a
.srt
file) from Youtube URLs - Stackoverflow (Just a basic URL preprocessor to reduce the possibility of conflicts/duplicate data)
- MyAnimeList (using Jikan v4)
This is meant to be extendible -- so its possible for you to write your own extractors/file loaders/dumpers (for new formats (e.g. srt
)) for sites you use commonly and pass those to url_cache.core.URLCache
to extract richer data for those sites. Otherwise, it saves the information from lassie
and the summarized HTML using readability
for each URL.
To avoid scope creep, this probably won't support:
- Converting the HTML summary to text (use something like the
lynx
command below) - Minimizing HTML - run something like
find ~/.local/share/url_cache/ -name '*.html' -exec <some tool/script that minimizes in place> \;
instead -- the data is just stored in individual files in the data directory
Usage:
In Python, this can be configured by using the url_cache.core.URLCache
class: For example:
import logging
from url_cache.core import URLCache
# make requests every 2 seconds
# debug logs
# save to a folder in my home directory
cache = URLCache(loglevel=logging.DEBUG, sleep_time=2, cache_dir="~/Documents/urldata")
c = cache.get("https://github.com/purarue")
# just request information, don't read/save to cache
data = cache.request_data("https://www.wikipedia.org/")
For more information, see the docs
The CLI interface provides some utility commands to get/list information from the cache.
Usage: url_cache [OPTIONS] COMMAND [ARGS]...
Options:
--cache-dir PATH Override default cache directory location
--debug / --no-debug Increase log verbosity
--sleep-time INTEGER How long to sleep between requests
--summarize-html / --no-summarize-html
Use readability to summarize html. Otherwise
saves the entire HTML document
--skip-subtitles / --no-skip-subtitles
Skip downloading Youtube Subtitles
--subtitle-language TEXT Subtitle language for Youtube Subtitles
--help Show this message and exit.
Commands:
cachedir Prints the location of the local cache directory
export Print all cached information as JSON
get Get information for one or more URLs Prints results as JSON
in-cache Prints if a URL is already cached
list List all cached URLs
An environment variable URL_CACHE_DIR
can be set, which changes the default cache directory.
API Cache Examples
I've also successfully used this to cache responses from API results in some of my projects, by subclassing and overriding the request_data
function. I just make a request and return a summary, and it transparently caches the rest. See:
CLI Examples
The get
command emits JSON
, so it could with other tools (e.g. jq
) used like:
$ url_cache get "https://click.palletsprojects.com/en/7.x/arguments/" | \
jq -r '.[] | .html_summary' | lynx -stdin -dump | head -n 5
Arguments
Arguments work similarly to [1]options but are positional. They also
only support a subset of the features of options due to their
syntactical nature. Click will also not attempt to document arguments
$ url_cache export | jq -r '.[] | .metadata | .title'
purarue - Overview
Arguments — Click Documentation (7.x)
url_cache list --location
/home/username/.local/share/url_cache/data/2/c/7/6284b2f664f381372fab3276449b2/000
/home/username/.local/share/url_cache/data/7/5/1/70fc230cd88f32e475ff4087f81d9/000
# to make a backup of the cache directory
$ tar -cvzf url_cache.tar.gz "$(url_cache cachedir)"
Accessible through the url_cache
script and python3 -m url_cache
Implementation Notes
This stores all of this information as individual files in a cache directory. In particular, it MD5
hashes the URL and stores information like:
.
└── a
└── a
└── e
└── cf0118bb22340e18fff20f2db8abd
└── 000
├── data
│ └── subtitles.srt
├── key
├── metadata.json
└── timestamp.datetime.txt
In other words, this is a file system hash table which implements separate chaining.
You're free to delete any of the directories in the cache if you want, this doesn't maintain a strict index, it uses a hash of the URL and then searches for a matching key
file.
By default this waits 5 seconds between requests. Since all the info is cached, I use this by requesting all the info from one data source (e.g. my bookmarks, or videos I've watched recently) in a loop in the background, which saves all the information to my computer. The next time I do that same loop, it doesn't have to make any requests and it just grabs all the info from local cache.
Originally created for HPI
.
Testing
git clone 'https://github.com/purarue/url_cache'
cd ./url_cache
pip install '.[testing]'
mypy ./src/url_cache
flake8 ./src/url_cache
pytest