Awesome
PyPruningRadixTrie
Python Port of Pruning Radix Trie by Wolf Garbe.
Changes compared to original version
- Removed parameter to disable pruning behavior.
- See
test/non_pruning_radix_trie.py
for a non-pruning version that you can use to see the speed improvement.
- See
- Added outline for more generic
InputProvider
and providers that read CSV or JSON as examples
What and Why
A Trie is a tree data structure that is commonly used for searching terms
that start with a given prefix.
It starts with an empty string at the base of the trie, the root node.
Adding a new entry to the trie creates a new branch. This branch shares already present characters with existing nodes
and creates new nodes when it's prefix diverges from the present entries.
# trie containing flower & flowchart (1 char = 1 node)
'' - f - l - o - w - e - r
|
c - h - a - r - t
A RadixTrie is the space optimized version of a Trie.
It combines the nodes with only one sub-node into one, containing more than one character.
# radix trie containing flower & flowchart
'' - flow - er
|
chart
The prefix Pruning references the algorithm to query the RadixTrie.
In order for the pruning to work, the nodes in RadixTrie are stored in a sorted manner.
This structure allows cutting off unpromising branches during querying the trie which makes the algorithm way faster
compared to a non-pruning trie.
Usage
Get from PyPI:
pip install pypruningradixtrie
Create the PRT:
# empty trie
trie = PruningRadixTrie()
# fill with data from CSV file on creation
trie = PruningRadixTrie('./test_data.csv', CSVInputProvider(',', lambda x: float(x[1])))
Add entries:
CSV:
# fill with data from CSV file, score is at position 1, term at position 0
fill_trie_from_file(trie, './test_data.csv', CSVInputProvider(',', lambda x: float(x[1]), 0))
JSON:
# define a functon to calculate the score out of a JSON entry
def score_fun(json_entry: Dict[str, Any]) -> float:
return json_entry["pages"] * json_entry["year"] / 10.0
# "title" = key for term to insert into PRT
fill_trie_from_file(trie, './test_data.json', JSONInputProvider("title", score_fun))
Single Entry:
# insert single entry
insert_term(trie, term="flower", score=20)
Use the PRT:
# get the top 10 entries that start with 'flower'
trie.get_top_k_for_prefix('flower', 10)