Awesome
lucene-grep
Grep-like utility based on Lucene Monitor compiled with GraalVM native-image.
Features
- Supports Lucene query syntax as described here
- Multiple queries can be provided
- Queries can be loaded from a file
- Supports Lucene text analysis configuration for:
- char filters
- tokenizers
- token filters
- stemmers for multiple languages
- predefined analyzers
- Support multiple query parsers (classic, complex phrase, standard, simple, and surround)
- Text output is colored or separated with customizable tags
- Supports printing file names as hyperlinks for click to open (check support for your terminal here)
- Text output supports templates
- Scoring mode (disables highlighting for now)
- Output can be formatted as JSON of EDN
- Supports text input from STDIN
- Supports filtering files with GLOB file pattern
- Support excluding files from processing with GLOB
- Compiled with GraalVM native-image tool
- Supports Linux, MacOS, and Windows
- Fast startup which makes it usable as CLI utility
Startup and memory as measured with time
utility on my Linux laptop:
<img src="docs/time-memory-usage.png"
alt="Startup time and memory usage" title="Startup time and memory usage" />
The default output has a format: [FILE_PATH]:[LINE_NUMBER]:[LINE_WITH_A_COLORED_HIGHLIGHT]
NOTE: Not compatible with grep
. When compared with grep
the functionality is limited in most aspects.
Quickstart
Brew
MacOS and Linux binaries are provided via brew
.
Install:
brew install dainiusjocas/brew/lmgrep
Upgrade:
brew upgrade lmgrep
Docker
lmgrep
is deployed in the Docker Hub:
echo "Lucene is awesome" | docker run -i dainiusjocas/lmgrep /lmgrep lucene
Windows
On Windows you can install using scoop and the scoop-clojure bucket.
Or just follow these concrete steps:
# Note: if you get an error you might need to change the execution policy (i.e. enable Powershell) with
# Set-ExecutionPolicy RemoteSigned -scope CurrentUser
Invoke-Expression (New-Object System.Net.WebClient).DownloadString('https://get.scoop.sh')
scoop bucket add scoop-clojure https://github.com/littleli/scoop-clojure
scoop bucket add extras
scoop install lmgrep
Other platforms
Just grab a binary from Github releases, extract, and place it anywhere on the $PATH
.
In case you're running MacOS then give run permissions for the executable binary:
sudo xattr -r -d com.apple.quarantine lmgrep
Then run it:
echo "Lucene is awesome" | ./lmgrep "Lucene"
Examples
Example of the lmgrep
:
./lmgrep "main" "*.{clj,edn}"
=>
./src/core.clj:44:(defn -main [& args]
./deps.edn:22: :main-opts ["-m" "cognitect.test-runner"]}
./deps.edn:24: {:main-opts ["-m" "clj-kondo.main --lint src test"]
./deps.edn:28: {:main-opts ["-m clj.native-image core"
The default output is somewhat similar to grep
, example:
grep -n -R --include=\*.{edn,clj} "main" ./
=>
./deps.edn:22: :main-opts ["-m" "cognitect.test-runner"]}
./deps.edn:24: {:main-opts ["-m" "clj-kondo.main --lint src test"]
./deps.edn:26: :jvm-opts ["-Dclojure.main.report=stderr"]}
./deps.edn:28: {:main-opts ["-m clj.native-image core"
Supports input from STDIN:
cat README.md | ./lmgrep "monitor lucene"
TIP: write your Lucene query within double quotes.
Various options with GLOB file pattern example:
./lmgrep --case-sensitive\?=false --ascii-fold\?=true --stem\?=true --tokenizer=whitespace "lucene" "**/*.md"
TIP: write GLOB file patterns within double quotes.
We can exclude files also with a GLOB pattern.
./lmgrep "lucene" "**/*.md" --excludes="README.md"
TIP: a GLOB pattern is treated as recursive if it contains "**", otherwise the GLOB is matched only against the file name.
Provide multiple queries:
echo "Lucene is\n awesome" | lmgrep --query=lucene --query=awesome
=>
*STDIN*:1:Lucene is
*STDIN*:2: awesome
Provide Lucene queries in a file:
echo "The quick brown fox jumps over the lazy dog" | ./lmgrep --queries-file=test/resources/queries.json --format=json
=>
{"line-number":1,"line":"The quick brown fox jumps over the lazy dog"}
The contents of the Lucene queries file is in JSON format, e.g.:
[
{
"query": "fox"
},
{
"query": "dog"
}
]
NOTE: when the Lucene queries are specified as a positional argument or with -q
or --query
params or with the --queries-file
, all the queries are concatenated into one list.
Deviations from Lucene query syntax
- The field names are not supported because there are no field names in a line of text.
Supported options
Usage: lmgrep [OPTIONS] LUCENE_QUERY [FILES]
Supported options:
-q, --query QUERY Lucene query string(s). If specified then all the positional arguments are interpreted as files.
--query-parser QUERY_PARSER Which query parser to use, one of: [classic complex-phrase simple standard surround]
--queries-file QUERIES_FILE A file path to the Lucene query strings with their config. If specified then all the positional arguments are interpreted as files.
--queries-index-dir QUERIES_INDEX_DIR A directory where Lucene Monitor queries are stored.
--tokenizer TOKENIZER Tokenizer to use, one of: [keyword letter standard unicode-whitespace whitespace]
--case-sensitive? CASE_SENSITIVE If text should be case sensitive
--ascii-fold? ASCII_FOLDED If text should be ascii folded
--stem? STEMMED If text should be stemmed
--stemmer STEMMER Which stemmer to use for token stemming, one of: [arabic armenian basque catalan danish dutch english estonian finnish french german german2 hungarian irish italian kp lithuanian lovins norwegian porter portuguese romanian russian spanish swedish turkish]
--presearcher PRESEARCHER no-filtering Which Lucene Monitor Presearcher to use, one of: [multipass-term-filtered no-filtering term-filtered]
--with-score If the matching score should be computed
--format FORMAT How the output should be formatted, one of: [edn json string]
--template TEMPLATE The template for the output string, e.g.: file={{file}} line-number={{line-number}} line={{line}}
--pre-tags PRE_TAGS A string that the highlighted text is wrapped in, use in conjunction with --post-tags
--post-tags POST_TAGS A string that the highlighted text is wrapped in, use in conjunction with --pre-tags
--excludes EXCLUDES A GLOB that filters out files that were matched with a GLOB
--skip-binary-files If a file that is detected to be binary should be skipped. Available for Linux and MacOS only.
--[no-]hidden Search in hidden files. Default: true.
--max-depth N In case of a recursive GLOB, how deep to search for input files.
--with-empty-lines When provided on the input that does not match write an empty line to STDOUT.
--with-scored-highlights ALPHA: Instructs to highlight with scoring.
--[no-]split If a file (or STDIN) should be split by newline.
--hyperlink If a file should be printed as hyperlinks.
--with-details For JSON and EDN output adds raw highlights list.
--word-delimiter-graph-filter WDGF WordDelimiterGraphFilter configurationFlags as per https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html
--show-analysis-components Just print-out the available analysis components in JSON.
--only-analyze When provided output will be analyzed text.
--explain Modifies --only-analyze. Output is detailed token info, similar to Elasticsearch Analyze API.
--graph Modifies --only-analyze. Output is a string that can be fed to the `dot` program.
--analysis ANALYSIS {} The analysis chain configuration
--query-parser-conf CONF The configuration for the query parser.
--concurrency CONCURRENCY 8 How many concurrent threads to use for processing.
--queue-size SIZE 1024 Number of lines read before being processed
--reader-buffer-size BUFFER_SIZE Buffer size of the BufferedReader in bytes.
--writer-buffer-size BUFFER_SIZE Buffer size of the BufferedWriter in bytes.
--[no-]preserve-order If the input order should be preserved.
--config-dir DIR A base directory from which to load text analysis resources, e.g. synonym files. Default: current dir.
--analyzers-file FILE A file that contains definitions of text analyzers. Works in combinations with --config-dir flag.
--query-update-buffer-size NUMBER Number of queries to be buffered in memory before being committed to the queryindex. Default 100000.
--streamed Listens on STDIN for json with both query and a piece of text to be analyzed
-h, --help
NOTE: question marks in zsh
shell must be escaped, e.g. --case-sensitive\?=true
or within double quotes e.g. "--case-sensitive?=true"
Text Analysis
The text analysis pipeline can be declaratively specified with the --analysis
flag, e.g.:
echo "<p>foo bars baz</p>" | \
./lmgrep \
--only-analyze \
--analysis='
{
"char-filters": [
{"name": "htmlStrip"},
{
"name": "patternReplace",
"args": {
"pattern": "foo",
"replacement": "bar"
}
}
],
"tokenizer": {"name": "standard"},
"token-filters": [
{"name": "englishMinimalStem"},
{"name": "uppercase"}
]
}
'
=>
["BAR","BAR","BAZ"]
The action inside lmgrep
is as follows:
- char filters are applied in order:
htmlStrip
is applied, which removes<p>
and</p>
from the string (i.e.foo bars baz
)patternReplace
is applied, which replacesfoo
withbar
(i.e.bar bars baz
)
- tokenization is performed (i.e.
[bar bars baz]
) - token filters are applied in order:
englishMinimalStem
which stems the tokens (i.e.[bar bar baz]
)uppercase
is applied (i.e.[BAR BAR BAZ]
)
- The resulting list of tokens is printed to STDOUT.
You can peel the analysis config layer by layer and see what are the intermediate results.
For the full list of supported analysis components see the documentation.
Default text analysis
If analysis is not specified then the default analysis pipeline is used which looks like:
--analysis='
{
"tokenizer": {
"name": "standard"
},
"token-filters": [
{
"name": "lowercase"
},
{
"name": "asciifolding"
},
{
"name": "englishMinimalStem"
}
]
}
'
Predefined analyzers
echo "dogs and cats" | ./lmgrep --only-analyze --analysis='{"analyzer": {"name": "English"}}'
=>
["dog","cat"]
Note that stopwords were removed and stemming was applied.
The full list of predefined analyzers can be found here.
Tips on Analysis Configuration
Analysis configuration must be a valid JSON and for your case it might make sense to store it in a file.
Store analysis in a file:
echo '{"analyzer": {"name": "English"}}' > analysis-conf.json
Run the text analysis:
echo "dogs and cats" | ./lmgrep --only-analyze --analysis="$(cat analysis-conf.json)"
If your JSON spans multiple lines ask a little help from jq
:
echo "dogs and cats" | ./lmgrep --only-analyze --analysis=$(jq -c . analysis-conf.json)
What about resources for analyzers?
Some token filters require a file as an argument, e.g. StopFilterFactory
requires words
which is a file.
By default, the Lucene would load the file under words
from the classpath.
However, lmgrep
is a single binary and there the notion of the classpath makes little sense.
To support such analysis components that expects files the Lucene class that loads files was patched to support arbitrary files.
E.g. create a stopwords file:
echo "foo\nbar" > my-stopwords.txt
Run the analysis
echo "foo bar baz" | \
./lmgrep \
--only-analyze \
--analysis='
{
"token-filters": [
{
"name": "stop",
"args": {
"words": "my-stopwords.txt"
}
}
]
}
'
=>
["baz"]
See the custom stopwords was removed.
Creating files in arbitrary places might be OK for one-off scripts.
However, it creates some mess.
Therefore, consider creating a folder for your analysis component resources such as: $HOME/.lmgrep
.
export LMGREP_HOME=$HOME/.lmgrep
echo "foo\nbar" > $LMGREP_HOME/my-stopwords.txt
echo "foo bar baz" | \
./lmgrep \
--only-analyze \
--analysis='
{
"token-filters": [
{
"name": "stop",
"args": {
"words": "'$LMGREP_HOME'/my-stopwords.txt"
}
}
]
}
'
=>
["baz"]
Analysis in the queries file
Every query in the queries file can provide its own configuration, e.g.:
[
{
"id": "0",
"query": "dogs",
"analysis": {
"analyzer": {
"name": "English"
}
}
},
{
"id": "1",
"query": "dogs",
"analysis": {
"tokenizer": {
"name": "whitespace"
}
}
}
]
For each unique analysis configuration a pair of Lucene Analyzer and an internal field name is created. Then Lucene Monitor is run over all queries, and their respective fields with their own analyzers for every text input.
WordDelimiterGraphFilter
Using WordDelimiterGraphFilter filter might help to tokenize text is various ways, e.g.:
echo "test class" | ./lmgrep "TestClass" --word-delimiter-graph-filter=99
=>
*STDIN*:1:test class
echo "TestClass" | ./lmgrep "test class" --word-delimiter-graph-filter=99
=>
*STDIN*:1:TestClass
The number 99 is a sum of options as described here.
Phrase Matching with Slop
To match a phrase you need to put it in double quotes:
echo "GraalVM is awesome" | ./lmgrep "\"graalvm is\""
=>
*STDIN*:1:GraalVM is awesome
By default, when phrase terms are not exactly one after another there is no match, e.g.:
echo "GraalVM is awesome" | ./lmgrep "\"graalvm awesome\""
=>
We can provide a slop parameter i.e. ~2
to allow some number of "substitutions" of terms in the document text, e.g.:
echo "GraalVM is awesome" | ./lmgrep "\"graalvm awesome\"~2"
=>
*STDIN*:1:GraalVM is awesome
As a side effect, when the slop is big enough terms can match out of order, e.g.:
echo "GraalVM is awesome" | ./lmgrep "\"awesome graalvm\"~3"
=>
*STDIN*:1:GraalVM is awesome
However, if order is important there is no way to enforce it Lucene query syntax.
Lucene query parsers
Currently, 5 Lucene query parsers are supported:
Query parser configuration
Additional configuration to query parsers can be passed with the --query-parser-conf
flag, e.g.:
./lmgrep "query" --query-parser-conf='{"allow-leading-wildcard": false}'
The value must be a JSON string. For the supported configuration values consult the documentation.
Development
Requirements:
- Clojure CLI
- Babashka
- Maven
- GraalVM with the
native-image
tool installed and on$PATH
- GNU Make
- Docker (just for rebuilding the linux native image).
Build executable for your platform:
make build
It will create an executable binary file named lmgrep
stored at the root directory of the repository.
Run the tests:
make test
Lint the core with clj-kondo:
bb lint
Print results with a custom format
./lmgrep --template="FILE={{file}} LINE_NR={{line-number}} LINE={{highlighted-line}}" "test" "**.md"
Template Variable | Notes |
---|---|
{{file}} | File name |
{{line-number}} | Line number where the text matched the query |
{{highlighted-line}} | Line that matched the query with highlighters applied |
{{line}} | Line that matched the query |
{{score}} | Score of the match (summed) |
When {{highlighted-line}}
is used then --pre-tags
and --post-tags
options are available, e.g.:
echo "some text to to match" | lmgrep "text" --pre-tags="<em>" --post-tags="</em>" --template="{{highlighted-line}}"
=>
some <em>text</em> to to match
Scoring
The main thing to understand is that scoring is for every line separately in the context of that one line as a whole corpus.
Another consideration is that scoring is summed up for every line of all the matches. E.g. query "one two" is rewritten by Lucene into two term queries.
Each individual score is BM25 which is default in Lucene.
--only-analyze
Great for debugging.
The output is a list tokens after analyzing the text, e.g.:
echo "Dogs and CAt" | ./lmgrep --only-analyze
# => ["dog","and","cat"]
In combination with --explain
flag outputs the detailed analyzed text similar to Elasticsearch Analyze API, e.g.:
echo "Dogs and CAt" | ./lmgrep --only-analyze --explain | jq
# => [
{
"token": "dog",
"position": 0,
"positionLength": 1,
"type": "<ALPHANUM>",
"end_offset": 4,
"start_offset": 0
},
{
"end_offset": 8,
"positionLength": 1,
"position": 1,
"start_offset": 5,
"type": "<ALPHANUM>",
"token": "and"
},
{
"position": 2,
"token": "cat",
"positionLength": 1,
"end_offset": 12,
"type": "<ALPHANUM>",
"start_offset": 9
}
]
To draw a token graph you can use the --graph
flag, e.g.:
echo "FooBar-Baz" | ./lmgrep --word-delimiter-graph-filter=99 --only-analyze --graph
# =>
digraph tokens {
graph [ fontsize=30 labelloc="t" label="" splines=true overlap=false rankdir = "LR" ];
// A2 paper size
size = "34.4,16.5";
edge [ fontname="Helvetica" fontcolor="red" color="#606060" ]
node [ style="filled" fillcolor="#e8e8f0" shape="Mrecord" fontname="Helvetica" ]
0 [label="0"]
-1 [shape=point color=white]
-1 -> 0 []
0 -> 2 [ label="foobar / FooBar"]
0 -> 1 [ label="foo / Foo"]
1 [label="1"]
1 -> 2 [ label="bar / Bar"]
2 [label="2"]
2 -> 3 [ label="baz / Baz"]
-2 [shape=point color=white]
3 -> -2 []
}
The --graph
flag makes the text analysis output into a valid GraphViz program that can be fed to dot
which draws a picture out of the text, magic.
If you have GraphViz installed on your machine, a one-liner to save the image of the text graph:
echo "FooBar-Baz" | ./lmgrep --word-delimiter-graph-filter=99 --only-analyze --graph | dot -Tpng -o token-graph.png
The output image looks should look like: <img src="docs/token-graph.png" alt="Token Graph" title="Token Graph" />
If you also have ImageMagic installed you can preview the token graph with this one-liner on Ubuntu:
echo "FooBar-Baz" | ./lmgrep --word-delimiter-graph-filter=99 --only-analyze --graph | dot -Tpng | display
Or MacOS:
echo "FooBar-Baz" | ./lmgrep --word-delimiter-graph-filter=99 --only-analyze --graph | dot -Tpng | open -a Preview.app -f
Streamed matching
Start lmgrep
process once and wait for the input from STDIN that includes both: text and the query.
Using such a technique avoids the "cold start" issues when with the stream of text the query is known only when the text is known.
Example:
echo '{"query": "nike~", "text": "I am selling nikee"}' | ./lmgrep --streamed --with-score --format=json --query-parser=simple
#=> {"line-number":1,"line":"I am selling nikee","score":0.09807344}
Is equivalent to:
echo "I am selling nikee" | ./lmgrep --query="nike~" --with-score --format=json --query-parser=simple
#=> {"line-number":1,"line":"I am selling nikee","score":0.09807344}
All other options are also applicable.
Custom Builds
Raudikko or Voikko stemming for Finnish Language
NOTE: The project is re-architected in a way that Raudikko token filter definition is in the subdirectory. Also, it was put under the deps.edn alias. However clever this change is, the uberjar builder has a hard time. The solution now is to modify deps.edn file so that the raudikko dependency is put under top level :deps. Tools build also has a hard time building an uberjar.
(export LMGREP_FEATURE_RAUDIKKO=true && bb generate-reflection-config && make build)
Environment variables
Check the docs.
Future work
- Optimize matching by processing lines in batches
License
Copyright © 2022 Dainius Jocas.
Distributed under The Apache License, Version 2.0.