Scrape files for sensitive information, and generate an interactive HTML report. Based on Rabin2.

Customize the tool to your liking!

Tested on Kali Linux v2023.4 (64-bit).

Made for educational purposes. I hope it will help!

Table of Contents

How to Install

Install Radare2

On Kali Linux, run:

apt-get -y install radare2

On Windows OS, download and unpack radareorg/radare2, then, add the bin directory to Windows PATH environment variable.

On macOS, run:

brew install radare2

Standard Install

pip3 install --upgrade file-scraper

Build and Install From the Source

git clone https://github.com/ivan-sincek/file-scraper && cd file-scraper

python3 -m pip install --upgrade build

python3 -m build

python3 -m pip install dist/file_scraper-3.1-py3-none-any.whl

Build the Template & Run

Prepare a template:

      "query":"[^\\w\\d\\n]+(?:basic|bearer)\\ .+",
      "query":"(?:access|account|admin|basic|bearer|card|conf|cred|customer|email|history|id|info|jwt|key|kyc|log|otp|pass|pin|priv|refresh|salt|secret|seed|setting|sign|token|transaction|transfer|user)[\\w\\d]*(?:\\\"\\ *\\:|\\ *\\=).+",
      "query":"[^\\w\\d\\n]+(?:bug|comment|fix|issue|note|problem|to(?:\\_|\\ |)do|work)[^\\w\\d\\n]+.+",
      "query":"-----BEGIN (?:CERTIFICATE|PRIVATE KEY)-----[\\s\\S]+?-----END (?:CERTIFICATE|PRIVATE KEY)-----",

Make sure your regular expressions return only one capturing group, e.g., [1, 2, 3, 4]; and not a touple, e.g., [(1, 2), (3, 4)].

Make sure to properly escape regular expression specific symbols in your template file, e.g., make sure to escape dot . as \\., and forward slash / as \\/, etc.

querytextyesRegular expression query.
searchbooleannoHighlight matches within output; otherwise, extract matches.
ignorecasebooleannoCase-insensitive search.
minimumintegernoShow only matches longer than int characters.
maximumintegernoShow only matches lesser than int characters.
decodebooleannoDecode matches. Available decodings: url, base64 hex, cert.
uniquebooleannoFilter out duplicates.
collectbooleannoCollect all matches in one place.

How I run the tool most of the time:

file-scraper -dir directory -o results.html -e default

Default (built-in) exclude file types are as following:

car, css, gif, jpeg, jpg, mp3, mp4, nib, ogg, otf, png, storyboard, strings, svg, ttf, webp, woff, woff2, xib


File Scraper v3.1 ( github.com/ivan-sincek/file-scraper )

Usage:   file-scraper -dir directory -o out          [-t template     ] [-e excludes    ] [-th threads]
Example: file-scraper -dir decoded   -o results.html [-t template.json] [-e jpeg,jpg,png] [-th 10     ]

    Scrape files for sensitive information
    Directory containing files, or a single file to scrape
    -dir, --directory> = decoded | files | test.exe | etc.
    Template file with extraction details, or a single RegEx to use
    Default: built-in JSON template file
    -t, --template = template.json | "secret\: [\w\d]+" | etc.
    Exclude all files that end with the specified extension
    Specify 'default' to load the built-in list
    Use comma-separated values
    -e, --excludes = mp3 | default,jpeg,jpg,png | etc.
    Include all files that end with the specified extension
    Overrides excludes
    Use comma-separated values
    -i, --includes = java | json,xml,yaml | etc.
    Beautify [minified] JavaScript (.js) files
    -b, --beautify
    Number of parallel threads to run
    Default: 30
    -th, --threads = 10 | etc.
    Output HTML file
    -o, --out = results.html | etc.
    Debug output
    -dbg, --debug


