Home

Awesome

Introduction

This is a Python program that scans WARC (web archive) files for viruses and NSFW (not-safe-for-work) content:

You can either run it in test mode (check an individual WARC file) or in server mode (for easy integration into existing workflows) when the server has access to the WARC files via file system.

The program accepts both compressed and uncompressed WARC files and the AI NSFW model is able to work with TIFF, JPEG, PNG, SVG and WEBP formats (although images with exotic dimensions/formats might not fully work).

Installation

Please use Python 3.9+. You can install the requirements as usual:

pip install -r requirements.txt

If you want to use the antivirus feature, you will need to install the clamd antivirus daemon. On Ubuntu, you can do so like this:

apt-get install clamav clamav-daemon -y

The first setup of clamd requires you to stop, update and start the service:

systemctl stop clamav-freshclam
freshclam
systemctl start clamav-freshclam

Usage

The tool scan be used in two ways:

Note that the first time, the application will automatically download the classifier model to the current user's home folder. This might take a few seconds (or minutes) depending on your connection. You can check the progress in stdout.

Test mode

You can start the application in test mode from the command-line as follows:

python app.py --test-av </path/to/warc>
python app.py --test-nsfw </path/to/warc>

The first example above runs the antivirus scan and the second the NSFW classifier.

test mode

Server mode

You can start the application as a server like so:

python app.py --server <port>

The application in server mode exposes the following endpoints:

All these endpoints are POST and take a single argument, file_path, which is the absolute path to the WARC that you want to analyze (it can be compressed or uncompressed).

Here is an example request with curl:

curl -X POST -H "Content-Type: application/json" -d '{"file_path": "/my/path/my.warc.gz"}' localhost:8123/test_all

Return values

All endpoints return JSON. The root element is results, which is a list containing the WARC records together with their filter results. Each entry in the list is identified by its WARC-Record-ID. Here is an example:

{
  "results": {
    "<urn:uuid:ec2aa5f2-391e-530a-a9ed-b44f944fdcb9>": {
      "av_details": null,
      "av_res": "OK",
      "filename": "picture.jpg",
      "mime": "image/jpeg",
      "nsfw_res": "SFW",
      "nsfw_score": 0.35693745957662754
    },
    ...
    }
}

The fields available for each record are the following:

NSFW scoring

The nsfw_score is a floating-point value between 0 (not NSFW at all) and 1 (certainly NSFW). On the other hand, the nsfw_res field returns either NSFW or SFW depending on what the AI has detected.

Updating your antivirus database

From time to time it might make sense to update your clamav signature database. You can do so by running

freshclam

You might also want to restart the service with

systemctl restart clamav-freshclam