


Harvester is a tool to crawl websites and OCR/extract metadata from documents, all through a usable graphical interface. The goal is for journalists, activists, and researchers to be able to rapidly collect open source intelligence (OSINT) from public websites and convert any set of documents into machine readable form without programming or complex technical setup.

Harvester requires DocManager so that it can index the data with Elasticsearch. Harvester can also be used with LookingGlass to seamlessly generate searchable archives of crawled data and processed documents.



Setup Instructions

  1. Install the dependencies
  1. Install Tika & Tesseract (optional)

NOTE: By default document conversion (pdf, docs, etc..) is handled by GiveMeText, this approach sends your documents over the clear internet. DO NOT USE THIS with sensitive documents, instead install Tika & Tesseract as described below.

  1. Get Harvester
  1. Run Harvester