Home

Awesome

eBook Tools

This is a collection of bash shell scripts for automated and semi-automated organization and management of large ebook collections. It contains the following tools:

All of the tools use a library file lib.sh that has useful functions for building other ebook management scripts. More details for the different script options and parameters can be found in the Usage, options and configuration section.

Installation and dependencies

There are two ways you can install and use the tools in this repository - directly or via docker images.

Since all of the tools are shell scripts, you should be able to use them directly from source in most up-to-date GNU/Linux distributions, as long as you have the needed dependencies installed. They should also be usable on other *nix systems like OS X and *BSD if you have the GNU versions of the dependencies installed or in the Windows Subsystem for Linux.

However, since non-linux systems are officially unsupported and may have unexpected issues, Docker containers are the preferred way to use the scripts in those systems. The docker images may also be easier to use than the bare scripts on non-GNU linux distributions or on older linux distributions like some LTS releases.

Shell scripts

To install and use the bare shell scripts, follow these steps:

  1. Install the dependencies below.
  2. Make sure that your system has a UTF-8 locale.
  3. Clone the repository or download a release archive and extract it.
  4. For convenience, you may want to add the scripts folder to the PATH environment variable.

You need recent versions of:

The scripts are only tested on linux, though they should work on any *nix system that has the needed dependencies. You can install everything needed with this command in Arch Linux:

pacman -S file less bash coreutils gawk sed grep calibre p7zip tesseract tesseract-data-eng python2-lxml poppler catdoc djvulibre

Note: you can probably get much better OCR results by using the unstable 4.0 version of Tesseract. It is present in the AUR or you can easily make a package like this yourself.

Here is how to install the packages on Debian (and Debian-based distributions like Ubuntu):

apt-get install file less bash coreutils gawk sed grep calibre p7zip-full tesseract-ocr tesseract-ocr-osd tesseract-ocr-eng python-lxml poppler-utils catdoc djvulibre-bin

Keep in mind that a lot of debian-based distributions do not have up-to-date packages and the scripts work best when calibre's version is at least 2.84. For earlier versions you have to set ISBN_METADATA_FETCH_ORDER and ORGANIZE_WITHOUT_ISBN_SOURCES to empty strings.

Docker

The docker image includes all of the needed dependencies, even the extra calibre plugins. There is an automatically built docker image in the Docker Hub. You can pull it locally with docker pull ebooktools/scripts. You can also easily build the docker image yourself: simply clone this repository (or download the latest release archive and extract it) and then run docker build -t ebooktools/scripts:latest . in the folder.

Here are some Docker-specific usage details:

For more Docker details, read the docker documentation and docker run reference specifically.

Usage, options and configuration

Scripts that work with multiple files recursively scan the supplied folders and assume that one file is one ebook. Ebooks that consist of multiple files should be compressed in a single file archive. The archive type does not matter, it can be .zip, .rar, .tar, .7z and others - all supported archive types by 7zip are fine.

All of the options documented below can either be passed to the scripts via command-line parameters or via environment variables. Command-line parameters supersede environment variables. Most parameters are not required and if nothing is specified, the default value will be used.

General options

All of these options are part of the common library and may affect some or all of the scripts.

General control flags:

Options related to extracting ISBNs from files and finding metadata by ISBN:

Options for OCR:

Options related to extracting and searching for non-ISBN metadata:

Options related to the input and output files:

Miscellaneous options

Script usage and options

organize-ebooks.sh [<OPTIONS>] folder-to-organize [...]

Description

This is probably the most versatile script in the repository. It can automatically organize folders with huge quantities of unorganized ebook files. This is done by extracting ISBNs and/or metadata from the ebook files, downloading their full and hopefully correct metadata from online sources and auto-renaming the unorganized files with full and correct names and moving them to specified folders. Is supports virtually all ebook types, including ebooks in arbitrary or even nested archives (like the other scripts, it assumes that one file is one ebook, even if it's a huge archive). OCR can be used for scanned ebooks and corrupt ebooks and non-ebook documents (pamphlets) can be separated in specified folders. Most of the general options and flags above affect how this script operates, but there are also some specific options for it.

Specific options for organizing files

Output options

interactive-organizer.sh [<OPTIONS>] folder-to-organize [...]

Description

This script can be used to manually organize ebook files quickly. It can also be used to semi-automatically verify the ebooks organized by organize-ebooks.sh if the KEEP_METADATA option was enabled so the new filenames can be compared with the old ones.

Options

find-isbns.sh [<OPTIONS>] [filename]

This script tries to find valid ISBNs inside a file or in stdin if no file is specified. Searching for ISBNs in files uses progressively more resource-intensive methods until some ISBNs are found, see the documentation below for more details.

Some global options affect this script (especially the ones related to extracting ISBNs from files), but the only script-specific option is:

convert-to-txt.sh [<OPTIONS>] filename

This script converts the supplied file to a text file. It can optionally also use OCR for .pdf, .djvu and image files. There are no local options, but a some of the global options affect this script's behavior a lot, especially the OCR ones.

rename-calibre-library.sh [<OPTIONS>] calibre-folder [...]

This script traverses a calibre library folder and renames all the book files in it by reading their metadata from calibre's metadata.opf files.

split-into-folders.sh [<OPTIONS>] folder-with-books [...]

This script recursively scans the supplied folders for files and splits the found files (and the accompanying metadata files if present) into folders with consecutive names that each contain the specified number of files.

Implementation details

Searching for ISBNs in files

There are several different ways that a specific file can be searched for ISBN numbers. Each step requires progressively more "expensive" operations. If at some point ISBNs are found, they are returned or used without trying the remaining strategies. These are the actual steps in order:

  1. Check the supplied file name for ISBNs (the path is ignored).
  2. If the MIME type of the file matches ISBN_DIRECT_GREP_FILES, search the file contents directly for ISBNs. If the MIME type matches ISBN_IGNORED_FILES, the search stops with no results.
  3. Check the file metadata from calibre's ebook-meta tool for ISBNs.
  4. Try to extract the file as an archive with 7z. If successful, recursively repeat all of these steps for all the extracted files. This is very useful for normal archives, but it also works for .epup, .chm, and other ebook filetypes that are actually archives of some type that 7z supports. This method of searching them for ISBNs is typically much faster than the next step.
  5. If the file is not an archive, try to convert it to a .txt file. Use calibre's ebook-convert unless a faster alternative is present - pdftotext from poppler for .pdf files, catdoc for .doc files or djvutxt for .djvu files. Search the resulting .txt file for ISBNs directly.
  6. If OCR is enabled and the simple conversion to .txt fails or if its result is empty, try OCR-ing the file. If the simple conversion result is non-empty but does not contain ISBNs and OCR_ENABLED is set to always, run OCR as well.

The regular expression used for matching ISBNs (ISBN_REGEX in lib.sh) is purposefully a bit broad, to catch as many potential ISBNs as possible. To reduce false positives, all matched numbers are verified for correct ISBN check digits. Additionally, by default all matched numbers that pass the checks are filtered with the ISBN_BLACKLIST_REGEX regular expression. That way technically valid but probably wrong ISBNs like 0123456789, 0000000000, 1111111111, etc. are discarded.

Limitations

Roadmap

Security and safety

Please keep in mind that this is beta-quality software. To avoid data loss, make sure that you have a backup of any files you want to organize. You may also want to run the scripts with the --dry-run or --symlink-only option the first time to make sure that they would do what you expect them to do.

Also keep in mind that these shell scripts parse and extract complex arbitrary media and archive files and pass them to other external programs written in memory-unsafe languages. This is not very safe and specially-crafted malicious ebook files can probably compromise your system when you use these scripts. If you are cautious and want to organize untrusted or unknown ebook files, use something like QubesOS or at least do it in a separate VM/jail/container/etc.

License

These scripts are licensed under the GNU General Public License v3.0. For more details see the LICENSE file in the repository.