Home

Awesome

AUTO-TEXT

This is a Common Lisp library intended for working with unknown "real-life" text files that hopefully hold data in table form (i.e. CSV files, fixed-width files, etc).

In these cases, given a text file, its useful to finding out the following:

Plus

Caveats

The file encoding detection is able to discriminate between:

This library is intended for working with english or "latin-1" (spanish, french, portuguese, italian) data, thus the choice of encodings.

However, more encoding detection rules can be added or edited in a simple way, see encoding.lisp, the code is simple to understand.

For detection of asian and far eastern languages like Chinese, Japanese, Korean, Russian, Arabic, Greek, see the inquisitor lib. Inquisitor doesn't work with latin or west european languages.

Usage

Main usage:

(ql:quickload "auto-text")

(auto-text:analyze "my-file.txt") 

Sample out:

CL-USER> (auto-text:analyze #P"my.txt" :silent nil)
Reading file for analysis... my.txt
Eol-type: CRLF
Likely delimiter? #\,  
BOM: NIL 
Possible encodings:  UTF-8 
Sampling 10 rows as CSV for checking width...
(:SAME-NUMBER-OF-COLUMNS T :DELIMITER #\, :EOL-TYPE :CRLF :BOM-TYPE NIL
 :ENCODING :UTF-8)
 
CL-USER> (auto-text:analyze #P"file.txt" :silent nil)
Reading file for analysis... file.txt
Eol-type: CRLF
Likely delimiter? #\Return  
BOM: NIL 
Possible encodings:  WINDOWS-1252 
(:DELIMITER #\Return :EOL-TYPE :CRLF :BOM-TYPE NIL :ENCODING :WINDOWS-1252)

Silence messages by passing :silent T

The analyze function will produce a property-list (plist) with the following keys:

Sample an arbitrary number of rows (lines) from an arbitrarily-sized file: See sample-rows-string:

(defun sample-rows-string (path &key (eol-type :crlf)
                                     (encoding :utf-8)
                                     (sample-size 10))
  "Sample some rows from path, return as list of strings.
Each line does not include the EOL"
  (loop for rows in (sample-rows-bytes path :eol-type eol-type
                                            :sample-size sample-size)
        collecting 
        (bytes-to-string rows encoding)))

Obtain a histogram or frequency hash-table that will tell you how many times a character appears on the whole file. (NOTE: This reads the file in byte mode): See histogram-binary-file on histogram.lisp

Config

See config.lisp, to config:

Hacking

Implementations

So far works on SBCL, CCL, CLISP, and ABCL.

Should run in any implementation where BABEL and CL-CSV work fine!!

Author

Flavio E. also known as defunkydrummer

License

MIT

See also

The aforementioned inquisitor lib. This code shares no code in common with inquisitor.