Awesome
DLMReader
An efficient multi-threaded package for reading(writing) delimited files. It is designed as a file parser for InMemoryDatasets.jl
.
DLMReader writes and reads AbstractDatasets types, i.e. other types must be converted to/from AbstractDatasets.
It works very well for huge files (long or/and wide).
DLMReader
does not guessdelimiter
and if it is different from,
, it must be passed via thedelimiter
keyword argument. By default, theDLMReader
package assumes Strings are not quoted, if they are quoted, user must pass the quote character via thequotechar
keyword argument.
Documentation
Features
DLMReader.jl
has some interesting features which distinguish it from other packages for reading delimited files. In what follows, we list few of them;
-
Informats
: TheDLMReader
package usesinformats
to call a class of functions on the raw text before parsing its value(s). This provides a flexible and extendable approach to parse values with special patterns. For instance, using the predefined informatCOMMA!
allows users to read a numeric column with "thousands separator" and/or the dollar sign, e.g. using this informat, the raw text like "$12,000.00
" will be parsed as "12000.00
". Moreover,informat
s support function composing, e.g.COMMA! ∘ ACC!
parses "$(12,000.00)
" as "-12000.00
", i.e.ACC!
is first applied and thenCOMMA!
is applied on its result.- Additionally,
informats
can be applied on whole line before processing individual values.
- Additionally,
-
Fixed-width text: If users pass the columns locations via the
fixed
keyword argument, the package reads those columns as fixed-width format. For instance, passingfixed = Dict(1=>1:1, 2=>2:2)
helps to parse "10
" as "[1,0]
". Mixing fixed-width format and delimited format is also allowed. -
Multiple observations per line: The package allows reading more than one observation per line. This can be done by passing the
multiple_obs = true
keyword argument. The multithreading feature (plus some other features) will be switched off if this option is set. -
Fast file writer: The
DLMReader
package exploits thebyrow
function fromInMemoryDatasets.jl
to write delimited files into disk. This enablesDLMReader
to convert values to string using multiple threads. -
Alternative delimiters: User can pass a vector of delimiters to the function. In this case,
filereader
treats any of the passed delimiters as field delimiter. -
Multiple Date formats: User can pass different date formats for different columns.
-
Different integer base: The
DLMReader
package allows users pass the integer base if it is different from 10 when parsing integers. -
String as delimiter: User can pass a string as delimiter of values. This must be passed via the
dlmstr
keyword argument. -
Informative warnings/info: If something goes wrong during the reading phase, the package will provide detailed warnings/info to help user investigate the issue.
Benchmarks
See here for some benchmarks.
Examples
The following files will be used during the examples, it is assumed that the files are located in the current working directory
ex1.csv
a, b, c
1,2,NA
2,3,2001-1-2
2,4,2020-4-2
1,2,2000-12-1
ex2.csv
a::b::C::DD
12::1345::15::15
12::13::15::15
12::13::15::15
12::13::15::15
12::13::15::15
12::13::15::15
12::13::15::15
12::13::::15
12::13::15::15
12::13::15::157
ex3.csv
1
2
4;5
6
8;9
1
4;
ex4.csv
1 3,5
2 4,6
33 5,7
ex5.csv
x1;x2:x3,x4
1;2;123;3
2;4,4,5
ex6.csv
id1 $2,000,000 3
id2 $34,000 4
id3 $200,000 1
And the code to read them into Julia
julia> using DLMReader
julia> filereader("ex1.csv", dtformat = Dict(3 => dateformat"y-m-d"))
julia> filereader("ex2.csv", dlmstr = "::")
julia> filereader("ex3.csv", types = [Int, Int, Int], header = false, linebreak = ';', delimiter = '\n')
julia> filereader("ex4.csv", fixed = Dict(1 => 1:4), header = false)
julia> filereader("ex5.csv", delimiter = [';', ':', ','])
julia> filereader("ex6.csv", delimiter = ' ', informat = Dict(2=>COMMA!), header = [:ID, :price, :quarter])
COMMA!
is a built-in informat which removes the comma from numbers. If number contains dollar or sterling signs, it also removes them. The trimmed text is sent to the parser for converting to a number.
Extra examples
julia> filereader(IOBuffer("1,2,3,4,5\n6,7,8\n10\n"),
header = [:x1, :x2],
types = [Int, Int],
multiple_obs = true)
5×2 Dataset
Row │ x1 x2
│ identity identity
│ Int64? Int64?
─────┼────────────────────
1 │ 1 2
2 │ 3 4
3 │ 5 6
4 │ 7 8
5 │ 10 missing
julia> filereader(IOBuffer(""" name1 name2 avg1 avg2 y
0 A D 75 5 32
1 A D 75 5 32
2 D L 32 7 12
3 F C 99 8 42
4 F C 99 8 42
5 C A 43 6 39
6 C A 43 6 39
7 L R 53 3 11
8 R F 21 2 25
9 R F 21 2 25
"""), delimiter = ' ', ignorerepeated = true, emptycolname = true)
10×6 Dataset
Row │ NONAME1 name1 name2 avg1 avg2 y
│ identity identity identity identity identity identity
│ Int64? String? String? Int64? Int64? Int64?
─────┼────────────────────────────────────────────────────────────
1 │ 0 A D 75 5 32
2 │ 1 A D 75 5 32
3 │ 2 D L 32 7 12
4 │ 3 F C 99 8 42
5 │ 4 F C 99 8 42
6 │ 5 C A 43 6 39
7 │ 6 C A 43 6 39
8 │ 7 L R 53 3 11
9 │ 8 R F 21 2 25
10 │ 9 R F 21 2 25