Home

Awesome

An Introduction to DataFrames.jl

Bogumił Kamiński, February 13, 2023

The tutorial is for DataFrames.jl 1.5.0

A brief introduction to basic usage of DataFrames.

The tutorial contains a specification of the project environment version under which it should be run. In order to prepare this environment, before using the tutorial notebooks, while in the project folder run the following command in the command line:

julia -e 'using Pkg; Pkg.activate("."); Pkg.instantiate()'

Tested under Julia 1.9.0. The project dependencies are the following:

  [69666777] Arrow v2.4.3
  [6e4b80f9] BenchmarkTools v1.3.2
  [336ed68f] CSV v0.10.9
  [324d7699] CategoricalArrays v0.10.7
  [8be319e6] Chain v0.5.0
  [944b1d66] CodecZlib v0.7.1
  [a93c6f00] DataFrames v1.5.0
  [1313f7d8] DataFramesMeta v0.13.0
  [5789e2e9] FileIO v1.16.0
  [da1fdf0e] FreqTables v0.4.5
  [7073ff75] IJulia v1.24.0
  [babc3d20] JDF v0.5.1
  [9da8a3cd] JLSO v2.7.0
  [b9914132] JSONTables v1.0.3
  [86f7a689] NamedArrays v0.9.6
  [2dfb63ee] PooledArrays v1.4.2
  [f3b207a7] StatsPlots v0.15.4
  [bd369af6] Tables v1.10.0
  [a5390f91] ZipFile v0.10.1
  [9a3f8284] Random
  [10745b16] Statistics v1.9.0

I will try to keep the material up to date as the packages evolve.

This tutorial covers DataFrames.jl and CategoricalArrays.jl, as they constitute the core of DataFrames.jl along with selected file reading and writing packages.

In the last extras part mentions selected functionalities of selected useful packages that I find useful for data manipulation, currently those are: FreqTables.jl, DataFramesMeta.jl StatsPlots.jl.

TOC

FileTopic
01_constructors.ipynbCreating DataFrame and conversion
02_basicinfo.ipynbGetting summary information
03_missingvalues.ipynbHandling missing values
04_loadsave.ipynbLoading and saving DataFrames
05_columns.ipynbWorking with columns of DataFrame
06_rows.ipynbWorking with row of DataFrame
07_factors.ipynbWorking with categorical data
08_joins.ipynbJoining DataFrames
09_reshaping.ipynbReshaping DataFrames
10_transforms.ipynbTransforming DataFrames
11_performance.ipynbPerformance tips
12_pitfalls.ipynbPossible pitfalls
13_extras.ipynbAdditional interesting packages

Changelog:

DateChanges
2017-12-05Initial release
2017-12-06Added description of insert!, merge!, empty!, categorical!, delete!, DataFrames.index
2017-12-09Added performance tips
2017-12-10Added pitfalls
2017-12-18Added additional worthwhile packages: FreqTables and DataFramesMeta
2017-12-29Added description of filter and filter!
2017-12-31Added description of conversion to Matrix
2018-04-06Added example of extracting a row from a DataFrame
2018-04-21Major update of whole tutorial
2018-05-01Added byrow! example
2018-05-13Added StatPlots package to extras
2018-05-23Improved comments in sections 1 do 5 by Jane Herriman
2018-07-25Update to 0.11.7 release
2018-08-25Update to Julia 1.0 release: sections 1 to 10
2018-08-29Update to Julia 1.0 release: sections 11, 12 and 13
2018-09-05Update to Julia 1.0 release: FreqTables section
2018-09-10Added CSVFiles section to chapter on load/save
2018-09-26Updated to DataFrames 0.14.0
2018-10-04Updated to DataFrames 0.14.1, added haskey and repeat
2018-12-08Updated to DataFrames 0.15.2
2019-01-03Updated to DataFrames 0.16.0, added serialization instructions
2019-01-18Updated to DataFrames 0.17.0, added passmissing
2019-01-27Added Feather.jl file read/write
2019-01-30Renamed StatPlots.jl to StatsPlots.jl and added Tables.jl
2019-02-08Added groupvars and groupindices functions
2019-04-27Updated to DataFrames 0.18.0, dropped JLD2.jl
2019-04-30Updated handling of missing values description
2019-07-16Updated to DataFrames 0.19.0
2019-08-14Added JSONTables.jl and Tables.columnindex
2019-08-16Added Project.toml and Manifest.toml
2019-08-26Update to Julia 1.2 and DataFrames 0.19.3
2019-08-29Add example how to compress/decompress CSV file using CodecZlib
2019-08-30Add examples of JLSO.jl and ZipFile.jl by xiaodaigh
2019-11-03Add examples of JDF.jl by xiaodaigh
2019-12-08Updated to DataFrames 0.20.0
2020-05-06Updated to DataFrames 0.21.0 (except load/save and extras)
2020-11-20Updated to DataFrames 0.22.0 (except DataFramesMeta.jl which does not work yet)
2020-11-26Updated to DataFramesMeta.jl 0.6; update by @pdeffebach
2021-05-15Updated to DataFrames.jl 1.1.1
2021-05-15Updated to DataFrames.jl 1.2 and DataFramesMeta.jl 0.8, added Chain.jl instead of Pipe.jl
2021-12-12Updated to DataFrames.jl 1.3
2022-10-05Updated to DataFrames.jl 1.4
2023-02-13Updated to DataFrames.jl 1.5

Core functions summary

  1. Constructors: DataFrame, DataFrame!, Tables.rowtable, Tables.columntable, Matrix, eachcol, eachrow, Tables.namedtupleiterator, empty, empty!
  2. Getting summary: size, nrow, ncol, describe, names, eltypes, first, last, getindex, setindex!, @view, isapprox, metadata, metadata!, colmetadata, colmetadata!
  3. Handling missing: missing (singleton instance of Missing), ismissing, nonmissingtype, skipmissing, replace, replace!, coalesce, allowmissing, disallowmissing, allowmissing!, completecases, dropmissing, dropmissing!, disallowmissing, disallowmissing!, passmissing
  4. Loading and saving: CSV (package), CSVFiles (package), Serialization (module), CSV.read, CSV.write, save, load, serialize, deserialize, Arrow.write, Arrow.Table (from Arrow.jl package), JSONTables (package), arraytable, objecttable, jsontable, CodecZlib (module), GzipCompressorStream, GzipDecompressorStream, JDF.jl (package), JDF.save, JDF.load, JLSO.jl (package), JLSO.save, JLSO.load, ZipFile.jl (package), ZipFile.reader, ZipFile.writer, ZipFile.addfile
  5. Working with columns: rename, rename!, hcat, insertcols!, categorical!, columnindex, hasproperty, select, select!, transform, transform!, combine, Not, All, Between, ByRow, AsTable
  6. Working with rows: sort!, sort, issorted, append!, vcat, push!, view, filter, filter!, deleteat!, unique, nonunique, unique!, allunique, repeat, parent, parentindices, flatten, @chain (from Chain.jl package), only, subset, subset!, shuffle, prepend!, pushfirst!, insert!, keepat!
  7. Working with categorical: categorical, cut, isordered, ordered!, levels, unique, levels!, droplevels!, unwrap, recode, recode!
  8. Joining: innerjoin, leftjoin, leftjoin!, rightjoin, outerjoin, semijoin, antijoin, crossjoin
  9. Reshaping: stack, unstack
  10. Transforming: groupby, mapcols, parent, groupcols, valuecols, groupindices, keys (for GroupedDataFrame), combine, select, select!, transform, transform!, @chain (from Chain.jl package)
  11. Extras:
    • FreqTables: freqtable, prop, Name
    • DataFramesMeta: @with, @subset, @select, @transform, @orderby, @by, @combine, @eachrow, @newcol, ^, $
    • StatsPlots: @df, plot, density, histogram,boxplot, violin