Home

Awesome

Typed DataFrames

Version status License Python version compatibility Version on GitHub Version on PyPi Build (Actions) Coverage (coveralls) Documentation status Maintainability Scrutinizer Code Quality Created with Tyrannosaurus

Pandas DataFrame subclasses that self-organize and serialize robustly.

import typeddfs

Film = typeddfs.typed("Film").require("name", "studio", "year").build()
df = Film.read_csv("file.csv")
assert df.columns.tolist() == ["name", "studio", "year"]
type(df)  # Film

Your types remember how to be read, including columns, dtypes, indices, and custom requirements. No index_cols=, header=, set_index, or astype needed.

Read and write any format:

path = input("input file? [.csv/.tsv/.tab/.json/.xml.bz2/.feather/.snappy.h5/...]")
df = Film.read_file(path)
df.write_file("output.snappy")

Need dataclasses?

instances = df.to_dataclass_instances()
Film.from_dataclass_instances(instances)

Save metadata?

df = df.set_attrs(dataset="piano")
df.write_file("df.csv", attrs=True)
df = Film.read_file("df.csv", attrs=True)
print(df.attrs)  # e.g. {"dataset": "piano")

Make dirs? Don’t overwrite?

df.write_file("df.csv", mkdirs=True, overwrite=False)

Write / verify checksums?

df.write_file("df.csv", file_hash=True)
df = Film.read_file("df.csv", file_hash=True)  # fails if wrong

Get example datasets?

print(ExampleDfs.penguins().df)
#    species     island  bill_length_mm  ...  flipper_length_mm  body_mass_g     sex
# 0    Adelie  Torgersen            39.1  ...              181.0       3750.0    MALE

Pretty-print the obvious way?

df.pretty_print(to="all_data.md.zip")
wiki_txt = df.pretty_print(fmt="mediawiki")

All standard DataFrame methods remain available. Use .of(df) to convert to your type, or .vanilla() for a plain DataFrame.

Read the docs πŸ“š for more info and examples.

πŸ› Pandas serialization bugs fixed

Pandas has several issues with serialization.

<details> <summary><em>See: Fixed issues</em></summary> Depending on the format and columns, these issues occur: </details>

🎁 Other features

See more in the guided walkthrough ✏️

<details> <summary><em>See: Short feature list</em></summary> </details>

πŸ’” Limitations

<details> <summary><em>See: List of limitations</em></summary> </details>

πŸ”Œ Serialization support

TypedDfs provides the methods read_file and write_file, which guess the format from the filename extension. For example, this will convert a gzipped, tab-delimited file to Feather:

TastyDf = typeddfs.typed("TastyDf").build()
TastyDf.read_file("myfile.tab.gz").write_file("myfile.feather")

Pandas does most of the serialization, but some formats require extra packages. Typed-dfs specifies extras to help you get required packages and with compatible versions.

Here are the extras:

For example, for Feather and TOML support use: typeddfs[feather,toml] As a shorthand for all formats, use typeddfs[all].

πŸ“Š Serialization in-depth

<details> <summary><em>See: Full table</em></summary>
formatchangespackagesextrasanityspeedbitrate
Featherfixedpyarrowfeather+++++++++
Parquetfixedpyarrowparquet *++ †++++++
csv/tsvfixed-βˆ’text
flexwf ‑new-βˆ’text
.fwf+read-βˆ’text
jsonfixed-βˆ’βˆ’text
xmlfixedlxmlxmlβˆ’βˆ’βˆ’text
.propertiesnew-βˆ’text
tomlnewtomlkittoml-βˆ’text
yamlnewruamel.yamlyaml--text
INInew--βˆ’text
.linesnew-βˆ’text
.npyβˆ’++++
.npzβˆ’++++
.htmlhtml5lib,beautifulsoup4htmlβˆ’βˆ’βˆ’βˆ’text
pickle-βˆ’βˆ’-
XLSXfixedopenpyxl,defusedxmlexcel+βˆ’-
ODSfixedopenpyxl,defusedxmlexcel+βˆ’-
XLSfixedopenpyxl,defusedxmlexcelβˆ’βˆ’βˆ’-
XLSBpyxlsbxlsbβˆ’βˆ’βˆ’+
HDF5tablesnone-βˆ’+
GZIPN/A-++
ZIP Β§N/A-++
BZIP2N/A--+++
XZN/A--+++
ZSTDzstandardN/A++++++
</details> <details> <summary><em>See: serialization notes</em></summary>

Feather offers massively better performance over CSV, gzipped CSV, and HDF5 in read speed, write speed, memory overhead, and compression ratios. Parquet typically results in smaller file sizes than Feather at some cost in speed. Feather is the preferred format for most cases.

</details>

πŸ”’ Security

Refer to the security policy.

πŸ“ Extra notes

<details> <summary><em>See: Pinned versions</em></summary>

Dependencies in the extras only have version minimums, not maximums. For example, typed-dfs requires pyarrow >= 4. natsort is also only assigned a minimum version number. This means that the result of typed-df’s sort_natural could change. To fix this, pin natsort to a specific major version; e.g. natsort = "^8" with Poetry or natsort>=8,<9 with pip.

</details>

🍁 Contributing

Typed-Dfs is licensed under the Apache License, version 2.0. New issues and pull requests are welcome. Please refer to the contributing guide. Generated with Tyrannosaurus.