Awesome
goodtables-pandas-py
Warning: Not an official frictionlessdata package
This package reads and validates a Frictionless Data Tabular Data Package using pandas. It is about ~10x faster than the official frictionlessdata/frictionless-py, at the expense of higher memory usage.
Usage
pip install goodtables-pandas-py
import goodtables_pandas as goodtables
report = goodtables.validate(source='datapackage.json')
Implementation notes
Limitations
- Only fields of type
string
,number
,integer
,boolean
,date
,datetime
,year
, andgeopoint
are currently supported. Other types can easily be supported with additionalparse_*
functions inparse.py
. - Memory use could be greatly minimized by reading, parsing, and checking tables in chunks (using
pandas.read_csv(chunksize=)
), and storing only field values for unique and foreign key checks.
Uniqueness of null
Pandas chooses to treat missing values (null
) as regular values, meaning that they are equal to themselves. How uniqueness is defined as a result is illustrated in the following examples.
unique | not unique |
---|---|
(1) , (null) | (1) , (null) , (null) |
(1, 1) , (1, null) | (1, 1) , (1, null) , (1, null) |
As the following script demonstrates, pandas considers the repeated rows (1, null)
to be duplicates, and thus not unique.
import pandas
import numpy as np
pandas.DataFrame(dict(x=[1, 1, 1], y=[1, np.nan, np.nan])).duplicated()
0 False 1 False 2 True dtype: bool
Although this behavior matches some SQL implementations (namely Microsoft SQL Server), others (namely PostgreSQL and SQLite) choose to treat null
as unique. See this dbfiddle.
Key constraints
primaryKey
Fields in primaryKey
cannot contain missing values (equivalent to required: true
).
See https://github.com/frictionlessdata/specs/issues/593.
uniqueKey
The uniqueKeys
property provides support for one or more row uniqueness
constraints which, unlike primaryKey
, do support null
values. Uniqueness is determined as described above.
{
"uniqueKeys": [
["x", "y"],
["y", "z"]
]
}
See https://github.com/frictionlessdata/specs/issues/593.
foreignKey
The reference key of a foreignKey
must meet the requirements of uniqueKey
: it must be unique but can contain null
. The local key must be present in the reference key, unless one of the fields is null.
reference | local: valid | local: invalid |
---|---|---|
(1) | (1) , (null) | (2) |
(1) , (null) | (1) , (null) | (2) |
(1, 1) | (1, 1) , (1, null) , (2, null) | (1, 2) |
De-duplication of key constraints
To avoid duplicate key checks, the various key constraints are expanded as follows:
- Reference foreign keys (
foreignKey.reference.fields
) are added (if not already present) to the unique keys (uniqueKeys
) of the reference resource. TheforeignKey
check only considers whether the local key is in the reference key. - The primary key (
primaryKey
) is moved (if not already present) to the unique keys (uniqueKeys
) and the fields in the key become required (field.constraints.required: true
) if not already. - Single-field unique keys (
uniqueKeys
) are dropped and the fields become unique (field.constraints.unique: true
) if not already.
The following example illustrates the transformation in terms of Table Schema descriptor.
Original
{
"fields": [
{
"name": "x",
"required": true
},
{
"name": "y",
"required": true
},
{
"name": "x2"
}
],
"primaryKey": ["x", "y"],
"uniqueKeys": [
["x", "y"],
["x"]
],
"foreignKeys": [
{
"fields": ["x2"],
"reference": {
"resource": "",
"fields": ["x"]
}
}
]
}
Checked
{
"fields": [
{
"name": "x",
"required": true,
"unique": true
},
{
"name": "y",
"required": true
},
{
"name": "x2"
}
],
"uniqueKeys": [
["x", "y"]
],
"foreignKeys": [
{
"fields": ["x2"],
"reference": {
"resource": "",
"fields": ["x"]
}
}
]
}