Home

Awesome

goodtables-pandas-py

tests coverage pypi

Warning: Not an official frictionlessdata package

This package reads and validates a Frictionless Data Tabular Data Package using pandas. It is about ~10x faster than the official frictionlessdata/frictionless-py, at the expense of higher memory usage.

Usage

pip install goodtables-pandas-py
import goodtables_pandas as goodtables

report = goodtables.validate(source='datapackage.json')

Implementation notes

Limitations

Uniqueness of null

Pandas chooses to treat missing values (null) as regular values, meaning that they are equal to themselves. How uniqueness is defined as a result is illustrated in the following examples.

uniquenot unique
(1), (null)(1), (null), (null)
(1, 1), (1, null)(1, 1), (1, null), (1, null)

As the following script demonstrates, pandas considers the repeated rows (1, null) to be duplicates, and thus not unique.

import pandas
import numpy as np

pandas.DataFrame(dict(x=[1, 1, 1], y=[1, np.nan, np.nan])).duplicated()

0 False 1 False 2 True dtype: bool

Although this behavior matches some SQL implementations (namely Microsoft SQL Server), others (namely PostgreSQL and SQLite) choose to treat null as unique. See this dbfiddle.

Key constraints

primaryKey

Fields in primaryKey cannot contain missing values (equivalent to required: true).

See https://github.com/frictionlessdata/specs/issues/593.

uniqueKey

The uniqueKeys property provides support for one or more row uniqueness constraints which, unlike primaryKey, do support null values. Uniqueness is determined as described above.

{
  "uniqueKeys": [
    ["x", "y"],
    ["y", "z"]
  ]
}

See https://github.com/frictionlessdata/specs/issues/593.

foreignKey

The reference key of a foreignKey must meet the requirements of uniqueKey: it must be unique but can contain null. The local key must be present in the reference key, unless one of the fields is null.

referencelocal: validlocal: invalid
(1)(1), (null)(2)
(1), (null)(1), (null)(2)
(1, 1)(1, 1), (1, null), (2, null)(1, 2)

De-duplication of key constraints

To avoid duplicate key checks, the various key constraints are expanded as follows:

The following example illustrates the transformation in terms of Table Schema descriptor.

Original

{
  "fields": [
    {
      "name": "x",
      "required": true
    },
    {
      "name": "y",
      "required": true
    },
    {
      "name": "x2"
    }
  ],
  "primaryKey": ["x", "y"],
  "uniqueKeys": [
    ["x", "y"],
    ["x"]
  ],
  "foreignKeys": [
    {
      "fields": ["x2"],
      "reference": {
        "resource": "",
        "fields": ["x"]
      }
    }
  ]
}

Checked

{
  "fields": [
    {
      "name": "x",
      "required": true,
      "unique": true
    },
    {
      "name": "y",
      "required": true
    },
    {
      "name": "x2"
    }
  ],
  "uniqueKeys": [
    ["x", "y"]
  ],
  "foreignKeys": [
    {
      "fields": ["x2"],
      "reference": {
        "resource": "",
        "fields": ["x"]
      }
    }
  ]
}