Home

Awesome

<div align="center"><img width="100%" src="https://github.com/sfu-db/dataprep/raw/develop/assets/logo.png"/></div>
<p align="center"> <a href="LICENSE"><img src="https://img.shields.io/pypi/l/dataprep?style=flat-square"/></a> <a href="https://sfu-db.github.io/dataprep/"><img src="https://img.shields.io/badge/dynamic/json?color=blue&label=docs&prefix=v&query=%24.info.version&url=https%3A%2F%2Fpypi.org%2Fpypi%2Fdataprep%2Fjson&style=flat-square"/></a> <a href="https://pypi.org/project/dataprep/"><img src="https://img.shields.io/pypi/pyversions/dataprep?style=flat-square"/></a> <a href="https://www.codacy.com/gh/sfu-db/dataprep?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=sfu-db/dataprep&amp;utm_campaign=Badge_Coverage"><img src="https://app.codacy.com/project/badge/Coverage/ed658f08dcce4f088c850253475540ba"/></a> <!-- <a href="https://codecov.io/gh/sfu-db/dataprep"><img src="https://img.shields.io/codecov/c/github/sfu-db/dataprep?style=flat-square"/></a> --> <a href="https://www.codacy.com/gh/sfu-db/dataprep?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=sfu-db/dataprep&amp;utm_campaign=Badge_Grade"><img src="https://app.codacy.com/project/badge/Grade/ed658f08dcce4f088c850253475540ba"/></a> <a href="https://discord.gg/xwbkFNk"><img src="https://img.shields.io/discord/702765817154109472?style=flat-square"/></a> </p> <p align="center"> <a href="https://sfu-db.github.io/dataprep/">Documentation</a> | <a href="https://discord.gg/xwbkFNk">Discord</a> | <a href="https://github.com/sfu-db/dataprep/discussions">Forum</a> </p>

DataPrep lets you prepare your data using a single library with a few lines of code.

Currently, you can use DataPrep to:

Releases

<div align="center"> <table> <tr> <th>Repo</th> <th>Version</th> <th>Downloads</th> </tr> <tr> <td>PyPI</td> <td><a href="https://pypi.org/project/dataprep/"><img src="https://img.shields.io/pypi/v/dataprep?style=flat-square"/></a></td> <td><a href="https://pepy.tech/project/dataprep"><img src="https://pepy.tech/badge/dataprep"/></a></td> </tr> <tr> <td>conda-forge</td> <td><a href="https://anaconda.org/conda-forge/dataprep"><img src="https://img.shields.io/conda/vn/conda-forge/dataprep.svg"/></a></td> <td><a href="https://anaconda.org/conda-forge/dataprep"><img src="https://img.shields.io/conda/dn/conda-forge/dataprep.svg"/></a></td> </tr> </table> </div>

Installation

pip install -U dataprep

EDA

DataPrep.EDA is the fastest and the easiest EDA (Exploratory Data Analysis) tool in Python. It allows you to understand a Pandas/Dask DataFrame with a few lines of code in seconds.

Create Profile Reports, Fast

You can create a beautiful profile report from a Pandas/Dask DataFrame with the create_report function. DataPrep.EDA has the following advantages compared to other tools:

The following code demonstrates how to use DataPrep.EDA to create a profile report for the titanic dataset.

from dataprep.datasets import load_dataset
from dataprep.eda import create_report
df = load_dataset("titanic")
create_report(df).show_browser()

Click here to see the generated report of the above code.

Click here to see the benchmark result.

Try DataPrep.EDA Online: DataPrep.EDA Demo in Colab

Innovative System Design

DataPrep.EDA is the only task-centric EDA system in Python. It is carefully designed to improve usability.

Learn DataPrep.EDA in 2 minutes:

<a href="https://youtu.be/nSkQy3ew3EI"><img src="assets/eda_video_cover.png"/></a>

Click here to check all the supported tasks.

Check plot, plot_correlation, plot_missing and create_report to see how each function works.

Clean

DataPrep.Clean contains about 140+ functions designed for cleaning and validating data in a DataFrame. It provides

The following example shows how to clean and standardize a column of country names.

from dataprep.clean import clean_country
import pandas as pd
df = pd.DataFrame({'country': ['USA', 'country: Canada', '233', ' tr ', 'NA']})
df2 = clean_country(df, 'country')
df2
           country  country_clean
0              USA  United States
1  country: Canada         Canada
2              233        Estonia
3              tr          Turkey
4               NA            NaN

Type validation is also supported:

from dataprep.clean import validate_country
series = validate_country(df['country'])
series
0     True
1    False
2     True
3     True
4    False
Name: country, dtype: bool

Check Documentation of Dataprep.Clean to see how each function works.

Connector

Connector now supports loading data from both web API and databases.

Web API

Connector is an intuitive, open-source API wrapper that speeds up development by standardizing calls to multiple APIs as a simple workflow.

Connector provides a simple wrapper to collect structured data from different Web APIs (e.g., Twitter, Spotify), making web data collection easy and efficient, without requiring advanced programming skills.

Do you want to leverage the growing number of websites that are opening their data through public APIs? Connector is for you!

Let's check out the several benefits that Connector offers:

How to fetch all publications of Andrew Y. Ng?

from dataprep.connector import connect
conn_dblp = connect("dblp", _concurrency = 5)
df = await conn_dblp.query("publication", author = "Andrew Y. Ng", _count = 2000)

Here, you can find detailed Examples.

Connector is designed to be easy to extend. If you want to connect with your own web API, you just have to write a simple configuration file to support it. This configuration file describes the API's main attributes like the URL, query parameters, authorization method, pagination properties, etc.

Database

Connector now has adopted connectorx in order to enable loading data from databases (Postgres, Mysql, SQLServer, etc.) into Python dataframes (pandas, dask, modin, arrow, polars) in the fastest and most memory efficient way. [Benchmark]

What you need to do is just install connectorx (pip install -U connectorx) and run one line of code:

from dataprep.connector import read_sql
read_sql("postgresql://username:password@server:port/database", "SELECT * FROM lineitem")

Check out here for supported databases and dataframes and more examples usages.

Documentation

The following documentation can give you an impression of what DataPrep can do:

Contribute

There are many ways to contribute to DataPrep.

Please take a look at our wiki for development documentations!

Acknowledgement

Some functionalities of DataPrep are inspired by the following packages.

Citing DataPrep

If you use DataPrep, please consider citing the following paper:

Jinglin Peng, Weiyuan Wu, Brandon Lockhart, Song Bian, Jing Nathan Yan, Linghao Xu, Zhixuan Chi, Jeffrey M. Rzeszotarski, and Jiannan Wang. DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python. SIGMOD 2021.

BibTeX entry:

@inproceedings{dataprepeda2021,
  author    = {Jinglin Peng and Weiyuan Wu and Brandon Lockhart and Song Bian and Jing Nathan Yan and Linghao Xu and Zhixuan Chi and Jeffrey M. Rzeszotarski and Jiannan Wang},
  title     = {DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python},
  booktitle = {Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21), June 20--25, 2021, Virtual Event, China},
  year      = {2021}
}