Home

Awesome

<!--- BADGES: START --->

GitHub - License PyPI - Python Version PyPI - Package Version Conda - Platform Conda (channel only) Docs - GitHub.io

<!--- BADGES: END ---> <img src="https://github.com/koaning/doubtlab/raw/main/docs/doubt.png" width=140 height=140 align="right">

doubtlab

A lab for bad labels. Learn more here.

This repository contains general tricks that may help you find bad, or noisy, labels in your dataset. The hope is that this repository makes it easier for folks to quickly check their own datasets before they invest too much time and compute on gridsearch.

Install

You can install the tool via pip or conda.

Install with pip

python -m pip install doubtlab

Install with conda

conda install -c conda-forge doubtlab

Quickstart

Doubtlab allows you to define "reasons" for a row of data to deserve another look. These reasons can form a pipeline which can be used to retreive a sorted list of examples worth checking again.

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

from doubtlab.ensemble import DoubtEnsemble
from doubtlab.reason import ProbaReason, WrongPredictionReason

# Let's say we have some dataset/model already
X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=1_000)
model.fit(X, y)

# Next we can add reasons for doubt. In this case we're saying
# that examples deserve another look if the associated proba values
# are low or if the model output doesn't match the associated label.
reasons = {
    'proba': ProbaReason(model=model),
    'wrong_pred': WrongPredictionReason(model=model)
}

# Pass these reasons to a doubtlab instance.
doubt = DoubtEnsemble(**reasons)

# Get the ordered indices of examples worth checking again
indices = doubt.get_indices(X, y)
# Get dataframe with "reason"-ing behind the sorting
predicates = doubt.get_predicates(X, y)

Features

The library implemented many "reasons" for doubt.

General Reasons

Classification Reasons

Regression Reasons

Feedback

It is early days for the project. The project should be plenty useful as-is, but we prefer to be honest. Feedback and anecdotes are very welcome!

Related Projects