Home

Awesome

<div align="center"> <img src="dataqa-ui/public/images/protractor.png?raw=true" width="200" height="200"/> <h1 align="center">DataQA</h1> </div> <div align="center"> <img src="https://img.shields.io/pypi/pyversions/dataqa"/> <img src="https://img.shields.io/github/license/dataqa/dataqa?color=success"/> <img src="https://img.shields.io/pypi/v/dataqa.svg?label=PyPI&logo=PyPI&logoColor=white&color=success"/> <img src="https://github.com/dataqa/dataqa/actions/workflows/github-actions.yml/badge.svg?&color=success"/> </div>

 

DataQA is a tool to label and explore unstructured documents. It uses rules-based weak supervision to significantly reduce the number of labels needed compared to other tools. Here are a few things you can do with it:

... and it's all available with a simple pip command!

 

<div align="center"> <img src="github_images/merged.gif" width="800" align="center"/> </div>

 

Installation

Pre-requisites:

Installing from pypi

To run with Docker

Usage

In the terminal, type dataqa run. Wait a few minutes initially, as it takes some minutes to start everything up.

Doing this will run a server locally and open a browser window at port 5000. If the application does not open the browser automatically, open localhost:5000 in your browser. You need to keep the terminal open.

To quit the application, simply do Ctr-C in the terminal. To resume the application, type dataqa run. Doing so will create a folder at $HOME/.dataqa_data.

Uploading data

The text file needs to be a csv file in utf-8 encoding of up to 30MB with a column named "text" which contains the main text. The other columns will be ignored.

This step is running some analysis on your text and might take up to 5 minutes.

Uninstall

In the terminal:

Does this tool need an internet connection?

Nope. No data will ever leave your local machine.

Troubleshooting

If the project data does not load, try to go to the homepage and http://localhost:5000 and navigate to the project from there.

Try running dataqa test to get more information about the error, and bug reports are very welcome!

To test the application, it is possible to upload a text that contains a column "__LABEL__". The ground-truth labels will then be displayed during labelling and the real performance will be shown in the performance table between brackets.

Documentation

Documentation at: https://dataqa.ai/docs/.

What is weak supervision and why does it work?

Weak supervision is a set of techniques to produce noisy labels for large quantities of data. It has gained popularity in recent years due to the large amounts of data typically needed for ML systems. The annotator is able to encode any prior domain knowledge it has in the form of rules. Even though these rules can be noisy, the algorithm learns how to weigh them accordingly and use them as signals to extract patterns from the data.

<div align="center"> <h4>Creating a rule for classification</h4> <img src="github_images/rule_creation.gif" width="800" align="center"/> <p>&nbsp;</p> <h4>Creating a rule for NER</h4> <img src="github_images/ner_rule.gif" width="800" align="center"/> </div>

Contact

For any feedback, please contact us at contact@dataqa.ai. Also follow me on alt text for more updates and content around ML and labelling.