DataPerf Data Cleaning Challenge

Training Set Cleaning

When dealing with massive datasets, noises in the datasets become inevitable. This is increasingly the problem for ML training and noises in the dataset can come from many places:

Natural noises come in during data acquisition.
Algorithmic labeling: e.g., weak supervision, and automatically generated labels by machines.
Data collection biases (e.g., biased hiring decisions).

If trained over these noisy datasets, ML models might suffer not only from lower quality, but also potential risks on other quality dimensions such as fairness. Careful data cleaning can often accommodate this, however, it can be a very expensive process if we need to investigate and clean all examples. By using a more data-centric approach we hope to direct human attention and the cleaning efforts toward data examples that matter more to the improvement of ML models.

In this data cleaning challenge, we invite participants to design and experiment data-centric approaches towards strategic data cleaning for training sets of an image classification model. As a participant, you will be asked to rank the samples in the entire training set, and then we will clean them one by one and evaluate the performance of the model after each fix. The earlier it reached a high enough accuracy, the better your rank is.

Similar to other dataperf challenges, the cleaning challenge comes with two flavors: the open division and the closed division.

In the open division, you will submit the output of running your cleaning algorithm on a given dataset. Then we will train the model and evaluate it based on your submission.
In the closed division, you will submit the cleaning algorithm itself, and we will run your algorithm to generate the output on several hidden datasets. Then we evaluate your submissions.

How to Participate

In order to make participation as easy as possible, we've come up with a set of tools that ease the process of iterating and submitting: MLCube and Dynabench. MLCube was developed to help you get started on your local computer, and it will help you download the datasets, run some baseline algorithms, evaluate your submission and baselines and plot the results. Once you are satisfied with your results, you can then submit it to Dynabench, which is a platform where we will evaluate your submission and show the leaderboard for this challenge.

Offline Evaluation with MLCube

The evaluation code of the challenge is entirely open at https://github.com/DS3Lab/dataperf-vision-debugging, where you can run some baselines and evaluate your algorithms locally. Below are the instructions on how to setup the environment and run them locally.

Prerequisite

In order to perform offline evaluation with MLCube, you will need to have docker installed on your system. Please follow the steps here:

Install on Mac

Project setup


# Fetch the vision selection repo
git clone <https://github.com/DS3Lab/dataperf-vision-debugging> && cd ./dataperf-vision-debugging

# Create Python environment and install MLCube Docker runner
python3 -m venv ./venv && source ./venv/bin/activate && pip install mlcube-docker

Training Set Cleaning

How to Participate

Offline Evaluation with MLCube

Prerequisite

Project setup

Execute single tasks with MLCube