Active learning: deduplication example

Github
https://github.com/romankazinnik/romankazinnik_blog/tree/master/active_learning

Roman Kazinnik

7 min readJun 2, 2021

Active Learning: main components

By the end of this post we will learn the following components of Active Machine Learning:

Blocking
Iterative labeling:

Select a small number of data points for manual labeling
Stopping criteria
Update Training data and Classifier

3. Run Classifier for all blocks

Active Learning: high-level diagrams

Deduplication with Active Learning includes (1) training a new classifier and (2) applying the classifier for the full data the following:

(1) Create Classifier with Iterative Active Learning

(2) Parallel data processing pipeline: Apache Beam ParDo-GroupBy / Map-reduce, or Python-3 multiprocessing pool (used in this blog)

Input Data, Output Data, Training Data, and data points.

I will explain this entire concept behind deduping with active learning with the help of a simple example. Let’s assume, I am given 1M records of restaurants. Each record has multiple columns containing information about a restaurant, such as a restaurant location, name, and name of the restaurant listing marketplace. All the columns are assumed to be noisy, such as the same restaurants may have similar but different names and slightly different locations in different marketplaces. Deduping helps us out by splitting the input data into two datasets of unique restaurants and non-unique restaurants. Active Learning approaches operate with a classifier trained for training data. Training data consists of pairs of records from the Input Data. The total number of all possible pairs is too high and computationally infeasible (½ 1M * 1M = .5e+12). A naive approach would require manually labeling the entire Training Data if some 1e+12 data points. Active learning allows the creation of reasonably good classifiers with below a hundred manual labelings. But how to choose these one hundred data points?

Active Learning: results with one hundred manual labeling

High Confidence

these pairs were predicted by the classifier as duplicates, which can be clearly seen as correct predictions.

Low confidence ‘noise’

example of data points from ‘noise’ interval when it is impossible to guess the correct labels

Blocks

Our task is to find duplicates in one million records and some 1e+12 pairs (data points). Without a classifier, one would need to manually label every possible pair which could be autonomous number T=1e+12 and computationally not feasible even for fastest classifiers. In addition, one would need to run a classifier for 1e+12 data points, which is not feasible.

The blocking approach solves the problem of a high number of pairs T by splitting all the records into thousands of smaller sets of records. We must assume records in different blocks can’t be duplicated. For example, restaurants with locations too far from each other must be different restaurants.

As a result, one can ignore pairs coming from different blocks and process only pairs where both restaurants belong to the same block. I use a quadtree to create thousands of blocks with at most 300 records per block and with about 100 records on average. The block boundaries must overlap (one can think about this question!) so that the total number of records in all the blocks will increase. However, the total number of pairs T as a result of splitting into B blocks with maximum R records will be limited by B*½ R² which is much smaller than ½ 10¹²

Classifier

The classifier is trained with labeled pairs of records (datapoints of training data) and computes a probability that a pair of two restaurant records denote a single (‘duplicate’) restaurant.

At each iterative step of manual labeling, one creates a new training set and a new classifier.

Specifically, at each step, training data is added with new manually labeled data points, usually a small number of less than ten.

After that, a new classifier is trained with the newly labeled data, whereas the manually labeled data points have higher weights in training.

Next, I run a classifier for another subset of unlabeled data points, and the highest confidence points are added to the training data automatically, without manual labeling. These automatically selected new data points will have lower weight.

After a few iterations of adding manually labeled data points and points with high confidence, the accumulated training data becomes large enough. This way, a classifier can utilize deep learning architectures.

Sampling and Manual Labeling

The iterative process of manual labeling adds more data to the training data. Instead of selecting the random pairs for manual labeling, Active Learning selects pairs with the highest impact with the goal to minimize the number of manual labeling iterations required to create an accurate classifier.

Manual iterative labeling of one positive- and one negative-class data points

The sampling strategy helps to select a small number of pairs for manual labeling from large unlabeled data. At each iteration K+1, we use Classifer_K to select a small number of pairs of records (as small as five) that would benefit the most in training a new classifier.

There are two steps:

(1) selecting the five pairs for manual labeling is still a research challenge. There are three main strategies. One strategy suggests select data with the highest uncertainty, such as closest to the probability of fifty percent for each class. In this work, I used another strategy. I observed that the data points at the fifty percent boundary are simply ‘noise’ and it is impossible to guess the labels for them. I sample data points far enough from the ‘noise’ boundary and from the ‘too certain’ sides.

Figure-1: Histogram of Classifier predicted probabilities — Figure 1 Classifiers output histogram

(2) The newly manually labeled records are added to the training with higher weights. The new classifier denoted Classifeir_K+1/2. After that, I extend the training data with unlabeled data points that have high confidence under Classifier_k+½, and create a new Classifer_k+1.

Figure 2: Hypothesis — Figure 2 Hypothesis of Three confidence intervals

Figure-1 shows a histogram of probabilities at the 10th iteration (outputs by Classifier-10 for Training Data-10), which confirms observation from another Active Learning work (Figure-2): both show that the shape has three following phases: Noise, Medium, and High Confidence.

Algorithm

Create Blocks
Initialize: Classifier-0 and Training-Data-0
Select M unlabeled data points. Apply Classifier-k to M datapoints.
Select D datapoints from M, both positive and negative ranges, using Sampling Strategy from the Medium Certainty range P-medium.
Quit if Classifier-K accuracy looks sufficiently high for the selected D data points. Otherwise, label manually and add the M newly labeled data points with higher weights to create Training Data-k+½
Run Classifier-k+½ for M more unlabeled datapoints and add the data points with the high confidence greater P-high to create Training Data-k+1

Parameters: D (as small as five data points), M (a thousand of data points), P-Medium confidence range, P-high probability threshold, Classifier architecture (Logistic Regression, RandomForest or DNN)

Computational Steps

20 mins: Build docker image
10 min Step-1:
Create blocks
15 min Step-2:
Create Initial Labels and initial Classifier
10 mins: Iterate until Classifier accuracy is sufficient
Sampling Strategy: Pull from Unlabeled and Label manually
Retrain Classifier with newly added Labels
100 min Step-3:
Run Classifier Inferences for All Block of Pairs.
32K blocks processed 10 blocks per min x32 cores
3 min Step-4:
Calibration: select threshold to decide duplicates
Use up to ten binary splits to decide on the best threshold value
Using the selected threshold, create two files with unique and duplicated restaurant names

Calibration selects the optimal threshold by showing five data points selected at random up to ten times

Docker file with four computations steps

Future work

Accuracy — How to estimate the actual classification accuracy attained by this algorithm?

Experimentation — try new features and model architectures and improve the classification accuracy.

Benchmark — multiple deduplication packages available open-source. Run some of them and compare accuracy

Production grade — current algorithms can run parallel and can persist its computations but it doesn’t robust to failures. For example, one needs to re-run if even one block out of thousands fails to compute. Implement an optimal reducer to aggregate all the results. Perhaps it makes to utilize available MLOps platforms, such as MetaFlow, MLFlow, Kubeflow.

Originally published at https://www.romankazinnik.com on June 2, 2021.