# Calculating the similarity between two machine learning datasets – visual studio magazine

The data science lab

### Calculating the similarity between two machine learning datasets

At first glance, calculating the similarity / distance between two sets of data seems easy, but in fact the problem is extremely difficult, says Dr. James McCaffrey of Microsoft Research.

A fairly common sub-issue in many machine learning and data science scenarios is the need to calculate the similarity (or difference or distance) between two sets of data. For example, if you are selecting a sample from a large training dataset, you probably want to know how similar the sample dataset is to the source dataset. Or if you want to start training for a very deep neural network, you need to find an existing model that has been trained using a dataset most similar to your new dataset.

At first glance, calculating the similarity / distance between two sets of data seems easy, but in fact the problem is extremely difficult. If you try to compare individual rows between sets of data, you quickly run into the problem of the combinatorial explosion: there are just too many comparisons. There are also related issues with handling different sizes of datasets and dealing with non-numeric data. The bottom line is that in most scenarios there is no easy way to calculate the distance between two sets of data.

This article explains how to calculate the distance between two sets of data using a combination of neural and classical statistical techniques. A good way to see where this article is going is to take a look at the screenshot of a demo program in **Figure 1**.

The objective of the demo is to calculate the distance between a data set P, which is 100 rows of the UCI Digits data set, and a data set Q, which is the same as the data set P but with 50 percent randomized data rows. The calculated distance between the two data sets is 1.6625. Higher values of the dataset distance indicate greater dissimilarity.

Each row of data in the P and Q datasets represents an 8×8 handwritten number. Each line has 65 numbers. The first 64 numbers are grayscale pixel values from 0 to 16. The last number is the associated numeric label, from 0 to 9. The demo loads the P and Q data into memory, then forms a 65-32 -4-32-65 deep neural autoencoder using dataset P. The result is an automatic encoder that converts each frame (and its tag) into a vector of four values where each value is a number between 0.0 and 1.0.

Then the demo takes each picture element in P and Q and uses the automatic encoder to convert them to a 4-valued vector. Then, each of the four components of the two data sets is divided into 10 frequency compartments. The last step is to calculate the four distances between the four pairs of frequency distributions, which gives (2.18, 1.06, 2.68, 0.73). The four distances are averaged to give the final distance of 1.6625 between P and Q. The process is much simpler than it looks.

This article assumes that you have intermediate or above knowledge of a C family programming language, preferably Python, and a basic knowledge of the PyTorch code library. The full source code for the demo program is shown in this article, and the full code is also available in the accompanying download. The data from the two data files used in the demo program is embedded as comments at the end of the source code download file.

To run the demo program, you must have Python and PyTorch installed on your machine. The demo programs were developed on Windows 10 using the 64-bit Anaconda 2020.02 distribution (which contains Python 3.7.6) and PyTorch version 1.8.0 for the processor installed via pip. The installation is not trivial. You can find detailed step-by-step installation instructions in my blog post.

**The auto-encoder component**

The definition of the autoencoder for the demo program is presented in **List 1**. The architecture is 65-32-4-32-65. The entry has 65 nodes, one for each entry in the UCI Digits dataset. The latent dimension is a hyperparameter and is set to 4. You can think of the latent dimension as the number of abstract attributes that represent a data item. For example, each image of a digit can be represented by “roundness”, “sharpness”, “thickness” and “laterality”.

**List 1:** The definition of autoencoder

import numpy as np import torch as T device = T.device("cpu") class Autoencoder(T.nn.Module): # 65-32-4-32-65 def __init__(self): super(Autoencoder, self).__init__() self.layer1 = T.nn.Linear(65, 32) # includes labels self.layer2 = T.nn.Linear(32, 4) self.layer3 = T.nn.Linear(4, 32) self.layer4 = T.nn.Linear(32, 65) self.latent_dim = 4 def encode(self, x): z = T.tanh(self.layer1(x)) z = T.sigmoid(self.layer2(z)) return z def decode(self, x): z = T.tanh(self.layer3(x)) z = T.sigmoid(self.layer4(z)) return z def forward(self, x): z = self.encode(x) oupt = self.decode(z) return oupt

The encode () method applies sigmoid activation on the latent layer so that all values are within the range [0.0, 1.0]. This allows the frequency distributions to be easily calculated. The decode () method also uses sigmoid activation, on the output layer. This means that the input data must be normalized within a range of [0.0, 1.0] so that the input can be directly compared to the output. Therefore, the 64 input pixel values, all of which are between 0 and 16, should be divided by 16. Class labels, which are between 0 and 9, should be divided by 9.

The train () function defined by the program is presented in **List 2**. The train () function accepts a PyTorch Dataset object. The train () function uses Adam optimization with a hard-coded learning rate set to 0.001. You might want to set the learning rate to make it a bit easier to experiment to find a good value for your particular problem scenario.

**List 2:** Automatic encoder training

def train(ae, ds, bs, me, le): # train autoencoder ae with dataset ds using batch # size bs, with max epochs me, log_every le data_ldr = T.utils.data.DataLoader(ds, batch_size=bs, shuffle=True) loss_func = T.nn.MSELoss() opt = T.optim.Adam(ae.parameters(), lr=0.001) print("Starting training") for epoch in range(0, me): for (b_idx, batch) in enumerate(data_ldr): opt.zero_grad() X = batch oupt = ae(X) loss_val = loss_func(oupt, X) # note X not Y loss_val.backward() opt.step() if epoch > 0 and epoch % le == 0: print("epoch = %6d" % epoch, end="") print(" curr batch loss = %7.4f" % loss_val.item(), end="") print("") print("Training complete ")

The definition of the dataset is presented in **List 3**. Note that the values of 64 pixels and the class label are programmatically normalized to a [0.0, 1.0] range. If your data has categorical variables, such as color with possible values (red, blue, green), you can use one-hot encoding: red = (1, 0, 0), blue = (0, 1 , 0), green = (0, 0, 1). By using one-hot encoding, you simultaneously solve non-digital data and normalization problems. If your data contains Boolean data, you should encode it using 0-1 encoding rather than minus-one-plus-one encoding.

**List 3:** Defining the dataset

class Digits_Dataset(T.utils.data.Dataset): # for an Autoencoder (not a classifier) # assumes data is: # 64 pixel values (0-16) (comma) label (0-9) # [0] [1] . . [63] [64] def __init__(self, src_file): all_xy = np.loadtxt(src_file, usecols=range(65), delimiter=",", comments="https://visualstudiomagazine.com/articles/2021/09/20/#", dtype=np.float32) self.xy_data = T.tensor(all_xy, dtype=T.float32).to(device) self.xy_data[:, 0:64] /= 16.0 # normalize pixels self.xy_data[:, 64] /= 9.0 # normalize labels def __len__(self): return len(self.xy_data) def __getitem__(self, idx): xy = self.xy_data[idx] return xy

**Converting data to frequency distributions**

The autoencoder component converts each data item into a vector of four latent values, such as [0.54, 0.93, 0.11, 0.63]. Each of the four latent values generates a frequency distribution. The work is done by the function defined by the make_freq_mat () program and a value_to_bin () helper function. These two functions are presented in **List 4**.

**List 4:** Functions for constructing frequency distributions

def value_to_bin(x): # x is in [0.0, 1.0] if x >= 0.0 and x < 0.1: return 0 elif x >= 0.1 and x < 0.2: return 1 elif x >= 0.2 and x < 0.3: return 2 elif x >= 0.3 and x < 0.4: return 3 elif x >= 0.4 and x < 0.5: return 4 elif x >= 0.5 and x < 0.6: return 5 elif x >= 0.6 and x < 0.7: return 6 elif x >= 0.7 and x < 0.8: return 7 elif x >= 0.8 and x < 0.9: return 8 else: return 9 # ----------------------------------------- def make_freq_mat(ae, ds): d = ae.latent_dim result = np.zeros((d,10), dtype=np.int64) n = len(ds) for i in range(n): x = ds[i] with T.no_grad(): latents = ae.encode(x) latents = latents.numpy() for j in range(d): bin = value_to_bin(latents[j]) result[j][bin] += 1 result = (result * 1.0) / n return result

The code is pretty straightforward, but the idea is a bit tricky. Suppose the latent dimension is only two (instead of four as in the demo) and you only have eight data items (instead of 100). Each data gives a vector with two values. Suppose they are:

(0.23, 0.67) (0.93, 0.55) (0.11, 0.38) (0.28, 0.72) (0.53, 0.83) (0.39, 0.21) (0.84, 0.79) (0.22, 0.61)

The first frequency distribution looks only at the first values: (0.23, 0.93, 0.11, 0.28, 0.53, 0.39, 0.84, 0.22). If there are 10 bins, the bins are [0.0 to 0.1), [0.1 to 0.2), . . . [0.9 to 1.0]. The numbers of traps for the first latent dim are: (0, 1, 3, 1, 0, 1, 0, 0, 1, 1). Since there are eight elements, the frequency distribution is (0.000, 0.125, 0.375, 0.125, 0.000, 0.125, 0.000, 0.000, 0.125, 0.125).

Likewise, the counts for the second low latent are (0, 0, 1, 1, 0, 1, 2, 2, 1, 0) and the frequency distribution is (0.000, 0.000, 0.125, 0.125, 0.000, 0.125 , 0.250, 0.250, 0.125, 0.000).

**Comparison of frequency distributions**

There are several ways to calculate the distance between two frequency distributions. Common distance functions include Kullback-Leibler divergence, Jensen-Shannon distance, and Hellinger distance. The demo program uses a very simplified version of Wasserstein’s distance function called information transfer distance. In short, the information transfer distance is the minimum amount of work required to transform one frequency distribution into another.

Distance measurement is implemented as a function defined by the info_transfer_distance () program and two helper functions, move_info () and first_nonzero (). The implementations are presented in **List 5**.

**List 5:** Information transfer distance functions

def first_nonzero(vec): dim = len(vec) for i in range(dim): if vec[i] > 0.0: return i return -1 # no empty cells found def move_info(src, si, dest, di): # move as much src at [si] as possible to dest[di] if src[si] <= dest[di]: # move all in src mass = src[si] src[si] = 0.0 # all src got moved dest[di] -= mass # less to fill now elif src[si] > dest[di]: # use just part of src mass = dest[di] # fill remainder of dest src[si] -= mass # less info left in src dest[di] = 0.0 # dest is filled dist = np.abs(si - di) return mass * dist # weighted info transfer def info_transfer_distance(p, q): # distance between two p-dists p, q # highly simplified Wasserstein distance source = np.copy(p) destination = np.copy(q) tot_info = 0.0 while True: # TODO: add sanity counter check from_idx = first_nonzero(source) to_idx = first_nonzero(destination) if from_idx == -1 or to_idx == -1: break info = move_info(source, from_idx, destination, to_idx) tot_info += info return tot_info

You can think of info_transfer_distance () as a black box because it doesn’t change regardless of the data. If two frequency distributions P and Q are identical, info_transfer_distance (P, Q) = 0. The greater the difference between P and Q, the greater the value of the function. If the frequencies are grouped into 10 compartments, the maximum value of the information transfer distance function is 9.0.

**Put it all together**

The demo program starts by specifying the filenames containing the two sets of data to compare:

def main(): # 0. get started print("Begin UCI Digits dataset distance demo ") T.manual_seed(1) np.random.seed(1) p_file = ".Datauci_digits_train_100.txt" q_file = ".Datadigits_100_noise_50.txt" . . .

File P contains 100 randomly selected lines from the UCI Digits dataset. The Q file is the same as P but 50 percent of the lines have been replaced with random values - 0 to 16 for the 64 pixels and 0 to 9 for the class label.

Then the two datasets are loaded into memory:

print("P file = " + str(p_file)) print("Q file = " + str(q_file)) print("First five lines of Q: ") show_uci_file(q_file, 5) # 1. create Dataset objects print("Loading P and Q Datasets into memory ") p_ds = Digits_Dataset(p_file) q_ds = Digits_Dataset(q_file)

The show_uci_file () is program-defined and is not really needed to calculate the similarity of datasets. A simple print () statement would work fine. Next, an autoencoder is created and trained on the reference data set P:

# 2. create and train autoencoder model using parent print("Creating 65-32-4-32-65 autoencoder using P ") autoenc = Autoencoder() # 65-32-4-32-65 autoenc.train() # set mode bat_size = 10 max_epochs = 100 log_every = 20 train(autoenc, p_ds, bat_size, max_epochs, log_every)