Tutorial 1: Simple Gaussian Mutual Information

In this tutorial, we will generate some random data from a Gaussian distribution to see how to compute the mutual information (MI) for a set of data. A Jupyter notebook of this tutorial can be found in the repository https://github.com/netscianalysis/ in the tutorials/ folder.

Make sure to activate the NetSci conda environment:

conda activate netsci

Two Independent Gaussians

The following Python code will sample data from two independent Gaussian distributions, with mean zero and standard deviation of 1, creating a file named ‘sample.dat’:

import numpy as np

# 100 rows
N = 100

# 2 columns
M = 2

gaussian_2D = np.zeros((M, N)).astype(np.float32)
for i in range(M):
    gaussian_2D[i,:] = np.random.normal(size=N)

with open("sample.dat", "w") as f:
    f.write("column1\tcolumn2\n")
    for i in range(N):
        f.write(str(gaussian_2D[0,i])+"\t"+str(gaussian_2D[1,i])+"\n")

Inside sample.dat, there are two columns of data, each containing 100 rows (not including the header):

column1     column2
-0.7315501  -0.2413269
-1.012331   0.03488477
-0.5544968  -1.5803745
-0.86383635 0.91249526
2.1049206   -0.46482915
0.84087217  1.1912068
2.6172605   1.1999675
-0.52112085 0.1250357
-0.9300766  0.73021877
1.1873163   2.590006
-1.0747769  -0.05441373
...

Now, let us use NetSci to compute the MI between these two distributions:

import numpy as np

import netcalc
import cuarray

# k-nearest neighbors number
k = 1

# Dimensionality of data
d = 1

# Two dimensions of distributions
xd = 2

# Number of data points in 2D distribution
N = 100

# Define all pair correlations that will be computed
num_nodes = 2
num_node_pairs = num_nodes**2
ab = cuarray.IntCuArray()
ab.init(num_node_pairs, 2)
for i in range(num_nodes):
    for j in range(num_nodes):
        node_pair_index = i*num_nodes + j
        ab[node_pair_index][0] = i
        ab[node_pair_index][1] = j

# Load the data into a numpy array
data_list = []
with open("sample.dat", "r") as f:
    for i, line in enumerate(f.readlines()):
        if i == 0:
            # Skip the data header
            continue

        line = line.strip().split()
        col1_datum = line[0]
        col2_datum = line[1]
        data_list.append([col1_datum, col2_datum])

assert len(data_list) == N

two_columns_of_data = np.array(data_list, dtype=np.float32).T

# The input array
X = cuarray.FloatCuArray()
X.fromNumpy2D(two_columns_of_data)

# The output array
I = cuarray.FloatCuArray()

netcalc.mutualInformation(X, I, ab, k, N, xd, d, netcalc.GPU_PLATFORM)

mutual_information = I[0][1]

print("predicted mutual information for sample.dat:", mutual_information)

Let us consider the inputs to this script later. First, let’s examine the output:

predicted mutual information for sample.dat: -0.015671253204345703

The exact MI for two independent Gaussian distributions can be found exactly:

\[I_{exact}(X,Y)=-\frac{1}{2}\log{(1-r^2)}\]

Where \(I_{exact}(X,Y)\) is the MI of the two Gaussian distributions \(X\) and \(Y\), and \(r\) is the covariance of the two distributions, which in this case, is zero. Therefore, for \(r=0\), \(I_{exact}(X,Y)=0\), which is fairly close to the value we obtained in this case.

Let us consider the inputs to the script above:

k: This is the “k” of “k-nearest-neighbors” - that is - how many neighbors to each point are taken in Kraskov’s algorithm. Kraskov recommends a value of “k” between 2 and 4, although situations where k=6 have also been used. In general, a lower value of “k” reduces bias, but increases noise, and a higher value of “k” reduces noise, but increases bias for non-independent distributions. If testing for independence, bias is not such an issue, so k can be as large as N/2, where N is the number of data points.
d: Dimensionality of the data. Since these are 1D Gaussians, the dimensionality is equal to one. In contrast, for example, the positions of atoms in a protein would be three-dimensional data.
xd: The number of distributions per MI calculation. At this time, only the MI of two distributions can be calculated at a time in NetSci, so xd=2.
N: The number of data points sampled from each distribution.
ab: An array that represents which pairs of distributions to compute. This can be adjusted to exclude the computation of certain pairs of distributions, if desired.
X: The input data, converted to a form that NetSci can use directly.
I: The output MI, with entries for each pair of distributions specified for an MI calculation.
netcalc.GPU_PLATFORM: Instruct NetSci to use the GPU device to perform the MI calculation. Alternatively, one may use the GPU by passing netcalc.CPU_PLATFORM as this argument.

Two Dependent Gaussians

Let us modify our data to include a strong dependency between the Gaussian distributions:

import numpy as np

# 100 rows
N = 100

# 2 columns
M = 2

covariance_value = 0.9

def make_covariance_matrix(covariance_value):
    cov = np.array([[1.0, covariance_value], [covariance_value, 1.0]])
    return cov

gaussian_2D_mean = np.zeros(2)
gaussian_2D_cov = make_covariance_matrix(covariance_value)
gaussian_2D = np.random.multivariate_normal(
    mean=gaussian_2D_mean,
    cov=gaussian_2D_cov,
    size=N,
).T.astype(np.float32)

with open("sample.dat", "w") as f:
    f.write("column1\tcolumn2\n")
    for i in range(N):
        f.write(str(gaussian_2D[0,i])+"\t"+str(gaussian_2D[1,i])+"\n")

This time, when one runs the MI calculation, the predicted MI is much larger:

predicted mutual information for sample.dat: 1.1769

This is due to the large covariance value between the two Gaussian distributions.

In Closing

This tutorial demonstrates the MI calculation on a set of data that was generated artificially by sampling from a Gaussian distribution. One could repeat the same MI analysis on data that was generated from any other source and benefit from the large speedup enabled by GPU acceleration.