Tutorial 1: Simple Gaussian Mutual Information
In this tutorial, we will generate some random data from a Gaussian distribution to see how to compute the mutual information (MI) for a set of data. A Jupyter notebook of this tutorial can be found in the repository https://github.com/netscianalysis/ in the tutorials/ folder.
Make sure to activate the NetSci conda environment:
conda activate netsci
Two Independent Gaussians
The following Python code will sample data from two independent Gaussian distributions, with mean zero and standard deviation of 1, creating a file named ‘sample.dat’:
import numpy as np
# 100 rows
N = 100
# 2 columns
M = 2
gaussian_2D = np.zeros((M, N)).astype(np.float32)
for i in range(M):
gaussian_2D[i,:] = np.random.normal(size=N)
with open("sample.dat", "w") as f:
f.write("column1\tcolumn2\n")
for i in range(N):
f.write(str(gaussian_2D[0,i])+"\t"+str(gaussian_2D[1,i])+"\n")
Inside sample.dat, there are two columns of data, each containing 100 rows (not including the header):
column1 column2
-0.7315501 -0.2413269
-1.012331 0.03488477
-0.5544968 -1.5803745
-0.86383635 0.91249526
2.1049206 -0.46482915
0.84087217 1.1912068
2.6172605 1.1999675
-0.52112085 0.1250357
-0.9300766 0.73021877
1.1873163 2.590006
-1.0747769 -0.05441373
...
Now, let us use NetSci to compute the MI between these two distributions:
import numpy as np
import netcalc
import cuarray
# k-nearest neighbors number
k = 1
# Dimensionality of data
d = 1
# Two dimensions of distributions
xd = 2
# Number of data points in 2D distribution
N = 100
# Define all pair correlations that will be computed
num_nodes = 2
num_node_pairs = num_nodes**2
ab = cuarray.IntCuArray()
ab.init(num_node_pairs, 2)
for i in range(num_nodes):
for j in range(num_nodes):
node_pair_index = i*num_nodes + j
ab[node_pair_index][0] = i
ab[node_pair_index][1] = j
# Load the data into a numpy array
data_list = []
with open("sample.dat", "r") as f:
for i, line in enumerate(f.readlines()):
if i == 0:
# Skip the data header
continue
line = line.strip().split()
col1_datum = line[0]
col2_datum = line[1]
data_list.append([col1_datum, col2_datum])
assert len(data_list) == N
two_columns_of_data = np.array(data_list, dtype=np.float32).T
# The input array
X = cuarray.FloatCuArray()
X.fromNumpy2D(two_columns_of_data)
# The output array
I = cuarray.FloatCuArray()
netcalc.mutualInformation(X, I, ab, k, N, xd, d, netcalc.GPU_PLATFORM)
mutual_information = I[0][1]
print("predicted mutual information for sample.dat:", mutual_information)
Let us consider the inputs to this script later. First, let’s examine the output:
predicted mutual information for sample.dat: -0.015671253204345703
The exact MI for two independent Gaussian distributions can be found exactly:
Where \(I_{exact}(X,Y)\) is the MI of the two Gaussian distributions \(X\) and \(Y\), and \(r\) is the covariance of the two distributions, which in this case, is zero. Therefore, for \(r=0\), \(I_{exact}(X,Y)=0\), which is fairly close to the value we obtained in this case.
Let us consider the inputs to the script above:
- k
This is the “k” of “k-nearest-neighbors” - that is - how many neighbors to each point are taken in Kraskov’s algorithm. Kraskov recommends a value of “k” between 2 and 4, although situations where k=6 have also been used. In general, a lower value of “k” reduces bias, but increases noise, and a higher value of “k” reduces noise, but increases bias for non-independent distributions. If testing for independence, bias is not such an issue, so k can be as large as N/2, where N is the number of data points.
- d
Dimensionality of the data. Since these are 1D Gaussians, the dimensionality is equal to one. In contrast, for example, the positions of atoms in a protein would be three-dimensional data.
- xd
The number of distributions per MI calculation. At this time, only the MI of two distributions can be calculated at a time in NetSci, so xd=2.
- N
The number of data points sampled from each distribution.
- ab
An array that represents which pairs of distributions to compute. This can be adjusted to exclude the computation of certain pairs of distributions, if desired.
- X
The input data, converted to a form that NetSci can use directly.
- I
The output MI, with entries for each pair of distributions specified for an MI calculation.
- netcalc.GPU_PLATFORM
Instruct NetSci to use the GPU device to perform the MI calculation. Alternatively, one may use the GPU by passing netcalc.CPU_PLATFORM as this argument.
Two Dependent Gaussians
Let us modify our data to include a strong dependency between the Gaussian distributions:
import numpy as np
# 100 rows
N = 100
# 2 columns
M = 2
covariance_value = 0.9
def make_covariance_matrix(covariance_value):
cov = np.array([[1.0, covariance_value], [covariance_value, 1.0]])
return cov
gaussian_2D_mean = np.zeros(2)
gaussian_2D_cov = make_covariance_matrix(covariance_value)
gaussian_2D = np.random.multivariate_normal(
mean=gaussian_2D_mean,
cov=gaussian_2D_cov,
size=N,
).T.astype(np.float32)
with open("sample.dat", "w") as f:
f.write("column1\tcolumn2\n")
for i in range(N):
f.write(str(gaussian_2D[0,i])+"\t"+str(gaussian_2D[1,i])+"\n")
This time, when one runs the MI calculation, the predicted MI is much larger:
predicted mutual information for sample.dat: 1.1769
This is due to the large covariance value between the two Gaussian distributions.
In Closing
This tutorial demonstrates the MI calculation on a set of data that was generated artificially by sampling from a Gaussian distribution. One could repeat the same MI analysis on data that was generated from any other source and benefit from the large speedup enabled by GPU acceleration.