Tutorial 3: Error Estimation with Bootstrapping
In this tutorial, we will estimate the error margins of a MI calculation by employing a bootstrapping method. A Jupyter notebook of this tutorial can be found in the repository https://github.com/netscianalysis/ in the tutorials/ folder.
Make sure to activate the NetSci conda environment:
conda activate netsci
Bootstrapping Analysis
The following Python code will sample data from two independent Gaussian distributions, with mean zero and standard deviation of 1, creating a file named ‘sample.dat’ (This is the same method as in tutorial 1: Simple Gaussian MI):
import numpy as np
# 100 rows
N = 100
# 2 columns
M = 2
gaussian_2D = np.zeros((M, N)).astype(np.float32)
for i in range(M):
gaussian_2D[i,:] = np.random.normal(size=N)
with open("sample.dat", "w") as f:
f.write("column1\tcolumn2\n")
for i in range(N):
f.write(str(gaussian_2D[0,i])+"\t"+str(gaussian_2D[1,i])+"\n")
Inside sample.dat, there are two columns of data, each containing 100 rows (not including the header):
column1 column2
-0.7315501 -0.2413269
-1.012331 0.03488477
-0.5544968 -1.5803745
-0.86383635 0.91249526
2.1049206 -0.46482915
0.84087217 1.1912068
2.6172605 1.1999675
-0.52112085 0.1250357
-0.9300766 0.73021877
1.1873163 2.590006
-1.0747769 -0.05441373
...
Now, let us compute the MI, and also run a bootstrapping analysis for an error estimate:
import numpy as np
import netcalc
import cuarray
# k-nearest neighbors number
k = 1
# Dimensionality of data
d = 1
# Two dimensions of distributions
xd = 2
# Number of data points in 2D distribution
N = 100
# Number of bootstrap samples of the data to compute
NUM_BOOTSTRAP_SAMPLES = 10
# Define all pair correlations that will be computed
num_nodes = 2
num_node_pairs = num_nodes**2
ab = cuarray.IntCuArray()
ab.init(num_node_pairs, 2)
for i in range(num_nodes):
for j in range(num_nodes):
node_pair_index = i*num_nodes + j
ab[node_pair_index][0] = i
ab[node_pair_index][1] = j
# Load the data into a numpy array
data_list = []
with open("sample.dat", "r") as f:
for i, line in enumerate(f.readlines()):
if i == 0:
# Skip the data header
continue
line = line.strip().split()
col1_datum = line[0]
col2_datum = line[1]
data_list.append([col1_datum, col2_datum])
assert len(data_list) == N
two_columns_of_data = np.array(data_list, dtype=np.float32).T
# The input array
X = cuarray.FloatCuArray()
X.fromNumpy2D(two_columns_of_data)
# The output array
I = cuarray.FloatCuArray()
netcalc.mutualInformation(X, I, ab, k, N, xd, d, netcalc.GPU_PLATFORM)
mutual_information = I[0][1]
bootstrap_MI_array = np.zeros(NUM_BOOTSTRAP_SAMPLES)
for b in range(NUM_BOOTSTRAP_SAMPLES):
# The input array
X = cuarray.FloatCuArray()
# Shuffle the data with replacement
shuffled_two_columns_of_data = np.zeros(two_columns_of_data.shape, dtype=np.float32)
indices = np.random.choice(np.arange(N), size=N, replace=True)
shuffled_two_columns_of_data[0,:] = two_columns_of_data[0, indices]
shuffled_two_columns_of_data[1,:] = two_columns_of_data[1, indices]
X.fromNumpy2D(shuffled_two_columns_of_data)
# The output array
I = cuarray.FloatCuArray()
netcalc.mutualInformation(X, I, ab, k, N, xd, d, netcalc.GPU_PLATFORM)
bootstrap_MI_array[b] = I[0][1]
bootstrap_error = np.std(bootstrap_MI_array)
print("predicted mutual information for sample.dat:", mutual_information, "+/-", bootstrap_error)