Tutorial 3: Error Estimation with Bootstrapping =============================================== In this tutorial, we will estimate the error margins of a MI calculation by employing a bootstrapping method. A Jupyter notebook of this tutorial can be found in the repository https://github.com/netscianalysis/ in the tutorials/ folder. Make sure to activate the NetSci conda environment:: conda activate netsci Bootstrapping Analysis ---------------------- The following Python code will sample data from two independent Gaussian distributions, with mean zero and standard deviation of 1, creating a file named 'sample.dat' (This is the same method as in tutorial 1: Simple Gaussian MI):: import numpy as np # 100 rows N = 100 # 2 columns M = 2 gaussian_2D = np.zeros((M, N)).astype(np.float32) for i in range(M): gaussian_2D[i,:] = np.random.normal(size=N) with open("sample.dat", "w") as f: f.write("column1\tcolumn2\n") for i in range(N): f.write(str(gaussian_2D[0,i])+"\t"+str(gaussian_2D[1,i])+"\n") Inside sample.dat, there are two columns of data, each containing 100 rows (not including the header):: column1 column2 -0.7315501 -0.2413269 -1.012331 0.03488477 -0.5544968 -1.5803745 -0.86383635 0.91249526 2.1049206 -0.46482915 0.84087217 1.1912068 2.6172605 1.1999675 -0.52112085 0.1250357 -0.9300766 0.73021877 1.1873163 2.590006 -1.0747769 -0.05441373 ... Now, let us compute the MI, and also run a bootstrapping analysis for an error estimate:: import numpy as np import netcalc import cuarray # k-nearest neighbors number k = 1 # Dimensionality of data d = 1 # Two dimensions of distributions xd = 2 # Number of data points in 2D distribution N = 100 # Number of bootstrap samples of the data to compute NUM_BOOTSTRAP_SAMPLES = 10 # Define all pair correlations that will be computed num_nodes = 2 num_node_pairs = num_nodes**2 ab = cuarray.IntCuArray() ab.init(num_node_pairs, 2) for i in range(num_nodes): for j in range(num_nodes): node_pair_index = i*num_nodes + j ab[node_pair_index][0] = i ab[node_pair_index][1] = j # Load the data into a numpy array data_list = [] with open("sample.dat", "r") as f: for i, line in enumerate(f.readlines()): if i == 0: # Skip the data header continue line = line.strip().split() col1_datum = line[0] col2_datum = line[1] data_list.append([col1_datum, col2_datum]) assert len(data_list) == N two_columns_of_data = np.array(data_list, dtype=np.float32).T # The input array X = cuarray.FloatCuArray() X.fromNumpy2D(two_columns_of_data) # The output array I = cuarray.FloatCuArray() netcalc.mutualInformation(X, I, ab, k, N, xd, d, netcalc.GPU_PLATFORM) mutual_information = I[0][1] bootstrap_MI_array = np.zeros(NUM_BOOTSTRAP_SAMPLES) for b in range(NUM_BOOTSTRAP_SAMPLES): # The input array X = cuarray.FloatCuArray() # Shuffle the data with replacement shuffled_two_columns_of_data = np.zeros(two_columns_of_data.shape, dtype=np.float32) indices = np.random.choice(np.arange(N), size=N, replace=True) shuffled_two_columns_of_data[0,:] = two_columns_of_data[0, indices] shuffled_two_columns_of_data[1,:] = two_columns_of_data[1, indices] X.fromNumpy2D(shuffled_two_columns_of_data) # The output array I = cuarray.FloatCuArray() netcalc.mutualInformation(X, I, ab, k, N, xd, d, netcalc.GPU_PLATFORM) bootstrap_MI_array[b] = I[0][1] bootstrap_error = np.std(bootstrap_MI_array) print("predicted mutual information for sample.dat:", mutual_information, "+/-", bootstrap_error)