Conversion of 1,2 Gbytes ribs list in sparse matrix

I have a list of edges in 1.2 GB of graphics in the text file. My Ubuntu computer has 8 GB of RAM. Each line at the entrance looks like

287111206 357850135 

I would like to convert it to a sparse adjacency matrix and output it to a file.

Some statistics for my data:

 Number of edges: around 62500000 Number of vertices: around 31250000 

I asked the same question before qaru.site/questions/1253879 / ... and had a good response. The problem is that I cannot get it to work.

At first I tried to load np.loadtxt into a file, but it was very slow and used a huge amount of memory. So instead I went to pandas.read_csv, which is very fast, but it caused problems. This is my current code:

 import pandas import numpy as np from scipy import sparse data = pandas.read_csv("edges.txt", sep=" ", header= None, dtype=np.uint32) A = data.as_matrix() print type(A) k1,k2,k3=np.unique(A,return_inverse=True,return_index=True) rows,cols=k3.reshape(A.shape).T M=sparse.coo_matrix((np.ones(rows.shape,int),(rows,cols))) print type(M) sep = "", header = None, dtype = np.uint32) import pandas import numpy as np from scipy import sparse data = pandas.read_csv("edges.txt", sep=" ", header= None, dtype=np.uint32) A = data.as_matrix() print type(A) k1,k2,k3=np.unique(A,return_inverse=True,return_index=True) rows,cols=k3.reshape(A.shape).T M=sparse.coo_matrix((np.ones(rows.shape,int),(rows,cols))) print type(M) 

The problem is that dataframe the pandas data is huge, and I do effective copy in A, which is inefficient. But even worse, since the code fails with the help of

 <type 'instancemethod'> Traceback (most recent call last): File "make-sparse-matrix.py", line 13, in <module> rows,cols=k3.reshape(A.shape).T AttributeError: 'function' object has no attribute 'shape' raph@raph-desktop :~/python$ python make-sparse-matrix.py <type 'numpy.ndarray'> Traceback (most recent call last): File "make-sparse-matrix.py", line 12, in <module> k1,k2,k3=np.unique(A,return_inverse=True,return_index=True) File "/usr/local/lib/python2.7/dist-packages/numpy/lib/arraysetops.py", line 209, in unique iflag = np.cumsum(flag) - 1 File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 2115, in cumsum return cumsum(axis, dtype, out) MemoryError .T <type 'instancemethod'> Traceback (most recent call last): File "make-sparse-matrix.py", line 13, in <module> rows,cols=k3.reshape(A.shape).T AttributeError: 'function' object has no attribute 'shape' raph@raph-desktop :~/python$ python make-sparse-matrix.py <type 'numpy.ndarray'> Traceback (most recent call last): File "make-sparse-matrix.py", line 12, in <module> k1,k2,k3=np.unique(A,return_inverse=True,return_index=True) File "/usr/local/lib/python2.7/dist-packages/numpy/lib/arraysetops.py", line 209, in unique iflag = np.cumsum(flag) - 1 File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 2115, in cumsum return cumsum(axis, dtype, out) MemoryError 

So my questions are:

  • Can I avoid using both 1.2GB pandas data and a 1.2 GB memory array in memory?
  • Is there a way to get the code in 8 GB of RAM?

You can reproduce the test input of the size I'm trying to handle:

 import random #Number of edges, vertices m = 62500000 n = m/2 for i in xrange(m): fromnode = str(random.randint(0, n-1)).zfill(9) tonode = str(random.randint(0, n-1)).zfill(9) print fromnode, tonode 

Update

Now I have tried several different approaches, all of which failed. Here is a summary.

  • Using igraph with g = Graph.Read_Ncol('edges.txt') . It uses a huge amount of RAM, which crashes my computer.
  • Using networkit with G= networkit.graphio.readGraph("edges.txt", networkit.Format.EdgeList, separator=" ", continuous=False) . It uses a huge amount of memory, which breaks my computer.
  • The code above in this matter, but using np.loadtxt ( "edge.txt") instead pandas. It uses a huge amount of memory, which breaks my computer.

Then I wrote a separate code, which renamed the names of all nodes in a number from 1 .. | V | where | V | - the total number of vertices. This should keep the code that imports the list of edges from which you want to create a table that displays the names of the peaks. Using this, I tried:

  1. Using this new file reprogrammed list ribs, I again used igraph using g = Graph.Read_Edgelist("edges-contig.txt") . It is now working, but requires 4 GB of RAM (this is more than the theoretical amount). However, there is no igraph function to write a sparse adjacency matrix to the graph. Recommended solution to convert graph to coo_matrix . Unfortunately, it uses a huge amount of memory, which breaks my computer.
  2. Using file remapped ribs list, I used networkit with G = networkit.readGraph("edges-contig.txt", networkit.Format.EdgeListSpaceOne) . It also works using less than 4 GB, which requires igraph. In networkit also has a function for recording Matlab files (which is a form of a sparse adjacency matrix that scipy can read). However, networkit.graphio.writeMat(G,"test.mat") uses a huge amount of RAM, which crashes my computer.

Finally, sascha answer below is completed, but it takes about 40 minutes.

+8
source share
5 answers

Here is my solution:

 import numpy as np import pandas as pd import scipy.sparse as ss def read_data_file_as_coo_matrix(filename='edges.txt'): "Read data file and return sparse matrix in coordinate format." data = pd.read_csv(filename, sep=' ', header=None, dtype=np.uint32) rows = data[0] # Not a copy, just a reference. cols = data[1] ones = np.ones(len(rows), np.uint32) matrix = ss.coo_matrix((ones, (rows, cols))) return matrix ): import numpy as np import pandas as pd import scipy.sparse as ss def read_data_file_as_coo_matrix(filename='edges.txt'): "Read data file and return sparse matrix in coordinate format." data = pd.read_csv(filename, sep=' ', header=None, dtype=np.uint32) rows = data[0] # Not a copy, just a reference. cols = data[1] ones = np.ones(len(rows), np.uint32) matrix = ss.coo_matrix((ones, (rows, cols))) return matrix in coordinate format." import numpy as np import pandas as pd import scipy.sparse as ss def read_data_file_as_coo_matrix(filename='edges.txt'): "Read data file and return sparse matrix in coordinate format." data = pd.read_csv(filename, sep=' ', header=None, dtype=np.uint32) rows = data[0] # Not a copy, just a reference. cols = data[1] ones = np.ones(len(rows), np.uint32) matrix = ss.coo_matrix((ones, (rows, cols))) return matrix , np.uint32) import numpy as np import pandas as pd import scipy.sparse as ss def read_data_file_as_coo_matrix(filename='edges.txt'): "Read data file and return sparse matrix in coordinate format." data = pd.read_csv(filename, sep=' ', header=None, dtype=np.uint32) rows = data[0] # Not a copy, just a reference. cols = data[1] ones = np.ones(len(rows), np.uint32) matrix = ss.coo_matrix((ones, (rows, cols))) return matrix rows, cols))) import numpy as np import pandas as pd import scipy.sparse as ss def read_data_file_as_coo_matrix(filename='edges.txt'): "Read data file and return sparse matrix in coordinate format." data = pd.read_csv(filename, sep=' ', header=None, dtype=np.uint32) rows = data[0] # Not a copy, just a reference. cols = data[1] ones = np.ones(len(rows), np.uint32) matrix = ss.coo_matrix((ones, (rows, cols))) return matrix 

Pandas do the heavy lifting of parsing using read_csv . And Pandas already stores the data in column format. data[0] and the data[1] just get links, not copies. Then I pass them to coo_matrix . Checked locally:

 In [1]: %timeit -n1 -r5 read_data_file_as_coo_matrix() 1 loop, best of 5: 14.2 s per loop ) In [1]: %timeit -n1 -r5 read_data_file_as_coo_matrix() 1 loop, best of 5: 14.2 s per loop 

Then, to save csr-matrix in the file:

 def save_csr_matrix(filename, matrix): """Save compressed sparse row (csr) matrix to file. Based on http://stackoverflow.com/a/8980156/232571 """ assert filename.endswith('.npz') attributes = { 'data': matrix.data, 'indices': matrix.indices, 'indptr': matrix.indptr, 'shape': matrix.shape, } np.savez(filename, **attributes) 

Checked locally:

 In [3]: %timeit -n1 -r5 save_csr_matrix('edges.npz', matrix.tocsr()) 1 loop, best of 5: 13.4 s per loop 'edges.npz', matrix.tocsr ()) In [3]: %timeit -n1 -r5 save_csr_matrix('edges.npz', matrix.tocsr()) 1 loop, best of 5: 13.4 s per loop 

And later download it from the file:

 def load_csr_matrix(filename): """Load compressed sparse row (csr) matrix from file. Based on http://stackoverflow.com/a/8980156/232571 """ assert filename.endswith('.npz') loader = np.load(filename) args = (loader['data'], loader['indices'], loader['indptr']) matrix = ss.csr_matrix(args, shape=loader['shape']) return matrix csr) matrix from file. def load_csr_matrix(filename): """Load compressed sparse row (csr) matrix from file. Based on http://stackoverflow.com/a/8980156/232571 """ assert filename.endswith('.npz') loader = np.load(filename) args = (loader['data'], loader['indices'], loader['indptr']) matrix = ss.csr_matrix(args, shape=loader['shape']) return matrix 

Checked locally:

 In [4]: %timeit -n1 -r5 load_csr_matrix('edges.npz') 1 loop, best of 5: 881 ms per loop 'edges.npz') In [4]: %timeit -n1 -r5 load_csr_matrix('edges.npz') 1 loop, best of 5: 881 ms per loop 

Finally, check all this:

 def test(): "Test data file parsing and matrix serialization." coo_matrix = read_data_file_as_coo_matrix() csr_matrix = coo_matrix.tocsr() save_csr_matrix('edges.npz', csr_matrix) loaded_csr_matrix = load_csr_matrix('edges.npz') # Comparison based on http://stackoverflow.com/a/30685839/232571 assert (csr_matrix != loaded_csr_matrix).nnz == 0 if __name__ == '__main__': test() ." def test(): "Test data file parsing and matrix serialization." coo_matrix = read_data_file_as_coo_matrix() csr_matrix = coo_matrix.tocsr() save_csr_matrix('edges.npz', csr_matrix) loaded_csr_matrix = load_csr_matrix('edges.npz') # Comparison based on http://stackoverflow.com/a/30685839/232571 assert (csr_matrix != loaded_csr_matrix).nnz == 0 if __name__ == '__main__': test() 

When running test() requires about 30 seconds:

 $ time python so_38688062.py real 0m30.401s user 0m27.257s sys 0m2.759s 

And the mark of the high-temperature storage was ~ 1.79 GB.

Note that after the conversion "edge.txt" in "edge.npz" in the format of CSR-matrix loading it takes less than a second.

+12
source

Updated version

As mentioned in the comments, this approach is not suitable for your use case. Let's make some changes:

  • use pandas to read in data (instead of numpy: I'm very surprised np.loadtxt does it badly!)
  • Use external library sortedcontainers for more efficient use of memory (instead of dictionary)
  • the basic approach is the same

This approach takes ~ 45 minutes (it is slow, but you can sort / save results, so you need to do this only once) and ~ 5 GB of storage for the preparation of a sparse matrix for your data, generated with the help of

 import random N = 62500000 for i in xrange(N): print random.randint(10**8,10**9-1), random.randint(10**8,10**9-1) 

Code

 import numpy as np from scipy.sparse import coo_matrix import pandas as pd from sortedcontainers import SortedList import time # Read data # global memory usage after: one big array df = pd.read_csv('EDGES.txt', delimiter=' ', header=None, dtype=np.uint32) data = df.as_matrix() df = None n_edges = data.shape[0] # Learn mapping to range(0, N_VERTICES) # N_VERTICES unknown # global memory usage after: one big array + one big searchtree print('fit mapping') start = time.time() observed_vertices = SortedList() mappings = np.arange(n_edges*2, dtype=np.uint32) # upper bound on vertices for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] if val not in observed_vertices: observed_vertices.add(val) mappings = mappings[:len(observed_vertices)] n_vertices = len(observed_vertices) end = time.time() print(' secs: ', end-start) print('transform mapping') # Map original data (in-place !) # global memory usage after: one big array + one big searchtree(can be deleted!) start = time.time() for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] mapper_pos = observed_vertices.index(val) data[row, column] = mappings[mapper_pos] end = time.time() print(' secs: ', end-start) observed_vertices = None # if not needed anymore mappings = None # if not needed anymore # Create sparse matrix (only caring about a single triangular part for now) # if needed: delete dictionary before as it not needed anymore! sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices)) array import numpy as np from scipy.sparse import coo_matrix import pandas as pd from sortedcontainers import SortedList import time # Read data # global memory usage after: one big array df = pd.read_csv('EDGES.txt', delimiter=' ', header=None, dtype=np.uint32) data = df.as_matrix() df = None n_edges = data.shape[0] # Learn mapping to range(0, N_VERTICES) # N_VERTICES unknown # global memory usage after: one big array + one big searchtree print('fit mapping') start = time.time() observed_vertices = SortedList() mappings = np.arange(n_edges*2, dtype=np.uint32) # upper bound on vertices for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] if val not in observed_vertices: observed_vertices.add(val) mappings = mappings[:len(observed_vertices)] n_vertices = len(observed_vertices) end = time.time() print(' secs: ', end-start) print('transform mapping') # Map original data (in-place !) # global memory usage after: one big array + one big searchtree(can be deleted!) start = time.time() for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] mapper_pos = observed_vertices.index(val) data[row, column] = mappings[mapper_pos] end = time.time() print(' secs: ', end-start) observed_vertices = None # if not needed anymore mappings = None # if not needed anymore # Create sparse matrix (only caring about a single triangular part for now) # if needed: delete dictionary before as it not needed anymore! sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices)) delimiter = '', header = None, dtype = np.uint32) import numpy as np from scipy.sparse import coo_matrix import pandas as pd from sortedcontainers import SortedList import time # Read data # global memory usage after: one big array df = pd.read_csv('EDGES.txt', delimiter=' ', header=None, dtype=np.uint32) data = df.as_matrix() df = None n_edges = data.shape[0] # Learn mapping to range(0, N_VERTICES) # N_VERTICES unknown # global memory usage after: one big array + one big searchtree print('fit mapping') start = time.time() observed_vertices = SortedList() mappings = np.arange(n_edges*2, dtype=np.uint32) # upper bound on vertices for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] if val not in observed_vertices: observed_vertices.add(val) mappings = mappings[:len(observed_vertices)] n_vertices = len(observed_vertices) end = time.time() print(' secs: ', end-start) print('transform mapping') # Map original data (in-place !) # global memory usage after: one big array + one big searchtree(can be deleted!) start = time.time() for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] mapper_pos = observed_vertices.index(val) data[row, column] = mappings[mapper_pos] end = time.time() print(' secs: ', end-start) observed_vertices = None # if not needed anymore mappings = None # if not needed anymore # Create sparse matrix (only caring about a single triangular part for now) # if needed: delete dictionary before as it not needed anymore! sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices)) -start) import numpy as np from scipy.sparse import coo_matrix import pandas as pd from sortedcontainers import SortedList import time # Read data # global memory usage after: one big array df = pd.read_csv('EDGES.txt', delimiter=' ', header=None, dtype=np.uint32) data = df.as_matrix() df = None n_edges = data.shape[0] # Learn mapping to range(0, N_VERTICES) # N_VERTICES unknown # global memory usage after: one big array + one big searchtree print('fit mapping') start = time.time() observed_vertices = SortedList() mappings = np.arange(n_edges*2, dtype=np.uint32) # upper bound on vertices for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] if val not in observed_vertices: observed_vertices.add(val) mappings = mappings[:len(observed_vertices)] n_vertices = len(observed_vertices) end = time.time() print(' secs: ', end-start) print('transform mapping') # Map original data (in-place !) # global memory usage after: one big array + one big searchtree(can be deleted!) start = time.time() for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] mapper_pos = observed_vertices.index(val) data[row, column] = mappings[mapper_pos] end = time.time() print(' secs: ', end-start) observed_vertices = None # if not needed anymore mappings = None # if not needed anymore # Create sparse matrix (only caring about a single triangular part for now) # if needed: delete dictionary before as it not needed anymore! sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices)) array + one big searchtree (can be deleted!) import numpy as np from scipy.sparse import coo_matrix import pandas as pd from sortedcontainers import SortedList import time # Read data # global memory usage after: one big array df = pd.read_csv('EDGES.txt', delimiter=' ', header=None, dtype=np.uint32) data = df.as_matrix() df = None n_edges = data.shape[0] # Learn mapping to range(0, N_VERTICES) # N_VERTICES unknown # global memory usage after: one big array + one big searchtree print('fit mapping') start = time.time() observed_vertices = SortedList() mappings = np.arange(n_edges*2, dtype=np.uint32) # upper bound on vertices for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] if val not in observed_vertices: observed_vertices.add(val) mappings = mappings[:len(observed_vertices)] n_vertices = len(observed_vertices) end = time.time() print(' secs: ', end-start) print('transform mapping') # Map original data (in-place !) # global memory usage after: one big array + one big searchtree(can be deleted!) start = time.time() for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] mapper_pos = observed_vertices.index(val) data[row, column] = mappings[mapper_pos] end = time.time() print(' secs: ', end-start) observed_vertices = None # if not needed anymore mappings = None # if not needed anymore # Create sparse matrix (only caring about a single triangular part for now) # if needed: delete dictionary before as it not needed anymore! sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices)) easy to understand space-complexity import numpy as np from scipy.sparse import coo_matrix import pandas as pd from sortedcontainers import SortedList import time # Read data # global memory usage after: one big array df = pd.read_csv('EDGES.txt', delimiter=' ', header=None, dtype=np.uint32) data = df.as_matrix() df = None n_edges = data.shape[0] # Learn mapping to range(0, N_VERTICES) # N_VERTICES unknown # global memory usage after: one big array + one big searchtree print('fit mapping') start = time.time() observed_vertices = SortedList() mappings = np.arange(n_edges*2, dtype=np.uint32) # upper bound on vertices for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] if val not in observed_vertices: observed_vertices.add(val) mappings = mappings[:len(observed_vertices)] n_vertices = len(observed_vertices) end = time.time() print(' secs: ', end-start) print('transform mapping') # Map original data (in-place !) # global memory usage after: one big array + one big searchtree(can be deleted!) start = time.time() for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] mapper_pos = observed_vertices.index(val) data[row, column] = mappings[mapper_pos] end = time.time() print(' secs: ', end-start) observed_vertices = None # if not needed anymore mappings = None # if not needed anymore # Create sparse matrix (only caring about a single triangular part for now) # if needed: delete dictionary before as it not needed anymore! sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices)) [mapper_pos] import numpy as np from scipy.sparse import coo_matrix import pandas as pd from sortedcontainers import SortedList import time # Read data # global memory usage after: one big array df = pd.read_csv('EDGES.txt', delimiter=' ', header=None, dtype=np.uint32) data = df.as_matrix() df = None n_edges = data.shape[0] # Learn mapping to range(0, N_VERTICES) # N_VERTICES unknown # global memory usage after: one big array + one big searchtree print('fit mapping') start = time.time() observed_vertices = SortedList() mappings = np.arange(n_edges*2, dtype=np.uint32) # upper bound on vertices for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] if val not in observed_vertices: observed_vertices.add(val) mappings = mappings[:len(observed_vertices)] n_vertices = len(observed_vertices) end = time.time() print(' secs: ', end-start) print('transform mapping') # Map original data (in-place !) # global memory usage after: one big array + one big searchtree(can be deleted!) start = time.time() for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] mapper_pos = observed_vertices.index(val) data[row, column] = mappings[mapper_pos] end = time.time() print(' secs: ', end-start) observed_vertices = None # if not needed anymore mappings = None # if not needed anymore # Create sparse matrix (only caring about a single triangular part for now) # if needed: delete dictionary before as it not needed anymore! sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices)) -start) import numpy as np from scipy.sparse import coo_matrix import pandas as pd from sortedcontainers import SortedList import time # Read data # global memory usage after: one big array df = pd.read_csv('EDGES.txt', delimiter=' ', header=None, dtype=np.uint32) data = df.as_matrix() df = None n_edges = data.shape[0] # Learn mapping to range(0, N_VERTICES) # N_VERTICES unknown # global memory usage after: one big array + one big searchtree print('fit mapping') start = time.time() observed_vertices = SortedList() mappings = np.arange(n_edges*2, dtype=np.uint32) # upper bound on vertices for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] if val not in observed_vertices: observed_vertices.add(val) mappings = mappings[:len(observed_vertices)] n_vertices = len(observed_vertices) end = time.time() print(' secs: ', end-start) print('transform mapping') # Map original data (in-place !) # global memory usage after: one big array + one big searchtree(can be deleted!) start = time.time() for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] mapper_pos = observed_vertices.index(val) data[row, column] = mappings[mapper_pos] end = time.time() print(' secs: ', end-start) observed_vertices = None # if not needed anymore mappings = None # if not needed anymore # Create sparse matrix (only caring about a single triangular part for now) # if needed: delete dictionary before as it not needed anymore! sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices)) 

First version

Below is the code is very simple and very inefficient (in terms of time and space) for the construction of a sparse matrix. I place this code, because I believe that it is important to understand the basic part, if you use them in something more.

Let's see if this code is efficient enough for your use case or if it needs work. From a distance it is difficult to say because we do not have your data.

Dictionary-part used for comparison, is a candidate to explode your memory. But it makes no sense to optimize, not knowing if it is necessary at all. Moreover, this part of the code depends on the number of vertices in your graph (and I do not know about this power).

 """ itertools.count usage here would need changes for py2 """ import numpy as np from itertools import count from scipy.sparse import coo_matrix # Read data # global memory usage after: one big array data = np.loadtxt('edges.txt', np.uint32) n_edges = data.shape[0] #print(data) #print(data.shape) # Learn mapping to range(0, N_VERTICES) # N_VERTICES unknown # global memory usage after: one big array + one big dict index_gen = count() mapper = {} for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] if val not in mapper: mapper[val] = next(index_gen) n_vertices = len(mapper) # Map original data (in-place !) # global memory usage after: one big array + one big dict (can be deleted!) for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity data[row, column] = mapper[data[row, column]] #print(data) # Create sparse matrix (only caring about a single triangular part for now) # if needed: delete dictionary before as it not needed anymore! sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices)) #print(sp_mat) changes for py2" "" """ itertools.count usage here would need changes for py2 """ import numpy as np from itertools import count from scipy.sparse import coo_matrix # Read data # global memory usage after: one big array data = np.loadtxt('edges.txt', np.uint32) n_edges = data.shape[0] #print(data) #print(data.shape) # Learn mapping to range(0, N_VERTICES) # N_VERTICES unknown # global memory usage after: one big array + one big dict index_gen = count() mapper = {} for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] if val not in mapper: mapper[val] = next(index_gen) n_vertices = len(mapper) # Map original data (in-place !) # global memory usage after: one big array + one big dict (can be deleted!) for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity data[row, column] = mapper[data[row, column]] #print(data) # Create sparse matrix (only caring about a single triangular part for now) # if needed: delete dictionary before as it not needed anymore! sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices)) #print(sp_mat) np.uint32) """ itertools.count usage here would need changes for py2 """ import numpy as np from itertools import count from scipy.sparse import coo_matrix # Read data # global memory usage after: one big array data = np.loadtxt('edges.txt', np.uint32) n_edges = data.shape[0] #print(data) #print(data.shape) # Learn mapping to range(0, N_VERTICES) # N_VERTICES unknown # global memory usage after: one big array + one big dict index_gen = count() mapper = {} for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] if val not in mapper: mapper[val] = next(index_gen) n_vertices = len(mapper) # Map original data (in-place !) # global memory usage after: one big array + one big dict (can be deleted!) for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity data[row, column] = mapper[data[row, column]] #print(data) # Create sparse matrix (only caring about a single triangular part for now) # if needed: delete dictionary before as it not needed anymore! sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices)) #print(sp_mat) array + one big dict """ itertools.count usage here would need changes for py2 """ import numpy as np from itertools import count from scipy.sparse import coo_matrix # Read data # global memory usage after: one big array data = np.loadtxt('edges.txt', np.uint32) n_edges = data.shape[0] #print(data) #print(data.shape) # Learn mapping to range(0, N_VERTICES) # N_VERTICES unknown # global memory usage after: one big array + one big dict index_gen = count() mapper = {} for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] if val not in mapper: mapper[val] = next(index_gen) n_vertices = len(mapper) # Map original data (in-place !) # global memory usage after: one big array + one big dict (can be deleted!) for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity data[row, column] = mapper[data[row, column]] #print(data) # Create sparse matrix (only caring about a single triangular part for now) # if needed: delete dictionary before as it not needed anymore! sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices)) #print(sp_mat) ) """ itertools.count usage here would need changes for py2 """ import numpy as np from itertools import count from scipy.sparse import coo_matrix # Read data # global memory usage after: one big array data = np.loadtxt('edges.txt', np.uint32) n_edges = data.shape[0] #print(data) #print(data.shape) # Learn mapping to range(0, N_VERTICES) # N_VERTICES unknown # global memory usage after: one big array + one big dict index_gen = count() mapper = {} for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] if val not in mapper: mapper[val] = next(index_gen) n_vertices = len(mapper) # Map original data (in-place !) # global memory usage after: one big array + one big dict (can be deleted!) for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity data[row, column] = mapper[data[row, column]] #print(data) # Create sparse matrix (only caring about a single triangular part for now) # if needed: delete dictionary before as it not needed anymore! sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices)) #print(sp_mat) !) """ itertools.count usage here would need changes for py2 """ import numpy as np from itertools import count from scipy.sparse import coo_matrix # Read data # global memory usage after: one big array data = np.loadtxt('edges.txt', np.uint32) n_edges = data.shape[0] #print(data) #print(data.shape) # Learn mapping to range(0, N_VERTICES) # N_VERTICES unknown # global memory usage after: one big array + one big dict index_gen = count() mapper = {} for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] if val not in mapper: mapper[val] = next(index_gen) n_vertices = len(mapper) # Map original data (in-place !) # global memory usage after: one big array + one big dict (can be deleted!) for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity data[row, column] = mapper[data[row, column]] #print(data) # Create sparse matrix (only caring about a single triangular part for now) # if needed: delete dictionary before as it not needed anymore! sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices)) #print(sp_mat) easy to understand space-complexity """ itertools.count usage here would need changes for py2 """ import numpy as np from itertools import count from scipy.sparse import coo_matrix # Read data # global memory usage after: one big array data = np.loadtxt('edges.txt', np.uint32) n_edges = data.shape[0] #print(data) #print(data.shape) # Learn mapping to range(0, N_VERTICES) # N_VERTICES unknown # global memory usage after: one big array + one big dict index_gen = count() mapper = {} for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] if val not in mapper: mapper[val] = next(index_gen) n_vertices = len(mapper) # Map original data (in-place !) # global memory usage after: one big array + one big dict (can be deleted!) for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity data[row, column] = mapper[data[row, column]] #print(data) # Create sparse matrix (only caring about a single triangular part for now) # if needed: delete dictionary before as it not needed anymore! sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices)) #print(sp_mat) a single triangular part for now) """ itertools.count usage here would need changes for py2 """ import numpy as np from itertools import count from scipy.sparse import coo_matrix # Read data # global memory usage after: one big array data = np.loadtxt('edges.txt', np.uint32) n_edges = data.shape[0] #print(data) #print(data.shape) # Learn mapping to range(0, N_VERTICES) # N_VERTICES unknown # global memory usage after: one big array + one big dict index_gen = count() mapper = {} for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity val = data[row, column] if val not in mapper: mapper[val] = next(index_gen) n_vertices = len(mapper) # Map original data (in-place !) # global memory usage after: one big array + one big dict (can be deleted!) for column in range(data.shape[1]): for row in range(data.shape[0]): # double-loop: slow, but easy to understand space-complexity data[row, column] = mapper[data[row, column]] #print(data) # Create sparse matrix (only caring about a single triangular part for now) # if needed: delete dictionary before as it not needed anymore! sp_mat = coo_matrix((np.ones(n_edges, dtype=bool), (data[:, 0], data[:, 1])), shape=(n_vertices, n_vertices)) #print(sp_mat) 

Output for ribs-10.txt :

 [[287111206 357850135] [512616930 441657273] [530905858 562056765] [524113870 320749289] [149911066 964526673] [169873523 631128793] [646151040 986572427] [105290138 382302570] [194873438 968653053] [912211115 195436728]] (10, 2) [[ 0 10] [ 1 11] [ 2 12] [ 3 13] [ 4 14] [ 5 15] [ 6 16] [ 7 17] [ 8 18] [ 9 19]] (0, 10) True (1, 11) True (2, 12) True (3, 13) True (4, 14) True (5, 15) True (6, 16) True (7, 17) True (8, 18) True (9, 19) True 
+3
source

I tried to use a variety of methods other than those which have already been used. I found the following good.

Method 1 - reading a file into a string, parsing a string into a 1-dimensional array using numpy fromstring.

 import numpy as np import scipy.sparse as sparse def readEdges(): with open('edges.txt') as f: data = f.read() edges = np.fromstring(data, dtype=np.int32, sep=' ') edges = np.reshape(edges, (edges.shape[0]/2, 2)) ones = np.ones(len(edges), np.uint32) cooMatrix = sparse.coo_matrix((ones, (edges[:,0], edges[:,1]))) %timeit -n5 readEdges() f: import numpy as np import scipy.sparse as sparse def readEdges(): with open('edges.txt') as f: data = f.read() edges = np.fromstring(data, dtype=np.int32, sep=' ') edges = np.reshape(edges, (edges.shape[0]/2, 2)) ones = np.ones(len(edges), np.uint32) cooMatrix = sparse.coo_matrix((ones, (edges[:,0], edges[:,1]))) %timeit -n5 readEdges() np.int32, sep = '') import numpy as np import scipy.sparse as sparse def readEdges(): with open('edges.txt') as f: data = f.read() edges = np.fromstring(data, dtype=np.int32, sep=' ') edges = np.reshape(edges, (edges.shape[0]/2, 2)) ones = np.ones(len(edges), np.uint32) cooMatrix = sparse.coo_matrix((ones, (edges[:,0], edges[:,1]))) %timeit -n5 readEdges() , np.uint32) import numpy as np import scipy.sparse as sparse def readEdges(): with open('edges.txt') as f: data = f.read() edges = np.fromstring(data, dtype=np.int32, sep=' ') edges = np.reshape(edges, (edges.shape[0]/2, 2)) ones = np.ones(len(edges), np.uint32) cooMatrix = sparse.coo_matrix((ones, (edges[:,0], edges[:,1]))) %timeit -n5 readEdges() 

Output:

 5 loops, best of 3: 13.6 s per loop 

Method 2 is the same as method 1, instead of loading a file into a string using a memory-mapped interface.

 def readEdgesMmap(): with open('edges.txt') as f: with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)) as m: edges = np.fromstring(m, dtype=np.int32, sep=' ') edges = np.reshape(edges, (edges.shape[0]/2, 2)) ones = np.ones(len(edges), np.uint32) cooMatrix = sparse.coo_matrix((ones, (edges[:,0], edges[:,1]))) %timeit -n5 readEdgesMmap() f: def readEdgesMmap(): with open('edges.txt') as f: with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)) as m: edges = np.fromstring(m, dtype=np.int32, sep=' ') edges = np.reshape(edges, (edges.shape[0]/2, 2)) ones = np.ones(len(edges), np.uint32) cooMatrix = sparse.coo_matrix((ones, (edges[:,0], edges[:,1]))) %timeit -n5 readEdgesMmap() [ def readEdgesMmap(): with open('edges.txt') as f: with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)) as m: edges = np.fromstring(m, dtype=np.int32, sep=' ') edges = np.reshape(edges, (edges.shape[0]/2, 2)) ones = np.ones(len(edges), np.uint32) cooMatrix = sparse.coo_matrix((ones, (edges[:,0], edges[:,1]))) %timeit -n5 readEdgesMmap() 

Output:

 5 loops, best of 3: 12.7 s per loop 

Tracking the use of /usr/bin/time , both methods use a maximum of about ~ 2 GB of memory.

A few notes:

  • It seems to be a little better than pandas read_csv . Using pandas read_csv, the output on the same computer is

    5 loops, best of 3: 16.2 s per loop

  • Converting from COO to CSR / CSC also requires considerable time. In @GrantJ's answer, this takes less time because the initialization of the COO matrix is ​​incorrect. The argument must be specified as a tuple. I wanted to leave a comment there, but I have no comments yet.

  • My guess about why this is slightly better than pandas read_csv is a preliminary assumption about 1D data.

+3
source

In my reply, I consider the case where node IDs are set to 9 characters long strings of each character from [0-9A-Za-z] . n these node identifiers should be mapped to the values [0,n-1] (which may not be necessary for your application, but still of general interest).

The following considerations, I am sure you know, here for the sake of completeness:

  • Memory is the throat of a bottle.
  • There are lines 10^8 in the file.
  • value pair string + int32 length of 9 characters costs about 120 bytes in the dictionary, which leads to the use of memory to 12 GB for a file.
  • Row identifier of the file can be displayed on int64 : there are 62 different symbol β†’ can be encoded with 6 bits, nine symbols in a row β†’ 6 * 9 = 54 and 64 bits. Cm. Also, a method toInt64() below.
  • there int64 + int32 = 12 bytes "real" information => ca. 1.2GB may be sufficient, but the cost of such a pair in the dictionary is about 60 bytes (it takes about 6 GB of RAM).
  • Creating small objects (on the heap) leads to a large amount of memory overhead, so combining these objects into arrays is beneficial. Interesting information about the memory used by python objects can be found in his article article . Interesting experience a reduction in memory usage are published in this blog .
  • python list can not be considered as a data structure, and a glossary. array.array may be alternative, but we use np.array (because there are sorting algorithms for np.array , but not array.array ).

1. step: read the file and display the lines on the int64 . It's a pain to a np.array grow dynamically, so we assume that now the number of edges in a file (it would be nice to have it in the title, but it can also be deduced from the size of the file):

 import numpy as np def read_nodes(filename, EDGE_CNT): nodes=np.zeros(EDGE_CNT*2, dtype=np.int64) cnt=0 for line in open(filename,"r"): nodes[cnt:cnt+2]=map(toInt64, line.split()) # use map(int, line.split()) for cases without letters return nodes r"): import numpy as np def read_nodes(filename, EDGE_CNT): nodes=np.zeros(EDGE_CNT*2, dtype=np.int64) cnt=0 for line in open(filename,"r"): nodes[cnt:cnt+2]=map(toInt64, line.split()) # use map(int, line.split()) for cases without letters return nodes )) # use map (int, line.split ()) for cases without letters import numpy as np def read_nodes(filename, EDGE_CNT): nodes=np.zeros(EDGE_CNT*2, dtype=np.int64) cnt=0 for line in open(filename,"r"): nodes[cnt:cnt+2]=map(toInt64, line.split()) # use map(int, line.split()) for cases without letters return nodes 

2. step: converting int64 values ​​to [0,n-1] values:

Possibility A, requires 3 * 0.8 GB:

 def maps_to_ids(filename, EDGE_CNT): """ return number of different node ids, and the mapped nodes""" nodes=read_nodes(filename, EDGE_CNT) unique_ids, nodes = np.unique(nodes, return_index=True) return (len(unique_ids), nodes) ids, and the mapped nodes" "" def maps_to_ids(filename, EDGE_CNT): """ return number of different node ids, and the mapped nodes""" nodes=read_nodes(filename, EDGE_CNT) unique_ids, nodes = np.unique(nodes, return_index=True) return (len(unique_ids), nodes) 

Possibility B, it takes 2 * 0.8 GB, but slightly slower:

 def maps_to_ids(filename, EDGE_CNT): """ return number of different node ids, and the mapped nodes""" nodes=read_nodes(filename, EDGE_CNT) unique_map = np.unique(nodes) for i in xrange(len(nodes)): node_id=np.searchsorted(unique_map, nodes[i]) # faster than bisect.bisect nodes[i]=node_id return (len(unique_map), nodes) ids, and the mapped nodes" "" def maps_to_ids(filename, EDGE_CNT): """ return number of different node ids, and the mapped nodes""" nodes=read_nodes(filename, EDGE_CNT) unique_map = np.unique(nodes) for i in xrange(len(nodes)): node_id=np.searchsorted(unique_map, nodes[i]) # faster than bisect.bisect nodes[i]=node_id return (len(unique_map), nodes) 

3. step: put it all in coo_matrix:

 from scipy import sparse def data_as_coo_matrix(filename, EDGE_CNT) node_cnt, nodes = maps_to_ids(filename, EDGE_CNT) rows=nodes[::2]#it is only a view, not a copy cols=nodes[1::2]#it is only a view, not a copy return sparse.coo_matrix((np.ones(len(rows), dtype=bool), (rows, cols)), shape=(node_cnt, node_cnt)) a copy from scipy import sparse def data_as_coo_matrix(filename, EDGE_CNT) node_cnt, nodes = maps_to_ids(filename, EDGE_CNT) rows=nodes[::2]#it is only a view, not a copy cols=nodes[1::2]#it is only a view, not a copy return sparse.coo_matrix((np.ones(len(rows), dtype=bool), (rows, cols)), shape=(node_cnt, node_cnt)) 

To call data_as_coo_matrix("data.txt", 62500000) peak memory values ​​are 2.5 GB (but with int32 instead of int64 only 1.5 GB is required). It took about 5 minutes for my car, but my car was quite slow ...

So, what is different from your decision?

  • I get only unique values ​​from np.unique (and not all indexes and reverse), so some memory is saved - I can replace the old identifiers with a new one in place.
  • I have no experience with pandas , so maybe there is some kind of copying between pandas ↔ numpy data structures?

What is the difference of the solutions sascha?

  • No need to sort the list all the time - just sort after all the items are in the list, which does np.unique() . The sascha solution will keep the list sorted all the time - you have to pay for it with at least a constant factor, even if the runtime remains O(n log(n)) . I assumed that the add operation would be O(n) , but as indicated, this is O(log(n) .

What is the difference with a GrantJ solution?

  • The size of the resulting sparse matrix NxN - with n is the number of different nodes, not 2^54x2^54 (with a very large number of empty rows and column).

PS:
That's my idea, as a string identifier 9 char can be compared with the value int64 , but I guess that this function may be the neck of the bottle as it is written, and should be optimized.

 def toInt64(string): res=0L for ch in string: res*=62 if ch <='9': res+=ord(ch)-ord('0') elif ch <='Z': res+=ord(ch)-ord('A')+10 else: res+=ord(ch)-ord('a')+36 return res ( ' def toInt64(string): res=0L for ch in string: res*=62 if ch <='9': res+=ord(ch)-ord('0') elif ch <='Z': res+=ord(ch)-ord('A')+10 else: res+=ord(ch)-ord('a')+36 return res ( 'A') + def toInt64(string): res=0L for ch in string: res*=62 if ch <='9': res+=ord(ch)-ord('0') elif ch <='Z': res+=ord(ch)-ord('A')+10 else: res+=ord(ch)-ord('a')+36 return res 
+2
source

You can look at the project IGRAPH , a library GPL code C, which is designed for this kind and has a good API Python. I think in your case, Python code will be similar to

 from igraph import Graph g = Graph.Read_Edgelist('edges.txt') g.write_adjacency('adjacency_matrix.txt') 
0
source

All Articles