In my reply, I consider the case where node IDs are set to 9 characters long strings of each character from [0-9A-Za-z] . n these node identifiers should be mapped to the values [0,n-1] (which may not be necessary for your application, but still of general interest).
The following considerations, I am sure you know, here for the sake of completeness:
- Memory is the throat of a bottle.
- There are lines
10^8 in the file. - value pair
string + int32 length of 9 characters costs about 120 bytes in the dictionary, which leads to the use of memory to 12 GB for a file. - Row identifier of the file can be displayed on
int64 : there are 62 different symbol β can be encoded with 6 bits, nine symbols in a row β 6 * 9 = 54 and 64 bits. Cm. Also, a method toInt64() below. - there int64 + int32 = 12 bytes "real" information => ca. 1.2GB may be sufficient, but the cost of such a pair in the dictionary is about 60 bytes (it takes about 6 GB of RAM).
- Creating small objects (on the heap) leads to a large amount of memory overhead, so combining these objects into arrays is beneficial. Interesting information about the memory used by python objects can be found in his article article . Interesting experience a reduction in memory usage are published in this blog .
- python list can not be considered as a data structure, and a glossary.
array.array may be alternative, but we use np.array (because there are sorting algorithms for np.array , but not array.array ).
1. step: read the file and display the lines on the int64 . It's a pain to a np.array grow dynamically, so we assume that now the number of edges in a file (it would be nice to have it in the title, but it can also be deduced from the size of the file):
import numpy as np def read_nodes(filename, EDGE_CNT): nodes=np.zeros(EDGE_CNT*2, dtype=np.int64) cnt=0 for line in open(filename,"r"): nodes[cnt:cnt+2]=map(toInt64, line.split()) r"): import numpy as np def read_nodes(filename, EDGE_CNT): nodes=np.zeros(EDGE_CNT*2, dtype=np.int64) cnt=0 for line in open(filename,"r"): nodes[cnt:cnt+2]=map(toInt64, line.split()) )) # use map (int, line.split ()) for cases without letters import numpy as np def read_nodes(filename, EDGE_CNT): nodes=np.zeros(EDGE_CNT*2, dtype=np.int64) cnt=0 for line in open(filename,"r"): nodes[cnt:cnt+2]=map(toInt64, line.split())
2. step: converting int64 values ββto [0,n-1] values:
Possibility A, requires 3 * 0.8 GB:
def maps_to_ids(filename, EDGE_CNT): """ return number of different node ids, and the mapped nodes""" nodes=read_nodes(filename, EDGE_CNT) unique_ids, nodes = np.unique(nodes, return_index=True) return (len(unique_ids), nodes) ids, and the mapped nodes" "" def maps_to_ids(filename, EDGE_CNT): """ return number of different node ids, and the mapped nodes""" nodes=read_nodes(filename, EDGE_CNT) unique_ids, nodes = np.unique(nodes, return_index=True) return (len(unique_ids), nodes)
Possibility B, it takes 2 * 0.8 GB, but slightly slower:
def maps_to_ids(filename, EDGE_CNT): """ return number of different node ids, and the mapped nodes""" nodes=read_nodes(filename, EDGE_CNT) unique_map = np.unique(nodes) for i in xrange(len(nodes)): node_id=np.searchsorted(unique_map, nodes[i]) ids, and the mapped nodes" "" def maps_to_ids(filename, EDGE_CNT): """ return number of different node ids, and the mapped nodes""" nodes=read_nodes(filename, EDGE_CNT) unique_map = np.unique(nodes) for i in xrange(len(nodes)): node_id=np.searchsorted(unique_map, nodes[i])
3. step: put it all in coo_matrix:
from scipy import sparse def data_as_coo_matrix(filename, EDGE_CNT) node_cnt, nodes = maps_to_ids(filename, EDGE_CNT) rows=nodes[::2] a copy from scipy import sparse def data_as_coo_matrix(filename, EDGE_CNT) node_cnt, nodes = maps_to_ids(filename, EDGE_CNT) rows=nodes[::2]
To call data_as_coo_matrix("data.txt", 62500000) peak memory values ββare 2.5 GB (but with int32 instead of int64 only 1.5 GB is required). It took about 5 minutes for my car, but my car was quite slow ...
So, what is different from your decision?
- I get only unique values ββfrom
np.unique (and not all indexes and reverse), so some memory is saved - I can replace the old identifiers with a new one in place. - I have no experience with
pandas , so maybe there is some kind of copying between pandas β numpy data structures?
What is the difference of the solutions sascha?
- No need to sort the list all the time - just sort after all the items are in the list, which does
np.unique() . The sascha solution will keep the list sorted all the time - you have to pay for it with at least a constant factor, even if the runtime remains O(n log(n)) . I assumed that the add operation would be O(n) , but as indicated, this is O(log(n) .
What is the difference with a GrantJ solution?
- The size of the resulting sparse matrix
NxN - with n is the number of different nodes, not 2^54x2^54 (with a very large number of empty rows and column).
PS:
That's my idea, as a string identifier 9 char can be compared with the value int64 , but I guess that this function may be the neck of the bottle as it is written, and should be optimized.
def toInt64(string): res=0L for ch in string: res*=62 if ch <='9': res+=ord(ch)-ord('0') elif ch <='Z': res+=ord(ch)-ord('A')+10 else: res+=ord(ch)-ord('a')+36 return res ( ' def toInt64(string): res=0L for ch in string: res*=62 if ch <='9': res+=ord(ch)-ord('0') elif ch <='Z': res+=ord(ch)-ord('A')+10 else: res+=ord(ch)-ord('a')+36 return res ( 'A') + def toInt64(string): res=0L for ch in string: res*=62 if ch <='9': res+=ord(ch)-ord('0') elif ch <='Z': res+=ord(ch)-ord('A')+10 else: res+=ord(ch)-ord('a')+36 return res