Question:
I finished the Python program to death, and there is one function that slows everything down. It makes heavy use of Python dictionaries, so I may not have used them in the best way. If I can't get it to work faster, I will have to rewrite it in C ++, so is there anyone who can help me optimize it in Python?
Hope I gave a correct explanation and you can understand my code! Thanks in advance for any help.
My code is:
This is an abusive feature profiled with line_profiler and kernprof . I am running Python 2.7
I am particularly puzzled by things like lines 363, 389, and 405, where the if , comparing two variables, seems to take too much time.
I have considered using NumPy (since these are sparse matrices), but I don't think it is appropriate because: (1) I do not index my matrix with integers (I use object instances); and (2) I do not store simple data types in a matrix (I store float tuples and an instance of an object). But I want to be convinced of NumPy. If anyone knows about the performance of a small NumPy matrix and Python hash tables, I would be interested.
Sorry, I did not provide a simple example that you can run, but this function is connected in a much larger project, and I could not figure out how to set up a simple example for testing without giving you half of my code base!
Timer unit: 3.33366e-10 s File: routing_distances.py Function: propagate_distances_node at line 328 Total time: 807.234 s Line # Hits Time Per Hit % Time Line Contents 328 @profile 329 def propagate_distances_node(self, node_a, cutoff_distance=200): 330 331 # a makes sure its immediate neighbours are correctly in its distance table 332 # because its immediate neighbours may change as binds/folding change 333 737753 3733642341 5060.8 0.2 for (node_b, neighbour_distance_b_a) in self.neighbours[node_a].iteritems(): 334 512120 2077788924 4057.2 0.1 use_neighbour_link = False 335 336 512120 2465798454 4814.9 0.1 if(node_b not in self.node_distances[node_a]): # a doesn't know distance to b 337 15857 66075687 4167.0 0.0 use_neighbour_link = True 338 else: # a does know distance to b 339 496263 2390534838 4817.1 0.1 (node_distance_b_a, next_node) = self.node_distances[node_a][node_b] 340 496263 2058112872 4147.2 0.1 if(node_distance_b_a > neighbour_distance_b_a): # neighbour distance is shorter 341 81 331794 4096.2 0.0 use_neighbour_link = True 342 496182 2665644192 5372.3 0.1 elif((None == next_node) and (float('+inf') == neighbour_distance_b_a)): # direct route that has just broken 343 75 313623 4181.6 0.0 use_neighbour_link = True 344 345 512120 1992514932 3890.7 0.1 if(use_neighbour_link): 346 16013 78149007 4880.3 0.0 self.node_distances[node_a][node_b] = (neighbour_distance_b_a, None) 347 16013 83489949 5213.9 0.0 self.nodes_changed.add(node_a) 348 349 ## Affinity distances update 350 16013 86020794 5371.9 0.0 if((node_a.type == Atom.BINDING_SITE) and (node_b.type == Atom.BINDING_SITE)): 351 164 3950487 24088.3 0.0 self.add_affinityDistance(node_a, node_b, self.chemistry.affinity(node_a.data, node_b.data)) 352 353 # a sends its table to all its immediate neighbours 354 737753 3549685140 4811.5 0.1 for (node_b, neighbour_distance_b_a) in self.neighbours[node_a].iteritems(): 355 512120 2129343210 4157.9 0.1 node_b_changed = False 356 357 # b integrates a distance table with its own 358 512120 2203821081 4303.3 0.1 node_b_chemical = node_b.chemical 359 512120 2409257898 4704.5 0.1 node_b_distances = node_b_chemical.node_distances[node_b] 360 361 # For all b routes (to c) that go to a first, update their distances 362 41756882 183992040153 4406.3 7.6 for node_c, (distance_b_c, node_after_b) in node_b_distances.iteritems(): # Think it ok to modify items while iterating over them (just not insert/delete) (seems to work ok) 363 41244762 172425596985 4180.5 7.1 if(node_after_b == node_a): 364 365 16673654 64255631616 3853.7 2.7 try: 366 16673654 88781802534 5324.7 3.7 distance_b_a_c = neighbour_distance_b_a + self.node_distances[node_a][node_c][0] 367 187083 929898684 4970.5 0.0 except KeyError: 368 187083 1056787479 5648.8 0.0 distance_b_a_c = float('+inf') 369 370 16673654 69374705256 4160.7 2.9 if(distance_b_c != distance_b_a_c): # a distance to c has changed 371 710083 3136751361 4417.4 0.1 node_b_distances[node_c] = (distance_b_a_c, node_a) 372 710083 2848845276 4012.0 0.1 node_b_changed = True 373 374 ## Affinity distances update 375 710083 3484577241 4907.3 0.1 if((node_b.type == Atom.BINDING_SITE) and (node_c.type == Atom.BINDING_SITE)): 376 99592 1591029009 15975.5 0.1 node_b_chemical.add_affinityDistance(node_b, node_c, self.chemistry.affinity(node_b.data, node_c.data)) 377 378 # If distance got longer, then ask b neighbours to update 379 ## TODO: document this! 380 16673654 70998570837 4258.1 2.9 if(distance_b_a_c > distance_b_c): 381 #for (node, neighbour_distance) in node_b_chemical.neighbours[node_b].iteritems(): 382 1702852 7413182064 4353.4 0.3 for node in node_b_chemical.neighbours[node_b]: 383 1204903 5912053272 4906.7 0.2 node.chemical.nodes_changed.add(node) 384 385 # Look for routes from a to c that are quicker than ones b knows already 386 42076729 184216680432 4378.1 7.6 for node_c, (distance_a_c, node_after_a) in self.node_distances[node_a].iteritems(): 387 388 41564609 171150289218 4117.7 7.1 node_b_update = False 389 41564609 172040284089 4139.1 7.1 if(node_c == node_b): # ab path 390 512120 2040112548 3983.7 0.1 pass 391 41052489 169406668962 4126.6 7.0 elif(node_after_a == node_b): # abab path 392 16251407 63918804600 3933.1 2.6 pass 393 24801082 101577038778 4095.7 4.2 elif(node_c in node_b_distances): # b can already get to c 394 24004846 103404357180 4307.6 4.3 (distance_b_c, node_after_b) = node_b_distances[node_c] 395 24004846 102717271836 4279.0 4.2 if(node_after_b != node_a): # b doesn't already go to a first 396 7518275 31858204500 4237.4 1.3 distance_b_a_c = neighbour_distance_b_a + distance_a_c 397 7518275 33470022717 4451.8 1.4 if(distance_b_a_c < distance_b_c): # quicker to go via a 398 225357 956440656 4244.1 0.0 node_b_update = True 399 else: # b can't already get to c 400 796236 3415455549 4289.5 0.1 distance_b_a_c = neighbour_distance_b_a + distance_a_c 401 796236 3412145520 4285.3 0.1 if(distance_b_a_c < cutoff_distance): # not too for to go 402 593352 2514800052 4238.3 0.1 node_b_update = True 403 404 ## Affinity distances update 405 41564609 164585250189 3959.7 6.8 if node_b_update: 406 818709 3933555120 4804.6 0.2 node_b_distances[node_c] = (distance_b_a_c, node_a) 407 818709 4151464335 5070.7 0.2 if((node_b.type == Atom.BINDING_SITE) and (node_c.type == Atom.BINDING_SITE)): 408 104293 1704446289 16342.9 0.1 node_b_chemical.add_affinityDistance(node_b, node_c, self.chemistry.affinity(node_b.data, node_c.data)) 409 818709 3557529531 4345.3 0.1 node_b_changed = True 410 411 # If any of node b rows have exceeded the cutoff distance, then remove them 412 42350234 197075504439 4653.5 8.1 for node_c, (distance_b_c, node_after_b) in node_b_distances.items(): # Can't use iteritems() here, as deleting from the dictionary 413 41838114 180297579789 4309.4 7.4 if(distance_b_c > cutoff_distance): 414 206296 894881754 4337.9 0.0 del node_b_distances[node_c] 415 206296 860508045 4171.2 0.0 node_b_changed = True 416 417 ## Affinity distances update 418 206296 4698692217 22776.5 0.2 node_b_chemical.del_affinityDistance(node_b, node_c) 419 420 # If we've modified node_b distance table, tell its chemical to update accordingly 421 512120 2130466347 4160.1 0.1 if(node_b_changed): 422 217858 1201064454 5513.1 0.0 node_b_chemical.nodes_changed.add(node_b) 423 424 # Remove any neighbours that have infinite distance (have just unbound) 425 ## TODO: not sure what difference it makes to do this here rather than above (after updating self.node_distances for neighbours) 426 ## but doing it above seems to break the walker movement 427 737753 3830386968 5192.0 0.2 for (node_b, neighbour_distance_b_a) in self.neighbours[node_a].items(): # Can't use iteritems() here, as deleting from the dictionary 428 512120 2249770068 4393.1 0.1 if(neighbour_distance_b_a > cutoff_distance): 429 150 747747 4985.0 0.0 del self.neighbours[node_a][node_b] 430 431 ## Affinity distances update 432 150 2148813 14325.4 0.0 self.del_affinityDistance(node_a, node_b)
Explanation of my code:
This function supports a sparse distance matrix representing the network distance (the sum of the edges of the weight on the shortest path) between nodes in a (very large) network. To work with the full table and use the Floyd-Warsall algorithm will be very slow. (I tried this first, and it was an order of magnitude slower than the current version.) Thus, my code uses a sparse matrix to represent the threshold version of the full distance matrix (any paths with a distance of more than 200 units are ignored). Network topology changes over time, so this distance matrix needs to be updated over time. To do this, I use a rough implementation of the distance-vector routing protocol : each node in the network knows the distance to each other node and the next node on the path. When a topology change occurs, the node (s) associated with this change update the distance table accordingly and report to its closest neighbors. Information is distributed over the network by nodes sending their distance tables to their neighbors, who update their distance tables and distribute them to their neighbors.
There is an object representing the distance matrix: self.node_distances . This is a dictionary mapping nodes to routing tables. A node is the object that I defined. A routing table is a dictionary that maps nodes to tuples (distance, next_node). The distance is the distance of the graph from node_a to node_b, and next_node is the neighbor of node_a with which you should go first, on the path between node_a and node_b. Next_node of None indicates that node_a and node_b are neighbors of the graph. For example, a sample of the distance matrix may be:
self.node_distances = { node_1 : { node_2 : (2.0, None), node_3 : (5.7, node_2), node_5 : (22.9, node_2) }, node_2 : { node_1 : (2.0, None), node_3 : (3.7, None), node_5 : (20.9, node_7)}, ...etc...
Due to topology changes, two nodes that were far apart (or not connected at all) can become close. When this happens, entries are added to this matrix. Due to the threshold, two nodes may become too far apart to care. When this happens, entries are deleted from this matrix.
The matrix self.neighbours is similar to self.node_distances , but contains information about direct links (edges) on the network. self.neighbours constantly changing externally to this function through a chemical reaction. This is where the network topology changes
The actual function I'm having problems with: propagate_distances_node() performs a one-step routing protocol between vectors . Given node, node_a , the function ensures that node_a neighbors are correctly located in the distance matrix (topology changes). The function then sends the t28> routing table to all node_a nearest neighbors in the network. It joins the node_a routing node_a with each neighboring own routing table.
For the rest of my program, the propagate_distances_node() function is called repeatedly until the distance matrix converges. A set of self.nodes_changed nodes that have changed their routing table since the last update is supported. At each iteration of my algorithm, a random subset of these nodes is selected and propagate_distances_node() is called on them. This means that nodes distribute their routing tables asynchronously and stochastically. This algorithm converges on the true distance matrix when the set self.nodes_changed becomes empty.
Parts of the "proximity distance" ( add_affinityDistance and del_affinityDistance ) are the cache of the (small) submatrix of the distance matrix, which is used by another part of the program.
The reason I do this is because I mimic the computational analogues of chemicals involved in reactions as part of my candidacy. A “chemical” is a graph of “atoms” (nodes on a graph). Two chemical bonds are modeled together, as their two plots are joined by new edges. A chemical reaction occurs (a complex process that is not appropriate here), changing the topology of the graph. But what happens in the reaction depends on how much different atoms make up the chemicals. Therefore, for each atom in the simulation, I want to know what other atoms it is close to. A rare, threshold distance matrix is the most efficient way to store this information. As the network topology changes as the reaction proceeds, I need to update the matrix. A remote vector routing protocol is the fastest way to do this. I don’t need a more complicated routing protocol because in my particular application things like routing loops do not happen (due to the way my chemicals are structured). The reason I do this stochastically is that I can interact with the processes of a chemical reaction with the spread of distance and simulate a chemical gradually changing shape over time as soon as the reaction occurs (instead of instantly changing shape).
self in this function is an object representing a chemical substance. The self.node_distances.keys() in self.node_distances.keys() are the atoms that make up the chemical. The nodes in self.node_distances[node_x].keys() are the nodes of the chemical and potential nodes of any chemicals to which the chemical is bound (and reacts).
Update:
I tried replacing each instance of node_x == node_y with node_x is node_y (according to @Sven Marnach's comment), but he slowed everything down! (I did not expect this!) My initial profile took 807.234s to run, but with this modification it increased to 895.895s. Sorry, I did the wrong profiling! I used line_by_line, which (in my code) had too much variance (the difference of ~ 90 seconds was in noise). With proper profiling, is is faster than == . Using CProfile , my code with == took 34.394s, but with is it took 33.535s (which I can confirm is out of noise).
Update: Existing Libraries
I'm not sure if there will be an existing library that can do what I want, since my requirements are unusual: I need to calculate the length of the shortest path between all pairs of nodes in a weighted undirected graph. I'm only interested in path lengths that are below the threshold. After calculating the path length, I make a small change in the network topology (adding or removing an edge), and then I want to recalculate the path length. My graphs are huge compared to the threshold value (from the given node, most of the graph is further than the threshold), so topology changes do not affect most shortest path lengths. That's why I use the routing algorithm: because it disseminates information about topology changes through the graph structure, so I can stop distributing it when it goes beyond the threshold. that is, I do not need to recalculate all the paths every time. I can use the previous path information (before changing the topology) to speed up the calculation. This is why I believe that my algorithm will be faster than any implementation of the shortest path algorithm libraries. I have never seen routing algorithms used outside of the actual routing of packets through physical networks (but if anyone has, then I would be interested).
NetworkX was proposed by @Thomas K. It has many algorithms for calculating shortest paths. It has an algorithm for calculating all pairs of shortest path lengths with a cutoff (this is exactly what I want), but it only works on unweighted graphs (mine is weighted). Unfortunately, algorithms for weighted graphs do not allow the use of cutoff (which can slow down the work of my graphs). And none of its algorithms supports the use of pre-calculated paths in a very similar network (i.e. routing material).
igraph is another graph library that I know, but I look at its documentation , I can not find anything about the shortest paths. But I might have missed it - its documentation does not seem very comprehensive.
NumPy might be possible thanks to the @ 9000 comment. I can store my sparse matrix in a NumPy array if I assign a unique integer to each instance of my nodes. Then I can index the NumPy array with integers instead of node instances. I will also need two NumPy arrays: one for distances and one for next_node links. This might be faster than using Python dictionaries (I don't know yet).
Does anyone know of any other libraries that might be useful?
Update: Memory Usage
I am running Windows (XP), so here is some memory usage information from Process Explorer . CPU usage is 50% because I have a dual-core machine.


My program does not end from RAM and starts the exchange. You can see that from the numbers and from the I / O graph it has no activity. The spikes on the I / O plot are where the program prints on the screen to tell how it is done.
However, my program continues to use more and more RAM over time, which is probably not very good (but it does not use a lot of RAM, so I have not noticed an increase so far).
And the distance between the spikes on the I / O graph increases with time. This is bad - my program displays every 100,000 iterations, so this means that each iteration takes longer to complete over time ... I confirmed this by running a long program and measuring the time between print (time between every 10,000 iterations programs). It should be constant, but as you can see from the graph, it is growing linearly ... so something is there. (The noise in this graph is explained by the fact that my program uses a lot of random numbers, so the time for each iteration changes.)

After my program has been running for a long time, the memory usage looks like this (therefore it is definitely not exhausted):

