Python Dictionary Access Code Optimization

Question

Python Dictionary Access Code Optimization

Question:

I finished the Python program to death, and there is one function that slows everything down. It makes heavy use of Python dictionaries, so I may not have used them in the best way. If I can't get it to work faster, I will have to rewrite it in C ++, so is there anyone who can help me optimize it in Python?

Hope I gave a correct explanation and you can understand my code! Thanks in advance for any help.

My code is:

This is an abusive feature profiled with line_profiler and kernprof . I am running Python 2.7

I am particularly puzzled by things like lines 363, 389, and 405, where the if , comparing two variables, seems to take too much time.

I have considered using NumPy (since these are sparse matrices), but I don't think it is appropriate because: (1) I do not index my matrix with integers (I use object instances); and (2) I do not store simple data types in a matrix (I store float tuples and an instance of an object). But I want to be convinced of NumPy. If anyone knows about the performance of a small NumPy matrix and Python hash tables, I would be interested.

Sorry, I did not provide a simple example that you can run, but this function is connected in a much larger project, and I could not figure out how to set up a simple example for testing without giving you half of my code base!

 Timer unit: 3.33366e-10 s File: routing_distances.py Function: propagate_distances_node at line 328 Total time: 807.234 s Line # Hits Time Per Hit % Time Line Contents 328 @profile 329 def propagate_distances_node(self, node_a, cutoff_distance=200): 330 331 # a makes sure its immediate neighbours are correctly in its distance table 332 # because its immediate neighbours may change as binds/folding change 333 737753 3733642341 5060.8 0.2 for (node_b, neighbour_distance_b_a) in self.neighbours[node_a].iteritems(): 334 512120 2077788924 4057.2 0.1 use_neighbour_link = False 335 336 512120 2465798454 4814.9 0.1 if(node_b not in self.node_distances[node_a]): # a doesn't know distance to b 337 15857 66075687 4167.0 0.0 use_neighbour_link = True 338 else: # a does know distance to b 339 496263 2390534838 4817.1 0.1 (node_distance_b_a, next_node) = self.node_distances[node_a][node_b] 340 496263 2058112872 4147.2 0.1 if(node_distance_b_a > neighbour_distance_b_a): # neighbour distance is shorter 341 81 331794 4096.2 0.0 use_neighbour_link = True 342 496182 2665644192 5372.3 0.1 elif((None == next_node) and (float('+inf') == neighbour_distance_b_a)): # direct route that has just broken 343 75 313623 4181.6 0.0 use_neighbour_link = True 344 345 512120 1992514932 3890.7 0.1 if(use_neighbour_link): 346 16013 78149007 4880.3 0.0 self.node_distances[node_a][node_b] = (neighbour_distance_b_a, None) 347 16013 83489949 5213.9 0.0 self.nodes_changed.add(node_a) 348 349 ## Affinity distances update 350 16013 86020794 5371.9 0.0 if((node_a.type == Atom.BINDING_SITE) and (node_b.type == Atom.BINDING_SITE)): 351 164 3950487 24088.3 0.0 self.add_affinityDistance(node_a, node_b, self.chemistry.affinity(node_a.data, node_b.data)) 352 353 # a sends its table to all its immediate neighbours 354 737753 3549685140 4811.5 0.1 for (node_b, neighbour_distance_b_a) in self.neighbours[node_a].iteritems(): 355 512120 2129343210 4157.9 0.1 node_b_changed = False 356 357 # b integrates a distance table with its own 358 512120 2203821081 4303.3 0.1 node_b_chemical = node_b.chemical 359 512120 2409257898 4704.5 0.1 node_b_distances = node_b_chemical.node_distances[node_b] 360 361 # For all b routes (to c) that go to a first, update their distances 362 41756882 183992040153 4406.3 7.6 for node_c, (distance_b_c, node_after_b) in node_b_distances.iteritems(): # Think it ok to modify items while iterating over them (just not insert/delete) (seems to work ok) 363 41244762 172425596985 4180.5 7.1 if(node_after_b == node_a): 364 365 16673654 64255631616 3853.7 2.7 try: 366 16673654 88781802534 5324.7 3.7 distance_b_a_c = neighbour_distance_b_a + self.node_distances[node_a][node_c][0] 367 187083 929898684 4970.5 0.0 except KeyError: 368 187083 1056787479 5648.8 0.0 distance_b_a_c = float('+inf') 369 370 16673654 69374705256 4160.7 2.9 if(distance_b_c != distance_b_a_c): # a distance to c has changed 371 710083 3136751361 4417.4 0.1 node_b_distances[node_c] = (distance_b_a_c, node_a) 372 710083 2848845276 4012.0 0.1 node_b_changed = True 373 374 ## Affinity distances update 375 710083 3484577241 4907.3 0.1 if((node_b.type == Atom.BINDING_SITE) and (node_c.type == Atom.BINDING_SITE)): 376 99592 1591029009 15975.5 0.1 node_b_chemical.add_affinityDistance(node_b, node_c, self.chemistry.affinity(node_b.data, node_c.data)) 377 378 # If distance got longer, then ask b neighbours to update 379 ## TODO: document this! 380 16673654 70998570837 4258.1 2.9 if(distance_b_a_c > distance_b_c): 381 #for (node, neighbour_distance) in node_b_chemical.neighbours[node_b].iteritems(): 382 1702852 7413182064 4353.4 0.3 for node in node_b_chemical.neighbours[node_b]: 383 1204903 5912053272 4906.7 0.2 node.chemical.nodes_changed.add(node) 384 385 # Look for routes from a to c that are quicker than ones b knows already 386 42076729 184216680432 4378.1 7.6 for node_c, (distance_a_c, node_after_a) in self.node_distances[node_a].iteritems(): 387 388 41564609 171150289218 4117.7 7.1 node_b_update = False 389 41564609 172040284089 4139.1 7.1 if(node_c == node_b): # ab path 390 512120 2040112548 3983.7 0.1 pass 391 41052489 169406668962 4126.6 7.0 elif(node_after_a == node_b): # abab path 392 16251407 63918804600 3933.1 2.6 pass 393 24801082 101577038778 4095.7 4.2 elif(node_c in node_b_distances): # b can already get to c 394 24004846 103404357180 4307.6 4.3 (distance_b_c, node_after_b) = node_b_distances[node_c] 395 24004846 102717271836 4279.0 4.2 if(node_after_b != node_a): # b doesn't already go to a first 396 7518275 31858204500 4237.4 1.3 distance_b_a_c = neighbour_distance_b_a + distance_a_c 397 7518275 33470022717 4451.8 1.4 if(distance_b_a_c < distance_b_c): # quicker to go via a 398 225357 956440656 4244.1 0.0 node_b_update = True 399 else: # b can't already get to c 400 796236 3415455549 4289.5 0.1 distance_b_a_c = neighbour_distance_b_a + distance_a_c 401 796236 3412145520 4285.3 0.1 if(distance_b_a_c < cutoff_distance): # not too for to go 402 593352 2514800052 4238.3 0.1 node_b_update = True 403 404 ## Affinity distances update 405 41564609 164585250189 3959.7 6.8 if node_b_update: 406 818709 3933555120 4804.6 0.2 node_b_distances[node_c] = (distance_b_a_c, node_a) 407 818709 4151464335 5070.7 0.2 if((node_b.type == Atom.BINDING_SITE) and (node_c.type == Atom.BINDING_SITE)): 408 104293 1704446289 16342.9 0.1 node_b_chemical.add_affinityDistance(node_b, node_c, self.chemistry.affinity(node_b.data, node_c.data)) 409 818709 3557529531 4345.3 0.1 node_b_changed = True 410 411 # If any of node b rows have exceeded the cutoff distance, then remove them 412 42350234 197075504439 4653.5 8.1 for node_c, (distance_b_c, node_after_b) in node_b_distances.items(): # Can't use iteritems() here, as deleting from the dictionary 413 41838114 180297579789 4309.4 7.4 if(distance_b_c > cutoff_distance): 414 206296 894881754 4337.9 0.0 del node_b_distances[node_c] 415 206296 860508045 4171.2 0.0 node_b_changed = True 416 417 ## Affinity distances update 418 206296 4698692217 22776.5 0.2 node_b_chemical.del_affinityDistance(node_b, node_c) 419 420 # If we've modified node_b distance table, tell its chemical to update accordingly 421 512120 2130466347 4160.1 0.1 if(node_b_changed): 422 217858 1201064454 5513.1 0.0 node_b_chemical.nodes_changed.add(node_b) 423 424 # Remove any neighbours that have infinite distance (have just unbound) 425 ## TODO: not sure what difference it makes to do this here rather than above (after updating self.node_distances for neighbours) 426 ## but doing it above seems to break the walker movement 427 737753 3830386968 5192.0 0.2 for (node_b, neighbour_distance_b_a) in self.neighbours[node_a].items(): # Can't use iteritems() here, as deleting from the dictionary 428 512120 2249770068 4393.1 0.1 if(neighbour_distance_b_a > cutoff_distance): 429 150 747747 4985.0 0.0 del self.neighbours[node_a][node_b] 430 431 ## Affinity distances update 432 150 2148813 14325.4 0.0 self.del_affinityDistance(node_a, node_b)

Explanation of my code:

This function supports a sparse distance matrix representing the network distance (the sum of the edges of the weight on the shortest path) between nodes in a (very large) network. To work with the full table and use the Floyd-Warsall algorithm will be very slow. (I tried this first, and it was an order of magnitude slower than the current version.) Thus, my code uses a sparse matrix to represent the threshold version of the full distance matrix (any paths with a distance of more than 200 units are ignored). Network topology changes over time, so this distance matrix needs to be updated over time. To do this, I use a rough implementation of the distance-vector routing protocol : each node in the network knows the distance to each other node and the next node on the path. When a topology change occurs, the node (s) associated with this change update the distance table accordingly and report to its closest neighbors. Information is distributed over the network by nodes sending their distance tables to their neighbors, who update their distance tables and distribute them to their neighbors.

There is an object representing the distance matrix: self.node_distances . This is a dictionary mapping nodes to routing tables. A node is the object that I defined. A routing table is a dictionary that maps nodes to tuples (distance, next_node). The distance is the distance of the graph from node_a to node_b, and next_node is the neighbor of node_a with which you should go first, on the path between node_a and node_b. Next_node of None indicates that node_a and node_b are neighbors of the graph. For example, a sample of the distance matrix may be:

 self.node_distances = { node_1 : { node_2 : (2.0, None), node_3 : (5.7, node_2), node_5 : (22.9, node_2) }, node_2 : { node_1 : (2.0, None), node_3 : (3.7, None), node_5 : (20.9, node_7)}, ...etc...

Due to topology changes, two nodes that were far apart (or not connected at all) can become close. When this happens, entries are added to this matrix. Due to the threshold, two nodes may become too far apart to care. When this happens, entries are deleted from this matrix.

The matrix self.neighbours is similar to self.node_distances , but contains information about direct links (edges) on the network. self.neighbours constantly changing externally to this function through a chemical reaction. This is where the network topology changes

The actual function I'm having problems with: propagate_distances_node() performs a one-step routing protocol between vectors . Given node, node_a , the function ensures that node_a neighbors are correctly located in the distance matrix (topology changes). The function then sends the t28> routing table to all node_a nearest neighbors in the network. It joins the node_a routing node_a with each neighboring own routing table.

For the rest of my program, the propagate_distances_node() function is called repeatedly until the distance matrix converges. A set of self.nodes_changed nodes that have changed their routing table since the last update is supported. At each iteration of my algorithm, a random subset of these nodes is selected and propagate_distances_node() is called on them. This means that nodes distribute their routing tables asynchronously and stochastically. This algorithm converges on the true distance matrix when the set self.nodes_changed becomes empty.

Parts of the "proximity distance" ( add_affinityDistance and del_affinityDistance ) are the cache of the (small) submatrix of the distance matrix, which is used by another part of the program.

The reason I do this is because I mimic the computational analogues of chemicals involved in reactions as part of my candidacy. A “chemical” is a graph of “atoms” (nodes on a graph). Two chemical bonds are modeled together, as their two plots are joined by new edges. A chemical reaction occurs (a complex process that is not appropriate here), changing the topology of the graph. But what happens in the reaction depends on how much different atoms make up the chemicals. Therefore, for each atom in the simulation, I want to know what other atoms it is close to. A rare, threshold distance matrix is the most efficient way to store this information. As the network topology changes as the reaction proceeds, I need to update the matrix. A remote vector routing protocol is the fastest way to do this. I don’t need a more complicated routing protocol because in my particular application things like routing loops do not happen (due to the way my chemicals are structured). The reason I do this stochastically is that I can interact with the processes of a chemical reaction with the spread of distance and simulate a chemical gradually changing shape over time as soon as the reaction occurs (instead of instantly changing shape).

self in this function is an object representing a chemical substance. The self.node_distances.keys() in self.node_distances.keys() are the atoms that make up the chemical. The nodes in self.node_distances[node_x].keys() are the nodes of the chemical and potential nodes of any chemicals to which the chemical is bound (and reacts).

Update:

I tried replacing each instance of node_x == node_y with node_x is node_y (according to @Sven Marnach's comment), ~~but he slowed everything down! (I did not expect this!) My initial profile took 807.234s to run, but with this modification it increased to 895.895s.~~ Sorry, I did the wrong profiling! I used line_by_line, which (in my code) had too much variance (the difference of ~ 90 seconds was in noise). With proper profiling, is is faster than == . Using CProfile , my code with == took 34.394s, but with is it took 33.535s (which I can confirm is out of noise).

Update: Existing Libraries

I'm not sure if there will be an existing library that can do what I want, since my requirements are unusual: I need to calculate the length of the shortest path between all pairs of nodes in a weighted undirected graph. I'm only interested in path lengths that are below the threshold. After calculating the path length, I make a small change in the network topology (adding or removing an edge), and then I want to recalculate the path length. My graphs are huge compared to the threshold value (from the given node, most of the graph is further than the threshold), so topology changes do not affect most shortest path lengths. That's why I use the routing algorithm: because it disseminates information about topology changes through the graph structure, so I can stop distributing it when it goes beyond the threshold. that is, I do not need to recalculate all the paths every time. I can use the previous path information (before changing the topology) to speed up the calculation. This is why I believe that my algorithm will be faster than any implementation of the shortest path algorithm libraries. I have never seen routing algorithms used outside of the actual routing of packets through physical networks (but if anyone has, then I would be interested).

NetworkX was proposed by @Thomas K. It has many algorithms for calculating shortest paths. It has an algorithm for calculating all pairs of shortest path lengths with a cutoff (this is exactly what I want), but it only works on unweighted graphs (mine is weighted). Unfortunately, algorithms for weighted graphs do not allow the use of cutoff (which can slow down the work of my graphs). And none of its algorithms supports the use of pre-calculated paths in a very similar network (i.e. routing material).

igraph is another graph library that I know, but I look at its documentation , I can not find anything about the shortest paths. But I might have missed it - its documentation does not seem very comprehensive.

NumPy might be possible thanks to the @ 9000 comment. I can store my sparse matrix in a NumPy array if I assign a unique integer to each instance of my nodes. Then I can index the NumPy array with integers instead of node instances. I will also need two NumPy arrays: one for distances and one for next_node links. This might be faster than using Python dictionaries (I don't know yet).

Does anyone know of any other libraries that might be useful?

Update: Memory Usage

I am running Windows (XP), so here is some memory usage information from Process Explorer . CPU usage is 50% because I have a dual-core machine.

global memory usage my program's memory usage

My program does not end from RAM and starts the exchange. You can see that from the numbers and from the I / O graph it has no activity. The spikes on the I / O plot are where the program prints on the screen to tell how it is done.

However, my program continues to use more and more RAM over time, which is probably not very good (but it does not use a lot of RAM, so I have not noticed an increase so far).

And the distance between the spikes on the I / O graph increases with time. This is bad - my program displays every 100,000 iterations, so this means that each iteration takes longer to complete over time ... I confirmed this by running a long program and measuring the time between print (time between every 10,000 iterations programs). It should be constant, but as you can see from the graph, it is growing linearly ... so something is there. (The noise in this graph is explained by the fact that my program uses a lot of random numbers, so the time for each iteration changes.)

lag between print statements increasing over time

After my program has been running for a long time, the memory usage looks like this (therefore it is definitely not exhausted):

global memory usage - after a long run my program's memory usage - after a long run

+57

optimization python dictionary sparse-matrix

Adam Nellis Feb 04 2018-11-11T00:

source share

5 answers

Pyrex / Cython ?

python C, .pyd , .

+6

Macke 06 . '11 20:23

source share

, ... Floyd-Warshall, . , Floyd-Warshall GPU. google :

http://cvit.iiit.ac.in/papers/Pawan07accelerating.pdf

http://my.safaribooksonline.com/book/programming/graphics/9780321545411/gpu-computing-for-protein-structure-prediction/ch43lev1sec2#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTk3ODAzMjE1NDU0MTEvNDg3

http://www.gpucomputing.net/?q=node/1203

http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter43.html

, Floyd-Warshall, Python, , GPU Python.

. , , , - hough. Python, , , ~ 7 i7. ; Nvidia GTX 480 ~ 0,002. YMMV, - , GPU, , , , .

Python/GPU PyCUDA PyOpenCL.

+4

Josh Bleecher Snyder 05 . '11 1:08

source share

( ), . 40 !

, 80% 20% - 13 , 24 . , ( "20% 80% " ).

: Psycho ? JIT, - - , 4 -5x - , ( , ), - :

 import psyco psyco.full()

Psycho GCJ , - , , 2 .

nit-picking ( == is .., - %). 13 " ":

 Line # Hits Time Per Hit % Time Line Contents 412 42350234 197075504439 4653.5 8.1 for node_c, (distance_b_c, node_after_b) in node_b_distances.items(): # Can't use iteritems() here, as deleting from the dictionary 386 42076729 184216680432 4378.1 7.6 for node_c, (distance_a_c, node_after_a) in self.node_distances[node_a].iteritems(): 362 41756882 183992040153 4406.3 7.6 for node_c, (distance_b_c, node_after_b) in node_b_distances.iteritems(): # Think it ok to modify items while iterating over them (just not insert/delete) (seems to work ok) 413 41838114 180297579789 4309.4 7.4 if(distance_b_c > cutoff_distance): 363 41244762 172425596985 4180.5 7.1 if(node_after_b == node_a): 389 41564609 172040284089 4139.1 7.1 if(node_c == node_b): # ab path 388 41564609 171150289218 4117.7 7.1 node_b_update = False 391 41052489 169406668962 4126.6 7 elif(node_after_a == node_b): # abab path 405 41564609 164585250189 3959.7 6.8 if node_b_update: 394 24004846 103404357180 4307.6 4.3 (distance_b_c, node_after_b) = node_b_distances[node_c] 395 24004846 102717271836 4279 4.2 if(node_after_b != node_a): # b doesn't already go to a first 393 24801082 101577038778 4095.7 4.2 elif(node_c in node_b_distances): # b can already get to c

A) , , , # 388 , , , node_b_update = False . , - , , False ! , F, T = False, True False True locals F T . , (3%?).

B) , # 389 "" 512120 ( # 390) № 391 16,251,407. , - - "", (2%?). , pass , :

 if (node_after_a is not node_b) and (node_c is not node_b): # neither abab nor ab path if (node_c in node_b_distances): # b can already get to c (distance_b_c, node_after_b) = node_b_distances[node_c] if (node_after_b is not node_a): # b doesn't already go to a first distance_b_a_c = neighbour_distance_b_a + distance_a_c if (distance_b_a_c < distance_b_c): # quicker to go via a node_b_update = T else: # b can't already get to c distance_b_a_c = neighbour_distance_b_a + distance_a_c if (distance_b_a_c < cutoff_distance): # not too for to go node_b_update = T

C) , try-except (# 365-367), - .get(key, defaultVal) collections.defaultdict(itertools.repeat(float('+inf'))) . try-except - . # 365 3,5% , -.

D) ( obj . obj [ idx ] ), . , , self.node_distances[node_a] (# 336, 339, 346, 366, 386), , ( . [] )), , . , node_a_distances = self.node_distances[node_a] , .

+4

Nas Banov 10 . '11 8:16

source share

, Qaru 30000 , .

:

, 21% , , - !

, . == is , if 388 @Nas Banov. try/except , ( 390 - node_c in node_b_distances ), , - . 391 392 node_b_distances[node_c] , .

, (. ). , ( ). , , :)

 Timer unit: 3.33366e-10 s File: routing_distances.py Function: propagate_distances_node at line 328 Total time: 760.74 s Line # Hits Time Per Hit % Time Line Contents 328 @profile 329 def propagate_distances_node(self, node_a, cutoff_distance=200): 330 331 # a makes sure its immediate neighbours are correctly in its distance table 332 # because its immediate neighbours may change as binds/folding change 333 791349 4158169713 5254.5 0.2 for (node_b, neighbour_distance_b_a) in self.neighbours[node_a].iteritems(): 334 550522 2331886050 4235.8 0.1 use_neighbour_link = False 335 336 550522 2935995237 5333.1 0.1 if(node_b not in self.node_distances[node_a]): # a doesn't know distance to b 337 15931 68829156 4320.5 0.0 use_neighbour_link = True 338 else: # a does know distance to b 339 534591 2728134153 5103.2 0.1 (node_distance_b_a, next_node) = self.node_distances[node_a][node_b] 340 534591 2376374859 4445.2 0.1 if(node_distance_b_a > neighbour_distance_b_a): # neighbour distance is shorter 341 78 347355 4453.3 0.0 use_neighbour_link = True 342 534513 3145889079 5885.5 0.1 elif((None is next_node) and (float('+inf') == neighbour_distance_b_a)): # direct route that has just broken 343 74 327600 4427.0 0.0 use_neighbour_link = True 344 345 550522 2414669022 4386.1 0.1 if(use_neighbour_link): 346 16083 81850626 5089.3 0.0 self.node_distances[node_a][node_b] = (neighbour_distance_b_a, None) 347 16083 87064200 5413.4 0.0 self.nodes_changed.add(node_a) 348 349 ## Affinity distances update 350 16083 86580603 5383.4 0.0 if((node_a.type == Atom.BINDING_SITE) and (node_b.type == Atom.BINDING_SITE)): 351 234 6656868 28448.2 0.0 self.add_affinityDistance(node_a, node_b, self.chemistry.affinity(node_a.data, node_b.data)) 352 353 # a sends its table to all its immediate neighbours 354 791349 4034651958 5098.4 0.2 for (node_b, neighbour_distance_b_a) in self.neighbours[node_a].iteritems(): 355 550522 2392248546 4345.4 0.1 node_b_changed = False 356 357 # b integrates a distance table with its own 358 550522 2520330696 4578.1 0.1 node_b_chemical = node_b.chemical 359 550522 2734341975 4966.8 0.1 node_b_distances = node_b_chemical.node_distances[node_b] 360 361 # For all b routes (to c) that go to a first, update their distances 362 46679347 222161837193 4759.3 9.7 for node_c, (distance_b_c, node_after_b) in node_b_distances.iteritems(): # Think it ok to modify items while iterating over them (just not insert/delete) (seems to work ok) 363 46128825 211963639122 4595.0 9.3 if(node_after_b is node_a): 364 365 18677439 79225517916 4241.8 3.5 try: 366 18677439 101527287264 5435.8 4.4 distance_b_a_c = neighbour_distance_b_a + self.node_distances[node_a][node_c][0] 367 181510 985441680 5429.1 0.0 except KeyError: 368 181510 1166118921 6424.5 0.1 distance_b_a_c = float('+inf') 369 370 18677439 89626381965 4798.6 3.9 if(distance_b_c != distance_b_a_c): # a distance to c has changed 371 692131 3352970709 4844.4 0.1 node_b_distances[node_c] = (distance_b_a_c, node_a) 372 692131 3066946866 4431.2 0.1 node_b_changed = True 373 374 ## Affinity distances update 375 692131 3808548270 5502.6 0.2 if((node_b.type == Atom.BINDING_SITE) and (node_c.type == Atom.BINDING_SITE)): 376 96794 1655818011 17106.6 0.1 node_b_chemical.add_affinityDistance(node_b, node_c, self.chemistry.affinity(node_b.data, node_c.data)) 377 378 # If distance got longer, then ask b neighbours to update 379 ## TODO: document this! 380 18677439 88838493705 4756.5 3.9 if(distance_b_a_c > distance_b_c): 381 #for (node, neighbour_distance) in node_b_chemical.neighbours[node_b].iteritems(): 382 1656796 7949850642 4798.3 0.3 for node in node_b_chemical.neighbours[node_b]: 383 1172486 6307264854 5379.4 0.3 node.chemical.nodes_changed.add(node) 384 385 # Look for routes from a to c that are quicker than ones b knows already 386 46999631 227198060532 4834.0 10.0 for node_c, (distance_a_c, node_after_a) in self.node_distances[node_a].iteritems(): 387 388 46449109 218024862372 4693.8 9.6 if((node_after_a is not node_b) and # not abab path 389 28049321 126269403795 4501.7 5.5 (node_c is not node_b)): # not ab path 390 27768341 121588366824 4378.7 5.3 try: # Assume node_c in node_b_distances ('try' block will raise KeyError if not) 391 27768341 159413637753 5740.8 7.0 if((node_b_distances[node_c][1] is not node_a) and # b doesn't already go to a first 392 8462467 51890478453 6131.8 2.3 ((neighbour_distance_b_a + distance_a_c) < node_b_distances[node_c][0])): 393 394 # Found a route 395 224593 1168129548 5201.1 0.1 node_b_distances[node_c] = (neighbour_distance_b_a + distance_a_c, node_a) 396 ## Affinity distances update 397 224593 1274631354 5675.3 0.1 if((node_b.type == Atom.BINDING_SITE) and (node_c.type == Atom.BINDING_SITE)): 398 32108 551523249 17177.1 0.0 node_b_chemical.add_affinityDistance(node_b, node_c, self.chemistry.affinity(node_b.data, node_c.data)) 399 224593 1165878108 5191.1 0.1 node_b_changed = True 400 401 809945 4449080808 5493.1 0.2 except KeyError: 402 # b can't already get to c (node_c not in node_b_distances) 403 809945 4208032422 5195.5 0.2 if((neighbour_distance_b_a + distance_a_c) < cutoff_distance): # not too for to go 404 405 # These lines of code copied, for efficiency 406 # (most of the time, the 'try' block succeeds, so don't bother testing for (node_c in node_b_distances)) 407 # Found a route 408 587726 3162939543 5381.7 0.1 node_b_distances[node_c] = (neighbour_distance_b_a + distance_a_c, node_a) 409 ## Affinity distances update 410 587726 3363869061 5723.5 0.1 if((node_b.type == Atom.BINDING_SITE) and (node_c.type == Atom.BINDING_SITE)): 411 71659 1258910784 17568.1 0.1 node_b_chemical.add_affinityDistance(node_b, node_c, self.chemistry.affinity(node_b.data, node_c.data)) 412 587726 2706161481 4604.5 0.1 node_b_changed = True 413 414 415 416 # If any of node b rows have exceeded the cutoff distance, then remove them 417 47267073 239847142446 5074.3 10.5 for node_c, (distance_b_c, node_after_b) in node_b_distances.items(): # Can't use iteritems() here, as deleting from the dictionary 418 46716551 242694352980 5195.0 10.6 if(distance_b_c > cutoff_distance): 419 200755 967443975 4819.0 0.0 del node_b_distances[node_c] 420 200755 930470616 4634.9 0.0 node_b_changed = True 421 422 ## Affinity distances update 423 200755 4717125063 23496.9 0.2 node_b_chemical.del_affinityDistance(node_b, node_c) 424 425 # If we've modified node_b distance table, tell its chemical to update accordingly 426 550522 2684634615 4876.5 0.1 if(node_b_changed): 427 235034 1383213780 5885.2 0.1 node_b_chemical.nodes_changed.add(node_b) 428 429 # Remove any neighbours that have infinite distance (have just unbound) 430 ## TODO: not sure what difference it makes to do this here rather than above (after updating self.node_distances for neighbours) 431 ## but doing it above seems to break the walker movement 432 791349 4367879451 5519.5 0.2 for (node_b, neighbour_distance_b_a) in self.neighbours[node_a].items(): # Can't use iteritems() here, as deleting from the dictionary 433 550522 2968919613 5392.9 0.1 if(neighbour_distance_b_a > cutoff_distance): 434 148 775638 5240.8 0.0 del self.neighbours[node_a][node_b] 435 436 ## Affinity distances update 437 148 2096343 14164.5 0.0 self.del_affinityDistance(node_a, node_b)

+1

Adam Nellis 11 . '11 11:10

source share

Paulo Scardine · Accepted Answer · 2011-02-04 18:22

node_after_b == node_a will try to call node_after_b.__eq__(node_a) :

 >>> class B(object): ... def __eq__(self, other): ... print "B.__eq__()" ... return False ... >>> class A(object): ... def __eq__(self, other): ... print "A.__eq__()" ... return False ... >>> a = A() >>> b = B() >>> a == b A.__eq__() False >>> b == a B.__eq__() False >>>

Try overriding Node.__eq__() with an optimized version before resorting to C.

UPDATE

I did this little experiment (python 2.6.6):

 #!/usr/bin/env python # test.py class A(object): def __init__(self, id): self.id = id class B(A): def __eq__(self, other): return self.id == other.id @profile def main(): list_a = [] list_b = [] for x in range(100000): list_a.append(A(x)) list_b.append(B(x)) ob_a = A(1) ob_b = B(1) for ob in list_a: if ob == ob_a: x = True if ob is ob_a: x = True if ob.id == ob_a.id: x = True if ob.id == 1: x = True for ob in list_b: if ob == ob_b: x = True if ob is ob_b: x = True if ob.id == ob_b.id: x = True if ob.id == 1: x = True if __name__ == '__main__': main()

Results:

 Timer unit: 1e-06 s File: test.py Function: main at line 10 Total time: 5.52964 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 10 @profile 11 def main(): 12 1 5 5.0 0.0 list_a = [] 13 1 3 3.0 0.0 list_b = [] 14 100001 360677 3.6 6.5 for x in range(100000): 15 100000 763593 7.6 13.8 list_a.append(A(x)) 16 100000 924822 9.2 16.7 list_b.append(B(x)) 17 18 1 14 14.0 0.0 ob_a = A(1) 19 1 5 5.0 0.0 ob_b = B(1) 20 100001 500454 5.0 9.1 for ob in list_a: 21 100000 267252 2.7 4.8 if ob == ob_a: 22 x = True 23 100000 259075 2.6 4.7 if ob is ob_a: 24 x = True 25 100000 539683 5.4 9.8 if ob.id == ob_a.id: 26 1 3 3.0 0.0 x = True 27 100000 271519 2.7 4.9 if ob.id == 1: 28 1 3 3.0 0.0 x = True 29 100001 296736 3.0 5.4 for ob in list_b: 30 100000 472204 4.7 8.5 if ob == ob_b: 31 1 4 4.0 0.0 x = True 32 100000 283165 2.8 5.1 if ob is ob_b: 33 x = True 34 100000 298839 3.0 5.4 if ob.id == ob_b.id: 35 1 3 3.0 0.0 x = True 36 100000 291576 2.9 5.3 if ob.id == 1: 37 1 3 3.0 0.0 x = True

I was very surprised:

"dot" access (ob.property) seems very expensive (line 25 on line 27).
there wasn’t much difference between is and '==', at least for simple objects

Then I tried with more complex objects, and the results are consistent with the first experiment.

Do you change a lot? If your data set is so large that it is not suitable for available RAM, I think you might run into some kind of I / O conflict related to virtual memory retrieval.

Do you use Linux? If so, can you publish vmstat to your machine during the launch of your program? Send us the output of something like:

 vmstat 10 100

Good luck

UPDATE (from OP comments)

I started the game with sys.setcheckinterval and turned on / off the GC. The rationale is that for this particular case (a huge number of instances), checking the default GC reference counter is somewhat expensive, and its default interval is too far.

Yes, I used to play with sys.setcheckinterval. I changed it to 1000 (the default is 100), but that did not make any measurable difference. Disabling the garbage collection helped - thanks. This was the biggest acceleration - saving 20% (171 minutes for the entire run, up to 135 minutes) - I'm not sure what these errors are, but it should be a statistically significant increase. - Adam Nellis February 9 at 3:10 pm

My suggestion:

I think Python GC is based on reference counts. From time to time this will check the reference count for each case; since you are crossing these huge memory structures in your particular case, the default GC frequency (1000 cycles?) is too often a huge waste. - Regards, February 10 at 2:06

Python Dictionary Access Code Optimization

More articles: