How to make Python / PostgreSQL faster?

Now I have a log parser reading through 515mb text files (every day file for the last 4 years). Currently, my code is as follows: http://gist.github.com/12978 . I used psyco (as seen from the code), and I also compile it using the compiled version. It makes about 100 lines every 0.3 seconds. The device is a standard 15-inch MacBook Pro (2.4 GHz C2D, 2 GB of RAM)

Is it possible for this to go faster or is this a limitation in the language / database?

+6
python postgresql
source share
5 answers

Do not waste time profiling. Time is always in database operations. Make as little as possible. Just the minimum number of inserts.

Three things.

One. Do not use SELECT again and again to match Date, Hostname, and Person parameters. Extract all ONCE data into a Python dictionary and use it in memory. Do not make repeated single selections. Use Python.

Two. Do not update.

In particular, do not do this. This is bad code for two reasons.

cursor.execute("UPDATE people SET chats_count = chats_count + 1 WHERE id = '%s'" % person_id) 

It will be replaced by a simple SELECT COUNT (*) FROM .... Never update the counter. Just count the rows that are with the SELECT statement. [If you cannot do this with a simple SELECT COUNT or SELECT COUNT (DISTINCT), you are missing data - your data model should always contain the correct full values. Never update.]

A. Never create SQL using string replacement. Totally dumb.

If for some reason SELECT COUNT(*) not fast enough (check first before doing anything lame), you can cache the count result in another table. AFTER all loads. Do a SELECT COUNT(*) FROM whatever GROUP BY whatever and paste this into the counting table. Do not update. Someday.

Three. Use bind variables. Always.

 cursor.execute( "INSERT INTO ... VALUES( %(x)s, %(y)s, %(z)s )", {'x':person_id, 'y':time_to_string(time), 'z':channel,} ) 

SQL never changes. Values ​​associated with the change, but SQL never changes. It is much faster. Never create SQL queries dynamically. Never.

+7
source share

Use bind variables instead of literal values ​​in sql statements and create a cursor for each unique SQL code so that the statement does not need to be reused the next time it is used. From python db api doc:

Prepare and execute the database (query or command). Parameters can be represented as a sequence or display and will be associated with variables in the operation. variables are indicated in the database designation (see module paramstyle attribute for details). [5]

The link to the operation will be saved by the cursor. If the operation object is transferred again, then the cursor can optimize its behavior. This is most effective for algorithms in which the same operation is used, but different parameters (many times).

ALWAYS ALWAYS ALWAYS use bind variables.

+3
source share

In the for loop, you insert several times into the chat table, so you only need one sql statement with bind variables that will execute with different values. Therefore, you can put this before the for loop:

 insert_statement=""" INSERT INTO chats(person_id, message_type, created_at, channel) VALUES(:person_id,:message_type,:created_at,:channel) """ 

Then, instead of each executing sql statement, put this in place:

 cursor.execute(insert_statement, person_id='person',message_type='msg',created_at=some_date, channel=3) 

This will speed things up because:

  • The cursor object will not need to rewrite the statement each time.
  • The db server does not have to create a new execution plan, since it can use the one that it created earlier.
  • You will not need to call santitize (), since special characters in the binding variables will not be included in the sql executable.

Note. The variable binding syntax I used is specific to Oracle. You will need to check the psycopg2 library documentation for the exact syntax.

Other optimizations:

  • After each iteration of the loop, you grow with the UPDATE people SET chatscount command. Keep a dictionary mapping the user to chat_count, and then run the total number operator that you saw. This will be faster than hitting db after every write.
  • Use bind variables for ALL of your queries. Not just the insert statement, I choose this as an example.
  • Change all the find _ * () functions that look for db to cache their results so that they don't fall into db every time.
  • psycho optimizes python programs that perform a large number of numerical operations. The script is expensive, not expensive CPU, so I would not expect to give you a lot of optimism.
+3
source share

As Mark suggested, use binding variables. The database should only prepare each statement once, and then “fill in the blanks” for each execution. As a good side effect, it will automatically take care of string problems (which your program does not handle).

Turn on transactions (if they are not already installed) and complete one end at the end of the program. The database does not need to write anything to disk until all data has been completed. And if your program encounters an error, none of the lines will be fixed, which allows you to simply restart the program after fixing the problem.

The functions log_hostname, log_person, and log_date make unnecessary SELECTs in tables. Create the appropriate attributes for the PRIMARY KEY or UNIQUE table. Then, instead of checking for a key before INSERT, just enter INSERT. If username / date / hostname already exists, INSERT will fail if the constraint is violated. (This will not work if you use a single-commit transaction, as suggested above.)

Alternatively, if you know that you are the only one entering the tables while your program is running, then create parallel data structures in memory and store them in memory while you do your INSERT. For example, read all host names from a table into an associative array at the beginning of the program. When you want to know if you need to do INSERT, just do an array search. If the entry is not found, perform INSERT and update the array accordingly. (This offer is compatible with transactions and single commit, but requires more programs. However, it will be viciously faster.)

+2
source share

In addition to the many great suggestions @Mark Roddy gave, do the following:

  • do not use readlines , you can iterate over file objects
  • try using executemany rather than execute : try to do batch inserts rather than separate inserts, which is usually faster because there is less overhead. It also reduces commits.
  • str.rstrip will work just fine, not to delete a new line using regex

When you insert, inserts will temporarily use more memory, but this should be good if you are not reading the entire file in memory.

+1
source share

All Articles