Handling a large number of log files with python

I am using python scripts to create statistics. One log sorting content looks like this. I call it logs: each log A has the format:

[2012-09-12 12:23:33] SOME_UNIQ_ID filesize 

other logs that I call B have the format:

 [2012-09-12 12:24:00] SOME_UNIQ_ID 

I need to calculate how many entries in the logs A is also in the logs of B, and get the time interval between two entries with the same record ID. My implementation uploaded the B-log ID to the map all the time, then check the A logs again to see if that ID exists on the map. The problem is that it causes too much memory because I have almost 100 million entries in the B-logs. Any suggestion to improve performance and memory usage? Thanks.

+6
source share
6 answers

You can try changing the search depending on whether “A” fits into the memory and sequentially scans “B”.

Otherwise, load the log files into the SQLite3 database with two tables (log_a, log_b) containing (timestamp, uniq_id, rest_of_line), then execute the SQL connection on uniq_id and perform any processing required for the results from that. This will reduce memory overhead, allow the SQL server to make a connection, but, of course, requires efficient duplication of log files on disk (but this is usually not a problem for most systems).

Example

 import sqlite3 from datetime import datetime db = sqlite3.connect(':memory:') db.execute('create table log_a (timestamp, uniq_id, filesize)') a = ['[2012-09-12 12:23:33] SOME_UNIQ_ID filesize'] for line in a: timestamp, uniq_id, filesize = line.rsplit(' ', 2) db.execute('insert into log_a values(?, ?, ?)', (timestamp, uniq_id, filesize)) db.commit() db.execute('create table log_b (timestamp, uniq_id)') b = ['[2012-09-12 13:23:33] SOME_UNIQ_ID'] for line in b: timestamp, uniq_id = line.rsplit(' ', 1) db.execute('insert into log_b values(?, ?)', (timestamp, uniq_id)) db.commit() TIME_FORMAT = '[%Y-%m-%d %H:%M:%S]' for matches in db.execute('select * from log_a join log_b using (uniq_id)'): log_a_ts = datetime.strptime(matches[0], TIME_FORMAT) log_b_ts = datetime.strptime(matches[3], TIME_FORMAT) print matches[1], 'has a difference of', abs(log_a_ts - log_b_ts) # 'SOME_UNIQ_ID has a difference of 1:00:00' # '1:00:00' == datetime.timedelta(0, 3600) 

Note that:

  • .connect on sqlite3 should be a file name
  • a and b should be your files
+3
source

Try the following:

  • Externally Sort Files
  • Read the A Logs file and save SOME_UNIQ_ID (A)
  • Read the B Logs file and save SOME_UNIQ_ID (B)
  • Compare SOME_UNIQ_ID (B) with SOME_UNIQ_ID (A)
    • If it is smaller, read the B Logs file again.
    • If it is larger, read the log file again and compare it with the saved SOME_UNIQ_ID (B)
    • If it is equal, find the time span

Assuming external sorting works efficiently, you end the process of reading both files only once.

+1
source

First, what is the ID format? Is globally unique?

I would choose one of these three options.

  • Use database
  • Combining two sets of identifiers
  • Unix Tools

I assume that you prefer the second option. Download only identifiers from A and B. Assuming that the identifier fits into a 32-bit integer, memory usage will be less than 1 GB. Then load the date and time of the same identifiers and calculate the space. The first option will be best for requirements.

0
source

If the unique identifier can be sorted (for example, in alphabetical or digital form), you can make a comparison.

Suppose, for example, the ID is numeric with a range of 1 - 10 ^ 7. Then you can first place the first 10 ^ 6 elements in your hash table, perform a sequential scan through the second file to find the corresponding entries.

In pseudo python, I have not tested this:

 for i in xrange(0,9): for line in file1: time, id = line.split(']') id = int(id) if i * 10**6 < id < (i+1) * 10**6: hash_table[id] = time for line in file2: time, id = line.split(']') # needs a second split to get the id id = int(id) if id in hashtable: # compare timestamps 

If the identifier is not numeric, you can create batches using the letter key:

 if id.startswith(a_letter_from_the_alphabet): hash_table[id] = time 
0
source

Since the bottleneck is translating timestamps. I divided this action into many isolation machines that are generated by the logs of the magazine and B. These machines take line stamps into the era, and the CENTER machine, which uses all these logs to calculate my result, now requires almost 1/20 of the time of the original path. I post my solution here, thanks to all of you guys.

0
source

I suggest using a database that supports both datetime and uniqueidentifier for the displayed unique identifier form. It comes from the window, and if you use Windows for this task, you can use the Microsoft SQL 2008 R2 Express version (free). Two tables will not use any key.

You can use bcp utility MS SQL, which is likely to be one of the fastest ways to insert data from a text file (or BULK INSERT ).

Indexes on a unique identifier should only be created after all inserts have been inserted. Otherwise, having indexes makes the insert operation slower. Then the internal connection should be as fast as technically possible.

0
source

Source: https://habr.com/ru/post/925455/


All Articles