Python synchronized reading of sorted files

I have two groups of files that contain data in CSV format with a shared key (Timestamp). I need to go through all the records chronologically.

  • Group A: Environmental Data

    • File names are in the format A_0001.csv, A_0002.csv, etc.
    • Pre-sorted Ascending
    • The key is a timestamp, i.e.YYYY-MM-DD HH: MM: SS
    • Contains environmental data in CSV format / columns
    • Very big data on a few GB.
  • Group B: "Event Data"

    • File names are in the format B_0001.csv, B_0002.csv
    • Pre-sorted Ascending
    • The key is a timestamp, i.e.YYYY-MM-DD HH: MM: SS
    • Contains event-based data in CSV / column format
    • Relatively small compared to group A files, 100 MB

What is the best approach?

  • Pre merge . Use one of the different recipes to combine the files into one sorted output and then read it for processing.
  • Real-time merge: inject code to “merge” real-time files

I will run many post-processing iterations. Any thoughts or suggestions? I am using Python.

+4
source share
5 answers

I would suggest a preliminary merger.

Reading a file takes a lot of processor time. Reading two files is twice as much. Since your program will deal with a lot of input data (many files, esp in group A), I think it would be better to deal with one read file and have all your relevant data in this file. It will also reduce the number of variables and read statements you will need.

This will improve the runtime of your algorithm, and I think that a good enough reason in this scenario to decide to use this approach

Hope this helps

0
source

im thinking importing it into db (mysql, sqlite, etc.) will give better performance than merging it into a script. db usually has optimized routines for loading csv, and the connection is likely to be as fast or much faster than merging two dicts (one very large) in python.

+2
source

“YYYY-MM-DD HH: MM: SS” can be sorted with a simple ascii comparison. What about reusing external merge logic? If the first field is the key, then:

 for entry in os.popen("sort -m -t, -k1,1 file1 file2"): process(entry) 
+2
source

This is like a relational join. Since your timestamps do not have to match, this is called not equivalent.

Sort-Merge is one of several popular algorithms. For non equijoins it works well. I think this will be what you are called "pre-merge". I don’t know what you mean by “real-time merge”, but I suspect that this is still a simple sort-merge, which is a great technique, heavily used by real databases.

Nested loops can also work. In this case, you are reading a smaller table in the outer loop. In the inner loop, you will find all the “matching” rows from the larger table. This is sort-merge efficiently, but with the assumption that there will be several rows from the large table that will correspond to the small table.

This, BTW, will allow you to more correctly assign value to the relationship between event data and environmental data. Instead of reading the result of a massive merge sort and trying to determine which record you have, nested loops do a great job.

In addition, you can “search” into a smaller table by reading the larger table.

This is difficult if you are making unequal comparisons because you do not have a suitable key to easily extract from a simple dict. However, you can easily extend the dict (overriding __contains__ and __getitem__ ) to compare ranges by key instead of simple equality tests.

+1
source

You can read from files in pieces, say, 10,000 records (or any other number that will profile, tells you to be optimal) and merge "on the fly." You can use a custom class to encapsulate IO; after that, the actual records can be accessed through the generator protocol ( __iter__ + next ).

It will be memory friendly, perhaps very good in terms of the total time to complete the operation and allow you to output step by step.

Sketch:

 class Foo(object): def __init__(self, env_filenames=[], event_filenames=[]): # open the files etc. def next(self): if self._cache = []: # take care of reading more records else: # return the first record and pop it from the cache # ... other stuff you need ... 
0
source

All Articles