Design patterns for aggregating heterogeneous tabular data

I am working on some C ++ code that combines information from several dozen csv files. They all contain some timestamped data that I want to extract, but the presentation in each file is slightly different. Differences between views go beyond different column orders and column names β€” for example, a single row with multiple columns in one file may be multiple lines in another file.

So I need some special processing for each file to collect a single data structure that contains the necessary information from all files. My question is, is there a preferred code template to simplify complexity and elegant code? Or, if there is good research, I must study to see how this complexity has been handled in the past.

(I understand that it can be simpler in a scripting language like perl, but the project is in C ++. Also, my question is more about whether there is a code template to handle this - so the answer isn’t be too specific for the language.)

+7
source share
2 answers

There are several phrases that you use in your question that stick out to me: custom handling for each file , representation is somewhat different , complexity manageable . Based on the fact that you have to use various variants of parsing algorithms based on the csv file format, and you (from what I can say), wanting to freely link your syntax mechanism, I would recommend a strategy template.

The strategy template separates the parsing engine from the users of the data contained in the CSV file. Data users are not interested in the format of the CSV file; they are only interested in the information in this file, which makes the strategic template an excellent choice. If there is a similarity between your syntax mechanisms, you can use template and strategic patterns together to reduce duplication and take advantage of inheritance.

Using the strategy template, you can extract the strategy in the factory method or abstract factory , as you deem necessary later, so that clients are separated from the parsing method.

+3
source

I'm not quite sure what you want to do with different files. If the idea is to use them as database tables, and you have several keys with attached information scattered across several files, you may need to look at something like MapReduce , where you first create a piece of information from each file and Combine information sharing the same key in the second step.

As for data structures, it depends on the location of your files. I will probably have a special reader for each type of file that will store information in dedicated data structures representing the information in the file. You can attach a key to each information and use the reduction operation to combine all pieces of information using the same key and combine them into a proxy structure.

On the other hand, if the idea is to create identical objects from different serialization methods (i.e. different files are independent, but represent the same data type with a different layout), without knowing in advance which serialization method was used , I am afraid that the only solution left before this is to force the removal of deserialization. You may have a set of readers, one for each input type, and try to analyze the file, if it fails, the next one starts, and so on, until you find a new file format or find a suitable reader. I do not think that there is any pattern.

0
source

All Articles