What is the best way to handle large CSV files?

I have a third-party system that generates a large amount of data every day (these are CSV files that are stored on FTP). 3 types of files are created:

  • every 15 minutes (2 files). These files are quite small (~ 2 Mb )
  • every day at 5 pm (~ 200 - 300 Mb )
  • every midnight (this CSV file is about 1 Gb )

Overall, the size of 4 CSV is 1.5 Gb . But we must keep in mind that some files are created every 15 minutes. This data should also be aggregated (not so complicated process, but it will definitely take time). I need quick answers. I think how to store this data and on the whole implementation.

We have a java stack. MS SQL Standard Database. From my measurements, MS SQL Standard with other applications will not handle such a load. What comes to my mind:

  • It can be upgrade to MS SQL Enterprise using a separate server.
  • Using PostgreSQL on a separate server. I am currently working on PoC for this approach.

What do you recommend here? There are probably better alternatives.

Edit # 1

These large files are new data for every day.

+6
source share
4 answers

Good. After spending some time with this problem (it includes reading, consulting, experimenting, performing multiple PoCs). I came up with the following solution.

T; dg

Database : PostgreSQL as it is good for CSV, free and open source.

Tool : Apache Spark is suitable for such tasks. Good performance.

Db

As for the database, it is important to solve it. What to choose and how it will work in the future with so much data. This should definitely be a separate server instance, so as not to create additional load on the main database instance, and not block other applications.

NoSQL

I was thinking about using Cassandra here, but this solution would be too complicated right now. Cassandra has no special requests. Cassandra The data storage tier is basically a key storage system. This means that you should "model" your data around the queries you need, and not around the structure of the data itself.

RDBMS

I did not want to overestimate here. And I made a choice here.

MS SQL Server

This is the way to go, but the big drawback here is the price. Quite expensive. Publishing an enterprise costs a lot of money considering our equipment. Regarding pricing, you can read this policy document .

Another drawback here was support for CSV files. This will be the main data source for us here. MS SQL Server cannot import and export CSV.

  • MS SQL Server silent truncation of a text field.

  • MS SQL Server Text Encoding Processing Goes Wrong.

MS SQL Server displays an error message because it does not understand quoting or escaping. More details about this comparison can be found in the article PostgreSQL and MS SQL Server .

PostgreSQL

This database is a mature product and is also battle tested. I have heard many positive reviews about this from others (of course, there are some tradeoffs). It has more classical SQL syntax, good CSV support, moreover, it is open source.

It is worth noting that SSMS is better than PGAdmin . SSMS has an autocomplete function, several results (when you run several queries and get several results in one, but in PGAdmin you get only the last one).

Anyway, right now I'm using DataGrip from JetBrains.

Processing tool

I looked at Spring Package and Apache Spark . Spring Package is too low-level thing for this task, and Apache Spark also provides scalability if needed in the future. In any case, Spring Batch can also do the job.

As for the Apache Spark example , the code can be found in learning-spark . My choice is Apache Spark .

+1
source

You might consider the Apache Spark project. After validating and validating the data, you can use Presto to run queries.

+1
source

You can use uniVocity-parsers to process CSV as quickly as possible, since this library comes with the fastest CSV parser. I am the author of this library, and it is open and free (Apache V2 License)

Now for loading data into the database you can try the univocity framework (commercial). We use it to load a huge amount of data into databases, such as SQL Server and PostgreSQL, very quickly - from 25 K to 200 thousand lines per second, depending on the database and its configuration.

Here is a simple example of how code to migrate from CSV would look like this:

 public static void main(String ... args){ //Configure CSV input directory CsvDataStoreConfiguration csv = new CsvDataStoreConfiguration("csv"); csv.addEntitiesFromDirectory(new File("/path/to/csv/dir/"), "ISO-8859-1"); //should grab column names from CSV files csv.getDefaultEntityConfiguration().setHeaderExtractionEnabled(true); javax.sql.DataSource dataSource = connectToDatabaseAndGetDataSource(); //specific to your environment //Configures the target database JdbcDataStoreConfiguration database = new JdbcDataStoreConfiguration("database", dataSource); //Use only for postgres - their JDBC driver requires us to convert the input Strings from the CSV to the correct column types. database.getDefaultEntityConfiguration().setParameterConversionEnabled(true); DataIntegrationEngine engine = Univocity.getEngine(new EngineConfiguration(csv, database)); //Creates a mapping between data stores "csv" and "database" DataStoreMapping mapping = engine.map(csv, database); // if names of CSV files and their columns match database tables an their columns // we can detect the mappings from one to the other automatically mapping.autodetectMappings(); //loads the database. engine.executeCycle(); } 

To improve performance, the structure allows you to manage the database schema and perform operations such as constraints and drop indices, loading data, and re-creating it. Data and schema conversion is also very well supported if you need to.

Hope this helps.

+1
source

Data Integration Pentaho (or a similar ETL tool) can handle importing data into an SQL database and can perform aggregation on the fly. PDI has a community version and can work autonomously or through the Java API.

0
source

All Articles