Good. After spending some time with this problem (it includes reading, consulting, experimenting, performing multiple PoCs). I came up with the following solution.
T; dg
Database : PostgreSQL as it is good for CSV, free and open source.
Tool : Apache Spark is suitable for such tasks. Good performance.
Db
As for the database, it is important to solve it. What to choose and how it will work in the future with so much data. This should definitely be a separate server instance, so as not to create additional load on the main database instance, and not block other applications.
NoSQL
I was thinking about using Cassandra here, but this solution would be too complicated right now. Cassandra has no special requests. Cassandra The data storage tier is basically a key storage system. This means that you should "model" your data around the queries you need, and not around the structure of the data itself.
RDBMS
I did not want to overestimate here. And I made a choice here.
MS SQL Server
This is the way to go, but the big drawback here is the price. Quite expensive. Publishing an enterprise costs a lot of money considering our equipment. Regarding pricing, you can read this policy document .
Another drawback here was support for CSV files. This will be the main data source for us here. MS SQL Server cannot import and export CSV.
MS SQL Server displays an error message because it does not understand quoting or escaping. More details about this comparison can be found in the article PostgreSQL and MS SQL Server .
PostgreSQL
This database is a mature product and is also battle tested. I have heard many positive reviews about this from others (of course, there are some tradeoffs). It has more classical SQL syntax, good CSV support, moreover, it is open source.
It is worth noting that SSMS is better than PGAdmin . SSMS has an autocomplete function, several results (when you run several queries and get several results in one, but in PGAdmin you get only the last one).
Anyway, right now I'm using DataGrip from JetBrains.
Processing tool
I looked at Spring Package and Apache Spark . Spring Package is too low-level thing for this task, and Apache Spark also provides scalability if needed in the future. In any case, Spring Batch can also do the job.
As for the Apache Spark example , the code can be found in learning-spark . My choice is Apache Spark .