How can I store extremely large amounts of traffic data for easy retrieval?

for a traffic accounting system I need to store a large number of data sets about Internet packets sent through our gateway router (containing a timestamp, user ID, destination address or ip source, number of bytes, etc.).

This data should be stored for some time, at least for several days. It is also necessary to provide an easy search.

What is a good way to do this? I already have some ideas:

  • Create a file for each user and day and add each data set to it.

    • Benefit: Possibly very fast, and data is easy to find, given the consistent file format.
    • Disadvantage. It’s not easy to see, for example. all UDP traffic of all users.
  • Use database

    • Benefit: It is very easy to find specific data with the correct SQL query.
    • Disadvantage: I'm not sure if there is a database engine that can efficiently process a table using perhaps hundreds of millions of data sets.
  • Perhaps it is possible to combine two approaches: using an SQLite database file for each user.

    • Advantage: It would be easy to get information for a single user using SQL queries in his file.
    • Disadvantage: obtaining general information will still be difficult.

But maybe someone has a good idea?

Thank you very well in advance.

+6
database sqlite storage
source share
3 answers

First of all, before you do anything, the Data Warehouse Toolkit .

You are doing data warehouse work, you need to solve it as a data warehouse job. You will need to read the correct design patterns for this kind of thing.

[Note Data Warehouse does not mean insane big or expensive or complex. This means Star Schema and smart ways to handle large amounts of data that are never updated.]

  • SQL databases are slow, but slow work is good for flexible searches.

  • The file system is fast. This is a terrible thing to update, but you are not updating, you are just accumulating.

A typical DW approach for this is to do this.

  • Define a "star pattern" for your data. The measurable facts and attributes (“dimensions”) of these facts. Your fact looks like the number of bytes. Everything else (address, timestamp, user ID, etc.) is an aspect of this fact.

  • Create dimensional data in the base measurement database. It is relatively small (IP addresses, users, date measurement, etc.). Each dimension will have all the attributes you will ever want to know. It grows, people always add attributes to dimensions.

  • Create a boot process that takes your logs, decides the sizes (times, addresses, users, etc.) and combines measurement keys with measures (number of bytes). This can update the dimension to add a new user or a new address. Typically, you read fact lines, do a search, and write fact lines that have all the relevant FKs associated with them.

  • Save these boot files to disk. These files are not updated. They just accumulate. Use simple notation like CSV so you can easily download them.

When someone wants to do an analysis, create a datamart for them.

For the selected IP address or time interval or any other, you will get all the relevant facts, as well as the main measurement data and volumetric datamart associated with them.

You can execute all the SQL queries that you want on this server. Most queries are passed to SELECT COUNT(*) and SELECT SUM(*) with various GROUP BY and HAVING and WHERE HAVING .

+4
source share

I think the correct answer really depends on the definition of a “data set”. As you mentioned in your question, you keep separate sets of information for each entry; timestamp, userid, destination IP address, ip source, number of bytes, etc.

SQL Server is capable of transmitting this type of data warehouse with hundreds of millions of records without any real hassle. Of course, this type of logging will require some good hardware to process it, but it should not be too complicated.

Any other decision, in my opinion, is going to make the report very difficult, but from the sounds of this important requirement.

0
source share

So, you are in one of the cases when you have a lot more writing activity than reading, you want your records to not block you, and you want your reads to be “fast enough” but not critical. This is a typical example of using business intelligence.

You should probably use a database and store your data as a “denormalized” schema to avoid complex joins and multiple inserts for each record. Think of your table as a huge log file.

In this case, some of the “new and bizarre” NoSQL databases are probably what you are looking for: they provide unconstrained ACID restrictions that you should not lose heart here (in the event of an accident, you may lose the last line of your log), but they perform much better for inserts because they don’t need to synchronize the logs on disk with every transaction.

0
source share

All Articles