Daily aggregate taking into account the client’s time zone of millions of lines

Question

Daily aggregate taking into account the client’s time zone of millions of lines

Suppose I have a table that stores information from visitors (site visitors). Suppose the table structure consists of the following fields:

ID
visitor_id
visit_time (stored as milliseconds in UTC from '1970-01-01 00:00:00')

Millions of rows are in this table and are still growing.

In this case, if I want to see a report (day against visitors) from any time zone, then one of the solutions:

Solution No. 1:

Get the time zone of the report viewer (e.g. client)
The totality of the data from this table, taking into account the time zone of the client
Show result per day

But in this case, performance will deteriorate. Another solution may be as follows:

Solution No. 2:

Using pre-aggregated tables / pivot tables that ignore the client’s time zone

But in any case, there is trade off between performance and correctness .

Solution No. 1 provides the correctness and Solution No. 2 provides the best performance.

I want to know what is the best practice in this particular scenario?

+5

mysql bigdata web analytics

user4743523 Apr 21 '15 at 3:27

source share

1 answer

Max haaksman · Answer 1 · 2015-07-20T11:02:07+0000

The question of processing time comes in reasonable quantities when you get into distributed systems, users and map events between different data sources.

I highly recommend that you ensure that all logging systems use UTC. This allows you to collect any servers (which, we hope, are synchronized in relation to their viewing of the current UTC time) located anywhere in the world.

Then, as requests arrive, you can convert users from the time zone to UTC. At this stage, you have one and the same solution - to execute the request in real time or, possibly, to access some previously generalized data.

Regardless of whether you want to collect data in advance, it will depend on many things. Some of them may entail the possibility to reduce the amount of stored data, reduce the amount of processing to support requests, how often requests will be executed, or even the costs of creating a system in comparison with the amount of use that it can see.

As for the best practices, keep the display characteristics (like time zone) independent of data processing.

If you have not already done so, make sure that you consider the lifetime of the data that you store. Do you need ten years ago data? I hope no. Do you have a strategy for discarding old data when it is no longer needed? Do you know how much data you will have if you save each record (evaluate with different traffic growth rates)?

Again, the best practice for large data sets is to understand how you are going to deal with size and how you are going to manage this data over time as they arise. This may include long-term storage, disposal, or possibly reduction to a summarized form.

Oh, and to slide by analogy with the Matrix, what’s actually going to bake your noodles in terms of “correctness” is that here, correctness is not a problem. Each time zone has a different kind of traffic during the “day” in its own zone, and each of them is “correct”. Even those time zones that differ from you in a setting that is not measured in hours only.

Daily aggregate taking into account the client’s time zone of millions of lines

More articles: