Design - how to handle timestamps (storage) when performing calculations; python

Question

Design - how to handle timestamps (storage) when performing calculations; python

I am trying to determine (since my application is dealing with a lot of data from different sources and different time zones, formats, etc.) how best to store my data and work with it.

For example, should everything be stored as UTC? This means that when I take the data, I need to determine what time zone it is in, and if it is NOT UTC, make the necessary conversion to do this. (Notice I'm at EST).

Then, when you perform data calculations, I have to extract (e.g. UTC) and enter my time zone (EST), so what makes sense when I look at it? Should I store it in UTC and do all my calculations?

Many of these data are time series and will be displayed, and the graph will be in EST.

This is a Python project, so let's say I have a data structure that:

"id1": { "interval": 60, <-- seconds, subDict['interval'] "last": "2013-01-29 02:11:11.151996+00:00" <-- UTC, subDict['last'] },

And I need to operate on this, determining if the current time (now ())> is the last + interval (60 seconds have passed)? So in the code:

 lastTime = dateutil.parser.parse(subDict['last']) utcNow = datetime.datetime.utcnow().replace(tzinfo=tz.tzutc()) if lastTime + datetime.timedelta(seconds=subDict['interval']) < utcNow: print "Time elapsed, do something!"

It makes sense? I work with UTC everywhere, both stored and computationally ...

In addition, if anyone has links to good reviews on how to work with timestamps in software, I would love to read it. Maybe how Joel On Software to use timestamps in applications?

+4

python design-patterns datetime architecture system-design

mr-sk Jan 30 '13 at 3:39

source share

4 answers

Since, as I see it, you don't seem to have a problem with the implementation, I would rather focus on the design aspects than the code format and timestamp. I have experience participating in the development of network support for a navigation system, implemented as a distributed system on a local network. The nature of this system is such that there is a lot of data (often conflicting) coming from different sources, so resolving potential conflicts and maintaining data integrity is quite difficult. Just some thoughts based on this experience.

Time tracking, even in a distributed system, including many computers, is usually not a problem if you do not need a higher degree of resolution than the one provided by the system time functions and a higher accuracy of time synchronization than that provided by the components of your OS.

In the simplest case, using UTC is quite reasonable, and for most tasks this is sufficient. However, it is important to understand the purpose of using timestamps in your system from the very beginning of the design. The time values (regardless of Unix time or UTC formatted strings) can sometimes be equal. If you need to resolve data conflicts based on timestamps (I mean, to always choose a newer (or older) value among several received from different sources), you need to understand if the conflict is incorrectly resolved (which usually means a conflict that can be resolved in more than one way, since the timestamps are equal) is a fatal problem for your system design or not. Possible options:

If 99.99% of conflicts are resolved equally on all nodes, you do not care, the remaining 0.01%, and they do not violate the integrity of the data. In this case, you can safely continue to use something like UTC.
If strict resolution of all conflicts is mandatory for you, you must create your own timestamping system. Timestamps can include time (perhaps not a system time, but some timer with a higher resolution), a sequence number (to create unique timestamps, even if there is not enough time for this), and a node identifier (to allow different nodes of your system to generate completely unique timestamps).
Finally, you may not need a timestamp based on time. Do you really need to be able to calculate the time difference between a pair of timestamps? Isn't that enough to allow you to organize timestamps, rather than link them to real-time moments? If you don't need time calculations, just a comparison, timestamps based on consecutive counters rather than real time are a good choice (see Lamport time for more details).

If you need a strict resolution of the conflict or you need a very high time resolution, you probably have to write your own timestamp service.

Many ideas and tips can be borrowed from A. Tanenbaum’s book " Distributed Systems: Principles and Paradigms ." When I came across such problems, it helped me a lot, and it has a separate chapter on creating timestamps.

+2

Ellioh Jan 30 '13 at 6:09

source share

I think the best approach is to store all timestamp data in UTC format. When you read it, immediately convert to UTC; Right before displaying, convert from UTC to your local time zone.

You might even want your code to print all timestamps twice, once local time and a second time in UTC time ... it depends on how much data you need to fit on the screen at the same time.

I am a big fan of the RFC 3339 date format. It is unambiguous for both people and cars. What is best is that almost nothing is optional, so it always looks the same:

 2013-01-29T19:46:00.00-08:00

I prefer to convert timestamps to single float values for storage and calculations, and then convert back to date and time format for display. I will not keep money in floats, but timestamp values are within the accuracy of float values!

Working with temporary floats greatly simplifies the code:

 if time_now() >= last_time + interval: print("interval has elapsed")

It looks like you are already doing this to a large extent, so I cannot offer any significant improvements.

I wrote some library functions to parse timestamps in Python swim time values and convert the floating point time values back to timestamp strings. Maybe something here will be useful for you:

http://home.blarg.net/~steveha/pyfeed.html

I suggest you look at feed.date.rfc3339 . BSD, so you can just use the code if you want.

EDIT: Question: How does this help in time zones?

Answer. If each timestamp that you store is stored in UTC as a value for the Python swim time (the number of seconds since an era with an additional fractional part), you can directly compare them; subtract one from the other to find out the interval between them; etc. If you use RFC 3339 timestamps, then each timestamp line has a time zone right in the timestamp line, and your code can be correctly converted to UTC. If you convert the float value to the timestamp string value right before the display, the time zone will be correct for local time.

In addition, as I said, it looks like he is already doing quite a lot, so I don’t think I can give any surprising advice.

+1

steveha Jan 30 '13 at 3:54

source share

Personally, I use the Unix-time standard, it is very convenient for storage due to its simple presentation form, it is just a sequence of numbers. Since this is UTC time internally, you must make sure that it is generated correctly (convert from other timestamps) before saving and formatting it according to any time zone you want.

Once you have a common time stamp format in the backend (tz) data, building the data is very simple, since it is just a matter of assigning the target TZ.

As an example:

 import time import datetime import pytz # print pre encoded date in your local time from unix epoch example = {"id1": { "interval": 60, "last": 1359521160.62 } } #this will use your system timezone formatted print time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(example['id1']['last'])) #this will use ISO country code to localize the timestamp countrytz = pytz.country_timezones['BR'][0] it = pytz.timezone(countrytz) print it.localize(datetime.datetime.utcfromtimestamp(example['id1']['last']))

+1

Martino dino Jan 30 '13 at 5:19

source share

Austin phillips · Accepted Answer · 2013-01-30T04:31:35+0000

It seems to me that you are already making the "right path". It is assumed that users will interact in their local time zone (input and output), but normally store normalized dates in UTC format so that they are unambiguous and simplify calculations. So, normalize in UTC as soon as possible and localize as soon as possible.

Some small amount of information about Python processing and time zone can be found here:

My real preference is to store dates as timestamp tv_sec unix values in the backend repository and convert Python datetime.datetime objects during processing. Processing is usually done using the datetime object in the UTC time zone and then converted to the user's local time zone just before exiting. I believe having a rich object like datetime.datetime helps with debugging.

The time zone is a nuisance that you have to deal with, and you will probably need to determine in each case whether it is worth maintaining time zones.

For example, let's say you calculate the daily bandwidth used. Some questions that may arise are as follows:

What happens on the border of daylight saving time? If you just assume that for the convenience of calculation, the day is 24 hours, or you need to always check each daily calculation, what can be less or more hours at the daylight saving time during the day?
When representing localized time, does it matter if time repeats? eg. If you show hourly reporting in local time without reference to the time zone, this confuses the user with a missing data hour or a change in the number of hours in the summer.

Design - how to handle timestamps (storage) when performing calculations; python

More articles: