Atomic haop fs move

Question

Atomic haop fs move

When creating the infrastructure for one of my current projects, I was faced with the problem of replacing existing HDFS files . More precisely, I want to do the following:

We have several machines ( log-servers ) that continuously generate logs. We have a dedicated computer ( log preprocessor ), which is responsible for receiving the log blocks (each fragment is about 30 minutes in length and 500-800 mb) from the log servers , after processing them and downloading them to the HDFS of our Hadoop cluster.

Preliminary processing is carried out in 3 stages:

for each logserver : filter (in parallel) received log-bit (output file is about 60-80 mb)
merge (merge-sort) all output files from step 1 and perform a little filtering (additionally 30-minute files are combined into 1-hour files)
using the current mapping from the external database, process the file from step # 2 to get the final log file , and put this file in HDFS.

The final log files should be used as input for several HADOOP period applications that run on the HADOOP cluster. The following files are stored in the HDFS files:

hdfs:/spool/.../logs/YYYY-MM-DD.HH.MM.log

Description of the problem:

the mapping used in step 3 changes over time, and we need to reflect these changes by recounting step 3 and replacing the old HDFS files with new ones. This update is performed at some intervals (for example, every 10-15 minutes), at least during the last 12 hours. Please note that when changing the mapping, the result of applying step 3 in the same input file may be significantly different (it will not be just a subset / subset of the previous result). Therefore, we need to overwrite existing files in HDFS.

However, we cannot just use hasoop fs -rm, and then hasoop fs -copyToLocal, because if some HADOOP application uses a temporarily deleted file, the application may crash. The solution I am using adds a new file next to the old one, the files have the same name but different suffixes indicating the version of the file. Now the layout is as follows:

 hdfs:/spool/.../logs/2012-09-26.09.00.log.v1 hdfs:/spool/.../logs/2012-09-26.09.00.log.v2 hdfs:/spool/.../logs/2012-09-26.09.00.log.v3 hdfs:/spool/.../logs/2012-09-26.10.00.log.v1 hdfs:/spool/.../logs/2012-09-26.10.00.log.v2

Any Hadoop application during its launch (configuration) selects files with the latest versions and works with them. Therefore, even if some kind of update occurs, the application will not experience any problems, because the input file will not be deleted.

Questions:

Do you know a simpler approach to this problem that doesn't use this complicated / ugly file versioning?
Some applications may start using an HDFS file that is currently loading but not yet loaded (applications see this file in HDFS but don’t know if it is compatible). In the case of gzip files, this can lead to errors. Could you advise how I can deal with this problem? I know that for local file systems I can do something like:
cp infile / finaldir / outfile.tmp && & &&& & mv / finaldir / output.tmp / finaldir / output

This works because mv is an atomic operation, however I'm not sure if this applies to HDFS. Could you please advise if HDFS supports some atomic work, for example, mv on ordinary local file systems?

Thanks in advance!

+7

atomic hadoop hdfs

Mikhail Shevelev Sep 26 '12 at 21:02

source share

1 answer

Harsh j · Answer 1 · 2012-12-30T20:54:30+0000

IMO, the file renaming approach is absolutely working.

HDFS, up to 1.x, lacks atomic renaming (these are dirty IIRC updates), but the operation is usually considered "atomic" and never posed a problem with the specific scenario that you have in mind here. You can rely on this without worrying about the partial state, since the source file has already been created and closed.

HDFS 2.x supports proper atomic renaming (via a new API call), which replaced the dirtier version of the earlier version. This is also the default behavior for renaming if you use the FileContext API.

Atomic haop fs move

Description of the problem:

Questions:

More articles: