When creating the infrastructure for one of my current projects, I was faced with the problem of replacing existing HDFS files . More precisely, I want to do the following:
We have several machines ( log-servers ) that continuously generate logs. We have a dedicated computer ( log preprocessor ), which is responsible for receiving the log blocks (each fragment is about 30 minutes in length and 500-800 mb) from the log servers , after processing them and downloading them to the HDFS of our Hadoop cluster.
Preliminary processing is carried out in 3 stages:
- for each logserver : filter (in parallel) received log-bit (output file is about 60-80 mb)
- merge (merge-sort) all output files from step 1 and perform a little filtering (additionally 30-minute files are combined into 1-hour files)
- using the current mapping from the external database, process the file from step # 2 to get the final log file , and put this file in HDFS.
The final log files should be used as input for several HADOOP period applications that run on the HADOOP cluster. The following files are stored in the HDFS files:
hdfs:/spool/.../logs/YYYY-MM-DD.HH.MM.log
Description of the problem:
the mapping used in step 3 changes over time, and we need to reflect these changes by recounting step 3 and replacing the old HDFS files with new ones. This update is performed at some intervals (for example, every 10-15 minutes), at least during the last 12 hours. Please note that when changing the mapping, the result of applying step 3 in the same input file may be significantly different (it will not be just a subset / subset of the previous result). Therefore, we need to overwrite existing files in HDFS.
However, we cannot just use hasoop fs -rm, and then hasoop fs -copyToLocal, because if some HADOOP application uses a temporarily deleted file, the application may crash. The solution I am using adds a new file next to the old one, the files have the same name but different suffixes indicating the version of the file. Now the layout is as follows:
hdfs:/spool/.../logs/2012-09-26.09.00.log.v1 hdfs:/spool/.../logs/2012-09-26.09.00.log.v2 hdfs:/spool/.../logs/2012-09-26.09.00.log.v3 hdfs:/spool/.../logs/2012-09-26.10.00.log.v1 hdfs:/spool/.../logs/2012-09-26.10.00.log.v2
Any Hadoop application during its launch (configuration) selects files with the latest versions and works with them. Therefore, even if some kind of update occurs, the application will not experience any problems, because the input file will not be deleted.
Questions:
Do you know a simpler approach to this problem that doesn't use this complicated / ugly file versioning?
Some applications may start using an HDFS file that is currently loading but not yet loaded (applications see this file in HDFS but donβt know if it is compatible). In the case of gzip files, this can lead to errors. Could you advise how I can deal with this problem? I know that for local file systems I can do something like:
cp infile / finaldir / outfile.tmp && & &&& & mv / finaldir / output.tmp / finaldir / output
This works because mv is an atomic operation, however I'm not sure if this applies to HDFS. Could you please advise if HDFS supports some atomic work, for example, mv on ordinary local file systems?
Thanks in advance!