After upgrading our small Cloudera Hadoop cluster to CDH 5, deleting files no longer frees up available storage space. Despite the fact that we delete more data than we add, the file system continues to fill up.
Cluster setup
We work with four node clusters on physical, dedicated hardware, with a total memory capacity of 110 TB. On April 3, we updated the CDH software from version 5.0.0-beta2 to version 5.0.0-1.
Previously, we used to store log data in hdf format in text format at a speed of about 700 GB / day. In April 1, we switched to importing data as .gz files, and this reduced the daily reception speed to 130 GB.
Since we want to save data only up to a certain age, there is night work to delete obsolete files. The result of this was clearly visible in the hdfs capacity monitoring diagram, but could no longer be seen.
Sine we import 570 GB less data than we delete every day, we would expect that the capacity will decrease. But instead, our claimed use of hdfs has been steadily increasing since the cluster software was updated.
Description of the problem
Running hdfs hadoop fs -du -h / gives the following result:
0 /system 1.3 T /tmp 24.3 T /user
This is consistent with what we expect to see, given the size of the imported files. Using a replication ratio of 3, this should correspond to a physical disk utilization of about 76.8 TB.
When hdfs dfsadmin -report starts hdfs dfsadmin -report result will be different:
Configured Capacity: 125179101388800 (113.85 TB) Present Capacity: 119134820995005 (108.35 TB) DFS Remaining: 10020134191104 (9.11 TB) DFS Used: 109114686803901 (99.24 TB) DFS Used%: 91.59% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0
Here DFS is used as 99.24 TB, which we see in the monitoring diagram. Where did all this data come from?
What we tried
The first thing we suspected was that the automatic emptying of the garbage did not work, but this does not seem to be the case. Only the last deleted files are in the recycle bin and they automatically disappear after a day.
Our problem seems very similar to what would happen if the hdfs metadata update was completed but not completed. I do not think that this is necessary when upgrading between these versions, but nevertheless performed both actions βjust in caseβ.
There are a lot of data in the `previous / finalized 'section on DN storage volumes in the local file system. I have too little knowledge about the details of the hdsf implementation to know how important this is, but it may mean something that the finalization is not synchronized.
Soon the cluster will run out of disk space, so any help would be appreciated.