Removing files from HDFS does not free up disk space

After upgrading our small Cloudera Hadoop cluster to CDH 5, deleting files no longer frees up available storage space. Despite the fact that we delete more data than we add, the file system continues to fill up.

Cluster setup

We work with four node clusters on physical, dedicated hardware, with a total memory capacity of 110 TB. On April 3, we updated the CDH software from version 5.0.0-beta2 to version 5.0.0-1.

Previously, we used to store log data in hdf format in text format at a speed of about 700 GB / day. In April 1, we switched to importing data as .gz files, and this reduced the daily reception speed to 130 GB.

Since we want to save data only up to a certain age, there is night work to delete obsolete files. The result of this was clearly visible in the hdfs capacity monitoring diagram, but could no longer be seen.

Sine we import 570 GB less data than we delete every day, we would expect that the capacity will decrease. But instead, our claimed use of hdfs has been steadily increasing since the cluster software was updated.

Description of the problem

Running hdfs hadoop fs -du -h / gives the following result:

 0 /system 1.3 T /tmp 24.3 T /user 

This is consistent with what we expect to see, given the size of the imported files. Using a replication ratio of 3, this should correspond to a physical disk utilization of about 76.8 TB.

When hdfs dfsadmin -report starts hdfs dfsadmin -report result will be different:

 Configured Capacity: 125179101388800 (113.85 TB) Present Capacity: 119134820995005 (108.35 TB) DFS Remaining: 10020134191104 (9.11 TB) DFS Used: 109114686803901 (99.24 TB) DFS Used%: 91.59% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 

Here DFS is used as 99.24 TB, which we see in the monitoring diagram. Where did all this data come from?

What we tried

The first thing we suspected was that the automatic emptying of the garbage did not work, but this does not seem to be the case. Only the last deleted files are in the recycle bin and they automatically disappear after a day.

Our problem seems very similar to what would happen if the hdfs metadata update was completed but not completed. I do not think that this is necessary when upgrading between these versions, but nevertheless performed both actions β€œjust in case”.

There are a lot of data in the `previous / finalized 'section on DN storage volumes in the local file system. I have too little knowledge about the details of the hdsf implementation to know how important this is, but it may mean something that the finalization is not synchronized.

Soon the cluster will run out of disk space, so any help would be appreciated.

+7
hadoop hdfs cloudera-cdh
source share
1 answer

I found a similar problem in our cluster, which was probably due to a failed upgrade.

First make sure you complete the update on namenode

 hdfs dfsadmin -finalizeUpgrade 

I found that datanodes for some reason did not complete their catalogs at all.

On your datanode you should see the following directory layout

 /[mountpoint}/dfs/dn/current/{blockpool}/current 

and

 /[mountpoint}/dfs/dn/current/{blockpool}/previous 

If you have not finished yet, the previous directory contains all the data that was created before the update. If you delete something, it will not delete it, so your storage will never decrease.

In fact, the simplest solution was enough

Restart namenode

Keep an eye on the datanode log, you should see something like this

 INFO org.apache.hadoop.hdfs.server.common.Storage: Finalizing upgrade for storage directory 

Subsequently, the directories will be cleaned up in the background, and the repository will be fixed.

+11
source share

All Articles