Distributed File System Health Check

I need a distributed file system that needs to scale to very large sizes (about 100 TB of realistic max). File size mainly depends on the range of 10-1500 KB, although some files can reach 250 MB.

I really like the idea of ​​systems like GFS with built-in backup redundancy that would - statistically - visualize file loss in the past.

I have a couple of requirements:

  • Open source
  • No SPOFs
  • Automatic file replication (i.e. no RAID needed)
  • Managed Client Access.
  • Flat file namespace - preferably
  • Built-in Version Removal / Delay
  • Proven Deployment Options

I took a serious look at MogileFS as it fulfills most of the requirements. It has no managed clients, but it's pretty simple to make a Java client port. However, there is no built-in version. Without version control, I will have to make regular backups, except for replicating files built into MogileFS.

Mostly I need protection from a programming error that suddenly cleans up a lot of files that it should not have. While MogileFS protects me from disk and machine errors by replicating my files over X devices, it does not save me if I do an unreasonable deletion.

I would like to be able to indicate that the delete operation does not actually take effect only after Y days. The deletion will be logically performed, but I can restore the state of the file within Y days until it actually disappears. In addition, MogileFS does not have the ability to check for disk corruption during recording - although, again, this can be added.

Since we are a Microsoft store (Windows, .NET, MSSQL), I would optimally use the main parts for working in Windows for ease of maintenance, while the storage nodes performed * nix (or a combination) due to licensing.

Before I even think about tipping over, do you have any suggestions for me? I also tested HadoopFS, OpenAFS, Luster, and GFS, but it doesn't seem to fit my requirements.

+3
source share
3 answers

Do you need to host this on your servers? Much of what you need can be provided by Amazon S3. Slow deletion can be implemented by writing deletions to the SimpleDB table and periodically starting garbage collection to delete files when necessary.

There is still one point of failure if you rely on one Internet connection. And, of course, you might think that Amazon itself is a point of failure, but the failure rate will always be much lower due to scale.

And I hope you recognize the other benefits, the ability to scale to any capacity. IT staff is not required to replace failed drives or systems. Usage costs will constantly decrease as disk capacity and throughput become cheaper (while the drives you buy are depreciated in value).

It is also possible to use a hybrid approach and use S3 as a secure backend archive and store hot data locally, as well as find a caching strategy that best suits your usage model. This can significantly reduce bandwidth usage and improve I / O, especially if data changes infrequently.

Downsides:

  • Files on S3 are unchangeable; they can be completely replaced completely or deleted. This is great for caching, not so good for efficiency when making small changes to large files.
  • Latency and bandwidth is your network connection. Caching can help improve this, but you will never get the same level of performance.

Version control will also be a customizable solution, but can be implemented using SimpleDB along with S3 to track changesets in a file. In fact, it really depends on your use case, if that is good.

+1
source

You can try running a version control system on top of a reliable file system. Then the problem arises of how to remove old checks after the timeout. You can configure the Apache server using DAV_SVN and commit every change made using the DAV interface. I'm not sure how much it scales with the large file sizes you describe.

0
source

@tweakt
I have also considered S3 broadly, but I do not think it will be satisfactory for us in the long run. We have many files that need to be stored securely - not through the ACL file, but through our application layer. Although this can also be done through S3, we have one bit less control over our file storage. In addition, there will also be a significant drawback in latency forms when we perform file operations - like initial savings (which can be performed asynchronously), but also when we later read files and need to perform operations on them.

Regarding SPOF, this is not a problem. We have redundant connections to our data center, and although I do not want any SPOF, the low S3 downtime was acceptable.

Unlimited scalability and lack of maintenance is certainly an advantage.

Regarding the hybrid approach. If we want to connect directly to S3 - if we didn’t want to store everything locally anyway (and just use S3 as a backup), the bandwidth prices are just too cool when we add S3 + CloudFront (CloudFront will be needed as we have customers from all sides). We currently host everything from our data center in Europe, and we have our own back squid system in the US for low-cost CDN functionality.

Despite the fact that it is very dependent on the domain, for us it is not a problem. We can replace the files (that is, the X key receives new content), but we will not make minor changes to the file. All our files are blobs.

0
source

All Articles