I need a distributed file system that needs to scale to very large sizes (about 100 TB of realistic max). File size mainly depends on the range of 10-1500 KB, although some files can reach 250 MB.
I really like the idea of ββsystems like GFS with built-in backup redundancy that would - statistically - visualize file loss in the past.
I have a couple of requirements:
- Open source
- No SPOFs
- Automatic file replication (i.e. no RAID needed)
- Managed Client Access.
- Flat file namespace - preferably
- Built-in Version Removal / Delay
- Proven Deployment Options
I took a serious look at MogileFS as it fulfills most of the requirements. It has no managed clients, but it's pretty simple to make a Java client port. However, there is no built-in version. Without version control, I will have to make regular backups, except for replicating files built into MogileFS.
Mostly I need protection from a programming error that suddenly cleans up a lot of files that it should not have. While MogileFS protects me from disk and machine errors by replicating my files over X devices, it does not save me if I do an unreasonable deletion.
I would like to be able to indicate that the delete operation does not actually take effect only after Y days. The deletion will be logically performed, but I can restore the state of the file within Y days until it actually disappears. In addition, MogileFS does not have the ability to check for disk corruption during recording - although, again, this can be added.
Since we are a Microsoft store (Windows, .NET, MSSQL), I would optimally use the main parts for working in Windows for ease of maintenance, while the storage nodes performed * nix (or a combination) due to licensing.
Before I even think about tipping over, do you have any suggestions for me? I also tested HadoopFS, OpenAFS, Luster, and GFS, but it doesn't seem to fit my requirements.