Can hdf5 be used for large amounts of text data?

Suppose I'm going to programmatically retrieve one hundred thousand open access books as text strings from the Internet. My intention is to do some analysis on them (using pandas). I already use mongodb in some parts of my application, but I don’t think it’s easy to put it in pendrive and transfer it to another machine. Sqlite is portable, but I hate writing sql. Other options I've seen are simply to put them in the file system as separate text files or in something called hdf5.

Is hdf5 good for this type of text data? If not, what other options are available?

+4
source share
2 answers

Yes, you can, but if I were you, I would just use separate text files and a zip containing directory. That's why:

Large arrays of numbers (HDF5 bread and butter) can be effectively stored in binary format, but there is no binary text, so there are no advantages in terms of using the HDF5 space. Yes, you can enable compression in HDF5 files, but you can easily compress text files.

Both text files and zip files are quite universal these days, so there is nothing that could be done in terms of portability.

Here is one example of something trivial that you cannot deal with HDF5: delete the data set and return its space.

, , .

+5

, .

- HDF HDF5: "HDF5 - , . / ."

: http://www.hdfgroup.org/HDF5/

!

+1

All Articles