Marinated sparse sparse matrix as input?

Question

Marinated sparse sparse matrix as input?

I am working on a multiclass classification problem consisting of a classification of resumes.

I used sklearn and its TfIdfVectorizer to get the large, sparse sparse matrix that I fed in the Tensorflow model after it was etched. On my local computer, I download it, convert a small batch into dense numpy arrays, and fill out the final. Everything works great.

Now I would like to do the same in the ML cloud. My pickle is stored in gs://my-bucket/path/to/pickle , but when I run my trainer, the pickle file cannot be found in this URI ( IOError: [Errno 2] No such file or directory ). I use pickle.load(open('gs://my-bucket/path/to/pickle), 'rb') to retrieve my data. I suspect that this is not the best way to open a file in GCS, but I am completely new to Google Cloud and I cannot find the right way to do this.

In addition, I read that it is necessary to use TFRecords or CSV format for data input, but I do not understand why my method could not work. CSV is ruled out, as the dense representation of the matrix will be too large to fit the memory. Can TFRecords encode efficiently sparse data? And is it possible to read data from a pickle file?

+2

google-cloud-ml

Thomas reynaud Oct 19 '16 at 13:46

source share

1 answer

rhaertel80 · Accepted Answer · 2016-10-19T15:27:07+0000

You are correct that Python "open" will not work with GCS out of the box. Given that you are using TensorFlow, you can use the file_io library instead, which will work with both local files and files in GCS.

 from tensorflow.python.lib.io import file_io pickle.loads(file_io.read_file_to_string('gs://my-bucket/path/to/pickle'))

NB: pickle.load(file_io.FileIO('gs://..', 'r')) does not work.

You can use any data format for you and are not limited by CSV or TFRecord (do you mind pointing out the place in the documentation that makes this expression?). If the data fits into memory, then your approach is reasonable.

If the data does not fit into memory, you will most likely want to use the TensorFlow reader, the most convenient of which are CSV or TFRecords. TFRecord is just a container of byte strings. Most often, it contains serialized tf.Example data that supports sparse data (this is essentially a map). See tf.parse_example for more information on tf.Example data parsing.

Marinated sparse sparse matrix as input?

More articles: