I am working on a multiclass classification problem consisting of a classification of resumes.
I used sklearn and its TfIdfVectorizer to get the large, sparse sparse matrix that I fed in the Tensorflow model after it was etched. On my local computer, I download it, convert a small batch into dense numpy arrays, and fill out the final. Everything works great.
Now I would like to do the same in the ML cloud. My pickle is stored in gs://my-bucket/path/to/pickle , but when I run my trainer, the pickle file cannot be found in this URI ( IOError: [Errno 2] No such file or directory ). I use pickle.load(open('gs://my-bucket/path/to/pickle), 'rb') to retrieve my data. I suspect that this is not the best way to open a file in GCS, but I am completely new to Google Cloud and I cannot find the right way to do this.
In addition, I read that it is necessary to use TFRecords or CSV format for data input, but I do not understand why my method could not work. CSV is ruled out, as the dense representation of the matrix will be too large to fit the memory. Can TFRecords encode efficiently sparse data? And is it possible to read data from a pickle file?
google-cloud-ml
Thomas reynaud
source share