Google Storage Shell (gs) file input / output for Cloud ML?

Google recently announced Clould ML, https://cloud.google.com/ml/ , and it’s very useful. However, one limitation is that entering / exiting the Tensorflow program must support gs: //.

If we use all APS parsers for reading / writing, this should be OK, as these APIs support gs:// .

However, if we use native file IO APIs such as open , this does not work because they do not understand gs://

For example:

  with open(vocab_file, 'wb') as f: cPickle.dump(self.words, f) 

This code will not work in Google Cloud ML.

However, changing all of the core file APIs in tensorflow APIs or Google Python APIs is very tedious. Is there an easy way to do this? Any wrappers to support Google storage systems, gs:// on top of my own IO file?

As suggested here, the Pancake week sparse matrix as input? maybe we can use file_io.read_file_to_string('gs://...') , but still it requires significant code modification.

+14
google-cloud-storage tensorflow google-cloud-ml
source share
3 answers

One solution is to copy all the data to the local disk when the program starts. You can do this with gsutil inside a Python script that runs, something like:

 vocab_file = 'vocab.pickled' subprocess.check_call(['gsutil', '-m' , 'cp', '-r', os.path.join('gs://path/to/', vocab_file), '/tmp']) with open(os.path.join('/tmp', vocab_file), 'wb') as f: cPickle.dump(self.words, f) 

And if you have any outputs, you can write them to a local disk and gsutil rsync them. (But be careful to handle reboots correctly, because you may be on a different machine).

Another solution is open monkey patch (Note: untested):

 import __builtin__ # NB: not all modes are compatible; should handle more carefully. # Probably should be reported on # https://github.com/tensorflow/tensorflow/issues/4357 def new_open(name, mode='r', buffering=-1): return file_io.FileIO(name, mode) __builtin__.open = new_open 

Just be sure to do this before any module tries to read from GCS.

+8
source share

Do it like this:

 from tensorflow.python.lib.io import file_io with file_io.FileIO('gs://.....', mode='w+') as f: cPickle.dump(self.words, f) 

Or you can read the pickle file as follows:

 file_stream = file_io.FileIO(train_file, mode='r') x_train, y_train, x_test, y_test = pickle.load(file_stream) 
+13
source share

apache_beam has a gcsio module that can be used to return a standard Python file object for reading / writing GCS objects. You can use this object in any way that works with Python file objects. for example

 def open_local_or_gcs(path, mode): """Opens the given path.""" if path.startswith('gs://'): try: return gcsio.GcsIO().open(path, mode) except Exception as e: # pylint: disable=broad-except # Currently we retry exactly once, to work around flaky gcs calls. logging.error('Retrying after exception reading gcs file: %s', e) time.sleep(10) return gcsio.GcsIO().open(path, mode) else: return open(path, mode) with open_local_or_gcs(vocab_file, 'wb') as f: cPickle.dump(self.words, f) 
+2
source share

All Articles