Hadoop Read-Only Access to Google Storage Buckets

I am trying to access a Google repository stuff from a Hadoop cluster deployed to Google Cloud using a bdutilscript. It does not work if access to the bucket is read-only.

What am I doing:

  • Expand the cluster with

    bdutil deploy -e datastore_env.sh
    
  • On the main device:

    vgorelik@vgorelik-hadoop-m:~$ hadoop fs -ls gs://pgp-harvard-data-public 2>&1 | head -10
    14/08/14 14:34:21 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.8-hadoop1
    14/08/14 14:34:25 WARN gcsio.GoogleCloudStorage: Repairing batch of 174 missing directories.
    14/08/14 14:34:26 ERROR gcsio.GoogleCloudStorage: Failed to repair some missing directories.
    java.io.IOException: Multiple IOExceptions.
    java.io.IOException: Multiple IOExceptions.
        at com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createCompositeException(GoogleCloudStorageExceptions.java:61)
        at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createEmptyObjects(GoogleCloudStorageImpl.java:361)
        at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createEmptyObjects(GoogleCloudStorageImpl.java:372)
        at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.listObjectInfo(GoogleCloudStorageImpl.java:914)
        at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.listObjectInfo(CacheSupplementedGoogleCloudStorage.java:455)
    

Looking at the source code for GCS Java , it looks like the Google Cloud Storage Connector for Hadoop needs empty “catalog” objects, which it can create through its own if the bucket can be written; otherwise it fails. Installation fs.gs.implicit.dir.repair.enable=falseleads to an error "Error receiving object."

Is it possible to use read buckets only as MR input?

gsutil . ?

+4
1

, Google Cloud Storage Hadoop.

, :

./hadoop-install/bin/hadoop \
  jar ./hadoop-install/contrib/streaming/hadoop-streaming-1.2.1.jar \
  -input gs://pgp-harvard-data-public/hu0*/*/*/*/ASM/master* \
  -mapper cgi-mapper.py -file cgi-mapper.py --numReduceTasks 0 \
  -output gs://big-data-roadshow/output

, .

, glob (*), Google Cloud Storage Hadoop - "placeholder".

gsutil ( "placeholder" ), glob, , , glob hadoop.

( " gsutil " ) "".

+5

All Articles