Reading in a CSV file as dataframe from hdfs

Question

Reading in a CSV file as dataframe from hdfs

I use pydoop to read in a file from hdfs, and when I use:

import pydoop.hdfs as hd
with hd.open("/home/file.csv") as f:
    print f.read()

It shows me the file in stdout.

Is there a way to read in this file as a dataframe? I tried using pandas' read_csv ("/home/file.csv"), but it tells me that the file was not found. Exact code and error:

>>> import pandas as pd
>>> pd.read_csv("/home/file.csv")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 275, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 590, in __init__
    self._make_engine(self.engine)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 731, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1103, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 353, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)
  File "pandas/parser.pyx", line 591, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)
IOError: File /home/file.csv does not exist

+4

python pandas hadoop hdfs

lordingtar Feb 26 '16 at 1:57

source share

1 answer

hpaulj · Accepted Answer · 2016-02-26T05:25:51+0000

I don't know anything about hdfs, but I'm wondering if the following can be done:

with hd.open("/home/file.csv") as f:
    df =  pd.read_csv(f)

I assume it read_csvworks with a file descriptor or virtually any iterable that will feed it. I know numpycsv readers .

pd.read_csv("/home/file.csv") , Python open, .. .

with open("/home/file.csv") as f: 
    print f.read()

, , hd.open , . , ( ) hdfs.

Reading in a CSV file as dataframe from hdfs

More articles: