I have text files with the following format:
000423|东阿阿胶| 300|1|0.15000| | 000425|徐工机械| 600|1|0.15000| | 000503|海虹控股| 400|1|0.15000| | 000522|白云山A| |2| | 1982.080| 000527|美的电器| 900|1|0.15000| | 000528|柳 工| 300|1|0.15000| |
when I use read_csv to load them into a DataFrame, it does not generate the correct dtype type for some columns. For example, the first column is parsed as int, not unicode str, the third column is parsed as unicode str, not int, due to lack of data ... Is there a way to pre-set dtype DataFrame, like numpy.genfromtxt does?
Update: I used read_csv like this, which caused the problem:
data = pandas.read_csv(StringIO(etf_info), sep='|', skiprows=14, index_col=0, skip_footer=1, names=['ticker', 'name', 'vol', 'sign', 'ratio', 'cash', 'price'], encoding='gbk')
To solve problems with dtype and encoding, I first need to use unicode() and numpy.genfromtxt :
etf_info = unicode(urllib2.urlopen(etf_url).read(), 'gbk') nd_data = np.genfromtxt(StringIO(etf_info), delimiter='|', skiprows=14, skip_footer=1, dtype=ETF_DTYPE) data = pandas.DataFrame(nd_data, index=nd_data['ticker'], columns=['name', 'vol', 'sign', 'ratio', 'cash', 'price'])
It would be nice if read_csv can add the dtype and usecols . Sorry for my greed. ^ _ ^