Python: handling a large dataset. Scipy or Rpy? And How?

In my python environment, Rpy and Scipy packages are already installed.

The problem I want to solve is this:

1) A huge set of financial data is stored in a text file. Download to Excel is not possible

2) I need to summarize certain fields and get the total values.

3) I need to show the first 10 lines based on the totals.

Which package (Scipy or Rpy) is best for this task?

If so, can you provide me with a few pointers (like documentation or an online example) that can help me implement the solution?

Speed ​​is a problem. Ideally, scipy and Rpy can process large files, even if the files are so large that they cannot be installed in memory.

+7
source share
6 answers

As @ gsk3 noted, bigmemory is a great package for this, along with biganalytics and bigtabulate (there are more of them, but they are worth checking out). There's also ff , although it's not that simple.

Common to R and Python is HDF5 support (see ncdf4 or NetCDF4 in R), which makes it very fast and easy to access massive datasets on disk. Personally, I primarily use bigmemory , although this is R specific. Since HDF5 is available in Python and very, very fast, this is likely to be your best choice in Python.

+2
source

Neither Rpy nor Scipy are needed, although numpy can make this a little easier. This problem seems to be ideal for a linear analyzer. Just open the file, read the line in the line, view the line in an array (see Numpy.fromstring), update the current amounts and go to the next line.

+5
source

Python File I / O does not have poor performance, so you can directly use the file module. You can see what functions are available in it by typing help (file) in the interactive interpreter. Creating a file is part of the core language functionality and does not require you to import file .

Something like:

 f = open ("C:\BigScaryFinancialData.txt", "r"); for line in f.readlines(): #line is a string type #do whatever you want to do on a per-line basis here, for example: print len(line) 

Disclaimer: This is a response in Python 2. I am not 100% sure that this works in Python 3.

I will leave this to you to figure out how to show the first 10 lines and find the sum of the lines. This can be done using simple program logic, which should not be a problem without any special libraries. Of course, if the lines have some kind of complicated formatting, which makes it difficult to parse the values, you can use some kind of module for parsing, for example, re (type help(re) in the interactive interpreter).

+3
source

How huge is your data, is it more than the memory of your PC? If it can be loaded into memory, you can use numpy.loadtxt () to load text data into a numpy array. eg:

 import numpy as np with file("data.csv", "rb") as f: title = f.readline() # if your data have a title line. data = np.loadtxt(f, delimiter=",") # if your data splitted by "," print np.sum(data, axis=0) # sum along 0 axis to get the sum of every column 
+2
source

I don't know anything about Rpy. I really know that SciPy is used to seriously collapse numbers with really large data sets, so it should work for your problem.

As the marshmallow noted, you might not need a single one; if you just need to save some current amounts, you can probably do it in Python. If this is a CSV file or another common file format, check and see if there is a Python module that will parse it, and then write a loop that sums up the corresponding values.

I am not sure how to get the first ten lines. Can you collect them on the fly when you go, or do you need to calculate the amounts and then select the rows? To collect them, you can use the dictionary to track the top 10 rows and use the keys to store the metric you used to rank (so that it is easy to find and discard a row if another row replaces it). If you need to find the lines after the calculation is done, slurp all the data in numpy.array, or just take a second pass through the file to pull out ten lines.

+1
source
+1
source

All Articles