Numpy for user R?

Question

Numpy for user R?

long-term user of R and Python. I use R for daily data analysis, and Python for tasks is harder for word processing and shell scripts. I work with ever-larger data sets, and these files are often in binary or text files when I receive them. Usually I usually use statistical / machine learning algorithms and in most cases create statistical graphs. I sometimes use R with SQLite and write C for iteratively intensive tasks; before looking into Hadoop, I consider investing in NumPy / Scipy because I heard that it has improved memory management [and switching to Numpy / Scipy for one with my background seems not so big) - I wonder if anyone -Never experience using these two options and could comment on improvements in this area, and if Numpy has idioms that deal with this problem. (I also know Rpy2, but I wonder if Numpy / Scipy can handle most of my needs.) Thanks -

+6

python numpy scipy r

hatmatrix Aug 23 '10 at 6:01

source share

3 answers

I use Numpy daily and R is almost like that.

For heavy crystal numbers, I prefer NumPy for R with a large margin (including R packages such as "Matrix"). I find a syntax cleaner, the function is larger and the calculation is faster (although I do not find R slow by any means). For example, the NumPy Broadcasting function, I don’t think, has an analog in R.

For example, to read in the data set from the csv file and “normalize” it for input into the ML algorithm (for example, the middle center, and then change the scale of each measurement), only the following is required:

data = NP.loadtxt(data1, delimiter=",") # 'data' is a NumPy array data -= NP.mean(data, axis=0) data /= NP.max(data, axis=0)

In addition, I found that when coding ML-algorithms, I need data structures that I can use in an elementary way, and also understand linear algebra (for example, matrix multiplication, transpose, etc.). NumPy gets this and makes it easy to create these hybrid structures (without overloading or subclassing the operator, etc.).

You will not be disappointed with NumPy / SciPy, most likely you will be amazed.

So, a few recommendations - in general and, in particular, taking into account the facts in your question:

install both NumPy and Scipy . As an approximate guide, NumPy provides basic data structures (in particular ndarray) and SciPy (which is actually several times larger than NumPy) provides domain-specific functions (for example, statistics, signal processing, integration).
install repository versions , especially w / r / t NumPy, because the version is dev 2.0. Matplotlib and NumPy are tightly integrated, you can use one without the other, but both of them are the best in their class among python libraries. You can get all three with easy_install, and I assume that you are already.
NumPy / SciPy have several modules specifically designed for the Training / Statistics machine, including the Clustering and Statistics package.
As well as packages aimed at general computing, but which code much faster ML algorithms, in particular, Optimization and Linear Algebra .
There are also SciKit s not included in the base NumPy or SciPy Libraries; You need to install them separately. Generally speaking, every SciKit is a set of handy wrappers for organizing coding in a given domain. SciKits that you are likely to find most relevant are: ann (approximate nearest neighbor) and learn (a set of ML regression algorithms and statistics and classifications, for example, logistic regression, Multi-Layer Perceptron, vector machine support).

+11

doug Aug 23 '10 at 9:03

source share

I cannot comment on R, but here are a few links to Numpy / Scipy and ML:

And the book (I just looked at some of its code ): Marsland, machine learning (with numpy), 2009 406p isbn 1420067184

If you could collect some notes about your experience on the Numpy / Scipy learning curve that may be useful to others.

+2

denis Aug 23 '10 at 17:33

source share

lgautier · Accepted Answer · 2010-08-23T13:46:54+0000

R, when you are looking for an environment for machine learning and statistics, of course, is the diversity of its libraries. As far as I know, SciPy + SciKits cannot replace CRAN.

Regarding memory usage, R uses the pass-by-value paradigm, while Python uses the pass-by-reference. Passing by value can lead to more "intuitive" code, passing by reference will help optimize memory usage. Numpy also allows you to have “views” on arrays (like subarrays without copying).

Regarding speed, pure Python is faster than pure R for accessing individual elements in an array, but this advantage disappears when working with numpy ( benchmark ) arrays. Fortunately, Cython makes it easy to get serious speed improvements.

If you work with big data, I find support for storage-based arrays better with Python (HDF5).

I'm not sure you should cut one for the other, but rpy2 can help you explore your options for a possible transition (arrays can be sent between R and Numpy without copying).

Numpy for user R?

More articles: