Python - What is the main Pandas improvement over Numpy / Scipy

I use numpy / scipy for data analysis. I recently started to learn Pandas.

I have looked through several guides, and I am trying to understand what are the main improvements of Pandas compared to Numpy / Scipy.

It seems to me that the key idea of โ€‹โ€‹Pandas is to wrap the various numpy arrays in a Data Frame, with some utility functions around it.

Is there something revolutionary about Pandas that I just stupidly missed?

+5
source share
3 answers

Pandas is not particularly revolutionary and uses the ecosystem of NumPy and SciPy to achieve its goals, along with some key Cython code. It can be considered as a simpler API for functionality with the addition of key utilities, such as associations and simpler group features, which are especially useful for people with tabular data or time series. But, although not revolutionary, Pandas has key benefits.

For a while I also perceived Pandas as regular utilities on top of NumPy for those who liked the DataFrame interface. However, now I see Pandas as providing these key functions (this is not exhaustive):

  • Array of structures (independent storage of disparate types instead of continuous storage of structured arrays in NumPy) - in many cases this will speed up processing.
  • Simpler interfaces for common operations (file loading, building, selecting and combining / aligning data) simplify work in small code.
  • Index arrays, which mean that operations are always aligned, instead of tracking alignment on their own.
  • Split-Apply-Combine is a powerful way of thinking and implementing data processing.

However, there are downsides to Pandas:

  • Pandas is basically a user interface library and is not particularly suitable for writing library code. The โ€œautomaticโ€ functions can hush you up for reuse, even if you donโ€™t need to slow down the code that gets called again and again.
  • Pandas usually takes up more memory because it is generous with creating arrays of objects to solve otherwise sticky problems of things like string handling.
  • If your use case is outside the scope of Pandas, it quickly becomes awkward. But as part of what he had to do, Pandas is powerful and easy to use for quick data analysis.
+10
source

I feel that I characterize Pandas as "improving." Numpy / SciPy skips most of the point. Numpy / Scipy is quite focused on efficient numerical calculation and solving numerical problems, which are often solved by scientists and engineers. If your problem starts with formulas and includes a numerical solution from there, you are probably good with these two.

Pandas is much more consistent with issues that start with data stored in files or databases containing strings as well as numbers. Consider the problem of reading data from a database query. In Pandas, you can read_sql_query directly and have a usable version of the data on one line. Numpy / SciPy has no equivalent functions.

For data containing rows or discrete rather than continuous data, there is no equivalent to the groupby function or joining tables into a database when matching values.

For time series, there is a massive advantage of processing time series of data using the datetime index, which allows you to smoothly drag and drop different intervals, fill in values โ€‹โ€‹and plot incredibly easily.

Since many of my problems start life in spreadsheets, I am also very grateful for the relatively transparent processing of Excel files in .xls and .xlsx formats with the interface format.

There is also a wider ecosystem with packages such as a boat that provides smoother statistical analysis and model fitting than is possible with numpy / scipy base material.

+8
source

The main thing is that it introduces new data structures, such as dataframes, panels, etc., and has good interfaces for another structure and libraries. Thus, overall this is a more significant expansion of the python ecosystem than an improvement over other libraries. For me, this is a great tool among others, for example numpy, bcolz. Often I use it to change my data, get an overview, before starting to do data mining, etc.

0
source

All Articles