Pandas fetch data for each row of the nth row

Question

Pandas fetch data for each row of the nth row

I have a script that reads syslog files in pandas dataframes and produces diagrams from them. Graphs are great for small data sets. But when I come across large datasets due to the long time frame for data collection, the charts become too crowded to distinguish.

I plan to re-sample the data frame so that if the data set has passed a certain size, I will change it so that in the end there is only the number of rows SIZE_LIMIT. This means that I need to filter the dataframe so that each row n = actual_size / SIZE_LIMIT is aggregated into one row in a new data framework. Aggregation can be either an average value or only the nth row, taken as is.

I am not fully versed in pandas, so I might have missed some obvious tools.

+4

pandas

nom-mon-ir Jan 29 '13 at 19:24

source share

2 answers

In fact, I think you should not change the data itself, but to view the data in the right interval to build. This view would be the actual data points to be built.

A naive approach would be, for example, for a computer screen to calculate how many points are in your interval and how many pixels you have. Thus, to build a frame with 10,000 points in a window width of 1,000 pixels, you take a fragment with STEP 10 using this syntax (integer_date will be a 1D array for example only):

 data_to_plot = whole_data[::10]

This can have undesirable effects, in particular masking short peaks that can “run invisible” from the cutting operation. An alternative would be to split your data into bins and then calculate one datapoint (maximum value, for example) for each bin. I believe that these operations can be fast due to operations with the numpy / pandas array.

Hope this helps!

+12

heltonbiker Jan 29 '13 at 19:38

source share

Zelazny7 · Accepted Answer · 2013-01-29T20:35:41+0000

You can use the pandas.qcut method for an index to divide the index into equal quantiles. The value you pass to qcut may be actual_size/SIZE_LIMIT .

 In [1]: from pandas import * In [2]: df = DataFrame({'a':range(10000)}) In [3]: df.head() Out[3]: a 0 0 1 1 2 2 3 3 4 4

Here, grouping the index on qcut(df.index,5) leads to 5 identically combined groups. Then I take the average for each group.

 In [4]: df.groupby(qcut(df.index,5)).mean() Out[4]: a [0, 1999.8] 999.5 (1999.8, 3999.6] 2999.5 (3999.6, 5999.4] 4999.5 (5999.4, 7999.2] 6999.5 (7999.2, 9999] 8999.5

Pandas fetch data for each row of the nth row

More articles: