Pandas fetch data for each row of the nth row

I have a script that reads syslog files in pandas dataframes and produces diagrams from them. Graphs are great for small data sets. But when I come across large datasets due to the long time frame for data collection, the charts become too crowded to distinguish.

I plan to re-sample the data frame so that if the data set has passed a certain size, I will change it so that in the end there is only the number of rows SIZE_LIMIT. This means that I need to filter the dataframe so that each row n = actual_size / SIZE_LIMIT is aggregated into one row in a new data framework. Aggregation can be either an average value or only the nth row, taken as is.

I am not fully versed in pandas, so I might have missed some obvious tools.

+4
source share
2 answers

You can use the pandas.qcut method for an index to divide the index into equal quantiles. The value you pass to qcut may be actual_size/SIZE_LIMIT .

 In [1]: from pandas import * In [2]: df = DataFrame({'a':range(10000)}) In [3]: df.head() Out[3]: a 0 0 1 1 2 2 3 3 4 4 

Here, grouping the index on qcut(df.index,5) leads to 5 identically combined groups. Then I take the average for each group.

 In [4]: df.groupby(qcut(df.index,5)).mean() Out[4]: a [0, 1999.8] 999.5 (1999.8, 3999.6] 2999.5 (3999.6, 5999.4] 4999.5 (5999.4, 7999.2] 6999.5 (7999.2, 9999] 8999.5 
+5
source

In fact, I think you should not change the data itself, but to view the data in the right interval to build. This view would be the actual data points to be built.

A naive approach would be, for example, for a computer screen to calculate how many points are in your interval and how many pixels you have. Thus, to build a frame with 10,000 points in a window width of 1,000 pixels, you take a fragment with STEP 10 using this syntax (integer_date will be a 1D array for example only):

 data_to_plot = whole_data[::10] 

This can have undesirable effects, in particular masking short peaks that can β€œrun invisible” from the cutting operation. An alternative would be to split your data into bins and then calculate one datapoint (maximum value, for example) for each bin. I believe that these operations can be fast due to operations with the numpy / pandas array.

Hope this helps!

+12
source

All Articles