How to specify the number of rows that will be in the pandas data frame?

Question

How to specify the number of rows that will be in the pandas data frame?

I have a Pandas dataframe, and I constantly add a row of data every second, as shown below.

df.loc[time.strftime("%Y-%m-%d %H:%M:%S")] = [reading1, reading2, reading3]
>>>df
                     sensor1 sensor2 sensor3
2015-04-14 08:50:23    5.4     5.6     5.7
2015-04-14 08:50:24    5.5     5.6     5.8
2015-04-14 08:50:26    5.2     5.3     5.4

If I continue this, eventually I will start to experience memory problems (every time it calls the entire DataFrame).

I only need to save X rows of data. that is, after the operation it will be:

>>>df
                     sensor1 sensor2 sensor3
(this row is gone)
2015-04-14 08:50:24    5.5     5.6     5.8
2015-04-14 08:50:26    5.2     5.3     5.4
2015-04-14 08:50:27    5.2     5.4     5.6

Is there a way to specify the maximum number of rows so that when adding any subsequent rows the oldest row is deleted at the same time WITHOUT "Check the length of the DataFrame, if the length of the DataFrame> X, delete the first row, add a new row"?

Similar to this, but for Pandas DataFrame: stack overflow

+4

python pandas dataframe data-analysis real-time-data

ps.george 13 . '15 15:23

3

TheBlackCat · Answer 1 · 2015-04-13T15:28:29+0000

pandas . . ( ) , - - . . .

: , .

collections.deque . , , . DataFrame. for , , -. :

import pandas as pd
from collections import deque

maxlen = 1000

dq = deque(maxlen=maxlen)

for reading1, reading3, reading3 in readings:
    dq.append(pd.Series([reading1, reading2, reading3], 
                        index=['sensor1', 'sensor2', 'sensor3'], 
                        name=time.strftime("%Y-%m-%d %H:%M:%S")))

df = pd.concat(dq, axis=1).T

- DataFrame , DataFrame. . , . , for , , , . , , , enumerate, , , :

import pandas as pd

maxlen = 1000

df = pd.DataFrame(np.full((maxlen, 5), np.nan),
                  columns=['index', 'time', 
                           'sensor1', 'sensor2', 'sensor3'])

i = 0
for reading1, reading3, reading3 in readings:
    df.loc[i%maxlen, :] = [i, time.strftime("%Y-%m-%d %H:%M:%S"),
                           reading1, reading2, reading3]
    i+=1

df.sort('index', inplace=True)
del df['index']
df.set_index('time', drop=True, inplace=True)

Alexander · Answer 2 · 2015-04-13T21:04:53+0000

DataFrame, , Nones. , DataFrame, . , , .

max_rows = 5
cols = list('AB')

# Initialize empty DataFrame
df = pd.DataFrame({c: np.repeat([None], [max_rows]) for c in cols})

new_rows = [pd.DataFrame({'A': [1], 'B': [10]}), 
            pd.DataFrame({'A': [2], 'B': [11]}),
            pd.DataFrame({'A': [3], 'B': [12]}),
            pd.DataFrame({'A': [4], 'B': [13]}),
            pd.DataFrame({'A': [5], 'B': [14]}),
            pd.DataFrame({'A': [6], 'B': [15]}),
            pd.DataFrame({'A': [7], 'B': [16]})]

for row in new_rows:
    df = df.shift(-1)
    df.iloc[-1, :] = row.values

>>> df
df
   A   B
0  3  12
1  4  13
2  5  14
3  6  15
4  7  16

AAPL.

from datetime import timedelta

aapl = DataReader("AAPL", data_source="yahoo", start="2014-1-1", end="2015-1-1")
cols = aapl.columns
df = pd.DataFrame({c: np.repeat([None], [max_rows]) for c in aapl.columns})[cols]
# Initialize a datetime index
df.index = pd.DatetimeIndex(end=aapl.index[0] + timedelta(days=-1), periods=max_rows, freq='D')

for timestamp, row in aapl.iterrows():
    df = df.shift(-1)
    df.iloc[-1, :] = row.values
    idx = df.index[:-1].tolist()
    idx.append(timestamp)
    df.index = idx

>>> df
              Open    High     Low   Close       Volume Adj Close
2013-12-28  112.58  112.71  112.01  112.01  1.44796e+07    111.57
2013-12-29   112.1  114.52  112.01  113.99   3.3721e+07    113.54
2013-12-30  113.79  114.77   113.7  113.91  2.75989e+07    113.46
2013-12-31  113.64  113.92  112.11  112.52  2.98815e+07    112.08
2014-12-31  112.82  113.13  110.21  110.38  4.14034e+07    109.95

S anand · Answer 3 · 2015-04-14T02:30:43+0000

- .

# Say we to limit to a thousand rows
N = 1000

# Create the DataFrame with N rows and 5 columns -- all NaNs
data = pd.DataFrame(pd.np.empty((N, 5)) * pd.np.nan) 

# To check the length of the DataFrame, we'll need to .dropna().
len(data.dropna())              # Returns 0

# Keep a running counter of the next index to insert into
counter = 0

# Insertion always happens at that counter
data.loc[counter, :] = pd.np.random.rand(5)

# ... and increment the counter, but when it exceeds N, set it to 0
counter = (counter + 1) % N

# Now, the DataFrame contains one row
len(data.dropna())              # Returns 1

# We can add several rows one after another. Let add twice as many as N
for row in pd.np.random.rand(2 * N, 5):
    data.loc[counter, :] = row
    counter = (counter + 1) % N

# Now that we added them, we still have only the last N rows
len(data)                       # Returns N

- . , :

. , data counter, .
. , N, .dropna() ( ), .

In most scenarios in which I deal with truncated added performance, none of the above are true, but your scenario may be different. In this case, @Alexander has a good solution, including .shift().

How to specify the number of rows that will be in the pandas data frame?

More articles: