Memory optimization when choosing from a pandas frame

I have a rather large pandas dataframe (1.7G) from which I select some columns to do some calculation (find the maximum value of the three selected columns). It seems that this operation is intensively associated with memory. I am trying to find a way to avoid this lack of memory.

For the purposes of this question, I simplify the data frame and use fake data. My code and memory area are displayed below,

from memory_profiler import profile
import pandas as pnd
import random


@profile
def main():
    cols = [chr(i) for i in range(65,91)]
    d = {}
    n = 1000000
    for c in cols:
        d[c] = [random.randint(0,100) for i in range(n)]
    df = pnd.DataFrame(d)
    items = ['A','F','G']
    a = df[items]
    b = a.max(axis=0)


if __name__ == "__main__":
    main()


Line #    Mem usage    Increment   Line Contents
================================================
     6     42.3 MiB      0.0 MiB   @profile
     7                             def main():
     8     42.3 MiB      0.0 MiB       cols = [chr(i) for i in range(65,91)]
     9     42.3 MiB      0.0 MiB       d = {}
    10     42.3 MiB      0.0 MiB       n = 1000000
    11    240.6 MiB    198.3 MiB       for c in cols:
    12    240.6 MiB      0.0 MiB           d[c] = [random.randint(0,100) for i in range(n)]
    13    446.7 MiB    206.1 MiB       df = pnd.DataFrame(d)
    14    446.7 MiB      0.0 MiB       items = ['A','F','G']
    15    469.7 MiB     23.1 MiB       a = df[items]
    16    469.8 MiB      0.1 MiB       b = a.max(axis=0)

In the above operation, it seems that df [items] uses 23 MB of memory. I reflect on this because it makes a copy of df and puts it in 'a'.

Is there any way to get rid of this lack of memory when selecting columns?

+4
source share
2

Pandas . , numpy. numpy. , , (, ) , .

, , , , .

?

+1

, , , .

- , CPU, maxes, , .

df.max()[['A','F','G']]

, , , ( ).

0

All Articles