Problem: I have data stored in a csv file with the following columns data / id / value. I have 15 files, each of which contains about 10-20 million lines. Each csv file covers a certain period, so the time indices do not overlap, but the columns (new identifiers appear from time to time, old ones disappear). I initially ran the script without a summary call, but then I ran into memory problems on my local machine (only 8 GB). Since there is a lot of redundancy in each file, a good output opens first (about 2/3 less data), but now the playoffs are performed. If I run the following script, the concat function will work "forever" (I always interrupt manually after a while (2h>)). Does Concat / append have size limits (I have approximately 10,000-20000 columns), or am I missing something? Any suggestions?
import pandas as pd path = 'D:\\' data = pd.DataFrame()
EDIT I: To clarify, each csv file has about 10-20mio lines and three columns, after pivot is applied, it reduces to about 2000 lines, but results in 10000 columns.
I can solve the memory problem by simply breaking the complete set of identifiers into subsets and running the necessary calculations based on each subset, since they are independent for each identifier. I know that this forces me to reload the same files n times, where n is the number of subsets used, but this is still reasonably fast. I am still wondering why append is not executing.
EDIT II: I tried to recreate a file structure with a simulation that is as close as possible to the actual data structure. I hope this is clear, I did not spend much time minimizing the simulation time, but it works quickly on my machine.
import string import random import pandas as pd import numpy as np import math
source share