Memory error using openpyxl and big data

I wrote a script that should read many excel files from a folder (about 10,000). This script loads an excel file (some of them have more than 2000 lines) and reads one column to count the number of lines (material check). If the number of lines is not equal to the specified number, it writes a warning to the log.

The problem occurs when a script reads over 1000 excel files. Then, when it throws a memory error, and I don't know where the problem might be. Previously, the script read two csv files with 14,000 lines and saved them in a list. These lists contain the identifier for the excel file and its corresponding number of lines. If this number of lines is not equal to the number of lines in the excel file, it writes a warning. Maybe the problem with reading these lists?

I use openpyxl to download books, do I need to close them before opening the next one?

This is my code:

# -*- coding: utf-8 -*- import os from openpyxl import Workbook import glob import time import csv from time import gmtime,strftime from openpyxl import load_workbook folder = '' conditions = 0 a = 0 flight_error = 0 condition_error = 0 typical_flight_error = 0 SP_error = 0 cond_numbers = [] with open('Conditions.csv','rb') as csv_name: # Abre el fichero csv donde estarán las equivalencias csv_read = csv.reader(csv_name,delimiter='\t') for reads in csv_read: cond_numbers.append(reads) flight_TF = [] with open('vuelo-TF.csv','rb') as vuelo_TF: csv_read = csv.reader(vuelo_TF,delimiter=';') for reads in csv_read: flight_TF.append(reads) excel_files = glob.glob('*.xlsx') for excel in excel_files: print "Leyendo excel: "+excel wb = load_workbook(excel) ws = wb.get_sheet_by_name('Control System') flight = ws.cell('A7').value typical_flight = ws.cell('B7').value a = 0 for row in range(6,ws.get_highest_row()): conditions = conditions + 1 value_flight = int(ws.cell(row=row,column=0).value) value_TF = ws.cell(row=row,column=1).value value_SP = int(ws.cell(row=row,column=4).value) if value_flight == '': break if value_flight != flight: flight_error = 1 # Si no todos los flight numbers dentro del vuelo son iguales if value_TF != typical_flight: typical_flight_error = 2 # Si no todos los typical flight dentro del vuelo son iguales if value_SP != 100: SP_error = 1 for cond in cond_numbers: if int(flight) == int(cond[0]): conds = int(cond[1]) if conds != int(conditions): condition_error = 1 # Si el número de condiciones no se corresponde con el esperado for vuelo_TF in flight_TF: if int(vuelo_TF[0]) == int(flight): TF = vuelo_TF[1] if typical_flight != TF: typical_flight_error = 1 # Si el vuelo no coincide con el respectivo typical flight if flight_error == 1: today = datetime.datetime.today() time = today.strftime(" %Y-%m-%d %H.%M.%S") log = open('log.txt','aw') message = time+': Los flight numbers del vuelo '+str(flight)+' no coinciden.\n' log.write(message) log.close() flight_error = 0 if condition_error == 1: today = datetime.datetime.today() time = today.strftime(" %Y-%m-%d %H.%M.%S") log = open('log.txt','aw') message = time+': El número de condiciones del vuelo '+str(flight)+' no coincide. Condiciones esperadas: '+str(int(conds))+'. Condiciones obtenidas: '+str(int(conditions))+'.\n' log.write(message) log.close() condition_error = 0 if typical_flight_error == 1: today = datetime.datetime.today() time = today.strftime(" %Y-%m-%d %H.%M.%S") log = open('log.txt','aw') message = time+': El vuelo '+str(flight)+' no coincide con el typical flight. Typical flight respectivo: '+TF+'. Typical flight obtenido: '+typical_flight+'.\n' log.write(message) log.close() typical_flight_error = 0 if typical_flight_error == 2: today = datetime.datetime.today() time = today.strftime(" %Y-%m-%d %H.%M.%S") log = open('log.txt','aw') message = time+': Los typical flight del vuelo '+str(flight)+' no son todos iguales.\n' log.write(message) log.close() typical_flight_error = 0 if SP_error == 1: today = datetime.datetime.today() time = today.strftime(" %Y-%m-%d %H.%M.%S") log = open('log.txt','aw') message = time+': Hay algún Step Percentage del vuelo '+str(flight)+' menor que 100.\n' log.write(message) log.close() SP_error = 0 conditions = 0 

If statements to check and write warning logs.

I am using Windows XP with 8 GB of RAM and Intel xeon w3505 (dual core, 2.53 GHz).

+6
source share
2 answers

By default, the openpyxl implementation will store all available cells in memory. I suggest you use the Optimized Reader (link - https://openpyxl.readthedocs.org/en/latest/optimized.html ) instead

In code: -

 wb = load_workbook(file_path, use_iterators = True) 

When downloading a book, go use_iterators = True . Then enter the sheet and cells, for example:

 for row in sheet.iter_rows(): for cell in row: cell_text = cell.value 

This will reduce the memory area by 5-10%

UPDATE: in version 2.4.0 use_iterators = True option is removed. In newer versions, openpyxl.writer.write_only.WriteOnlyWorksheet is introduced to dump large amounts of data.

 from openpyxl import Workbook wb = Workbook(write_only=True) ws = wb.create_sheet() # now we'll fill it with 100 rows x 200 columns for irow in range(100): ws.append(['%d' % i for i in range(200)]) # save the file wb.save('new_big_file.xlsx') 

The code below has just not been verified, just copied from the link above.

Thanks @SdaliM for the info.

+9
source

In recent versions of openpyxl, you have to download and read a huge source book with the argument read_only=True and create / write a huge destination book with write_only=True mode:

https://openpyxl.readthedocs.io/en/latest/optimized.html

0
source

All Articles