Python Find the highest row in a given column.

I am new to stackoverflow and quite recently learned about basic Python. This is the first time I use openpyxl. I used to use xlrd and xlsxwriter, and I managed to make some useful programs. But now I need a .xlsx reader & writer.

There is a file that I need to read and edit with data already stored in the code. Suppose .xlsx has five columns with data: A, B, C, D, E. In column A, I have more than 1000 rows with data. In column D, I have 150 rows of data.

Basically, I want the program to find the last row with the data in this column (say, D). Then write the saved data variable in the next available row (last row + 1) in column D.

The problem is that I cannot use ws.get_highest_row() because it returns the row 1000 in column A.

Basically, for now this is all I have:

 data = 'xxx' from openpyxl import load_workbook wb = load_workbook('book.xlsx', use_iterators=True) ws = wb.get_sheet_by_name('Sheet1') last_row = ws.get_highest_row() 

Obviously this does not work at all. last_row returns 1000.

+5
source share
5 answers

Here's how to do it using Pandas.

It's easy to get the last non-empty string in Pandas using last_valid_index .

Perhaps the best way to write the resulting DataFrame to your xlsx file, but according to the docs , this is a very dumb way to actually do it in openpyxl .

Let's say you start with this simple sheet:

Original worksheet

Say we want to put xxx in column C :

 import openpyxl as xl import pandas as pd wb = xl.load_workbook('deleteme.xlsx') ws = wb.get_sheet_by_name('Sheet1') df = pd.read_excel('deleteme.xlsx') def replace_first_null(df, col_name, value): """ Replace the first null value in DataFrame df.`col_name` with `value`. """ return_df = df.copy() idx = list(df.index) last_valid = df[col_name].last_valid_index() last_valid_row_number = idx.index(last_valid) # This next line has mixed number and string indexing # but it should be ok, since df is coming from an # Excel sheet and should have a consecutive index return_df.loc[last_valid_row_number + 1, col_name] = value return return_df def write_df_to_worksheet(ws, df): """ Write the values in df to the worksheet ws in place """ for i, col in enumerate(replaced): for j, val in enumerate(replaced[col]): if not pd.isnull(val): # Python is zero indexed, so add one # (plus an extra one to take account # of the header row!) ws.cell(row=j + 2, column=i + 1).value = val # Here the actual replacing happening replaced = replace_first_null(df, 'C', 'xxx') write_df_to_worksheet(ws, df) wb.save('changed.xlsx') 

that leads to:

Edited excel file

+1
source

The problem is that get_highest_row() itself uses row instances to determine the maximum row in the sheet. RowDimension does not have column information, which means that we cannot use it to solve your problem and must access it differently.

Here is one kind of "ugly" version of openpyxl that doesn't work if use_iterators=True :

 from openpyxl.utils import coordinate_from_string def get_maximum_row(ws, column): return max(coordinate_from_string(cell)[-1] for cell in ws._cells if cell.startswith(column)) 

Using:

 print get_maximum_row(ws, "A") print get_maximum_row(ws, "B") print get_maximum_row(ws, "C") print get_maximum_row(ws, "D") 

Other than that, I will follow @LondonRob's suggestion to parse the contents with pandas and let it complete the task.

+2
source

If this is an openpyxl limitation, you can try one of the following methods:

  • convert excel file to csv and use python csv module.
  • unzip the excel file using zipfile and then go to the subfolder "xl / worksheets" of the uncompressed file and there you will find XML for each of your worksheets. From there, you can parse and update using BeautifulSoup or lxml .

The xslx Excel format is a compressed (encoded) folder of the XML file tree. You can find the specification here .

0
source

Picture. I will begin to return to the stackoverflow community. Alecxe's solution did not work for me, and I did not want to use Pandas etc., so I did it instead. It checks from the end of the table and gives the next available / empty row in column D.

 def unassigned_row_in_column_D(): ws_max_row = int(ws.max_row) cell_coord = 'D' + str(ws_max_row) while ws.cell(cell_coord).value == None: ws_max_row -= 1 cell_coord = 'D' + str(ws_max_row) ws_max_row += 1 return 'D' + str(ws_max_row) #then add variable data = 'xxx' to that cell ws.cell(unassigned_row_in_column_D()).value = data 
0
source

Alexce solution does not work for me. This is probably a question of the openpyxl version, I'm on 2.4.1, this is what worked after a little tweaking:

 def get_max_row_in_col(ws, column): return max([cell[0] for cell in ws._cells if cell[1] == column]) 
0
source

All Articles