Apache POI Streaming (SXSSF) for reading

I need to read large excel files and import their data into my application.

Since the POI processes a large number of heaps for work, often throwing OutOfMemory errors, I found that there is a Streaming API for processing excel data in a serial way (and not for downloading a file completely to memory)

I created an xlsx workbook with one worksheet and typed several values ​​in the cells and came up with the following code to try reading it:

  public static void main(String[] args) throws Throwable { SXSSFWorkbook wb = new SXSSFWorkbook(new XSSFWorkbook(new FileInputStream("C:\\test\\tst.xlsx"))); // keep 100 rows in memory, exceeding rows will be flushed to disk SXSSFSheet sheet = (SXSSFSheet) wb.getSheetAt(0); Row row = sheet.getRow(0); //row is always null while(row.iterator().hasNext()){ //-> NullPointerException System.out.println(row.getCell(0).getStringCellValue()); } } 

However, despite the fact that it correctly receives its worksheets, it always contains empty (zero) lines.

I have researched and learned some examples of Streaming APIs on the Internet, but none of them contain reading existing files, they are all designed to create excel files.

Is it possible to read data from existing .xlsx files in a stream?

+6
source share
1 answer

After digging a few more, I recognized this library :

If you used the Apache POI in the past to read in Excel files, you probably noticed that it is not very memory efficient. Reading throughout the book will cause a serious surge in memory usage, which can damage the server.

There are many good reasons why Apache should read throughout the book, but most of them are due to the fact that the library allows you to read and write with random addresses. If (and only if) you just want to quickly read the contents of an Excel file in a fast and efficient way of memory, you probably don't need this ability. Unfortunately, the only thing in the POI library for reading a streaming book requires your code to use a SAX-like parser. All friendly classes, such as Row and Cell, are not in this API.

This library serves as a wrapper around this streaming API, while preserving the syntax of the standard POI API. Read on to find out if this is right for you.

 InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx")); StreamingReader reader = StreamingReader.builder() .rowCacheSize(100) // number of rows to keep in memory (defaults to 10) .bufferSize(4096) // buffer size to use when reading InputStream to file (defaults to 1024) .sheetIndex(0) // index of sheet to use (defaults to 0) .sheetName("sheet1") // name of sheet to use (overrides sheetIndex) .read(is); // InputStream or File for XLSX file (required) 

There is also a SAX Event API that reads a document and analyzes its contents through events.

If the memory problem is a problem, then for XSSF you can get the basic XML data and process it yourself. This is for intermediate developers who want to learn the slightly low-level structure of .xlsx files and are happy to process XML in java. It is relatively easy to use, but requires a basic understanding of the file structure. The advantage is that you can read the XLSX file with a relatively small amount of memory.

+15
source

All Articles