How to parse UTF-8 characters in Excel files using POI

I used the POI to parse the XLS and XLSX files. However, I cannot correctly extract special characters, such as UTF-8 encoded characters, such as Chinese or Japanese, from an Excel spreadsheet. I figured out how to extract data from a UTF-8 encoded file with a csv or tab delimiter, but not good luck with the Excel file. Can anyone help?

( Edit: code snippet from comments)

HSSFSheet sheet = workbook.getSheet(worksheet); HSSFEvaluationWorkbook ewb = HSSFEvaluationWorkbook.create(workbook); while (rowCtr <= lastRow && !rowBreakOut) { Row row = sheet.getRow(rowCtr);//rows.next(); for (int col=firstCell; col<lastCell && !breakOut; col++) { Cell cell; cell = row.getCell(col,Row.RETURN_BLANK_AS_NULL); if (ctype == Cell.CELL_TYPE_STRING) { sValue = cell.getStringCellValue(); log.warn("String value = "+sValue); String encoded = URLEncoder.encode(sValue, "UTF-8"); log.warn("URL-encoded with UTF-8: " + encoded); .... 
+7
source share
4 answers

I had the same problem when extracting persian text from an excel file. I used Eclipse and just went to Project -> Properties and changed the "text file encoding" to UTF-8, solving the problem.

+7
source

in POI you can use like this:

 Workbook wb = new HSSFWorkbook(); Sheet sheet = wb.createSheet("new sheet"); // Create a row and put some cells in it. Rows are 0 based. Row row = sheet.createRow(1); // Create a new font and alter it. Font font = wb.createFont(); font.setCharSet(FontCharset.ARABIC.getValue()); font.setFontHeightInPoints((short)24); font.setFontName("B Nazanin"); font.setItalic(true); font.setStrikeout(true); // Fonts are set into a style so create a new one to use. CellStyle style = wb.createCellStyle(); style.setFont(font); // Create a cell and put a value in it. Cell cell = row.createCell(1); cell.setCellValue("ุณู„ุงู…"); cell.setCellStyle(style); // Write the output to a file FileOutputStream fileOut = new FileOutputStream("workbook.xls"); wb.write(fileOut); fileOut.close(); 

and may use a different encoding in FontCharset

+3
source

The solution is simple to read the string values โ€‹โ€‹of a string of any encoding (non-English characters); just use the following method:

 sValue = cell.getRichStringCellValue().getString(); 

instead:

 sValue = cell.getStringCellValue(); 

This applies to UTF-8 encoded characters such as Chinese, Arabic, or Japanese.

PS , if anyone uses the nullpunkt / excel-to-json command-line utility that uses the Apache POI library, change the converter file / ExcelToJsonConverter.java, replacing the entries in "getStringCellValue ()" to avoid reading non-English characters like "?? ? "

+1
source

Get bytes using UTF as follows

 cell.getStringCellValue().getBytes(Charset.forName("UTF-8")); 
0
source

All Articles