How to parse UTF-8 characters in Excel files using POI

Question

How to parse UTF-8 characters in Excel files using POI

I used the POI to parse the XLS and XLSX files. However, I cannot correctly extract special characters, such as UTF-8 encoded characters, such as Chinese or Japanese, from an Excel spreadsheet. I figured out how to extract data from a UTF-8 encoded file with a csv or tab delimiter, but not good luck with the Excel file. Can anyone help?

( Edit: code snippet from comments)

HSSFSheet sheet = workbook.getSheet(worksheet); HSSFEvaluationWorkbook ewb = HSSFEvaluationWorkbook.create(workbook); while (rowCtr <= lastRow && !rowBreakOut) { Row row = sheet.getRow(rowCtr);//rows.next(); for (int col=firstCell; col<lastCell && !breakOut; col++) { Cell cell; cell = row.getCell(col,Row.RETURN_BLANK_AS_NULL); if (ctype == Cell.CELL_TYPE_STRING) { sValue = cell.getStringCellValue(); log.warn("String value = "+sValue); String encoded = URLEncoder.encode(sValue, "UTF-8"); log.warn("URL-encoded with UTF-8: " + encoded); ....

+7

java excel utf-8 cjk apache-poi

user1198370 Feb 08 '12 at 22:28

source share

4 answers

Mohsen · Answer 1 · 2012-02-25T21:15:08+0000

I had the same problem when extracting persian text from an excel file. I used Eclipse and just went to Project -> Properties and changed the "text file encoding" to UTF-8, solving the problem.

oveis beheshti · Answer 2 · 2013-11-28T18:09:26+0000

in POI you can use like this:

 Workbook wb = new HSSFWorkbook(); Sheet sheet = wb.createSheet("new sheet"); // Create a row and put some cells in it. Rows are 0 based. Row row = sheet.createRow(1); // Create a new font and alter it. Font font = wb.createFont(); font.setCharSet(FontCharset.ARABIC.getValue()); font.setFontHeightInPoints((short)24); font.setFontName("B Nazanin"); font.setItalic(true); font.setStrikeout(true); // Fonts are set into a style so create a new one to use. CellStyle style = wb.createCellStyle(); style.setFont(font); // Create a cell and put a value in it. Cell cell = row.createCell(1); cell.setCellValue("سلام"); cell.setCellStyle(style); // Write the output to a file FileOutputStream fileOut = new FileOutputStream("workbook.xls"); wb.write(fileOut); fileOut.close();

and may use a different encoding in FontCharset

Yacoub oweis · Answer 3 · 2017-02-14T11:48:08+0000

The solution is simple to read the string values of a string of any encoding (non-English characters); just use the following method:

 sValue = cell.getRichStringCellValue().getString();

instead:

 sValue = cell.getStringCellValue();

This applies to UTF-8 encoded characters such as Chinese, Arabic, or Japanese.

PS , if anyone uses the nullpunkt / excel-to-json command-line utility that uses the Apache POI library, change the converter file / ExcelToJsonConverter.java, replacing the entries in "getStringCellValue ()" to avoid reading non-English characters like "?? ? "

ybn · Answer 4 · 2014-06-25T14:42:01+0000

Get bytes using UTF as follows

 cell.getStringCellValue().getBytes(Charset.forName("UTF-8"));

How to parse UTF-8 characters in Excel files using POI

More articles: