Java使用poi将office文件转为html

    技术2022-07-11  141

    一、前言

    功能需求:上传office文档,并提供文件在线预览。

    解决方案:

    使用Aspose.cells.jar包,将文档转换为pdf格式;使用libreOffice,将文档转换为pdf格式;使用poi将文档转换为html格式。 方案一:通过Aspose的方式,该功能是付费版,需要破解,所以是能抛弃。方案二,使用libreOffice,需要安装使用libreOffice,linux还需要装unoconv,需要使用commons-io的pom依赖,之前maven官方库查询不到这个pom依赖所以放弃了这个方案,刚才准备查询资料时发现这个依赖已经可以使用,估计是前段时间maven官方库出现问题。方案三,只需要添加所需的依赖就可以使用,但是转换出的html会有一些格式问题,等下会再下面讲到。

    二、添加依赖

    <dependencies> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi</artifactId> <version>3.12</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-scratchpad</artifactId> <version>3.12</version> </dependency> <dependency> <groupId>fr.opensagres.xdocreport</groupId> <artifactId>fr.opensagres.xdocreport.document</artifactId> <version>1.0.5</version> </dependency> <dependency> <groupId>fr.opensagres.xdocreport</groupId> <artifactId>org.apache.poi.xwpf.converter.xhtml</artifactId> <version>1.0.4</version> </dependency> <dependency> <groupId>fr.opensagres.xdocreport</groupId> <artifactId>fr.opensagres.xdocreport.core</artifactId> <version>2.0.1</version> </dependency> </dependencies>

    三、word文档转html

    一般word文件后缀有doc、docx两种。docx是word2007以及以后版本文档的扩展名,doc是word2003文档保存的扩展名。对于这两种格式的word转换成html需要使用不同的方法。

    转换问题: 转换后,对于2003版本word,自动生成的目录会显示有错误;2003版本和2007版本对于特殊字符都有可能显示不出来,不过问题不是很明显;文章中的图片一定要是常用的图片格式(jpeg,jpg,png等),不然无法显示。

    2003版本word转换成html(.doc)

    public static boolean word2003ToHtml(Map params) { logger.debug("***** word2003ToHtml start params:{}", params); try { //图片存放路径 String fileImg = params.get("fileImg").toString(); //转换html后,html中图片的url前缀 String viewImgPath = params.get("viewImgPath").toString(); //html文件 File htmlFile = new File(params.get("htmlFile").toString()); File file = new File(params.get("filePath").toString() + params.get("FILE_NAME").toString()); // 1) 加载word文档生成 HWPFDocument对象 InputStream inputStream = new FileInputStream(file); HWPFDocument wordDocument = new HWPFDocument(inputStream); WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument()); //设置图片存放的位置 wordToHtmlConverter.setPicturesManager(new PicturesManager() { public String savePicture(byte[] content, PictureType pictureType, String suggestedName, float widthInches, float heightInches) { File imgPath = new File(fileImg); if (!imgPath.exists()) {//图片目录不存在则创建 imgPath.mkdirs(); } File file = new File(fileImg + suggestedName); try { OutputStream os = new FileOutputStream(file); os.write(content); os.close(); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } //这里可以指定word文档中图片的路径。 return viewImgPath + "/" + suggestedName; } }); //解析word文档 wordToHtmlConverter.processDocument(wordDocument); Document htmlDocument = wordToHtmlConverter.getDocument(); OutputStream outputStream = new FileOutputStream(htmlFile); DOMSource domSource = new DOMSource(htmlDocument); StreamResult streamResult = new StreamResult(outputStream); TransformerFactory factory = TransformerFactory.newInstance(); Transformer serializer = factory.newTransformer(); serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8"); serializer.setOutputProperty(OutputKeys.INDENT, "yes"); serializer.setOutputProperty(OutputKeys.METHOD, "html"); serializer.transform(domSource, streamResult); outputStream.close(); } catch (Exception e) { e.printStackTrace(); return false; } return true; }

    3、2007版本word转换成html(.docx)

    public static boolean word2007ToHtml(Map params) throws Exception { logger.debug("***** word2007ToHtml start params:{}", params); try { //转换html后,html中图片的url前缀 String viewImgPath = params.get("viewImgPath").toString(); //图片存放路径 String fileImg = params.get("fileImg").toString(); File file = new File(params.get("filePath").toString() + params.get("FILE_NAME").toString()); // 1) 加载word文档生成 XWPFDocument对象 InputStream inputStream = new FileInputStream(file); XWPFDocument document = new XWPFDocument(inputStream); // 2) 解析 XHTML配置 (URIResolver来设置图片存放的目录) XHTMLOptions options = XHTMLOptions.create(); options.URIResolver(new BasicURIResolver(viewImgPath)); FileImageExtractor extractor = new FileImageExtractor(new File(fileImg)); options.setExtractor(extractor); // 3) 将 XWPFDocument转换成XHTML File htmlFile = new File(params.get("htmlFile").toString()); OutputStream outputStream = new FileOutputStream(htmlFile); XHTMLConverter.getInstance().convert(document, outputStream, options); } catch (Exception e) { e.printStackTrace(); return false; } return true; }

    四、Excel文档转Html

    POI中将Excel转换为HTML方法仅能转换HSSFWorkBook类型(即03版xls),故可以先将读取的xlsx文件转换成xls文件再调用该方法统一处理

    2003版本excel转换html(excel中不可包括图片,因为poi没提供将excel中图片转换的方法)

    public static boolean excelToHtml(Map params) { try { String file = params.get("FILE_NAME").toString(); String filePath = params.get("filePath").toString() + file; InputStream input = new FileInputStream(filePath); HSSFWorkbook excelBook = new HSSFWorkbook(); //判断Excel文件将07+版本转换为03版本 if (file.endsWith(EXCEL_XLS)) { //Excel 2003 excelBook = new HSSFWorkbook(input); } else if (file.endsWith(EXCEL_XLSX)) { // Excel 2007/2010 ExcelTransFormUtil xls = new ExcelTransFormUtil(); XSSFWorkbook workbookOld = new XSSFWorkbook(input); xls.transformXSSF(workbookOld, excelBook); } ExcelToHtmlConverter excelToHtmlConverter = new ExcelToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument()); //去掉Excel头行 excelToHtmlConverter.setOutputColumnHeaders(false); //去掉Excel行号 excelToHtmlConverter.setOutputRowNumbers(false); excelToHtmlConverter.processWorkbook(excelBook); Document htmlDocument = excelToHtmlConverter.getDocument(); ByteArrayOutputStream outStream = new ByteArrayOutputStream(); DOMSource domSource = new DOMSource(htmlDocument); StreamResult streamResult = new StreamResult(outStream); TransformerFactory tf = TransformerFactory.newInstance(); Transformer serializer = tf.newTransformer(); serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); serializer.setOutputProperty(OutputKeys.INDENT, "yes"); serializer.setOutputProperty(OutputKeys.METHOD, "html"); serializer.transform(domSource, streamResult); outStream.close(); String content = new String(outStream.toByteArray()); FileUtils.writeStringToFile(new File(params.get("htmlFile").toString()), content, "UTF-8"); } catch (Exception e) { e.printStackTrace(); return false; } return true; }

    将2007版本转换成2003版本

    public void transformXSSF(XSSFWorkbook workbookOld, HSSFWorkbook workbookNew) { HSSFSheet sheetNew; XSSFSheet sheetOld; workbookNew.setMissingCellPolicy(workbookOld.getMissingCellPolicy()); for (int i = 0; i < workbookOld.getNumberOfSheets(); i++) { sheetOld = workbookOld.getSheetAt(i); sheetNew = workbookNew.getSheet(sheetOld.getSheetName()); sheetNew = workbookNew.createSheet(sheetOld.getSheetName()); this.transform(workbookOld, workbookNew, sheetOld, sheetNew); } } private void transform(XSSFWorkbook workbookOld, HSSFWorkbook workbookNew, XSSFSheet sheetOld, HSSFSheet sheetNew) { sheetNew.setDisplayFormulas(sheetOld.isDisplayFormulas()); sheetNew.setDisplayGridlines(sheetOld.isDisplayGridlines()); sheetNew.setDisplayGuts(sheetOld.getDisplayGuts()); sheetNew.setDisplayRowColHeadings(sheetOld.isDisplayRowColHeadings()); sheetNew.setDisplayZeros(sheetOld.isDisplayZeros()); sheetNew.setFitToPage(sheetOld.getFitToPage()); sheetNew.setHorizontallyCenter(sheetOld.getHorizontallyCenter()); sheetNew.setMargin(Sheet.BottomMargin, sheetOld.getMargin(Sheet.BottomMargin)); sheetNew.setMargin(Sheet.FooterMargin, sheetOld.getMargin(Sheet.FooterMargin)); sheetNew.setMargin(Sheet.HeaderMargin, sheetOld.getMargin(Sheet.HeaderMargin)); sheetNew.setMargin(Sheet.LeftMargin, sheetOld.getMargin(Sheet.LeftMargin)); sheetNew.setMargin(Sheet.RightMargin, sheetOld.getMargin(Sheet.RightMargin)); sheetNew.setMargin(Sheet.TopMargin, sheetOld.getMargin(Sheet.TopMargin)); sheetNew.setPrintGridlines(sheetNew.isPrintGridlines()); sheetNew.setRightToLeft(sheetNew.isRightToLeft()); sheetNew.setRowSumsBelow(sheetNew.getRowSumsBelow()); sheetNew.setRowSumsRight(sheetNew.getRowSumsRight()); sheetNew.setVerticallyCenter(sheetOld.getVerticallyCenter()); HSSFRow rowNew; for (Row row : sheetOld) { rowNew = sheetNew.createRow(row.getRowNum()); if (rowNew != null) this.transform(workbookOld, workbookNew, (XSSFRow) row, rowNew); } for (int i = 0; i < this.lastColumn; i++) { sheetNew.setColumnWidth(i, sheetOld.getColumnWidth(i)); sheetNew.setColumnHidden(i, sheetOld.isColumnHidden(i)); } for (int i = 0; i < sheetOld.getNumMergedRegions(); i++) { CellRangeAddress merged = sheetOld.getMergedRegion(i); sheetNew.addMergedRegion(merged); } } private void transform(XSSFWorkbook workbookOld, HSSFWorkbook workbookNew, XSSFRow rowOld, HSSFRow rowNew) { HSSFCell cellNew; rowNew.setHeight(rowOld.getHeight()); for (Cell cell : rowOld) { cellNew = rowNew.createCell(cell.getColumnIndex(), cell.getCellType()); if (cellNew != null) this.transform(workbookOld, workbookNew, (XSSFCell) cell, cellNew); } this.lastColumn = Math.max(this.lastColumn, rowOld.getLastCellNum()); } private void transform(XSSFWorkbook workbookOld, HSSFWorkbook workbookNew, XSSFCell cellOld, HSSFCell cellNew) { cellNew.setCellComment(cellOld.getCellComment()); Integer hash = cellOld.getCellStyle().hashCode(); if (this.styleMap != null && !this.styleMap.containsKey(hash)) { this.transform(workbookOld, workbookNew, hash, cellOld.getCellStyle(), (HSSFCellStyle) workbookNew.createCellStyle()); } cellNew.setCellStyle(this.styleMap.get(hash)); switch (cellOld.getCellType()) { case Cell.CELL_TYPE_BLANK: break; case Cell.CELL_TYPE_BOOLEAN: cellNew.setCellValue(cellOld.getBooleanCellValue()); break; case Cell.CELL_TYPE_ERROR: cellNew.setCellValue(cellOld.getErrorCellValue()); break; case Cell.CELL_TYPE_FORMULA: cellNew.setCellValue(cellOld.getCellFormula()); break; case Cell.CELL_TYPE_NUMERIC: cellNew.setCellValue(cellOld.getNumericCellValue()); break; case Cell.CELL_TYPE_STRING: cellNew.setCellValue(cellOld.getStringCellValue()); break; default: } } private void transform(XSSFWorkbook workbookOld, HSSFWorkbook workbookNew, Integer hash, XSSFCellStyle styleOld, HSSFCellStyle styleNew) { styleNew.setAlignment(styleOld.getAlignment()); styleNew.setBorderBottom(styleOld.getBorderBottom()); styleNew.setBorderLeft(styleOld.getBorderLeft()); styleNew.setBorderRight(styleOld.getBorderRight()); styleNew.setBorderTop(styleOld.getBorderTop()); styleNew.setDataFormat(this.transform(workbookOld, workbookNew, styleOld.getDataFormat())); styleNew.setFillBackgroundColor(styleOld.getFillBackgroundColor()); styleNew.setFillForegroundColor(styleOld.getFillForegroundColor()); styleNew.setFillPattern(styleOld.getFillPattern()); styleNew.setFont(this.transform(workbookNew, (XSSFFont) styleOld.getFont())); styleNew.setHidden(styleOld.getHidden()); styleNew.setIndention(styleOld.getIndention()); styleNew.setLocked(styleOld.getLocked()); styleNew.setVerticalAlignment(styleOld.getVerticalAlignment()); styleNew.setWrapText(styleOld.getWrapText()); this.styleMap.put(hash, styleNew); } private short transform(XSSFWorkbook workbookOld, HSSFWorkbook workbookNew, short index) { DataFormat formatOld = workbookOld.createDataFormat(); DataFormat formatNew = workbookNew.createDataFormat(); return formatNew.getFormat(formatOld.getFormat(index)); } private HSSFFont transform(HSSFWorkbook workbookNew, XSSFFont fontOld) { HSSFFont fontNew = workbookNew.createFont(); fontNew.setBoldweight(fontOld.getBoldweight()); fontNew.setCharSet(fontOld.getCharSet()); fontNew.setColor(fontOld.getColor()); fontNew.setFontName(fontOld.getFontName()); fontNew.setFontHeight(fontOld.getFontHeight()); fontNew.setItalic(fontOld.getItalic()); fontNew.setStrikeout(fontOld.getStrikeout()); fontNew.setTypeOffset(fontOld.getTypeOffset()); fontNew.setUnderline(fontOld.getUnderline()); return fontNew; } 手动捂脸,有点太多了。大家凑合着看。。。

    五、PPT文档转Html

    就是将ppt转换成一张张图片再放入html。

    注: 1) ppt中文字会有中文显示问题,可以将中文文字库加入到服务器中即可。 2) pptx2007版本文件不能显示表格

    2003版本ppt转html

    public static boolean ppt2003Tohtml(Map params) { try { String imgPath = params.get("fileImg").toString(); File file = new File(params.get("filePath").toString() + params.get("FILE_NAME").toString()); InputStream inputStream = new FileInputStream(file); SlideShow ppt = new SlideShow(inputStream); inputStream.close(); Dimension pgsize = ppt.getPageSize(); org.apache.poi.hslf.model.Slide[] slide = ppt.getSlides(); FileOutputStream out = null; String imghtml = ""; String viewImgPath = params.get("viewImgPath").toString(); for (int i = 0; i < slide.length; i++) { logger.debug("第" + i + "页。"); TextRun[] truns = slide[i].getTextRuns(); for (int k = 0; k < truns.length; k++) { RichTextRun[] rtruns = truns[k].getRichTextRuns(); for (int l = 0; l < rtruns.length; l++) { rtruns[l].setFontIndex(1); rtruns[l].setFontName("宋体"); } } BufferedImage img = new BufferedImage(pgsize.width, pgsize.height, BufferedImage.TYPE_INT_RGB); Graphics2D graphics = img.createGraphics(); graphics.setPaint(Color.BLUE); graphics.fill(new Rectangle2D.Float(0, 0, pgsize.width, pgsize.height)); slide[i].draw(graphics); // 这里设置图片的存放路径和图片的格式(jpeg,png,bmp等等) out = new FileOutputStream(imgPath + (i + 1) + ".jpeg"); javax.imageio.ImageIO.write(img, "jpeg", out); //图片在html加载路径 String imgs = viewImgPath + "/" + (i + 1) + ".jpeg"; imghtml += "<img src=\'" + imgs + "\' style=\'width:960px;height:530px;vertical-align:text-bottom;\'><br><br><br><br>"; DOMSource domSource = new DOMSource(); StreamResult streamResult = new StreamResult(out); TransformerFactory tf = TransformerFactory.newInstance(); Transformer serializer = tf.newTransformer(); serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8"); serializer.setOutputProperty(OutputKeys.INDENT, "yes"); serializer.setOutputProperty(OutputKeys.METHOD, "html"); serializer.transform(domSource, streamResult); out.close(); String ppthtml = "<html><head><META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\"></head><body>" + imghtml + "</body></html>"; FileUtils.writeStringToFile(new File(params.get("htmlFile").toString()), ppthtml, "utf-8"); } } catch (Exception e) { e.printStackTrace(); return false; } return true; }

    2007版本ppt转换html

    public static boolean ppt2007Tohtml(Map params) { try { String imgPath = params.get("fileImg").toString(); File file = new File(params.get("filePath").toString() + params.get("FILE_NAME").toString()); InputStream inputStream = new FileInputStream(file); XMLSlideShow ppt = new XMLSlideShow(inputStream); inputStream.close(); Dimension pgsize = ppt.getPageSize(); XSLFSlide[] pptPageXSLFSLiseList = ppt.getSlides(); FileOutputStream out = null; String imghtml = ""; String viewImgPath = params.get("viewImgPath").toString(); for (int i = 0; i < pptPageXSLFSLiseList.length; i++) { try { for (XSLFShape shape : pptPageXSLFSLiseList[i].getShapes()) { if (shape instanceof XSLFTextShape) { XSLFTextShape tsh = (XSLFTextShape) shape; for (XSLFTextParagraph p : tsh) { for (XSLFTextRun r : p) { r.setFontFamily("宋体"); } } } } BufferedImage img = new BufferedImage(pgsize.width, pgsize.height, BufferedImage.TYPE_INT_RGB); Graphics2D graphics = img.createGraphics(); // clear the drawing area graphics.setPaint(Color.white); graphics.fill(new Rectangle2D.Float(0, 0, pgsize.width, pgsize.height)); // render pptPageXSLFSLiseList[i].draw(graphics); // String Imgname = imgPath + (i + 1) + ".jpeg"; out = new FileOutputStream(Imgname); javax.imageio.ImageIO.write(img, "jpeg", out); //图片在html加载路径 String imgs = viewImgPath + "/" + (i + 1) + ".jpeg"; imghtml += "<img src=\'" + imgs + "\' style=\'width:960px;height:530px;vertical-align:text-bottom;\'><br><br><br><br>"; } catch (Exception e) { System.out.println(e); System.out.println("第" + i + "张ppt转换出错"); } } DOMSource domSource = new DOMSource(); StreamResult streamResult = new StreamResult(out); TransformerFactory tf = TransformerFactory.newInstance(); Transformer serializer = tf.newTransformer(); serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8"); serializer.setOutputProperty(OutputKeys.INDENT, "yes"); serializer.setOutputProperty(OutputKeys.METHOD, "html"); serializer.transform(domSource, streamResult); out.close(); String ppthtml = "<html><head><META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\"></head><body>" + imghtml + "</body></html>"; FileUtils.writeStringToFile(new File(params.get("htmlFile").toString()), ppthtml, "utf-8"); } catch (Exception e) { e.printStackTrace(); return false; } return true; }

    六、预览

    每次预览将源文件转换成html后,将html文件上传nginx服务器目录下,用nginx代理访问,以防出现浏览器缓存问题。 我这里是将html文件和图片路径打包,通过sftp上传nginx服务器,然后解压。使用iframe标签访问html文件。这里需要注意的是预览pdf文件,会出现下载打印的按钮,如果需求是文件只可缓存不可下载,就不可以使用iframe标签预览pdf。
    Processed: 0.009, SQL: 9