java 图片识别文字(中英文混合)
调用 tess4j 库来识别图片文字
依赖的maven库
org.slf4j slf4j-api 1.7.26 org.slf4j slf4j-simple 1.7.26 net.sourceforge.tess4j tess4j 5.1.1
图片识别文字
package com; import net.sourceforge.tess4j.ITesseract; import net.sourceforge.tess4j.Tesseract1; import net.sourceforge.tess4j.TesseractException; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.io.File; public class TestOCR { private static final Logger logger = LoggerFactory.getLogger(TestOCR.class); public static void main(String[] args) { String result = doOCR("字库位置", "要识别的图片地址"); System.out.println(result); } private static String doOCR(String dataPath, String imgPath) { File imageFile = new File(imgPath); ITesseract instance = new Tesseract1(); //字库位置 instance.setDatapath(dataPath); //eng+chi_sim代表中英文混合 instance.setLanguage("eng+chi_sim");//eng :英文 chi_sim :简体中文 try { return instance.doOCR(imageFile); } catch (TesseractException e) { logger.error("", e); } return ""; } }
字库下载
下载中文包:https://github.com/tesseract-ocr/tessdata 选择chi_sim.traineddata文件进行下载,英文包在tess4j jar包中可以获取。