OCR 文字提取 – IYATT-yx 的博客

阅读量： 136

最近更新于 2024-05-05 14:19

测试环境：

Ubuntu 20.04 x86_64
Python 3.9.10
opencv-python 4.5.5.64
pytesseract 0.3.9
jupyter 1.0.0
matplotlib 3.5.1

pytesseract 依赖 tesseract-ocr，这是一个开源的 OCR 项目，项目地址：https://github.com/tesseract-ocr/tesseract

我这里使用的版本是 5.1.0，基于源码编译安装，流程如下：

# 安装一些依赖
sudo apt update
sudo apt install -y git build-essential autoconf automake libtool pkg-config libpng-dev libjpeg8-dev libtiff5-dev zlib1g-dev libicu-dev libpango1.0-dev libcairo2-dev

# 获取源码
cd /tmp
git clone https://github.com/tesseract-ocr/tesseract.git --depth=1 --branch=5.1.0

# 编译安装
cd tesseract
./autogen.sh
./configure --prefix=$HOME/local/
make -j8
make install

将 tesseract 命令添加到环境变量

echo "export PATH=$PATH:$HOME/local/bin/" >> ~/.bashrc
source ~/.bashrc

然后添加模型文件，官方提供了两种：

最佳（最准确）训练的 LSTM 模型：https://github.com/tesseract-ocr/tessdata_best
支持旧版和 LSTM OCR 引擎的训练模型：https://github.com/tesseract-ocr/tessdata

我使用的最佳模型，其实也不需要下载所有的模型，一般而言只需要用到中文和英文识别，因此下载 chi_sim.traineddata 和 eng.traineddata（本文资源中也有提供），然后将这两个文件拷贝到 $HOME/local/share/tessdata 路径下

________________________________________________________________________________________

使用示例：

本文资源文件下载：https://pan.baidu.com/s/12BXjUnWrCHn3zIM_gV8Ybg?pwd=4nf8

用 jupyter 打开 ocr.ipynb 并运行整个笔记本，可以查看 OCR 效果

图片预览

文字提取

OCR 文字提取