画像から文字抽出

①PPA追加
add-apt-repository ppa:alex-p/tesseract-ocr
⇒
エラーになったら
　　・add-apt-repositoryがない
　　⇒
　　apt-get install software-properties-common
　　apt-get install gnupg-agent　⇒　いらんかも
　　apt-get update
　　
　　・S.gpg-agent.browser not found error while adding ppa in debian strech
　　⇒
　　add-apt-repository ppa:pi-rho/dev　⇒　いらんかも
　　apt-get update

上記実行後に
cd /root
add-apt-repository ppa:alex-p/tesseract-ocr

②Tesseractインストール
apt install tesseract-ocr
apt install libtesseract-dev
tesseract -v

③訓練済モデルインストール
apt install tesseract-ocr-jpn tesseract-ocr-jpn-vert
apt install tesseract-ocr-script-jpan tesseract-ocr-script-jpan-vert
tesseract --list-langs

④Tesseract実行
tesseract /usr/src/misc/imageToText/test.png /usr/src/misc/imageToText/order3_out -l jpn

pythonから実行

⑤pytesseractのインストール
pip install pytesseract

⑥ソース

try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract

# 日本語の画像ファイル
FILENAME = '/usr/src/misc/imageToText/test.png'

# デフォルト言語の英語で実行されるため意味なし
print(pytesseract.image_to_string(Image.open(FILENAME)))

# 日本語で文字出力
print(pytesseract.image_to_string(Image.open(FILENAME), lang='jpn'))

# ボックス(座標位置付き)出力
print(pytesseract.image_to_boxes(Image.open(FILENAME), lang='jpn'))

# TSV出力(多分、一番詳細情報あり)
print(pytesseract.image_to_data(Image.open(FILENAME), lang='jpn'))

# OSD(Orientation and script detection)
print(pytesseract.image_to_osd(Image.open(FILENAME), lang='jpn'))

# HOCR形式出力
print(pytesseract.image_to_pdf_or_hocr(FILENAME, extension='hocr'))

⑦実行
python /usr/src/misc/imageToText/test.py

元ページ
https://qiita.com/FukuharaYohei/items/e09049c8d312eaf166a5