Python读取PDF文字转txt,解决分栏识别问题,能读两栏

搜索了一下,大致有这些库能将PDF转txt

1. PyPDF/PyPDF2(截止2024.03.28这两个已经合并成了一个)pypdf · PyPI

2. pdfplumber GitHub - jsvine/pdfplumber: Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

2. PyMuPDF PyMuPDF · PyPI

3. PDFMiner  (有5年没更新了,不建议使用)GitHub - euske/pdfminer: Python PDF Parser (Not actively maintained). Check out pdfminer.six.

4. pdftotext (Mac系统没安装成功,故未试用) GitHub - jalan/pdftotext: Simple PDF text extraction

 要转txt的PDF有一页内容如下:

其中PyPDF和pdfplumber的代码很相似都用extract_text, PyMuPDF则用get_text:

import pdfplumber
from pypdf import PdfReader
import fitz # PyMuPDF

fname = "26.pdf"

with pdfplumber.open(fname) as pdf:
    print(len(pdf.pages))
    for page in pdf.pages:
        text = page.extract_text()#提取文本
        print(text)
        with open('1.txt', 'w') as f:
            f.write(text)


pdf = PdfReader(fname)
print(len(pdf.pages))
for page in pdf.pages:
    text = page.extract_text()
    print(text)
    with open('2.txt', 'w') as f:
        f.write(text)


with fitz.open(fname) as pdf:
    text = chr(12).join([page.get_text() for page in pdf])
    with open('3.txt', 'w') as f:
        f.write(text)

执行结果如下(从左到右分别是pdfplumber/PyPDF/PyMuPDF) 

对比发现:

1. pdfplumber未能正确处理分栏

2. PyPDF 未能正确识别换行

综上,选择PyMuPDF用来提取PDF中的文字,做成脚本(pdf2txt.py)内容如下:

#!/usr/bin/env python
"""PDF转txt

Usage::
    >>> python pdf2txt.py <pdf>
"""
import os
import sys
from pathlib import Path

# pip install PyMuPDF
import fitz  # type:ignore[import-untyped]


def pdf2text(fname: str) -> str:
    if "~" in fname:
        fname = os.path.expanduser(fname)
    with fitz.open(fname) as doc:  # open document
        text = chr(12).join([page.get_text() for page in doc])
    return text


def main() -> None:
    if not sys.argv[1:]:
        if "PYCHARM_HOSTED" not in os.environ:
            print(__doc__)
            return
        fname = input("请输入PDF文件路径:")
    else:
        fname = sys.argv[1]
    text = pdf2text(fname)
    new_name = Path(fname).stem + ".txt"
    size = Path(new_name).write_bytes(text.encode())
    print(f"Save to {new_name} with {size=}")


if __name__ == "__main__":  # pragma: no cover
    main()

相关推荐

  1. 实现布局

    2024-03-29 08:38:02       35 阅读

最近更新

  1. docker php8.1+nginx base 镜像 dockerfile 配置

    2024-03-29 08:38:02       94 阅读
  2. Could not load dynamic library ‘cudart64_100.dll‘

    2024-03-29 08:38:02       100 阅读
  3. 在Django里面运行非项目文件

    2024-03-29 08:38:02       82 阅读
  4. Python语言-面向对象

    2024-03-29 08:38:02       91 阅读

热门阅读

  1. 使用Python进行双色球选号

    2024-03-29 08:38:02       33 阅读
  2. VOS 3000外呼系统中接通率与应答率的区别

    2024-03-29 08:38:02       26 阅读
  3. python爬虫----python列表高级

    2024-03-29 08:38:02       36 阅读
  4. LeetCode-热题100:560. 和为 K 的子数组

    2024-03-29 08:38:02       42 阅读
  5. idea默认代码生成脚本修改

    2024-03-29 08:38:02       38 阅读
  6. LINUX交叉编译arm/aarch64简单脚本

    2024-03-29 08:38:02       44 阅读
  7. 服务器不能DELETE和PUT

    2024-03-29 08:38:02       48 阅读
  8. 监听页面的使用时间

    2024-03-29 08:38:02       33 阅读
  9. 每天学习一个Linux命令之shutdown

    2024-03-29 08:38:02       43 阅读
  10. Android 14.0 SystemUI下拉状态栏时间格式的修改(二)

    2024-03-29 08:38:02       40 阅读