基于python的PDF文件解析器汇总

2024-06-12 07:42:04
开发
26

基于python的PDF文件解析器汇总

大多数科学文献目前以 PDF 格式存在，这是一种轻量级、普遍的文件格式，能够保持一致的文本布局和格式。对于人类读者而言， PDF格式的文件内容展示整洁且一致的布局有助于阅读，可以很容易地浏览一篇论文并识别标题和图表。但是对于计算机而言，PDF 格式是一个非常嘈杂的 ASCII 文件，并不包含任何结构化文本的信息。因此，从这些已经发表的PDF格式的文献重提取文字、图片、表格、注释、目录等数据来构建格式化的信息用于机器学习，例如目前最需要大量文本数据的自然语言处理或大语言模型等应用中。

1. pdfminer.six

GitHub address：pdfminer.six
最新发布时间：2023年12月28日

1.1 安装

pip install pdfminer.six

1.2 测试

from pdfminer.high_level import extract_text

text = extract_text("example.pdf")
print(text)

1.3 功能

支持各种字体类型（Type1、TrueType、Type3 和 CID）。
支持提取图像（JPG、JBIG2、Bitmaps）。
支持各种压缩方式（ASCIIHexDecode、ASCII85Decode、LZWDecode、FlateDecode、RunLengthDecode、CCITTFaxDecode）。
支持 RC4 和 AES 加密。
支持提取 AcroForm 交互式表单。
提取目录。
提取标记内容。
自动布局分析。

2. PyPDF4

Github address：PyPDF4
最新发布时间：2018年8月8日

2.1 安装

pip install pypdf

2.2 测试

from pypdf import PdfReader

reader = PdfReader("example.pdf")
page = reader.pages[0]
print(page.extract_text())

3. pdfrw

3.1 安装

pip install pdfrw

3.2 测试

from pdfrw import PdfReader
def get_pdf_info(path):
    pdf = PdfReader(path)

    print(pdf.keys())
    print(pdf.Info)
    print(pdf.Root.keys())
    print('PDF has {} pages'.format(len(pdf.pages)))

if __name__ == '__main__':
    get_pdf_info('example.pdf')

4. PDFQuery

4.1 安装

pip install pdfquery

4.2 测试

from pdfquery import PDFQuery

pdf = PDFQuery('example.pdf')
pdf.load()

# Use CSS-like selectors to locate the elements
text_elements = pdf.pq('LTTextLineHorizontal')

# Extract the text from the elements
text = [t.text for t in text_elements]

print(text)

5. Nougat

Meta出品。Nougat (Neural Optical Understanding for Academic Documents)基于ViT（Visual Transformer）模型，通过光学字符识别（Optical Character Recognition, OCR）将科学论文转化为标记语言。

最新发布时间：2023年8月22日
GitHub address: [Nougat](GitHub - facebookresearch/nougat: Implementation of Nougat Neural Optical Understanding for Academic Documents)
Project page: Nougat

5.1 安装

# from pip:
pip install nougat-ocr

# or from github repository
pip install git+https://github.com/facebookresearch/nougat

5.2 测试

nougat path/to/file.pdf --out output_directory

5.3 用法

usage: nougat [-h] [--batchsize BATCHSIZE] [--checkpoint CHECKPOINT] [--model MODEL] [--out OUT]
              [--recompute] [--markdown] [--no-skipping] pdf [pdf ...]

positional arguments:
  pdf                   PDF(s) to process.

options:
  -h, --help            show this help message and exit
  --batchsize BATCHSIZE, -b BATCHSIZE
                        Batch size to use.
  --checkpoint CHECKPOINT, -c CHECKPOINT
                        Path to checkpoint directory.
  --model MODEL_TAG, -m MODEL_TAG
                        Model tag to use.
  --out OUT, -o OUT     Output directory.
  --recompute           Recompute already computed PDF, discarding previous predictions.
  --full-precision      Use float32 instead of bfloat16. Can speed up CPU conversion for some setups.
  --no-markdown         Do not add postprocessing step for markdown compatibility.
  --markdown            Add postprocessing step for markdown compatibility (default).
  --no-skipping         Don't apply failure detection heuristic.
  --pages PAGES, -p PAGES
                        Provide page numbers like '1-4,7' for pages 1 through 4 and page 7. Only works

6. SciPDF Parser

基于GROBID (GeneRation Of BIbliographic Data))

Github address: SciPDF Parser
最新发布时间：

6.1 安装

# from pip
pip install scipdf-parser

# or from github respository
pip install git+https://github.com/titipata/scipdf_parser

6.2 测试

在解析PDF之前需要先运行GROBID

bash serve_grobid.sh

该脚本将会运行 GROBID在默认端口：8070
以下为python 解析PDF文件的脚本。

import scipdf
article_dict = scipdf.parse_pdf_to_dict('example_data/futoma2017improved.pdf') # return dictionary

# option to parse directly from URL to PDF, if as_list is set to True, output 'text' of parsed section will be in a list of paragraphs instead
article_dict = scipdf.parse_pdf_to_dict('https://www.biorxiv.org/content/biorxiv/early/2018/11/20/463760.full.pdf', as_list=False)

# output example
>> {
    'title': 'Proceedings of Machine Learning for Healthcare',
    'abstract': '...',
    'sections': [
        {'heading': '...', 'text': '...'},
        {'heading': '...', 'text': '...'},
        ...
    ],
    'references': [
        {'title': '...', 'year': '...', 'journal': '...', 'author': '...'},
        ...
    ],
    'figures': [
        {'figure_label': '...', 'figure_type': '...', 'figure_id': '...', 'figure_caption': '...', 'figure_data': '...'},
        ...
    ],
    'doi': '...'
}

xml = scipdf.parse_pdf('("example.pdf', soup=True) # option to parse full XML from GROBID

7. pdfplumber

GitHub address: pdfplumber
最新发布时间：2024年3月7日

7.1 安装

pip install pdfplumber

7.2 测试

pdfplumber < example.pdf > background-checks.csv

7.3 用法

参数	描述
`--format [format]`	`csv` or `json`. The `json` format returns more information; it includes PDF-level and page-level metadata, plus dictionary-nested attributes.
`--pages [list of pages]`	A space-delimited, `1`-indexed list of pages or hyphenated page ranges. E.g., `1, 11-15`, which would return data for pages 1, 11, 12, 13, 14, and 15.
`--types [list of object types to extract]`	Choices are `char`, `rect`, `line`, `curve`, `image`, `annot`, et cetera. Defaults to all available.
`--laparams`	A JSON-formatted string (e.g., `'{"detect_vertical": true}'`) to pass to `pdfplumber.open(..., laparams=...)`.
`--precision [integer]`	The number of decimal places to round floating-point numbers. Defaults to no rounding.

7.4 python package usage

import pdfplumber

with pdfplumber.open("example.pdf") as pdf:
    first_page = pdf.pages[0]
    print(first_page.chars[0])

8. borb

8.0 简介

borb 是一个纯 Python 库，用于读取、写入和操作 PDF 文档。它将 PDF 文档表示为嵌套列表、字典和基本数据类型（数字、字符串、布尔值等）的类似 JSON 的数据结构。

Github address: borb
最新发布时间：2024年5月

8.1 安装

下载地址: [borb](borb · PyPI)

# from pip
pip install borb

# reinstalled the latest version (rather than using its internal cache)
pip uninstall borb
pip install --no-cache borb

8.2 测试（创建pdf）

from pathlib import Path

from borb.pdf import Document
from borb.pdf import Page
from borb.pdf import SingleColumnLayout
from borb.pdf import Paragraph
from borb.pdf import PDF

# create an empty Document
pdf = Document()

# add an empty Page
page = Page()
pdf.add_page(page)

# use a PageLayout (SingleColumnLayout in this case)
layout = SingleColumnLayout(page)

# add a Paragraph object
layout.add(Paragraph("Hello World!"))

# store the PDF
with open(Path("output.pdf"), "wb") as pdf_file_handle:
    PDF.dumps(pdf_file_handle, pdf)

8.3 功能

读取PDF并提取元信息
修改元信息
从PDF中提取文本
从PDF中提取图像
改变PDF中的图像
向PDF添加注释（笔记、链接等）
向PDF添加文本
向PDF添加表格
向PDF添加列表
使用页面布局管理器

9. ScienceBeam Parser

Githu address：ScienceBeam

9.1 安装

pip install sciencebeam-parser

9.2 测试

Python API: 服务器启动

from sciencebeam_parser.config.config import AppConfig
from sciencebeam_parser.resources.default_config import DEFAULT_CONFIG_FILE
from sciencebeam_parser.service.server import create_app


config = AppConfig.load_yaml(DEFAULT_CONFIG_FILE)
app = create_app(config)
app.run(port=8080, host='127.0.0.1', threaded=True)

Python API: 解析PDF文件

from sciencebeam_parser.resources.default_config import DEFAULT_CONFIG_FILE
from sciencebeam_parser.config.config import AppConfig
from sciencebeam_parser.utils.media_types import MediaTypes
from sciencebeam_parser.app.parser import ScienceBeamParser

config = AppConfig.load_yaml(DEFAULT_CONFIG_FILE)

# the parser contains all of the models
sciencebeam_parser = ScienceBeamParser.from_config(config)

# a session provides a scope and temporary directory for intermediate files
# it is recommended to create a separate session for every document
with sciencebeam_parser.get_new_session() as session:
    session_source = session.get_source(
        'example.pdf',
        MediaTypes.PDF
    )
    converted_file = session_source.get_local_file_for_response_media_type(
        MediaTypes.TEI_XML
    )
    # Note: the converted file will be in the temporary directory of the session
    print('converted file:', converted_file)

原文地址:https://blog.csdn.net/MurphyStar/article/details/139599866 本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：https://www.suanlizi.com/kf/1800674894124552192.html 如若内容造成侵权/违法违规/事实不符，请联系《酸梨子》网邮箱：1419361763@qq.com进行投诉反馈，一经查实，立即删除！

阅读全部