跳转到内容

PDF:文档处理完整工具集

pdf 是 PDF 文件处理的瑞士军刀——覆盖读取、提取、合并、拆分、旋转、加水印、表单填充、OCR 等 8 种操作,通过 8 个独立 Python 脚本实现。

  • 📖 使用 pypdf 进行基础 PDF 操作(读取/合并/拆分/旋转/加密)
  • 🔍 使用 pdfplumber 提取文本和表格
  • 📝 使用 reportlab 创建 PDF
  • 📋 表单处理子系统(5 个脚本:检查可填充字段、提取字段结构、填充表单、验证边界框、创建验证图像)
  • 🖼️ PDF 转图像 + OCR 识别
  • 🔐 密码保护和水印添加

用户提及 PDF 文件、需要处理 PDF 表格、需要填写 PDF 表单、需要合并/拆分 PDF、需要对扫描版 PDF 做 OCR 识别。

pdf is a Swiss Army knife for PDF processing — covering reading, extraction, merging, splitting, rotation, watermarks, form filling, and OCR through 8 independent Python scripts.

  • 📖 Basic PDF operations with pypdf (read/merge/split/rotate/encrypt)
  • 🔍 Text and table extraction with pdfplumber
  • 📝 PDF creation with reportlab
  • 📋 Form processing subsystem (5 scripts: check fillable fields, extract field structure, fill forms, validate bounding boxes, create validation images)
  • 🖼️ PDF to image conversion + OCR
  • 🔐 Password protection and watermarking

User mentions PDF files, needs to process PDF tables, fill PDF forms, merge/split PDFs, or OCR scanned PDFs.

pdf 的结构属于**“脚本工具箱”型**——一个核心 SKILL.md 提供总览和快速参考,forms.md 和 reference.md 提供扩展指引,scripts/ 目录包含 8 个独立 Python 脚本,各自处理一种 PDF 操作。

约 315 行的 SKILL.md,核心结构包含:

  1. Quick Start:基于 pypdf 的快速入门代码
  2. Python Libraries:三大库的使用说明
    • pypdf:基础操作(合并、拆分、旋转、元数据、水印、加密)
    • pdfplumber:文本和表格提取
    • reportlab:PDF 创建(Canvas/Platypus)
  3. Command-Line Tools:pdftotext、qpdf、pdftk 的备选方案
  4. Common Tasks:扫描件 OCR、水印、图像提取、密码保护
  5. Quick Reference:7 类任务的工具推荐速查表

TRIGGER 条件极为宽泛——“用户想对 PDF 做任何操作”,包括读取、提取、合并、拆分、旋转、水印、新建、表单填充、加密解密、图像提取、OCR。这种宽触发策略适合覆盖全面但操作相对标准化的领域。

8 个脚本彼此独立,各自处理一种具体的 PDF 操作。其中 5 个脚本构成了表单处理子系统。SKILL.md 提供工具选择指南(pypdf vs pdfplumber vs reportlab),但具体指令在 SKILL.md 中内联提供,而非依赖单独的参考文档。

pdf 是典型的**“脚本工具箱”(Script Toolkit)** 型 Skill:

The pdf skill follows a “Script Toolkit” pattern — a core SKILL.md providing overview and quick reference, forms.md and reference.md for extended guidance, and scripts/ directory with 8 independent Python scripts, each handling one PDF operation.

An ~315-line SKILL.md with core sections:

  1. Quick Start: pypdf-based quick start code
  2. Python Libraries: Usage instructions for 3 libraries
    • pypdf: Basic operations (merge, split, rotate, metadata, watermark, encrypt)
    • pdfplumber: Text and table extraction
    • reportlab: PDF creation (Canvas/Platypus)
  3. Command-Line Tools: Alternatives with pdftotext, qpdf, pdftk
  4. Common Tasks: Scanned PDF OCR, watermarking, image extraction, password protection
  5. Quick Reference: Tool recommendation table for 7 task types

TRIGGER conditions are extremely broad — “user wants to do anything with PDF files”, including reading, extraction, merging, splitting, rotation, watermarking, creation, form filling, encryption/decryption, image extraction, OCR. This broad trigger strategy suits a domain that is comprehensive but relatively standardized.

The 8 scripts are independent, each handling a specific PDF operation. 5 of them form a form processing subsystem. SKILL.md provides tool selection guidance (pypdf vs pdfplumber vs reportlab), but specific instructions are provided inline within SKILL.md rather than through separate reference documents.

pdf is a typical “Script Toolkit” Skill:

特征说明
脚本数8 个独立 Python 脚本
参考文档SKILL.md + forms.md + reference.md
工具选择内置在 SKILL.md 的 Quick Reference 表中
脚本关系各自独立,由表单处理子系统连接部分脚本
库策略三重库:pypdf(基础)+ pdfplumber(提取)+ reportlab(创建)
FeatureDescription
Script Count8 independent Python scripts
Reference DocsSKILL.md + forms.md + reference.md
Tool SelectionBuilt into SKILL.md Quick Reference table
Script RelationshipsIndependent, partially connected by form processing subsystem
Library StrategyTriple library: pypdf (basic) + pdfplumber (extraction) + reportlab (creation)

pdf 包含 8 个 Python 脚本,全部独立且各自处理一种 PDF 操作。

这是表单填充的核心脚本,实现了**“注解式”表单填充**,适用于不支持标准 AcroForm 的 PDF。

核心逻辑分为三个步骤:

  1. 坐标转换:支持两种坐标系统(图像坐标 → PDF 坐标、PDF 坐标 → pypdf 坐标),通过 transform_from_image_coordstransform_from_pdf_coords 两个函数实现
  2. 注解创建:使用 pypdf 的 FreeText 注解类,在每个字段位置创建包含正确字体、字号、颜色的文本注解
  3. 写入输出:将原 PDF 和所有注解合并写入新文件

关键设计亮点是坐标转换层——它负责处理”视觉空间”到”PDF 内部空间”的映射。

pdf contains 8 Python scripts, all independent and each handling a specific PDF operation.

The core form filling script implementing “annotation-style” form filling, suitable for PDFs that don’t support standard AcroForm.

Core logic in three steps:

  1. Coordinate transformation: Supports two coordinate systems (image → PDF and PDF → pypdf) via transform_from_image_coords and transform_from_pdf_coords functions
  2. Annotation creation: Uses pypdf’s FreeText annotation class, creating text annotations with correct font, size, and color at each field location
  3. Write output: Merges original PDF with all annotations into a new file

The key design highlight is the coordinate transformation layer — it handles the mapping from “visual space” to “PDF internal space”.

脚本功能依赖
fill_pdf_form_with_annotations.py使用 FreeText 注解填充表单pypdf
fill_fillable_fields.py填充标准可填充表单字段pypdf
extract_form_field_info.py提取表单字段结构(含定位)pypdf
extract_form_structure.py提取表单整体结构pypdf
check_fillable_fields.py检查 PDF 是否有可填充字段pypdf
check_bounding_boxes.py检查字段边界框有效性pypdf
create_validation_image.py创建表单填充验证图像pypdf, Pillow
convert_pdf_to_images.pyPDF 转 PNG 图像(含尺寸缩放)pdf2image, Pillow
ScriptFunctionDependency
fill_pdf_form_with_annotations.pyFill forms via FreeText annotationspypdf
fill_fillable_fields.pyFill standard fillable form fieldspypdf
extract_form_field_info.pyExtract form field structure (with positioning)pypdf
extract_form_structure.pyExtract overall form structurepypdf
check_fillable_fields.pyCheck if PDF has fillable fieldspypdf
check_bounding_boxes.pyValidate field bounding boxespypdf
create_validation_image.pyCreate form fill validation imagespypdf, Pillow
convert_pdf_to_images.pyPDF to PNG conversion (with size scaling)pdf2image, Pillow
fill_pdf_form_with_annotations.py ↗ 源文件
1 def fill_pdf_form(input_pdf_path, fields_json_path, output_pdf_path): 2 with open(fields_json_path, "r") as f: 3 fields_data = json.load(f) 4 5 reader = PdfReader(input_pdf_path) 6 writer = PdfWriter() 7 writer.append(reader) 8 9 pdf_dimensions = {} 10 for i, page in enumerate(reader.pages): 11 mediabox = page.mediabox 12 pdf_dimensions[i + 1] = [mediabox.width, mediabox.height] 13 14 annotations = [] 15 for field in fields_data["form_fields"]: 16 page_num = field["page_number"] 17 page_info = next(p for p in fields_data["pages"] 18 if p["page_number"] == page_num) 19 pdf_width, pdf_height = pdf_dimensions[page_num] 20 21 if "pdf_width" in page_info: 22 transformed_entry_box = transform_from_pdf_coords( 23 field["entry_bounding_box"], float(pdf_height)) 24 else: 25 transformed_entry_box = transform_from_image_coords( 26 field["entry_bounding_box"], 27 image_width, image_height, 28 float(pdf_width), float(pdf_height)) 29 30 annotation = FreeText( 31 text=entry_text["text"], 32 rect=transformed_entry_box, 33 font=font_name, font_size=font_size, 34 font_color=font_color, 35 ) 36 annotations.append(annotation) 37 writer.add_annotation(page_number=page_num - 1, annotation=annotation) 38 39 with open(output_pdf_path, "wb") as output: 40 writer.write(output)
代码解读
L1 完整函数入口:接受 3 个参数——输入PDF路径、字段定义JSON、输出PDF路径。 L6 PdfReader + PdfWriter 组合:reader 用于读取原 PDF 的所有页面和元数据,writer 用于构建输出文件。 L10 PDF 尺寸预提取:遍历所有页面,将 media box 尺寸存入字典供坐标转换使用——这是坐标精确性的关键。 L16 坐标系统分支:根据字段数据来源(PDF 直接提取还是图像分析),选择 PDF 坐标转换或图像坐标转换函数。 L26 FreeText 注解创建:pypdf 的高层 API,在指定矩形区域内插入富文本。相比修改底层 PDF 流,这种方式更安全、更可逆。

PDF 转图像的实用工具,使用 pdf2image 将 PDF 页面渲染为 PNG 图像。

核心逻辑:将 PDF 通过 pdf2image 转换为 PIL Image 对象,对超出最大尺寸的图片进行等比例缩放,然后保存为 PNG 文件。支持自定义输出目录和最大尺寸限制(默认 1000px)。

这个脚本的输出在表单处理视觉验证流水线中扮演关键角色——转换后的图像被用于 create_validation_image.py 进行视觉比对。

A PDF to image utility using pdf2image to render PDF pages as PNG images.

Core logic: Converts PDF to PIL Image objects via pdf2image, proportionally scales images exceeding maximum size, then saves as PNG files. Supports custom output directory and maximum dimension limits (default 1000px).

This script’s output plays a key role in the form processing visual verification pipeline — the converted images are used by create_validation_image.py for visual comparison.

表单结构解析工具,分析 PDF 中的 AcroForm 字段。

核心功能:

  • 递归遍历字段树,提取全限定字段名
  • 解析字段类型(文本框 / 复选框 / 单选组 / 下拉框)
  • 定位字段在页面中的位置(Rect 坐标)
  • 对单选组,提取所有选项的位置坐标
  • 按页面 + Y 轴位置排序输出

该脚本是表单填充流水线的起点——它的输出 JSON 被用作 fill_pdf_form_with_annotations.py 的输入。

  • check_bounding_boxes.py:验证字段边界框是否在页面范围内
  • check_fillable_fields.py:检测 PDF 是否包含可填充字段
  • create_validation_image.py:将填充后的表单可视化,用于人工验证
  • extract_form_structure.py:提取表单整体结构(层级、分组、嵌套)
  • fill_fillable_fields.py:使用标准 AcroForm API 填充表单字段(与注解式方法互补)

5 个脚本共同构成了一条完整流水线:检查 → 提取结构 → 提取字段详情 → 填充(两种方式) → 验证。这是 pdf skill 中最精密的子系统设计。

A form structure parsing tool that analyzes AcroForm fields in PDFs.

Key features:

  • Recursively traverses the field tree, extracting fully qualified field names
  • Parses field types (text/checkbox/radio group/dropdown)
  • Locates field positions on pages (Rect coordinates)
  • For radio groups, extracts position coordinates of all options
  • Sorts output by page + Y-axis position

This script is the starting point of the form filling pipeline — its JSON output serves as input to fill_pdf_form_with_annotations.py.

  • check_bounding_boxes.py: Validates that field bounding boxes are within page boundaries
  • check_fillable_fields.py: Detects if a PDF contains fillable fields
  • create_validation_image.py: Visualizes filled forms for manual verification
  • extract_form_structure.py: Extracts overall form structure (hierarchy, grouping, nesting)
  • fill_fillable_fields.py: Fills form fields using standard AcroForm API (complementary to annotation method)

5 scripts form a complete pipeline: check → extract structure → extract field details → fill (2 methods) → validate. This is the most sophisticated subsystem design in the pdf skill.

  1. “脚本工具箱”模式:8 个独立、单用途脚本,各司其职,没有复杂的模块依赖。用户可以根据需要只运行一个脚本,无需处理不相关的依赖
  2. 表单处理子系统:5 个脚本构建了从分析到验证的完整流水线,展示了如何将多个独立工具组合成工作流
  3. 三重库策略:不是用一个库解决所有问题,而是根据操作类型选择最合适的库——pypdf 做基础操作,pdfplumber 做提取,reportlab 做创建
  4. 坐标转换层:表单填充脚本中的坐标转换抽象,屏蔽了图像坐标系和 PDF 坐标系之间的差异

“如果你想为另一类文件格式创建类似的处理 Skill(如图片、音频、视频)…”

  1. 分析文件格式的操作类型:列出用户可能需要的所有操作(读取、转换、编辑、验证)
  2. 为每种操作创建独立脚本:遵循”一个脚本一个操作”的原则
  3. 建立核心参考:SKILL.md 提供快速参考表格,链接到各个脚本
  4. 识别子系统:如果某些操作构成工作流(如 分析→编辑→验证),将它们组织为子系统
  5. 选择合适的库:和 pdf 的三重库策略一样,不要试图用一个工具解决所有问题

⚠️ pypdf 版本兼容性: pypdf 在不同版本间 API 变化较大(特别是注解 API),需要指定版本范围

⚠️ 表单类型判断: 不是所有 PDF 都使用标准 AcroForm——有些使用 XFA 表单(pypdf 不支持),需要 fallback 到注解式填充

⚠️ 坐标系统混淆: PDF 使用基于左下角的坐标系统,而图像使用基于左上角的坐标系统——坐标转换错误是表单处理中的首要错误源

⚠️ reportlab 的字体限制: reportlab 的内置字体不支持 Unicode 上/下标字符,必须使用 XML 标签或手动调整位置

  1. “Script Toolkit” Pattern: 8 independent, single-purpose scripts with no complex module dependencies. Users can run only the one they need without unrelated dependencies
  2. Form Processing Subsystem: 5 scripts build a complete pipeline from analysis to validation, demonstrating how independent tools combine into a workflow
  3. Triple Library Strategy: Rather than using one library for everything, choose the best tool per operation type — pypdf for basic, pdfplumber for extraction, reportlab for creation
  4. Coordinate Transformation Layer: The coordinate transform abstraction in form filling scripts shields differences between image and PDF coordinate systems

“If you want to create a similar processing Skill for another file format (e.g., images, audio, video)…”

  1. Analyze operation types: List all operations users might need (read, convert, edit, validate)
  2. Create independent scripts per operation: Follow “one script, one operation” principle
  3. Establish core reference: SKILL.md provides quick reference table, linking to each script
  4. Identify subsystems: If certain operations form a workflow (analyze → edit → validate), organize them as subsystems
  5. Choose the right libraries: Like pdf’s triple library strategy, don’t try to solve everything with one tool

⚠️ pypdf version compatibility: API changes significantly between pypdf versions (especially annotation API) — specify version ranges

⚠️ Form type detection: Not all PDFs use standard AcroForm — some use XFA forms (unsupported by pypdf), requiring fallback to annotation-style filling

⚠️ Coordinate system confusion: PDF uses bottom-left based coordinate system while images use top-left — coordinate conversion errors are the #1 source of form processing bugs

⚠️ reportlab font limitations: Built-in fonts don’t support Unicode subscript/superscript characters — must use XML tags or manual positioning

模式说明适用于...
脚本工具箱独立、单用途脚本的集合,无复杂依赖操作类型多样但各自独立的领域
工作流流水线多个工具脚本串联为完整处理流程需要分析 → 处理 → 验证的复杂任务
三重库策略为不同操作类型选择不同库没有单一库能覆盖所有需求的场景
坐标抽象层屏蔽不同坐标系统的差异涉及图像和文档坐标互转的场景
PatternDescriptionApplies to...
Script ToolkitCollection of independent, single-purpose scriptsDomains with diverse but independent operations
Pipeline WorkflowMultiple tool scripts chained into a complete processComplex tasks requiring analysis → processing → validation
Triple Library StrategyDifferent libraries for different operation typesScenarios where no single library covers all needs
Coordinate Abstraction LayerShields differences between coordinate systemsScenarios involving image/document coordinate conversion