PDF：文档处理完整工具集

一句话总结

pdf 是 PDF 文件处理的瑞士军刀——覆盖读取、提取、合并、拆分、旋转、加水印、表单填充、OCR 等 8 种操作，通过 8 个独立 Python 脚本实现。

核心能力

📖 使用 pypdf 进行基础 PDF 操作（读取/合并/拆分/旋转/加密）
🔍 使用 pdfplumber 提取文本和表格
📝 使用 reportlab 创建 PDF
📋 表单处理子系统（5 个脚本：检查可填充字段、提取字段结构、填充表单、验证边界框、创建验证图像）
🖼️ PDF 转图像 + OCR 识别
🔐 密码保护和水印添加

触发场景

用户提及 PDF 文件、需要处理 PDF 表格、需要填写 PDF 表单、需要合并/拆分 PDF、需要对扫描版 PDF 做 OCR 识别。

文件清单

One-Line Summary

pdf is a Swiss Army knife for PDF processing — covering reading, extraction, merging, splitting, rotation, watermarks, form filling, and OCR through 8 independent Python scripts.

Core Capabilities

📖 Basic PDF operations with pypdf (read/merge/split/rotate/encrypt)
🔍 Text and table extraction with pdfplumber
📝 PDF creation with reportlab
📋 Form processing subsystem (5 scripts: check fillable fields, extract field structure, fill forms, validate bounding boxes, create validation images)
🖼️ PDF to image conversion + OCR
🔐 Password protection and watermarking

Trigger Scenarios

User mentions PDF files, needs to process PDF tables, fill PDF forms, merge/split PDFs, or OCR scanned PDFs.

File Inventory

pdf
- SKILL.md 主入口 · PDF 处理指南
- forms.md 表单处理专篇
- reference.md 高级参考 · pypdfium2/pdf-lib
- LICENSE.txt 许可证
- scripts
  - fill_pdf_form_with_annotations.py 表单填充（注解）
  - fill_fillable_fields.py 表单填充（标准字段）
  - extract_form_field_info.py 提取表单字段信息
  - extract_form_structure.py 提取表单结构
  - check_fillable_fields.py 检查可填充字段
  - check_bounding_boxes.py 检查边界框
  - create_validation_image.py 创建验证图像
  - convert_pdf_to_images.py PDF 转图像

目录结构分析

pdf 的结构属于**“脚本工具箱”型**——一个核心 SKILL.md 提供总览和快速参考，forms.md 和 reference.md 提供扩展指引，scripts/ 目录包含 8 个独立 Python 脚本，各自处理一种 PDF 操作。

SKILL.md 结构解析

约 315 行的 SKILL.md，核心结构包含：

Quick Start：基于 pypdf 的快速入门代码
Python Libraries：三大库的使用说明
- pypdf：基础操作（合并、拆分、旋转、元数据、水印、加密）
- pdfplumber：文本和表格提取
- reportlab：PDF 创建（Canvas/Platypus）
Command-Line Tools：pdftotext、qpdf、pdftk 的备选方案
Common Tasks：扫描件 OCR、水印、图像提取、密码保护
Quick Reference：7 类任务的工具推荐速查表

YAML Frontmatter 分析

TRIGGER 条件极为宽泛——“用户想对 PDF 做任何操作”，包括读取、提取、合并、拆分、旋转、水印、新建、表单填充、加密解密、图像提取、OCR。这种宽触发策略适合覆盖全面但操作相对标准化的领域。

模块关系

8 个脚本彼此独立，各自处理一种具体的 PDF 操作。其中 5 个脚本构成了表单处理子系统。SKILL.md 提供工具选择指南（pypdf vs pdfplumber vs reportlab），但具体指令在 SKILL.md 中内联提供，而非依赖单独的参考文档。

设计模式分类

pdf 是典型的**“脚本工具箱”（Script Toolkit）** 型 Skill：

Directory Structure Analysis

The pdf skill follows a “Script Toolkit” pattern — a core SKILL.md providing overview and quick reference, forms.md and reference.md for extended guidance, and scripts/ directory with 8 independent Python scripts, each handling one PDF operation.

SKILL.md Structure Analysis

An ~315-line SKILL.md with core sections:

Quick Start: pypdf-based quick start code
Python Libraries: Usage instructions for 3 libraries
- pypdf: Basic operations (merge, split, rotate, metadata, watermark, encrypt)
- pdfplumber: Text and table extraction
- reportlab: PDF creation (Canvas/Platypus)
Command-Line Tools: Alternatives with pdftotext, qpdf, pdftk
Common Tasks: Scanned PDF OCR, watermarking, image extraction, password protection
Quick Reference: Tool recommendation table for 7 task types

YAML Frontmatter Analysis

TRIGGER conditions are extremely broad — “user wants to do anything with PDF files”, including reading, extraction, merging, splitting, rotation, watermarking, creation, form filling, encryption/decryption, image extraction, OCR. This broad trigger strategy suits a domain that is comprehensive but relatively standardized.

Module Relationships

The 8 scripts are independent, each handling a specific PDF operation. 5 of them form a form processing subsystem. SKILL.md provides tool selection guidance (pypdf vs pdfplumber vs reportlab), but specific instructions are provided inline within SKILL.md rather than through separate reference documents.

Design Pattern Classification

pdf is a typical “Script Toolkit” Skill:

特征	说明
脚本数	8 个独立 Python 脚本
参考文档	SKILL.md + forms.md + reference.md
工具选择	内置在 SKILL.md 的 Quick Reference 表中
脚本关系	各自独立，由表单处理子系统连接部分脚本
库策略	三重库：pypdf（基础）+ pdfplumber（提取）+ reportlab（创建）

Feature	Description
Script Count	8 independent Python scripts
Reference Docs	SKILL.md + forms.md + reference.md
Tool Selection	Built into SKILL.md Quick Reference table
Script Relationships	Independent, partially connected by form processing subsystem
Library Strategy	Triple library: pypdf (basic) + pdfplumber (extraction) + reportlab (creation)

脚本清单

pdf 包含 8 个 Python 脚本，全部独立且各自处理一种 PDF 操作。

详细分析

fill_pdf_form_with_annotations.py

这是表单填充的核心脚本，实现了**“注解式”表单填充**，适用于不支持标准 AcroForm 的 PDF。

核心逻辑分为三个步骤：

坐标转换：支持两种坐标系统（图像坐标 → PDF 坐标、PDF 坐标 → pypdf 坐标），通过 transform_from_image_coords 和 transform_from_pdf_coords 两个函数实现
注解创建：使用 pypdf 的 FreeText 注解类，在每个字段位置创建包含正确字体、字号、颜色的文本注解
写入输出：将原 PDF 和所有注解合并写入新文件

关键设计亮点是坐标转换层——它负责处理”视觉空间”到”PDF 内部空间”的映射。

Script Inventory

pdf contains 8 Python scripts, all independent and each handling a specific PDF operation.

Detailed Analysis

fill_pdf_form_with_annotations.py

The core form filling script implementing “annotation-style” form filling, suitable for PDFs that don’t support standard AcroForm.

Core logic in three steps:

Coordinate transformation: Supports two coordinate systems (image → PDF and PDF → pypdf) via transform_from_image_coords and transform_from_pdf_coords functions
Annotation creation: Uses pypdf’s FreeText annotation class, creating text annotations with correct font, size, and color at each field location
Write output: Merges original PDF with all annotations into a new file

The key design highlight is the coordinate transformation layer — it handles the mapping from “visual space” to “PDF internal space”.

脚本	功能	依赖
`fill_pdf_form_with_annotations.py`	使用 FreeText 注解填充表单	pypdf
`fill_fillable_fields.py`	填充标准可填充表单字段	pypdf
`extract_form_field_info.py`	提取表单字段结构（含定位）	pypdf
`extract_form_structure.py`	提取表单整体结构	pypdf
`check_fillable_fields.py`	检查 PDF 是否有可填充字段	pypdf
`check_bounding_boxes.py`	检查字段边界框有效性	pypdf
`create_validation_image.py`	创建表单填充验证图像	pypdf, Pillow
`convert_pdf_to_images.py`	PDF 转 PNG 图像（含尺寸缩放）	pdf2image, Pillow

Script	Function	Dependency
`fill_pdf_form_with_annotations.py`	Fill forms via FreeText annotations	pypdf
`fill_fillable_fields.py`	Fill standard fillable form fields	pypdf
`extract_form_field_info.py`	Extract form field structure (with positioning)	pypdf
`extract_form_structure.py`	Extract overall form structure	pypdf
`check_fillable_fields.py`	Check if PDF has fillable fields	pypdf
`check_bounding_boxes.py`	Validate field bounding boxes	pypdf
`create_validation_image.py`	Create form fill validation images	pypdf, Pillow
`convert_pdf_to_images.py`	PDF to PNG conversion (with size scaling)	pdf2image, Pillow

fill_pdf_form_with_annotations.py ↗ 源文件

1 def fill_pdf_form(input_pdf_path, fields_json_path, output_pdf_path): 2 with open(fields_json_path, "r") as f: 3 fields_data = json.load(f) 4 5 reader = PdfReader(input_pdf_path) 6 writer = PdfWriter() 7 writer.append(reader) 8 9 pdf_dimensions = {} 10 for i, page in enumerate(reader.pages): 11 mediabox = page.mediabox 12 pdf_dimensions[i + 1] = [mediabox.width, mediabox.height] 13 14 annotations = [] 15 for field in fields_data["form_fields"]: 16 page_num = field["page_number"] 17 page_info = next(p for p in fields_data["pages"] 18 if p["page_number"] == page_num) 19 pdf_width, pdf_height = pdf_dimensions[page_num] 20 21 if "pdf_width" in page_info: 22 transformed_entry_box = transform_from_pdf_coords( 23 field["entry_bounding_box"], float(pdf_height)) 24 else: 25 transformed_entry_box = transform_from_image_coords( 26 field["entry_bounding_box"], 27 image_width, image_height, 28 float(pdf_width), float(pdf_height)) 29 30 annotation = FreeText( 31 text=entry_text["text"], 32 rect=transformed_entry_box, 33 font=font_name, font_size=font_size, 34 font_color=font_color, 35 ) 36 annotations.append(annotation) 37 writer.add_annotation(page_number=page_num - 1, annotation=annotation) 38 39 with open(output_pdf_path, "wb") as output: 40 writer.write(output)

代码解读

L1 完整函数入口：接受 3 个参数——输入PDF路径、字段定义JSON、输出PDF路径。 L6 PdfReader + PdfWriter 组合：reader 用于读取原 PDF 的所有页面和元数据，writer 用于构建输出文件。 L10 PDF 尺寸预提取：遍历所有页面，将 media box 尺寸存入字典供坐标转换使用——这是坐标精确性的关键。 L16 坐标系统分支：根据字段数据来源（PDF 直接提取还是图像分析），选择 PDF 坐标转换或图像坐标转换函数。 L26 FreeText 注解创建：pypdf 的高层 API，在指定矩形区域内插入富文本。相比修改底层 PDF 流，这种方式更安全、更可逆。

convert_pdf_to_images.py

PDF 转图像的实用工具，使用 pdf2image 将 PDF 页面渲染为 PNG 图像。

核心逻辑：将 PDF 通过 pdf2image 转换为 PIL Image 对象，对超出最大尺寸的图片进行等比例缩放，然后保存为 PNG 文件。支持自定义输出目录和最大尺寸限制（默认 1000px）。

这个脚本的输出在表单处理视觉验证流水线中扮演关键角色——转换后的图像被用于 create_validation_image.py 进行视觉比对。

convert_pdf_to_images.py

A PDF to image utility using pdf2image to render PDF pages as PNG images.

Core logic: Converts PDF to PIL Image objects via pdf2image, proportionally scales images exceeding maximum size, then saves as PNG files. Supports custom output directory and maximum dimension limits (default 1000px).

This script’s output plays a key role in the form processing visual verification pipeline — the converted images are used by create_validation_image.py for visual comparison.

extract_form_field_info.py

表单结构解析工具，分析 PDF 中的 AcroForm 字段。

核心功能：

递归遍历字段树，提取全限定字段名
解析字段类型（文本框 / 复选框 / 单选组 / 下拉框）
定位字段在页面中的位置（Rect 坐标）
对单选组，提取所有选项的位置坐标
按页面 + Y 轴位置排序输出

该脚本是表单填充流水线的起点——它的输出 JSON 被用作 fill_pdf_form_with_annotations.py 的输入。

其余脚本简要说明

check_bounding_boxes.py：验证字段边界框是否在页面范围内
check_fillable_fields.py：检测 PDF 是否包含可填充字段
create_validation_image.py：将填充后的表单可视化，用于人工验证
extract_form_structure.py：提取表单整体结构（层级、分组、嵌套）
fill_fillable_fields.py：使用标准 AcroForm API 填充表单字段（与注解式方法互补）

表单处理子系统

5 个脚本共同构成了一条完整流水线：检查 → 提取结构 → 提取字段详情 → 填充（两种方式） → 验证。这是 pdf skill 中最精密的子系统设计。

extract_form_field_info.py

A form structure parsing tool that analyzes AcroForm fields in PDFs.

Key features:

Recursively traverses the field tree, extracting fully qualified field names
Parses field types (text/checkbox/radio group/dropdown)
Locates field positions on pages (Rect coordinates)
For radio groups, extracts position coordinates of all options
Sorts output by page + Y-axis position

This script is the starting point of the form filling pipeline — its JSON output serves as input to fill_pdf_form_with_annotations.py.

Brief Summary of Remaining Scripts

check_bounding_boxes.py: Validates that field bounding boxes are within page boundaries
check_fillable_fields.py: Detects if a PDF contains fillable fields
create_validation_image.py: Visualizes filled forms for manual verification
extract_form_structure.py: Extracts overall form structure (hierarchy, grouping, nesting)
fill_fillable_fields.py: Fills form fields using standard AcroForm API (complementary to annotation method)

Form Processing Subsystem

5 scripts form a complete pipeline: check → extract structure → extract field details → fill (2 methods) → validate. This is the most sophisticated subsystem design in the pdf skill.

设计亮点

“脚本工具箱”模式：8 个独立、单用途脚本，各司其职，没有复杂的模块依赖。用户可以根据需要只运行一个脚本，无需处理不相关的依赖
表单处理子系统：5 个脚本构建了从分析到验证的完整流水线，展示了如何将多个独立工具组合成工作流
三重库策略：不是用一个库解决所有问题，而是根据操作类型选择最合适的库——pypdf 做基础操作，pdfplumber 做提取，reportlab 做创建
坐标转换层：表单填充脚本中的坐标转换抽象，屏蔽了图像坐标系和 PDF 坐标系之间的差异

可复用模式

移植思路

“如果你想为另一类文件格式创建类似的处理 Skill（如图片、音频、视频）…”

分析文件格式的操作类型：列出用户可能需要的所有操作（读取、转换、编辑、验证）
为每种操作创建独立脚本：遵循”一个脚本一个操作”的原则
建立核心参考：SKILL.md 提供快速参考表格，链接到各个脚本
识别子系统：如果某些操作构成工作流（如分析→编辑→验证），将它们组织为子系统
选择合适的库：和 pdf 的三重库策略一样，不要试图用一个工具解决所有问题

常见坑

⚠️ pypdf 版本兼容性： pypdf 在不同版本间 API 变化较大（特别是注解 API），需要指定版本范围

⚠️ 表单类型判断： 不是所有 PDF 都使用标准 AcroForm——有些使用 XFA 表单（pypdf 不支持），需要 fallback 到注解式填充

⚠️ 坐标系统混淆： PDF 使用基于左下角的坐标系统，而图像使用基于左上角的坐标系统——坐标转换错误是表单处理中的首要错误源

⚠️ reportlab 的字体限制： reportlab 的内置字体不支持 Unicode 上/下标字符，必须使用 XML 标签或手动调整位置

Design Highlights

“Script Toolkit” Pattern: 8 independent, single-purpose scripts with no complex module dependencies. Users can run only the one they need without unrelated dependencies
Form Processing Subsystem: 5 scripts build a complete pipeline from analysis to validation, demonstrating how independent tools combine into a workflow
Triple Library Strategy: Rather than using one library for everything, choose the best tool per operation type — pypdf for basic, pdfplumber for extraction, reportlab for creation
Coordinate Transformation Layer: The coordinate transform abstraction in form filling scripts shields differences between image and PDF coordinate systems

Reusable Patterns

Porting Guide

“If you want to create a similar processing Skill for another file format (e.g., images, audio, video)…”

Analyze operation types: List all operations users might need (read, convert, edit, validate)
Create independent scripts per operation: Follow “one script, one operation” principle
Establish core reference: SKILL.md provides quick reference table, linking to each script
Identify subsystems: If certain operations form a workflow (analyze → edit → validate), organize them as subsystems
Choose the right libraries: Like pdf’s triple library strategy, don’t try to solve everything with one tool

Common Pitfalls

⚠️ pypdf version compatibility: API changes significantly between pypdf versions (especially annotation API) — specify version ranges

⚠️ Form type detection: Not all PDFs use standard AcroForm — some use XFA forms (unsupported by pypdf), requiring fallback to annotation-style filling

⚠️ Coordinate system confusion: PDF uses bottom-left based coordinate system while images use top-left — coordinate conversion errors are the #1 source of form processing bugs

⚠️ reportlab font limitations: Built-in fonts don’t support Unicode subscript/superscript characters — must use XML tags or manual positioning

模式	说明	适用于...
脚本工具箱	独立、单用途脚本的集合，无复杂依赖	操作类型多样但各自独立的领域
工作流流水线	多个工具脚本串联为完整处理流程	需要分析 → 处理 → 验证的复杂任务
三重库策略	为不同操作类型选择不同库	没有单一库能覆盖所有需求的场景
坐标抽象层	屏蔽不同坐标系统的差异	涉及图像和文档坐标互转的场景

Pattern	Description	Applies to...
Script Toolkit	Collection of independent, single-purpose scripts	Domains with diverse but independent operations
Pipeline Workflow	Multiple tool scripts chained into a complete process	Complex tasks requiring analysis → processing → validation
Triple Library Strategy	Different libraries for different operation types	Scenarios where no single library covers all needs
Coordinate Abstraction Layer	Shields differences between coordinate systems	Scenarios involving image/document coordinate conversion