跳转到内容

XLSX:Excel 电子表格处理与分析

xlsx 是电子表单处理的专家系统——核心价值不在脚本,而在 SKILL.md 中编码的行业标准格式化规则和零公式错误保证。

  • 📊 使用 openpyxl 创建和编辑电子表单
  • 🔢 Excel 公式优先策略——用公式而非硬编码值
  • 🔄 公式重算引擎——通过 LibreOffice 宏重算并检测错误
  • 📐 行业标准格式化规则(财务模型配色/数字格式)
  • 📋 模板保留——精确匹配现有格式和样式
  • ✅ 零公式错误保证

用户提及 .xlsx、.xlsm、.csv、.tsv 文件、需要创建/编辑电子表单、需要处理财政数据或公式。

xlsx is an expert system for spreadsheet processing — its core value lies not in scripts, but in the industry-standard formatting rules and zero-formula-error guarantee encoded in SKILL.md.

  • 📊 Create and edit spreadsheets with openpyxl
  • 🔢 Excel formula-first strategy — formulas instead of hardcoded values
  • 🔄 Formula recalculation engine — recalculates via LibreOffice macros and detects errors
  • 📐 Industry-standard formatting rules (financial model color coding / number formats)
  • 📋 Template preservation — exactly match existing format and styles
  • ✅ Zero formula error guarantee

User mentions .xlsx, .xlsm, .csv, .tsv files, needs to create/edit spreadsheets, or needs to process financial data with formulas.

xlsx 的结构特点是**“指令重于脚本”**——脚本数量最少(仅 1 个独特脚本),但 SKILL.md 中编码了最详尽的领域知识(财务建模标准、格式规则、验证清单)。

约 280 行的 SKILL.md,核心结构分为两大块:

第一部分:Requirements for Outputs —— 核心”规则墙”

  1. All Excel files 通用规则:

    • 使用专业字体(Arial、Times New Roman)
    • 零公式错误 —— 这是 xlsx 最核心的承诺
    • 编辑模板时精确保留现有格式和样式
  2. Financial models 财务模型专属规则——最精华部分:

    • 行业标准配色:蓝色=硬编码输入、黑色=公式、绿色=同工作簿引用、红色=外部引用、黄色=待更新假设
    • 数字格式标准:年份为文本、货币用 $#,##0、零显示为 ”-“、百分比 0.0%、倍数 0.0x、负数用括号
    • 公式构建规则:假设放在独立单元格、使用单元格引用而非硬编码值、公式避免循环引用
    • 硬编码文档:每个硬编码值必须标注来源(10-K、Bloomberg、FactSet 等)

第二部分:XLSX creation, editing, and analysis —— 操作指南

  1. Important Requirements:LibreOffice 用于公式重算
  2. Reading and analyzing data:使用 pandas 进行数据分析
  3. Excel File Workflows:公式优先策略、创建/编辑工作流
  4. Recalculating formulas:recalc.py 使用说明和 JSON 输出解析
  5. Formula Verification Checklist:详尽的自查清单
  6. Best Practices:库选择指南、openpyxl/pandas 使用技巧

TRIGGER 条件明确——“电子表单文件是主要输入或输出”。包含具体的 SKIP 条件:“交付物不是 Word 文档、HTML 报告、独立 Python 脚本、数据库流水线或 Google Sheets API 集成”。

与其他 Office skill 不同,xlsx 的核心价值不在脚本的复杂度:

  • 独特脚本recalc.py(公式重算)——这是 xlsx 最关键的脚本
  • 共享包office/(unpack/pack/validate/soffice)——与 docx、pptx 完全共享
  • 核心规则:SKILL.md 中的”要求”部分——这是真正的差异化优势

xlsx 是典型的**“领域专家”(Domain Expertise)** 型 Skill:

The xlsx skill is characterized by “instructions over scripts” — it has the fewest unique scripts (only 1), but its SKILL.md encodes the most detailed domain knowledge (financial modeling standards, formatting rules, verification checklists).

An ~280-line SKILL.md with two major sections:

Part 1: Requirements for Outputs — The Core “Rule Wall”

  1. All Excel files general rules:

    • Professional fonts (Arial, Times New Roman)
    • Zero formula errors — xlsx’s most core promise
    • When editing templates, exactly preserve existing format and styles
  2. Financial models rules — the most valuable part:

    • Industry-standard color coding: Blue=hardcoded inputs, Black=formulas, Green=same-workbook references, Red=external references, Yellow=assumptions needing attention
    • Number format standards: Years as text, currency with $#,##0, zeros as ”-”, percentages 0.0%, multiples 0.0x, negatives in parentheses
    • Formula construction rules: Assumptions in separate cells, cell references instead of hardcoded values, no circular references
    • Hardcode documentation: Each hardcoded value must cite source (10-K, Bloomberg, FactSet, etc.)

Part 2: XLSX creation, editing, and analysis — Operation Guide

  1. Important Requirements: LibreOffice required for formula recalculation
  2. Reading and analyzing data: pandas for data analysis
  3. Excel File Workflows: Formula-first strategy, create/edit workflows
  4. Recalculating formulas: recalc.py usage and JSON output parsing
  5. Formula Verification Checklist: Detailed self-check list
  6. Best Practices: Library selection guide, openpyxl/pandas tips

TRIGGER conditions are clear — “spreadsheet file is primary input or output”. Includes specific SKIP conditions: “deliverable is not Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration”.

Unlike other Office skills, xlsx’s core value is not in script complexity:

  • Unique script: recalc.py (formula recalculation) — the most critical xlsx script
  • Shared package: office/ (unpack/pack/validate/soffice) — fully shared with docx, pptx
  • Core rules: The “Requirements” section in SKILL.md — the true differentiator

xlsx is a typical “Domain Expertise” Skill:

特征说明
核心价值SKILL.md 中的行业标准规则,而非脚本能力
脚本数仅 1 个独特脚本(recalc.py)
规则密度约 50% 的 SKILL.md 内容是格式化规则和验证清单
共享包office/ 与 docx/pptx 完全共享
质量保证零公式错误 + 详尽自查清单
FeatureDescription
Core ValueIndustry-standard rules in SKILL.md, not script capability
Script CountOnly 1 unique script (recalc.py)
Rule Density~50% of SKILL.md is formatting rules and verification checklists
Shared Packageoffice/ fully shared with docx/pptx
Quality AssuranceZero formula errors + detailed self-check checklist

xlsx 的脚本集合轻量但关键:

xlsx’s script collection is lightweight but critical:

脚本功能依赖
recalc.py公式重算 + 错误检测openpyxl, LibreOffice
ScriptFunctionDependency
recalc.pyFormula recalculation + error detectionopenpyxl, LibreOffice
模块功能
office/unpack.pyXLSX → XML 拆包
office/pack.pyXML → XLSX 打包
office/validate.py验证入口
office/soffice.pyLibreOffice 封装(沙箱兼容配置)
office/validators/Schema 验证器(共享)
ModuleFunction
office/unpack.pyXLSX → XML unpacking
office/pack.pyXML → XLSX packaging
office/validate.pyValidation entry point
office/soffice.pyLibreOffice wrapper (sandbox-compatible config)
office/validators/Schema validators (shared)

这是 xlsx 中唯一真正独特的脚本,负责解决 openpyxl 的核心限制:openpyxl 不会计算公式的值,因此创建带公式的 Excel 文件后需要一个外部引擎来执行重算。

核心设计包括五项:LibreOffice 宏自动部署(首次运行时自动创建 StarBasic 宏 RecalculateAndSave,调用 ThisComponent.calculateAll());跨平台支持(macOS 和 Linux,自动检测 gtimeout);错误检测系统(扫描 7 种 Excel 错误);JSON 输出(结构化报告含类型和位置);超时保护防止 LibreOffice 挂起。

关键设计亮点在于双层验证:第一层通过 LibreOffice 宏计算所有公式,第二层通过 openpyxl 逐单元格扫描错误,输出 JSON 格式便于 Claude 程序化处理。

recalc.py — Formula Recalculation Engine

Section titled “recalc.py — Formula Recalculation Engine”

This is the only truly unique script in xlsx, solving openpyxl’s core limitation: openpyxl does not calculate formula values, so an external engine is needed to perform recalculation after creating Excel files with formulas.

Core design includes five aspects: auto-deploy LibreOffice macro (creates StarBasic RecalculateAndSave macro on first run, calling ThisComponent.calculateAll()); cross-platform support (macOS and Linux, auto-detects gtimeout); error detection system (scans for 7 Excel error types); JSON output (structured report with types and locations); and timeout protection to prevent LibreOffice from hanging.

The key design highlight is the two-layer validation: Layer 1 calculates all formulas via LibreOffice macro, Layer 2 scans each cell for errors via openpyxl, outputting JSON for programmatic error handling by Claude.

recalc.py — 公式重算核心函数 ↗ 源文件
1 def recalc(filename, timeout=30): 2 if not Path(filename).exists(): 3 return {"error": f"File {filename} does not exist"} 4 5 abs_path = str(Path(filename).absolute()) 6 7 if not setup_libreoffice_macro(): 8 return {"error": "Failed to setup LibreOffice macro"} 9 10 cmd = ["soffice", "--headless", "--norestore", 11 "vnd.sun.star.script:Standard.Module1" 12 ".RecalculateAndSave?language=Basic&location=application", 13 abs_path] 14 15 result = subprocess.run(cmd, capture_output=True, text=True, 16 env=get_soffice_env()) 17 18 wb = load_workbook(filename, data_only=True) 19 for sheet_name in wb.sheetnames: 20 ws = wb[sheet_name] 21 for row in ws.iter_rows(): 22 for cell in row: 23 if cell.value and isinstance(cell.value, str): 24 for err in excel_errors: 25 if err in cell.value: 26 error_details[err].append( 27 f"{sheet_name}!{cell.coordinate}") 28 29 wb_formulas = load_workbook(filename, data_only=False) 30 # Count formulas for reporting 31 for sheet_name in wb_formulas.sheetnames: 32 for row in wb_formulas[sheet_name].iter_rows(): 33 for cell in row: 34 if cell.value and isinstance(cell.value, str) and cell.value.startswith("="): 35 formula_count += 1 36 37 return {"status": "success" if total_errors == 0 else "errors_found", 38 "total_errors": total_errors, 39 "total_formulas": formula_count, 40 "error_summary": error_summary}
代码解读
L1 函数入口:filename(Excel 路径) + timeout(超时秒数,默认 30)。返回 JSON 格式结果。 L6 setup_libreoffice_macro():首次运行时自动创建 StarBasic 宏,调用 ThisComponent.calculateAll() 和 ThisComponent.store()。 L9 关键命令:使用 LibreOffice 的 vnd.sun.star.script URI 直接调用 StarBasic 宏。这是最可靠的远程宏调用方式。 L16 第一层验证:宏执行完毕后,以 data_only=True 模式打开重算后的文件,读取每个单元格的计算值。 L20 错误扫描:逐单元格检查 7 种 Excel 错误字符串。记录每个错误的位置(SheetName!CellRef 格式)。 L26 第二层验证:以 data_only=False 模式重新打开,统计公式总数用于报告。注意:这必须是一个独立的 load_workbook 调用,避免 data_only 冲突。 L33 JSON 输出:状态(success/errors_found)、错误总数、公式总数、错误摘要(按类型分组,最多 20 个位置)。

xlsx 复用了 docx 中 scripts/office/validators/ 目录下的全部验证器。虽然部分验证器名称是”docx”或”pptx”,但 base.py 中的基础验证逻辑(命名空间检查、关系 ID 验证、内容类型验证)对 XLSX 同样有效。

xlsx reuses all validators from the scripts/office/validators/ directory in docx. Although some validators are named “docx” or “pptx”, the base validation logic in base.py (namespace checks, relationship ID validation, content type validation) works for XLSX as well.

Usage is identical to docx:

Terminal window
python scripts/office/validate.py output.xlsx

xlsx 只有 1 个独特脚本(recalc.py),这与其他 Office skill 形成鲜明对比(docx 有 comment.py 等,pptx 有 thumbnail.py 等)。这是有意为之的选择——所有”智能”都编码在 SKILL.md 的指令中:格式化规则在 SKILL.md 中,脚本只需遵守;公式行为由 openpyxl 处理,无需额外脚本;模板保留是 Claude 的执行要求,不可脚本化;唯一需要外部工具的部分是公式重算——所以只有 recalc.py。

这说明了一个重要设计原则:SKILL.md 是主要的”程序”,脚本只处理无法通过指令完成的操作。

xlsx has only 1 unique script (recalc.py), in stark contrast to other Office skills (docx has comment.py, etc., pptx has thumbnail.py, etc.). This is intentional — all “intelligence” is encoded in SKILL.md’s instructions: formatting rules are in SKILL.md, scripts just follow them; formula behavior is handled by openpyxl, no extra scripts needed; template preservation is a Claude execution requirement, not scriptable; the only external tooling needed is formula recalculation — hence only recalc.py.

This illustrates an important design principle: SKILL.md is the primary “program”, and scripts only handle operations that cannot be accomplished through instructions alone.

  1. “领域专家”模式:xlsx 的价值来自 SKILL.md 中编码的专业知识——财务模型的行业标准配色、数字格式、公式规范。这些规则是真正的”领域壁垒”
  2. 零公式错误保证:通过 recalc.py 的双层验证(LibreOffice 重算 + openpyxl 扫描),确保每个输出的公式都是无错误的
  3. 公式优先策略:SKILL.md 中明确禁止在 Python 中计算并硬编码结果——所有计算必须使用 Excel 公式,确保电子表单的动态性和可更新性
  4. 模板保留原则:编辑现有模板时,精确匹配已有格式和样式——不强行应用标准化格式
  5. 自部署宏机制:recalc.py 首次运行时自动创建 LibreOffice 宏,避免了繁琐的手动配置

“如果你想为电子表单或其他数据密集型格式创建类似的 Skill…”

  1. 识别领域标准:在你的领域中,什么是”好输出”的标准?对这些标准进行编码——这就是 xlsx 的”规则墙”
  2. 选择正确的库:基础操作用 pandas(数据分析),精细格式化用 openpyxl
  3. 建立质量保障机制:类似 recalc.py 的验证脚本,确保输出无错误
  4. 采用公式优先思路:如果你的格式有内置计算能力,优先使用而非在代码中预先计算
  5. 复用共享包:如果处理相关格式(CSV、TSV、ODS),提取共享处理代码

⚠️ data_only=True 陷阱:data_only=True 打开文件后保存,会永久丢失公式(公式被替换为计算值)。除非明确需要”转化为值”,否则永远不要保存 data_only=True 模式打开的文件

⚠️ 列号混淆: Excel 列号是 1-based,pandas DataFrame 的行是 0-based——在向量和坐标之间转换时很容易出错(如第 64 列是 BL 而非 BK)

⚠️ 公式的 #REF! 错误: 插入/删除行列后,公式可能引用被移除的单元格。recalc.py 的 JSON 输出可以定位这些错误,但修复需要人工判断

⚠️ 模板格式覆盖: 编辑现有模板时,最容易犯的错误是”格式化”(将所有表格改为标准样式),而非保留原有格式。必须显式禁止

⚠️ LibreOffice 宏首次配置: recalc.py 的宏自动部署依赖于 soffice --headless --terminate_after_init 的成功执行。在沙箱环境中,get_soffice_env() 的环境变量配置至关重要

  1. “Domain Expertise” Pattern: xlsx’s value comes from domain knowledge encoded in SKILL.md — financial model industry-standard color coding, number formats, formula conventions. These rules are the true “domain moat”
  2. Zero Formula Error Guarantee: Two-layer validation (LibreOffice recalculation + openpyxl scanning) ensures every output formula is error-free
  3. Formula-First Strategy: SKILL.md explicitly prohibits calculating in Python and hardcoding results — all calculations must use Excel formulas, ensuring spreadsheet dynamism and updatability
  4. Template Preservation Principle: When editing existing templates, exactly match existing format and styles — don’t force standard formatting
  5. Self-Deploying Macro Mechanism: recalc.py auto-creates LibreOffice macros on first run, avoiding tedious manual setup

“If you want to create a similar Skill for spreadsheets or other data-intensive formats…”

  1. Identify domain standards: What defines “good output” in your domain? Encode these standards — this is xlsx’s “rule wall”
  2. Choose the right libraries: pandas for basic operations (data analysis), openpyxl for fine formatting
  3. Establish quality assurance: A validation script like recalc.py ensuring error-free output
  4. Adopt formula-first approach: If your format has built-in computation, prefer it over pre-calculating in code
  5. Reuse shared packages: If processing related formats (CSV, TSV, ODS), extract shared processing code

⚠️ data_only=True trap: Opening a file with data_only=True and saving permanently loses formulas (replaced with calculated values). Never save a file opened in data_only=True mode unless explicitly converting to values

⚠️ Column number confusion: Excel columns are 1-based, pandas DataFrame rows are 0-based — easy to make mistakes when converting between vectors and coordinates (e.g., column 64 = BL, not BK)

⚠️ #REF! errors in formulas: After inserting/deleting rows/columns, formulas may reference removed cells. recalc.py’s JSON output can locate these errors, but repair requires human judgment

⚠️ Template format override: When editing existing templates, the most common mistake is “formatting” (converting all tables to standard styles) instead of preserving original format. Must explicitly prohibit

⚠️ LibreOffice first-time macro config: recalc.py’s macro auto-deployment depends on successful soffice --headless --terminate_after_init execution. In sandbox environments, get_soffice_env() environment variable configuration is critical

模式说明适用于...
领域专家将行业标准直接编码为指令任何需要专业领域知识的 skill
零错误保证通过双层验证确保输出质量错误代价高、需要精确输出的场景
公式优先优先使用领域自身的计算能力处理有内置计算引擎的格式
模板保留编辑时精确保留现有格式处理有复杂样式和约定的模板
自部署工具首次运行时自动配置依赖需要外部工具但不想手动配置的场景
PatternDescriptionApplies to...
Domain ExpertiseEncode industry standards as instructionsAny skill needing specialized domain knowledge
Zero-Error GuaranteeTwo-layer validation ensuring output qualityScenarios where errors are costly
Formula-FirstPrefer the format's own computation engineFormats with built-in calculation engines
Template PreservationExactly preserve existing format when editingTemplates with complex styles and conventions
Self-Deploying ToolsAuto-configure dependencies on first runScenarios needing external tools without manual setup