--- name: docx description: "当用户想要创建、读取、编辑或操作 Word 文档(.docx 文件)时使用此技能。触发条件包括:提及 \"Word 文档\"、\"word document\"、\".docx\",或请求制作具有目录、标题、页码或信笺等格式的专业文档。还可用于从 .docx 文件中提取或重组内容、在文档中插入或替换图像、在 Word 文件中执行查找替换、处理修订或评论、或将内容转换为精美的 Word 文档。如果用户要求 \"报告\"、\"备忘录\"、\"信函\"、\"模板\" 或类似的 Word 或 .docx 文件交付物,请使用此技能。不要用于 PDF、电子表格、Google 文档或与文档生成无关的通用编码任务。" license: Proprietary. LICENSE.txt has complete terms --- # DOCX creation, editing, and analysis ## Overview A .docx file is a ZIP archive containing XML files. ## Quick Reference | Task | Approach | |------|----------| | Read/analyze content | `pandoc` or unpack for raw XML | | Create new document | Use `docx-js` - see Creating New Documents below | | Edit existing document | Unpack → edit XML → repack - see Editing Existing Documents below | ### Converting .doc to .docx Legacy `.doc` files must be converted before editing: ```bash python scripts/office/soffice.py --headless --convert-to docx document.doc ``` ### Reading Content ```bash # Text extraction with tracked changes pandoc --track-changes=all document.docx -o output.md # Raw XML access python scripts/office/unpack.py document.docx unpacked/ ``` ### Converting to Images ```bash python scripts/office/soffice.py --headless --convert-to pdf document.docx pdftoppm -jpeg -r 150 document.pdf page ``` ### Accepting Tracked Changes To produce a clean document with all tracked changes accepted (requires LibreOffice): ```bash python scripts/accept_changes.py input.docx output.docx ``` --- ## Creating New Documents Generate .docx files with JavaScript, then validate. Install: `npm install -g docx` ### Setup ```javascript const { Document, Packer, Paragraph, TextRun, Table, TableRow, TableCell, ImageRun, Header, Footer, AlignmentType, PageOrientation, LevelFormat, ExternalHyperlink, InternalHyperlink, Bookmark, FootnoteReferenceRun, PositionalTab, PositionalTabAlignment, PositionalTabRelativeTo, PositionalTabLeader, TabStopType, TabStopPosition, Column, SectionType, TableOfContents, HeadingLevel, BorderStyle, WidthType, ShadingType, VerticalAlign, PageNumber, PageBreak } = require('docx'); const doc = new Document({ sections: [{ children: [/* content */] }] }); Packer.toBuffer(doc).then(buffer => fs.writeFileSync("doc.docx", buffer)); ``` ### Validation After creating the file, validate it. If validation fails, unpack, fix the XML, and repack. ```bash python scripts/office/validate.py doc.docx ``` ### Page Size ```javascript // CRITICAL: docx-js defaults to A4, not US Letter // Always set page size explicitly for consistent results sections: [{ properties: { page: { size: { width: 12240, // 8.5 inches in DXA height: 15840 // 11 inches in DXA }, margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } // 1 inch margins } }, children: [/* content */] }] ``` **Common page sizes (DXA units, 1440 DXA = 1 inch):** | Paper | Width | Height | Content Width (1" margins) | |-------|-------|--------|---------------------------| | US Letter | 12,240 | 15,840 | 9,360 | | A4 (default) | 11,906 | 16,838 | 9,026 | **Landscape orientation:** docx-js swaps width/height internally, so pass portrait dimensions and let it handle the swap: ```javascript size: { width: 12240, // Pass SHORT edge as width height: 15840, // Pass LONG edge as height orientation: PageOrientation.LANDSCAPE // docx-js swaps them in the XML }, // Content width = 15840 - left margin - right margin (uses the long edge) ``` ### Styles (Override Built-in Headings) Use Arial as the default font (universally supported). Keep titles black for readability. ```javascript const doc = new Document({ styles: { default: { document: { run: { font: "Arial", size: 24 } } }, // 12pt default paragraphStyles: [ // IMPORTANT: Use exact IDs to override built-in styles { id: "Heading1", name: "Heading 1", basedOn: "Normal", next: "Normal", quickFormat: true, run: { size: 32, bold: true, font: "Arial" }, paragraph: { spacing: { before: 240, after: 240 }, outlineLevel: 0 } }, // outlineLevel required for TOC { id: "Heading2", name: "Heading 2", basedOn: "Normal", next: "Normal", quickFormat: true, run: { size: 28, bold: true, font: "Arial" }, paragraph: { spacing: { before: 180, after: 180 }, outlineLevel: 1 } }, ] }, sections: [{ children: [ new Paragraph({ heading: HeadingLevel.HEADING_1, children: [new TextRun("Title")] }), ] }] }); ``` ### Lists (NEVER use unicode bullets) ```javascript // ❌ WRONG - never manually insert bullet characters new Paragraph({ children: [new TextRun("• Item")] }) // BAD new Paragraph({ children: [new TextRun("\u2022 Item")] }) // BAD // ✅ CORRECT - use numbering config with LevelFormat.BULLET const doc = new Document({ numbering: { config: [ { reference: "bullets", levels: [{ level: 0, format: LevelFormat.BULLET, text: "•", alignment: AlignmentType.LEFT, style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] }, { reference: "numbers", levels: [{ level: 0, format: LevelFormat.DECIMAL, text: "%1.", alignment: AlignmentType.LEFT, style: { paragraph: { indent: { left: 720, hanging: 360 } } } }] }, ] }, sections: [{ children: [ new Paragraph({ numbering: { reference: "bullets", level: 0 }, children: [new TextRun("Bullet item")] }), new Paragraph({ numbering: { reference: "numbers", level: 0 }, children: [new TextRun("Numbered item")] }), ] }] }); // ⚠️ Each reference creates INDEPENDENT numbering // Same reference = continues (1,2,3 then 4,5,6) // Different reference = restarts (1,2,3 then 1,2,3) ``` ### Tables **CRITICAL: Tables need dual widths** - set both `columnWidths` on the table AND `width` on each cell. Without both, tables render incorrectly on some platforms. ```javascript // CRITICAL: Always set table width for consistent rendering // CRITICAL: Use ShadingType.CLEAR (not SOLID) to prevent black backgrounds const border = { style: BorderStyle.SINGLE, size: 1, color: "CCCCCC" }; const borders = { top: border, bottom: border, left: border, right: border }; new Table({ width: { size: 9360, type: WidthType.DXA }, // Always use DXA (percentages break in Google Docs) columnWidths: [4680, 4680], // Must sum to table width (DXA: 1440 = 1 inch) rows: [ new TableRow({ children: [ new TableCell({ borders, width: { size: 4680, type: WidthType.DXA }, // Also set on each cell shading: { fill: "D5E8F0", type: ShadingType.CLEAR }, // CLEAR not SOLID margins: { top: 80, bottom: 80, left: 120, right: 120 }, // Cell padding (internal, not added to width) children: [new Paragraph({ children: [new TextRun("Cell")] })] }) ] }) ] }) ``` **Table width calculation:** Always use `WidthType.DXA` — `WidthType.PERCENTAGE` breaks in Google Docs. ```javascript // Table width = sum of columnWidths = content width // US Letter with 1" margins: 12240 - 2880 = 9360 DXA width: { size: 9360, type: WidthType.DXA }, columnWidths: [7000, 2360] // Must sum to table width ``` **Width rules:** - **Always use `WidthType.DXA`** — never `WidthType.PERCENTAGE` (incompatible with Google Docs) - Table width must equal the sum of `columnWidths` - Cell `width` must match corresponding `columnWidth` - Cell `margins` are internal padding - they reduce content area, not add to cell width - For full-width tables: use content width (page width minus left and right margins) ### Images ```javascript // CRITICAL: type parameter is REQUIRED new Paragraph({ children: [new ImageRun({ type: "png", // Required: png, jpg, jpeg, gif, bmp, svg data: fs.readFileSync("image.png"), transformation: { width: 200, height: 150 }, altText: { title: "Title", description: "Desc", name: "Name" } // All three required })] }) ``` ### Page Breaks ```javascript // CRITICAL: PageBreak must be inside a Paragraph new Paragraph({ children: [new PageBreak()] }) // Or use pageBreakBefore new Paragraph({ pageBreakBefore: true, children: [new TextRun("New page")] }) ``` ### Hyperlinks ```javascript // External link new Paragraph({ children: [new ExternalHyperlink({ children: [new TextRun({ text: "Click here", style: "Hyperlink" })], link: "https://example.com", })] }) // Internal link (bookmark + reference) // 1. Create bookmark at destination new Paragraph({ heading: HeadingLevel.HEADING_1, children: [ new Bookmark({ id: "chapter1", children: [new TextRun("Chapter 1")] }), ]}) // 2. Link to it new Paragraph({ children: [new InternalHyperlink({ children: [new TextRun({ text: "See Chapter 1", style: "Hyperlink" })], anchor: "chapter1", })]}) ``` ### Footnotes ```javascript const doc = new Document({ footnotes: { 1: { children: [new Paragraph("Source: Annual Report 2024")] }, 2: { children: [new Paragraph("See appendix for methodology")] }, }, sections: [{ children: [new Paragraph({ children: [ new TextRun("Revenue grew 15%"), new FootnoteReferenceRun(1), new TextRun(" using adjusted metrics"), new FootnoteReferenceRun(2), ], })] }] }); ``` ### Tab Stops ```javascript // Right-align text on same line (e.g., date opposite a title) new Paragraph({ children: [ new TextRun("Company Name"), new TextRun("\tJanuary 2025"), ], tabStops: [{ type: TabStopType.RIGHT, position: TabStopPosition.MAX }], }) // Dot leader (e.g., TOC-style) new Paragraph({ children: [ new TextRun("Introduction"), new TextRun({ children: [ new PositionalTab({ alignment: PositionalTabAlignment.RIGHT, relativeTo: PositionalTabRelativeTo.MARGIN, leader: PositionalTabLeader.DOT, }), "3", ]}), ], }) ``` ### Multi-Column Layouts ```javascript // Equal-width columns sections: [{ properties: { column: { count: 2, // number of columns space: 720, // gap between columns in DXA (720 = 0.5 inch) equalWidth: true, separate: true, // vertical line between columns }, }, children: [/* content flows naturally across columns */] }] // Custom-width columns (equalWidth must be false) sections: [{ properties: { column: { equalWidth: false, children: [ new Column({ width: 5400, space: 720 }), new Column({ width: 3240 }), ], }, }, children: [/* content */] }] ``` Force a column break with a new section using `type: SectionType.NEXT_COLUMN`. ### Table of Contents ```javascript // CRITICAL: Headings must use HeadingLevel ONLY - no custom styles new TableOfContents("Table of Contents", { hyperlink: true, headingStyleRange: "1-3" }) ``` ### Headers/Footers ```javascript sections: [{ properties: { page: { margin: { top: 1440, right: 1440, bottom: 1440, left: 1440 } } // 1440 = 1 inch }, headers: { default: new Header({ children: [new Paragraph({ children: [new TextRun("Header")] })] }) }, footers: { default: new Footer({ children: [new Paragraph({ children: [new TextRun("Page "), new TextRun({ children: [PageNumber.CURRENT] })] })] }) }, children: [/* content */] }] ``` ### Critical Rules for docx-js - **Set page size explicitly** - docx-js defaults to A4; use US Letter (12240 x 15840 DXA) for US documents - **Landscape: pass portrait dimensions** - docx-js swaps width/height internally; pass short edge as `width`, long edge as `height`, and set `orientation: PageOrientation.LANDSCAPE` - **Never use `\n`** - use separate Paragraph elements - **Never use unicode bullets** - use `LevelFormat.BULLET` with numbering config - **PageBreak must be in Paragraph** - standalone creates invalid XML - **ImageRun requires `type`** - always specify png/jpg/etc - **Always set table `width` with DXA** - never use `WidthType.PERCENTAGE` (breaks in Google Docs) - **Tables need dual widths** - `columnWidths` array AND cell `width`, both must match - **Table width = sum of columnWidths** - for DXA, ensure they add up exactly - **Always add cell margins** - use `margins: { top: 80, bottom: 80, left: 120, right: 120 }` for readable padding - **Use `ShadingType.CLEAR`** - never SOLID for table shading - **Never use tables as dividers/rules** - cells have minimum height and render as empty boxes (including in headers/footers); use `border: { bottom: { style: BorderStyle.SINGLE, size: 6, color: "2E75B6", space: 1 } }` on a Paragraph instead. For two-column footers, use tab stops (see Tab Stops section), not tables - **TOC requires HeadingLevel only** - no custom styles on heading paragraphs - **Override built-in styles** - use exact IDs: "Heading1", "Heading2", etc. - **Include `outlineLevel`** - required for TOC (0 for H1, 1 for H2, etc.) --- ## Editing Existing Documents **Follow all 3 steps in order.** ### Step 1: Unpack ```bash python scripts/office/unpack.py document.docx unpacked/ ``` Extracts XML, pretty-prints, merges adjacent runs, and converts smart quotes to XML entities (`“` etc.) so they survive editing. Use `--merge-runs false` to skip run merging. ### Step 2: Edit XML Edit files in `unpacked/word/`. See XML Reference below for patterns. **Use "Claude" as the author** for tracked changes and comments, unless the user explicitly requests use of a different name. **Use the Edit tool directly for string replacement. Do not write Python scripts.** Scripts introduce unnecessary complexity. The Edit tool shows exactly what is being replaced. **CRITICAL: Use smart quotes for new content.** When adding text with apostrophes or quotes, use XML entities to produce smart quotes: ```xml Here’s a quote: “Hello” ``` | Entity | Character | |--------|-----------| | `‘` | ‘ (left single) | | `’` | ’ (right single / apostrophe) | | `“` | “ (left double) | | `”` | ” (right double) | **Adding comments:** Use `comment.py` to handle boilerplate across multiple XML files (text must be pre-escaped XML): ```bash python scripts/comment.py unpacked/ 0 "Comment text with & and ’" python scripts/comment.py unpacked/ 1 "Reply text" --parent 0 # reply to comment 0 python scripts/comment.py unpacked/ 0 "Text" --author "Custom Author" # custom author name ``` Then add markers to document.xml (see Comments in XML Reference). ### Step 3: Pack ```bash python scripts/office/pack.py unpacked/ output.docx --original document.docx ``` Validates with auto-repair, condenses XML, and creates DOCX. Use `--validate false` to skip. **Auto-repair will fix:** - `durableId` >= 0x7FFFFFFF (regenerates valid ID) - Missing `xml:space="preserve"` on `` with whitespace **Auto-repair won't fix:** - Malformed XML, invalid element nesting, missing relationships, schema violations ### Common Pitfalls - **Replace entire `` elements**: When adding tracked changes, replace the whole `...` block with `......` as siblings. Don't inject tracked change tags inside a run. - **Preserve `` formatting**: Copy the original run's `` block into your tracked change runs to maintain bold, font size, etc. --- ## XML Reference ### Schema Compliance - **Element order in ``**: ``, ``, ``, ``, ``, `` last - **Whitespace**: Add `xml:space="preserve"` to `` with leading/trailing spaces - **RSIDs**: Must be 8-digit hex (e.g., `00AB1234`) ### Tracked Changes **Insertion:** ```xml inserted text ``` **Deletion:** ```xml deleted text ``` **Inside ``**: Use `` instead of ``, and `` instead of ``. **Minimal edits** - only mark what changes: ```xml The term is 30 60 days. ``` **Deleting entire paragraphs/list items** - when removing ALL content from a paragraph, also mark the paragraph mark as deleted so it merges with the next paragraph. Add `` inside ``: ```xml ... Entire paragraph content being deleted... ``` Without the `` in ``, accepting changes leaves an empty paragraph/list item. **Rejecting another author's insertion** - nest deletion inside their insertion: ```xml their inserted text ``` **Restoring another author's deletion** - add insertion after (don't modify their deletion): ```xml deleted text deleted text ``` ### Comments After running `comment.py` (see Step 2), add markers to document.xml. For replies, use `--parent` flag and nest markers inside the parent's. **CRITICAL: `` and `` are siblings of ``, never inside ``.** ```xml deleted more text text ``` ### Images 1. Add image file to `word/media/` 2. Add relationship to `word/_rels/document.xml.rels`: ```xml ``` 3. Add content type to `[Content_Types].xml`: ```xml ``` 4. Reference in document.xml: ```xml ``` --- ## Dependencies - **pandoc**: Text extraction - **docx**: `npm install -g docx` (new documents) - **LibreOffice**: PDF conversion (auto-configured for sandboxed environments via `scripts/office/soffice.py`) - **Poppler**: `pdftoppm` for images