使用 AI 将 PDF 文档转为 JSON 数据

最近使用 AI　实现的一个功能，分析 PDF 银行账单文件，将流水明细输出为 JSON 格式结果。一句话就是实现 PDF 转 JSON。

分享一下实现过程的心得：

模型的选择
让 AI 输出可控的 JSON 格式
提示语

1. 模型的选择

选择的模型需要符合：API 方式调用、支持 PDF 文件输入。

直接把 PDF 发给模型，模型同时看到文字和坐标，能推断单元格，这样能够保持向量文本和边框结构，而转换为图片或者其它格式或者自己 OCR 后，可能会丢失这些信息。

测试的几个模型：

1.1 阿里千问

支持上传 PDF 文档后，把 PDF 返回的 file_id 发给模型分析转换。优点是超长的上下文长度，模型推理能力一般。

Qwen-Long 提供长达 1000 万 token（约 1500 万字）的上下文长度，支持上传文档并基于文档进行问答。

代码演示：

# 上传文档
curl 'https://dashscope.aliyuncs.com/compatible-mode/v1/files' \
-H 'Authorization: Bearer $DASHSCOPE_API_KEY' \
-F 'file=@"/d:/www/app/public/uploads/files/pdf/2024.1.pdf"' \
-F 'purpose="file-extract"'

# 分析文档
curl 'https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions' \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
--data '{
    "model": "qwen-long",
    "messages": [
        {"role": "system","content": "You are a helpful assistant."},
        {"role": "system","content": "fileid://file-fe-xxx"},
        {"role": "user","content": "分析中文繁体和英文内容的银行月结账单，返回 JSON 格式流水明细等账单内容。"}
    ]
}'

1.2 DeepSeek

官方 API 方式调用截至本文编写（2025-06-05）不支持文件输入，放弃。

补充一个通过 OpenRouter 提供 PDF 引擎处理过后再使用 deepseek-r1 模型的方法。

# PDF Support
# https://openrouter.ai/docs/features/images-and-pdfs#pdf-support
# pdf base64
# $upload_pdf = '/d:/www/app/public/uploads/files/pdf/2024.1.pdf';
# $file_base64 = base64_encode(file_get_contents($upload_pdf));
# $s = 'data:application/pdf;base64,'.$file_base64;
curl 'https://openrouter.ai/api/v1/chat/completions' \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
--data
{
  "model": "deepseek/deepseek-r1-0528:free",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "分析中文繁体和英文内容的银行月结账单，返回 JSON 格式流水明细等账单内容。"
        },
        {
          "type": "file",
          "file": {
            "filename": "2024.1.pdf",
            "file_data": "data:application/pdf;base64,xxxx"
          }
        }
      ]
    }
  ],
  "plugins": [
    {
      "id": "file-parser",
      "pdf": {
        "engine": "pdf-text"
      }
    }
  ]
}

1.3 OpenAI

GPT-4o / GPT-4.1， GPT-4.1 识别的数据更加准确。

开始走的弯路：

使用 Chat Completions API（比如 GPT-4、GPT-3.5）只能接收文本输入（prompt），它不具备自动读取或搜索文件的功能。

后面发现 Assistant API 可实现但步骤繁琐：

用 Assistant API（Beta）借助 Tools（比如 file_search）对上传文件进行复杂任务操作，比如文件搜索、代码解释、网页搜索、插件调用（比如 PDF 分析、表格处理）。

Assistant API 文件分析流程

1.3.1 上传 PDF 文件获取 file_id：/v1/files
1.3.2 创建 Assistant（助手）：/v1/assistants
1.3.3 创建 Thread（对话线程）：/v1/threads
1.3.4 向 Thread 添加消息，并在 attachments 中绑定文件和指定使用的 file_search 工具：/v1/threads/{thread_id}/messages
1.3.5 创建 Run（启动对话），指定使用 file_search 工具分析文件：/v1/threads/{thread_id}/runs
1.3.6 轮询 Run 状态，status 从 in_progress 到 completed：/v1/threads/{thread_id}/runs/{run_id}
1.3.7 获取最终结果：/v1/threads/{thread_id}/messages

注意：Assistants API 预计在 2026 年上半年弃用。

其实可以一步到位：

官方自 2025-03-18 起已原生支持，chat.completions 可在消息里引用 file_id，Vision 模型会自动解析文字与版面，无需自行拆页或 OCR。

注意：只要模型属于 GPT-4o / o-series / GPT-4.1 这三大家族，就能识别 file_id 附件中的 PDF 并进行流水抽取；其余模型会报 “Invalid chat format …” 之类错误。

代码演示：

# Upload file
# https://platform.openai.com/docs/api-reference/files/create
curl https://api.openai.com/v1/files \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F 'purpose="assistants"' \
  -F 'file=@"/d:/www/app/public/uploads/files/pdf/2024.1.pdf"'

# Create chat completion
# https://platform.openai.com/docs/api-reference/chat/create
curl 'https://api.openai.com/v1/chat/completions' \
-H 'Authorization: Bearer $OPENAI_API_KEY' \
-H 'Content-Type: application/json' \
--data '{
  "model": "gpt-4o",
  "response_format": { "type": "json_object" },
  "max_tokens": 4096,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "分析中文繁体和英文内容的银行月结账单，返回 JSON 格式流水明细等账单内容。"
        },
        {
          "type": "file",
          "file": {
            "file_id": "file-4ABeEXRAJ53CDpGrLG272Q"
          }
        }
      ]
    }
  ]
}'

2. 让 AI 输出可控的 JSON 格式

因为全程使用程序处理，输出的结果需要为可控的 JSON，不然读取数据的时候不知道取哪一个键，或者如何遍历数据。

主要通过提示语约定。

3. 提示语

## 任务目标
分析中文繁体和英文内容的银行月结账单，返回 JSON 格式流水明细等账单内容。

## 提取规则

1. 每一笔流水包含以下五列：Date 日期、Transaction_Details 進支詳情、Deposit 存入、Withdrawal 支出、Balance 結餘。

2. 所有数据均保留原始内容（中文、英文、数字均原样保留），缺失也保留空值，禁止将中文内容翻译或转换或删除。

3. 保持原始账单分类，还原每个分类的每一行列的表格数据。

4.【重要规则】每一笔交易有且仅有一行中文内容，禁止把两行中文内容数据合并成同一笔记录。请严格按照这一规则进行提取。

5. 所有收支明细存`trade_records`键名下，其值为多个账单分类数组。其中账单分类类别用`category`键名，交易流水用`transactions`键名，值为交易数组。

6. 文档其它非页脚、页头数据存`file_summary`键名下，值为交易数组。数组下分别为键名为“信息名称的英文字符“，值为“信息内容”的键值对，当值为多个数据则用数组表示。

7. 最大能力地保留原始文档所有数据和层级关系在 JSON 上，特别地“公司名稱”键名用`company_name`，“Account Number 戶口號碼”键名用`account_number`，“Statement Date 賬單日期”键名用`statement_date`，“Bank 分行”键名用`bank_branch`，“Bank Code 銀行編號”键名用`bank_code`，“HKD Deposit 港元存款”键名用`hkd_deposit`，“Foreign Currency 外幣存款”键名用`foreign_currency`。

8. JSON 键使用英文单词小写字母，有空格或者特殊字符使用下划线“_”代替。

9. 输出示例

```json
{"file_summary": {"company_name": "xxxx", "account_number": "xxxx", "statement_date": "xxxx", "other_data": [{"key1":"value1"}]}, "trade_records": [{"category":"hkd_statement_savings","transactions": []},{"category":"current","transactions": []}]}
```

gpt-4.1 实际使用英语效果更佳。

使用中文提示词，输出内容中会直接把进支详情的中文摘要漏掉了，使用英文后可以完美失败和保留中文内容。

## Task Objective

Analyze bank monthly statements with Traditional Chinese and English content, and return all transaction details and other statement content in **JSON** format.

## Extraction Rules

1. Each transaction must contain the following five fields: `date`, `transaction_details`, `deposit`, `withdrawal`, `balance`.

2. All extracted data must **preserve the original content** (Traditional Chinese, English, and numbers as is). If a value is missing, leave it as an empty value. **Do NOT translate, convert, or remove any Chinese content.**

3. Keep the original statement categories. Restore the tabular data of each category, including every row and column.

4. **Important Rule:** Each transaction must correspond to exactly one line of Chinese content. Never merge two lines of Chinese into a single record. Strictly follow this rule for extraction.

5. All transaction records must be stored under the `trade_records` key, which is an array of statement categories. Each category should use the `category` key. The transactions in that category are under the `transactions` key as an array of records.

6. Other data (excluding page header and footer) must be stored under the `file_summary` key, as a key-value array. The key is the English name of the information, and the value is the content. If there are multiple values, use an array.

7. **Preserve all original data and hierarchy** from the document as much as possible in the JSON. Use the following key names for important fields:

  - `company_name` for 公司名稱
  - `account_number` for Account Number 戶口號碼
  - `statement_date` for Statement Date 賬單日期
  - `bank_branch` for Bank 分行
  - `bank_code` for Bank Code 銀行編號
  - `hkd_deposit` for HKD Deposit 港元存款
  - `foreign_currency` for Foreign Currency 外幣存款

8. JSON keys must use lowercase English letters, with underscores (`_`) for spaces or special characters.

9. Output Example:

```json
{"file_summary": {"company_name": "xxxx", "account_number": "xxxx", "statement_date": "xxxx", "other_data": [{"key1":"value1"}]}, "trade_records": [{"category":"hkd_statement_savings","transactions": []},{"category":"current","transactions": []}]}
```