Skip to content

Comments

[WIP] Add batch conversion files for PDF to Markdown#37

Draft
Copilot wants to merge 1 commit intomainfrom
copilot/add-batch-convert-files
Draft

[WIP] Add batch conversion files for PDF to Markdown#37
Copilot wants to merge 1 commit intomainfrom
copilot/add-batch-convert-files

Conversation

Copy link

Copilot AI commented Nov 14, 2025

Thanks for asking me to work on this. I will get started on it and keep this PR's description up to date as I form a plan and make progress.

Original prompt

Add a set of files to make it easy for users to batch-convert PDF files to Markdown with thepipe, provide a reproducible Docker environment, a Makefile, a runbook, a YAML report template, and a manual GitHub Actions workflow to run the conversion in CI. Files to add (create new files in the repository):

  1. scripts/batch_convert.sh
#!/usr/bin/env bash
set -euo pipefail

# 简单批量转换脚本(Bash)
# 用法:
#   ./scripts/batch_convert.sh <input_dir> <output_dir>
# 说明:
# - 假设 thepipe CLI 提供: thepipe pdf2md <input.pdf> -o <output.md>
# - 若 CLI 不同,请修改 convert_cmd 变量或替换 run_conversion() 中的命令
# - 生成的日志文件为 logs/convert-YYYYMMDD-HHMMSS.log
# - 脚本会写入 templates/conversion_report.yaml 的简短报告(每次覆盖一个带时间戳的报告)

if [ "$#" -ne 2 ]; then
  echo "Usage: $0 <input_dir> <output_dir>"
  exit 1
fi

INPUT_DIR="$1"
OUTPUT_DIR="$2"
TIMESTAMP=$(date +"%Y%m%d-%H%M%S")
LOG_DIR="logs"
REPORT_DIR="reports"
mkdir -p "${OUTPUT_DIR}" "${LOG_DIR}" "${REPORT_DIR}"

LOG_FILE="${LOG_DIR}/convert-${TIMESTAMP}.log"
REPORT_FILE="${REPORT_DIR}/report-${TIMESTAMP}.yaml"

# Modify this command template if your thepipe CLI differs.
# Use %IN% and %OUT% placeholders or replace directly with an actual command.
convert_cmd_template='thepipe pdf2md "%IN%" -o "%OUT%"'

echo "Start conversion: $(date)" | tee "${LOG_FILE}"
echo "input_dir: ${INPUT_DIR}" | tee -a "${LOG_FILE}"
echo "output_dir: ${OUTPUT_DIR}" | tee -a "${LOG_FILE}"
echo "report_file: ${REPORT_FILE}" | tee -a "${LOG_FILE}"

echo "files:" > "${REPORT_FILE}"

shopt -s nullglob
count=0
failed=0
for pdf in "${INPUT_DIR}"/*.pdf; do
  base="$(basename "$pdf" .pdf)"
  out="${OUTPUT_DIR}/${base}.md"
  echo "Converting: ${pdf} -> ${out}" | tee -a "${LOG_FILE}"
  cmd=${convert_cmd_template//'\"%IN%\"'/"\"${pdf}\""}
  cmd=${cmd//'"%OUT%"'/"\"${out}\""}
  # run conversion
  if eval ${cmd} >> "${LOG_FILE}" 2>&1; then
    echo "OK: ${out}" | tee -a "${LOG_FILE}"
    echo "  - file: \"${pdf}\"" >> "${REPORT_FILE}"
    echo "    output: \"${out}\"" >> "${REPORT_FILE}"
    echo "    status: ok" >> "${REPORT_FILE}"
  else
    echo "FAILED: ${pdf}" | tee -a "${LOG_FILE}"
    echo "  - file: \"${pdf}\"" >> "${REPORT_FILE}"
    echo "    output: \"${out}\"" >> "${REPORT_FILE}"
    echo "    status: failed" >> "${REPORT_FILE}"
    failed=$((failed+1))
  fi
  count=$((count+1))
done

if [ "$count" -eq 0 ]; then
  echo "No PDF files found in ${INPUT_DIR}" | tee -a "${LOG_FILE}"
else
  echo "Finished converting ${count} PDF(s). ${failed} failed." | tee -a "${LOG_FILE}"
fi

echo "End: $(date)" | tee -a "${LOG_FILE}"
  1. scripts/convert_pdfs.py
#!/usr/bin/env python3
"""
Python bulk conversion wrapper.

用法:
  python3 scripts/convert_pdfs.py --input-dir pdfs --output-dir converted-md

功能:
 - 遍历 input-dir 下所有 .pdf(按字母排序)
 - 为每个文件调用 thepipe CLI(默认:thepipe pdf2md <in> -o <out>)
 - 记录每个文件的状态到 reports/report-<timestamp>.yaml
 - 在发生错误时返回非零退出码(便于在任务调度器/CI 里检测)
 - 可通过 --cmd-template 自定义实际执行命令(占位符 %IN% 和 %OUT%)
"""
import argparse
import subprocess
import sys
from pathlib import Path
from datetime import datetime


def run_conversion(pdf: Path, out_md: Path, cmd_template: str):
    cmd = cmd_template.replace("%IN%", str(pdf)).replace("%OUT%", str(out_md))
    try:
        subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        return True, ""
    except subprocess.CalledProcessError as e:
        return False, e.stderr.decode(errors="replace")


def write_report(report_path: Path, entries):
    report_path.parent.mkdir(parents=True, exist_ok=True)
    try:
        import yaml
        with report_path.open("w", encoding="utf-8") as fh:
            yaml.safe_dump({"generated_at": datetime.utcnow().isoformat()+"Z", "entries": entries}, fh, sort_keys=False, allow_unicode=True)
    except Exception:
        # fallback plain text
        with report_path.open("w", encoding="utf-8") as fh:
            fh.write(f"generated_at: {datetime.utcnow().isoformat()}Z\n")
            for e in entries:
                fh.write(str(e) + "\n")


def main():
    p = argparse.ArgumentParser()
    p.add_argument("--input-dir", "-i", required=True)
    p.add_argument("--output-dir", "-o", required=True)
    p.add_argument("--cmd-template", default='thepipe pdf2md "%IN%" -o "%OUT%"',
                   help='Command template to run; use %IN% and %OUT% placeholders.')
    args = p.parse_args()

    input_dir = Path(args.input_dir)
    output_dir = Path(args.output_dir)
    if not input_dir.exists() or not input_dir.is_dir():
        print(f"Input directory {input_dir} not found", file=sys.stderr)
        sys.exit(2)
    output_dir.mkdir(parents=True, exist_ok=True)

    pdfs = sorted(input_dir.glob("*.pdf"))
    if not pdfs:
        print(f"No PDF files found in {input_dir}")
        return

    ts = datetime.utcnow().strftime("%Y%m%d-%H%M%S")
    report_path = Path("reports") / f"report-{ts}.yaml"
    entries = []
    any_failed = False

    for pdf in pdfs:
        out_md = output_dir / (pdf.stem + ".md")
  ...

</details>

*This pull request was created as a result of the following prompt from Copilot chat.*
> Add a set of files to make it easy for users to batch-convert PDF files to Markdown with thepipe, provide a reproducible Docker environment, a Makefile, a runbook, a YAML report template, and a manual GitHub Actions workflow to run the conversion in CI. Files to add (create new files in the repository):
> 
> 1) scripts/batch_convert.sh
> 
> ```
> #!/usr/bin/env bash
> set -euo pipefail
> 
> # 简单批量转换脚本(Bash)
> # 用法:
> #   ./scripts/batch_convert.sh <input_dir> <output_dir>
> # 说明:
> # - 假设 thepipe CLI 提供: thepipe pdf2md <input.pdf> -o <output.md>
> # - 若 CLI 不同,请修改 convert_cmd 变量或替换 run_conversion() 中的命令
> # - 生成的日志文件为 logs/convert-YYYYMMDD-HHMMSS.log
> # - 脚本会写入 templates/conversion_report.yaml 的简短报告(每次覆盖一个带时间戳的报告)
> 
> if [ "$#" -ne 2 ]; then
>   echo "Usage: $0 <input_dir> <output_dir>"
>   exit 1
> fi
> 
> INPUT_DIR="$1"
> OUTPUT_DIR="$2"
> TIMESTAMP=$(date +"%Y%m%d-%H%M%S")
> LOG_DIR="logs"
> REPORT_DIR="reports"
> mkdir -p "${OUTPUT_DIR}" "${LOG_DIR}" "${REPORT_DIR}"
> 
> LOG_FILE="${LOG_DIR}/convert-${TIMESTAMP}.log"
> REPORT_FILE="${REPORT_DIR}/report-${TIMESTAMP}.yaml"
> 
> # Modify this command template if your thepipe CLI differs.
> # Use %IN% and %OUT% placeholders or replace directly with an actual command.
> convert_cmd_template='thepipe pdf2md "%IN%" -o "%OUT%"'
> 
> echo "Start conversion: $(date)" | tee "${LOG_FILE}"
> echo "input_dir: ${INPUT_DIR}" | tee -a "${LOG_FILE}"
> echo "output_dir: ${OUTPUT_DIR}" | tee -a "${LOG_FILE}"
> echo "report_file: ${REPORT_FILE}" | tee -a "${LOG_FILE}"
> 
> echo "files:" > "${REPORT_FILE}"
> 
> shopt -s nullglob
> count=0
> failed=0
> for pdf in "${INPUT_DIR}"/*.pdf; do
>   base="$(basename "$pdf" .pdf)"
>   out="${OUTPUT_DIR}/${base}.md"
>   echo "Converting: ${pdf} -> ${out}" | tee -a "${LOG_FILE}"
>   cmd=${convert_cmd_template//'\"%IN%\"'/"\"${pdf}\""}
>   cmd=${cmd//'"%OUT%"'/"\"${out}\""}
>   # run conversion
>   if eval ${cmd} >> "${LOG_FILE}" 2>&1; then
>     echo "OK: ${out}" | tee -a "${LOG_FILE}"
>     echo "  - file: \"${pdf}\"" >> "${REPORT_FILE}"
>     echo "    output: \"${out}\"" >> "${REPORT_FILE}"
>     echo "    status: ok" >> "${REPORT_FILE}"
>   else
>     echo "FAILED: ${pdf}" | tee -a "${LOG_FILE}"
>     echo "  - file: \"${pdf}\"" >> "${REPORT_FILE}"
>     echo "    output: \"${out}\"" >> "${REPORT_FILE}"
>     echo "    status: failed" >> "${REPORT_FILE}"
>     failed=$((failed+1))
>   fi
>   count=$((count+1))
> done
> 
> if [ "$count" -eq 0 ]; then
>   echo "No PDF files found in ${INPUT_DIR}" | tee -a "${LOG_FILE}"
> else
>   echo "Finished converting ${count} PDF(s). ${failed} failed." | tee -a "${LOG_FILE}"
> fi
> 
> echo "End: $(date)" | tee -a "${LOG_FILE}"
> ```
> 
> 2) scripts/convert_pdfs.py
> 
> ```
> #!/usr/bin/env python3
> """
> Python bulk conversion wrapper.
> 
> 用法:
>   python3 scripts/convert_pdfs.py --input-dir pdfs --output-dir converted-md
> 
> 功能:
>  - 遍历 input-dir 下所有 .pdf(按字母排序)
>  - 为每个文件调用 thepipe CLI(默认:thepipe pdf2md <in> -o <out>)
>  - 记录每个文件的状态到 reports/report-<timestamp>.yaml
>  - 在发生错误时返回非零退出码(便于在任务调度器/CI 里检测)
>  - 可通过 --cmd-template 自定义实际执行命令(占位符 %IN% 和 %OUT%)
> """
> import argparse
> import subprocess
> import sys
> from pathlib import Path
> from datetime import datetime
> 
> 
> def run_conversion(pdf: Path, out_md: Path, cmd_template: str):
>     cmd = cmd_template.replace("%IN%", str(pdf)).replace("%OUT%", str(out_md))
>     try:
>         subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
>         return True, ""
>     except subprocess.CalledProcessError as e:
>         return False, e.stderr.decode(errors="replace")
> 
> 
> def write_report(report_path: Path, entries):
>     report_path.parent.mkdir(parents=True, exist_ok=True)
>     try:
>         import yaml
>         with report_path.open("w", encoding="utf-8") as fh:
>             yaml.safe_dump({"generated_at": datetime.utcnow().isoformat()+"Z", "entries": entries}, fh, sort_keys=False, allow_unicode=True)
>     except Exception:
>         # fallback plain text
>         with report_path.open("w", encoding="utf-8") as fh:
>             fh.write(f"generated_at: {datetime.utcnow().isoformat()}Z\n")
>             for e in entries:
>                 fh.write(str(e) + "\n")
> 
> 
> def main():
>     p = argparse.ArgumentParser()
>     p.add_argument("--input-dir", "-i", required=True)
>     p.add_argument("--output-dir", "-o", required=True)
>     p.add_argument("--cmd-template", default='thepipe pdf2md "%IN%" -o "%OUT%"',
>                    help='Command template to run; use %IN% and %OUT% placeholders.')
>     args = p.parse_args()
> 
>     input_dir = Path(args.input_dir)
>     output_dir = Path(args.output_dir)
>     if not input_dir.exists() or not input_dir.is_dir():
>         print(f"Input directory {input_dir} not found", file=sys.stderr)
>         sys.exit(2)
>     output_dir.mkdir(parents=True, exist_ok=True)
> 
>     pdfs = sorted(input_dir.glob("*.pdf"))
>     if not pdfs:
>         print(f"No PDF files found in {input_dir}")
>         return
> 
>     ts = datetime.utcnow().strftime("%Y%m%d-%H%M%S")
>     report_path = Path("reports") / f"report-{ts}.yaml"
>     entries = []
>     any_failed = False
> 
>     for pdf in pdfs:
>         out_md = output_dir / (pdf.stem + ".md")
>         print(f"Converting {pdf} -> {out_md}")
>         ok, err = run_conversion(pdf, out_md, args.cmd_template)
>         entry = {"file": str(pdf), "output": str(out_md), "status": "ok" if ok else "failed"}
>         if not ok:
>             entry["error"] = err[:1000]
>             any_failed = True
>             print(f"Failed: {pdf}. See report {report_path} for details.", file=sys.stderr)
>         entries.append(entry)
> 
>     write_report(report_path, entries)
>     print(f"Wrote report: {report_path}")
>     if any_failed:
>         sys.exit(3)
> 
> 
> if __name__ == "__main__":
>     from datetime import datetime
>     main()
> ```
> 
> 3) Dockerfile
> 
> ```
> FROM python:3.11-slim
> 
> # system deps: poppler-utils (pdftotext), tesseract (如果需要 OCR)
> RUN apt-get update && apt-get install -y --no-install-recommends \
>     poppler-utils \
>     tesseract-ocr \
>     && rm -rf /var/lib/apt/lists/*
> 
> WORKDIR /app
> COPY . /app
> 
> # 如果 thepipe 是本 repo 的 python 包,可在这里安装:
> # RUN pip install -e .
> 
> # 安装 PyYAML(脚本里使用)
> RUN pip install --no-cache-dir pyyaml
> 
> ENTRYPOINT ["sh", "-c"]
> CMD ["python3 /app/scripts/convert_pdfs.py -i pdfs -o converted-md"]
> ```
> 
> 4) Makefile
> 
> ```
> .PHONY: install convert docker-build docker-run
> 
> # 本地安装(可选)
> install:
> 	python3 -m venv .venv
> 	. .venv/bin/activate && pip install --upgrade pip
> 	. .venv/bin/activate && pip install -r requirements.txt || true
> 	@echo "虚拟环境已创建:.venv ,请激活后运行脚本"
> 
> convert:
> 	python3 scripts/convert_pdfs.py --input-dir pdfs --output-dir converted-md
> 
> docker-build:
> 	docker build -t pdf2md-env .
> 
> docker-run:
> 	docker run --rm -v $(PWD)/pdfs:/data/pdfs -v $(PWD)/converted-md:/data/out pdf2md-env \
> 		python3 /app/scripts/convert_pdfs.py -i /data/pdfs -o /data/out
> ```
> 
> 5) docs/RUNBOOK.md
> 
> ````markdown
> # PDF -> Markdown 批量转换 Runbook
> 
> 目的
> - 提供一套“零到一”的可复现流程,方便任何人在本地或容器中把 PDF 批量转换为 Markdown,并能把操作过程记录成机器可读的报告与可审计的日志。
> 
> 准备(本地)
> 1. 将要转换的 PDF 放到 `pdfs/` 目录(或任意目录,运行脚本时指定)。
> 2. 确保已安装 thepipe CLI(或修改脚本中的命令模板为你实际使用的转换工具)。
>    - 示例(假定 thepipe 为 Python 包):在仓库根目录运行 `pip install -e .`
> 3. 可选:创建 Python 虚拟环境(参见 Makefile install)。
> 
> 运行(简单)
> - Bash 脚本:
>   chmod +x scripts/batch_convert.sh
>   ./scripts/batch_convert.sh pdfs converted-md
> 
> - Python 脚本:
>   python3 scripts/convert_pdfs.py --input-dir pdfs --output-dir converted-md
> 
> 运行(容器)
> - 构建镜像:
>   docker build -t pdf2md-env .
> - 运行:
>   docker run --rm -v $(pwd)/pdfs:/data/pdfs -v $(pwd)/converted-md:/data/out pdf2md-env \
>     python3 /app/scripts/convert_pdfs.py -i /data/pdfs -o /data/out
> 
> 日志与报告
> - 脚本会在 `logs/` 写入详细日志文件(bash 脚本)。
> - Python 脚本会在 `reports/report-<timestamp>.yaml` 写入结构化报告,包含每个文件的 status/output/error(若失败)。
> - 报告示例见 templates/conversion_report.yaml。
> 
> 如何分享/复用
> - 把整个目录打包成 zip 发给别人(包含 scripts、Dockerfile、docs)。
> - 或者把 Docker 镜像上传到私有 registry,别人只需运行镜像。
> - 建议附上一个 small sample pdf(或指向可公开获取的示例)以便快速验证。
> 
> 记录变更(可选)
> - 在本地可使用 git 管理版本(但不用推到 GitHub 也可以)。
> - 每次重要变更(脚本改动、thepipe 版本升级)记录 CHANGELOG.md 或在 reports 中新增 metadata。
> 
> 排错提示
> - 如果脚本报 “thepipe: command not found”:
>   - 确认 thepipe 是否在 PATH;或激活虚拟环境;或在 Dockerfile 中安装它。
> - 输出 Markdown 为空或文本顺序错乱:
>   - 检查 thepipe 的参数(是否需要 OCR 步骤);尝试先用 pdftotext 单文件调试。
> - 转换大量文件时:
>   - 分批处理或在脚本中添加并行处理(GNU parallel 或 multiprocessing)。
> 
> 如果你需要
> - 我可以把脚本改成并发版(多进程);
> - 或者按你提供 thepipe 的实际 CLI,将脚本里的命令模板替换并再测试一次;
> - 或将这些文件打包成 zip 供你直接分发。
> ````
> 
> 6) templates/conversion_report.yaml
> 
> ```
> # conversion report template (YAML)
> # generated_at: 2025-11-14T12:00:00Z
> # entries:
> #   - file: /path/to/pdfs/example.pdf
> #     output: converted-md/example.md
> #     status: ok
> #     error: ""
> #   - file: /path/to/pdfs/bad.pdf
> #     output: converted-md/bad.md
> #     status: failed
> #     error: "Error message or excerpt"
> ```
> 
> 7) .github/workflows/convert-pdfs.yml
> 
> ```
> name: Convert PDFs to Markdown (manual)
> 
> on:
>   workflow_dispatch:
>     inputs:
>       pdf_dir:
>         description: 'Relative path to PDF directory (in repo)'
>         required: true
>         default: 'pdfs'
>       out_dir:
>         description: 'Output directory for markdown'
>         required: true
>         default: 'converted-md'
>       commit_results:
>         description: 'Commit converted files back to the repo?'
>         required: true
>         default: 'true'
> 
> jobs:
>   convert:
>     runs-on: ubuntu-latest
>     steps:
>       - name: Checkout repository
>         uses: actions/checkout@v4
>         with:
>           fetch-depth: 0
> 
>       - name: Set up Python (if needed)
>         uses: actions/setup-python@v4
>         with:
>           python-version: '3.11'
> 
>       - name: Install repository (if thepipe is python package) and dependencies
>         run: |
>           # 如果 thepipe 是本仓库的 python 包,可以在此安装
>           if [ -f "pyproject.toml" ] || [ -f "setup.py" ]; then
>             pip install -e .
>           fi
>           # 安装脚本可能需要的额外工具(按需修改)
>           pip install -r requirements.txt || true
> 
>       - name: Run conversion script
>         env:
>           INPUT_DIR: ${{ github.event.inputs.pdf_dir }}
>           OUTPUT_DIR: ${{ github.event.inputs.out_dir }}
>         run: |
>           mkdir -p "${OUTPUT_DIR}"
>           chmod +x scripts/batch_convert.sh || true
>           ./scripts/batch_convert.sh "${INPUT_DIR}" "${OUTPUT_DIR}"
> 
>       - name: Commit converted files
>         if: ${{ github.event.inputs.commit_results == 'true' }}
>         run: |
>           git config user.name "github-actions[bot]"
>           git config user.email "github-actions[bot]@users.noreply.github.com"
>           git add "${{ github.event.inputs.out_dir }}" || true
>           if git diff --staged --quiet; then
>             echo "No changes to commit"
>           else
>             git commit -m "chore: auto-convert PDFs to markdown (from workflow)"
>             git push
>           fi
> ```
> 
> Notes and assumptions:
> - I did not set a base branch; please let me know if you want the PR target to be a different branch than the repository default.
> - The scripts assume the CLI command is `thepipe pdf2md <in> -o <out>`; if the real CLI differs please provide the correct invocation and I will update the scripts.
> - The workflow is manual (workflow_dispatch) and by default attempts to commit results if the input commit_results is 'true'.
> 
> Please create a pull request with these additions in the repository emcf/thepipe. Thank you.

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

Copilot AI self-assigned this Nov 14, 2025
Copilot stopped work on behalf of 12poem due to an error November 14, 2025 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant