Skip to content

支持qq-chat-exporter导出的jsonL格式 #9

@ElluIFX

Description

@ElluIFX

问题描述

qq-chat-exporter在导出因内存限制无法处理的超大群聊记录时只能使用jsonL格式,但该格式不受ChatLab原生支持

格式描述如下

= OutputFolder
|- manifest.json
|= chunks
	|- chunk_0001.jsonl
	|- chunk_0002.jsonl
	|- ...

其中群聊元数据和chunk序列的描述存在manifest.json中,chunk_xxxx.jsonl仅存储聊天记录列表

manifest示例如下

{
  "metadata": {
    "exportTime": "2025-12-27T03:33:25.460Z",
    "version": "5.0.0",
    "format": "chunked-jsonl"
  },
  "chatInfo": {
    "name": "xxxxxxxxxxxx",
    "type": "group",
    "selfUid": "u_w-fLQ9bMQHLkZpHNzh_p6A",
    "selfUin": "123456",
    "selfName": "Ellu"
  },
  "statistics": {
    "totalMessages": 1768832,
    "chunkCount": 36
  },
  "chunked": {
    "format": "jsonl",
    "chunksDir": "chunks",
    "chunkFileExt": ".jsonl",
    "maxMessagesPerChunk": 50000,
    "maxBytesPerChunk": 52428800,
    "chunks": [
      {
        "file": "chunks/chunk_0001.jsonl",
        "messages": 50000,
        "bytes": 34676792,
        "startTime": 1765504601000,
        "endTime": 1766804055000
      },
      {
        "file": "chunks/chunk_0002.jsonl",
        "messages": 50000,
        "bytes": 31139309,
        "startTime": 1764508356000,
        "endTime": 1765546217000
      },
      ...
    ]
  }
}

chunk.jsonl示例如下,并不是标准json,而是每行一条完整json记录,换行分隔

{"id":"7587998612871331832","seq":"7107700","timestamp":1766718601000,"time":"2025-12-26T03:10:01.000Z","sender":{"uid":"u_xxxxxx_xxxxxxxxx","uin":"xxxxxxxxxx","name":"用户A","nickname":"用户A的昵称","groupCard":"用户A的群名片"},"type":"text","content":{"text":"[表情5][表情5][表情5]","html":"[表情5][表情5][表情5]","elements":[{"type":"face","data":{"id":"5","name":"/流泪"}},{"type":"face","data":{"id":"5","name":"/流泪"}},{"type":"face","data":{"id":"5","name":"/流泪"}}],"resources":[],"mentions":[]},"recalled":false,"system":false}
{"id":"7587998630100730690","seq":"7107701","timestamp":1766718605000,"time":"2025-12-26T03:10:05.000Z","sender":{"uid":"u_xxxxxx_xxxxxxxxx","uin":"xxxxxxxxxx","name":"用户B","nickname":"用户B的昵称","groupCard":"用户B的群名片","remark":"用户B的备注"},"type":"text","content":{"text":"有啊","html":"有啊","elements":[{"type":"text","data":{"text":"有啊"}}],"resources":[],"mentions":[]},"recalled":false,"system":false}
{"id":"7587998635703326372","seq":"7107702","timestamp":1766718605000,"time":"2025-12-26T03:10:05.000Z","sender":{"uid":"u_xxxxxx_xxxxxxxxx","uin":"xxxxxxxxxx","name":"用户C","groupCard":"用户C的群名片"},"type":"text","content":{"text":"maa有烧水","html":"maa有烧水","elements":[{"type":"text","data":{"text":"maa有烧水"}}],"resources":[],"mentions":[]},"recalled":false,"system":false}
...

目前我是写了一个python脚本来手动合并jsonl为超大json,提供给有相同问题的人做解决参考,但我感觉这个其实可以内置在ChatLab中,读取manifest后如果发现chunks,就逐个遍历加入数据库中

merge_jsonl_chunks.py

总之谢谢作者开发这么好玩的一个群聊分析

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions