I test with a 30 minutes audio, found a content below
{
"startMs": 53420,
"endMs": 92090,
"text": "马友友先生您好,我们非常高兴能邀请到您。您今天早上过得怎么样?您喝咖啡了吗?还是您更喜欢喝茶?\n我什么都喝,无论是咖啡还是茶,这些我都是喝的。早上呢,我习惯性地会来杯咖啡,这可能就是一种习惯了。今天早上,我确实是起得特别早。早晨,嗯,确实是一个非常好的思考时机,因为你的思绪还没有被一天的琐事所占据,头脑会特别清醒。我不知道你喜欢怎么开始早晨,但对我来说,这确实是反思或深度思考的好时机。"
},
the content before \n is spearker A say, after \n is speaker B say, but in result they are combined by \n, and all contents are put in speaker A content. how to solve it?