Audio discontinuity: skipping between paragraphs

This is a great project. I'm having some issues with discontinuities in the audio output. I don't know if this is a result of the model or how this project uses the model or me.

 I'm running this on a 8gb 3070 in WSL using Orpheus-3b-FT-Q8_0.gguf and CUDA Version: 12.8 (I changed the dockerfile to use cu128).

```
root@56a71da09802:/app# nvidia-smi
Tue Jul 15 17:56:19 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 572.70         CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070        On  |   00000000:01:00.0  On |                  N/A |
| 53%   46C    P0             40W /  240W |    7120MiB /   8192MiB |      3%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A               1      C   /llama-server                         N/A      |
|    0   N/A  N/A               1      C   /python3.10                           N/A      |
|    0   N/A  N/A              30      C   /python3.10                           N/A      |
+-----------------------------------------------------------------------------------------+
```

I was just trying to convert some news to audio. This text can be found using google so I don't think I'm violating any copyright by sharing this example. 

```
Artillery, rocket launchers and self-propelled howitzers opened fire at a training area in northern Australia on Monday, kick-starting three weeks of military drills here between the U.S. and 18 allies.

The biennial exercise, called Talisman Sabre, is meant to send a message to China: The U.S. and its partners are ready to respond together to aggression from Beijing, which has been increasingly asserting itself in what it regards as its sphere of influence in the Asia-Pacific region.

“Our resolve to train, and will to prepare and will to fight, that’s not made up,” said Lt. Gen. J.B. Vowell, the deputy commanding general of U.S. Army Pacific, after he watched munitions hit a simulated enemy on a hillside in the training zone.
```
[Audio jumpshare.com link](https://jumpshare.com/s/1KCQgVvDtwVFfyP6In2A)

The audio gets confused at ```Talisman Sabre```, jumps to the first paragraph again and then is garbled until ```to train```.

I guess I'm just curious to know if this is an issue with the model or how this project uses the model or something wrong with my setup. I otherwise like the fluency and natural sound of the output voices. 

Thanks for making this, it's really impressive!

Log of speech generation:
```
orpheus-fastapi     | Starting speech generation for 'Artillery, rocket launchers and self-propelled how...'
orpheus-fastapi     | Using voice: mia, GPU acceleration: Yes (High-end)
orpheus-fastapi     | Generating speech for: <|audio|>mia: Artillery, rocket launchers and self-propelled howitzers opened fire at a training area in northern Australia on Monday, kick-starting three weeks of military drills here between the U.S. and 18 allies.
orpheus-fastapi     |
orpheus-fastapi     | The biennial exercise, called Talisman Sabre, is meant to send a message to China: The U.S. and its partners are ready to respond together to aggression from Beijing, which has been increasingly asserting itself in what it regards as its sphere of influence in the Asia-Pacific region.
orpheus-fastapi     |
orpheus-fastapi     | “Our resolve to train, and will to prepare and will to fight, that’s not made up,” said Lt. Gen. J.B. Vowell, the deputy commanding general of U.S. Army Pacific, after he watched munitions hit a simulated enemy on a hillside in the training zone.<|eot_id|>
orpheus-fastapi     | Using optimized parameters for high-end GPU
llama-cpp-server-1  | slot launch_slot_: id  0 | task 35253 | processing task
llama-cpp-server-1  | slot update_slots: id  0 | task 35253 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 167
llama-cpp-server-1  | slot update_slots: id  0 | task 35253 | kv cache rm [4, end)
llama-cpp-server-1  | slot update_slots: id  0 | task 35253 | prompt processing progress, n_past = 167, n_tokens = 163, progress = 0.976048
llama-cpp-server-1  | slot update_slots: id  0 | task 35253 | prompt done, n_past = 167, n_tokens = 163
orpheus-fastapi     | Processing first audio chunk with 7 tokens
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 28
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 56
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 84
orpheus-fastapi     | Progress: 127.1 tokens/sec, est. 1.1s audio generated, 255 tokens, 13 chunks in 2.0s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 112
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 140
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 168
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 196
orpheus-fastapi     | Audio generation rate: 8.04 chunks/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 224
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 252
orpheus-fastapi     | Progress: 137.8 tokens/sec, est. 3.1s audio generated, 553 tokens, 36 chunks in 4.0s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 280
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 308
orpheus-fastapi     | Token processing rate: 66.1 tokens/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 336
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 364
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 392
orpheus-fastapi     | Progress: 141.5 tokens/sec, est. 4.8s audio generated, 851 tokens, 57 chunks in 6.0s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 420
orpheus-fastapi     | Audio generation rate: 10.66 chunks/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 448
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 476
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 504
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 532
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 560
orpheus-fastapi     | Progress: 143.6 tokens/sec, est. 6.6s audio generated, 1152 tokens, 78 chunks in 8.0s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 588
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 616
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 644
orpheus-fastapi     | Audio generation rate: 10.99 chunks/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 672
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 700
orpheus-fastapi     | Progress: 145.1 tokens/sec, est. 8.5s audio generated, 1455 tokens, 100 chunks in 10.0s
orpheus-fastapi     | Token processing rate: 70.8 tokens/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 728
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 756
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 784
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 812
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 840
orpheus-fastapi     | Progress: 145.4 tokens/sec, est. 10.3s audio generated, 1751 tokens, 121 chunks in 12.0s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 868
orpheus-fastapi     | Audio generation rate: 10.32 chunks/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 896
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 924
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 952
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 980
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1008
orpheus-fastapi     | Progress: 145.6 tokens/sec, est. 12.1s audio generated, 2048 tokens, 142 chunks in 14.1s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1036
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1064
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1092
orpheus-fastapi     | Audio generation rate: 10.57 chunks/second
orpheus-fastapi     | Token processing rate: 72.2 tokens/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1120
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1148
orpheus-fastapi     | Progress: 146.1 tokens/sec, est. 13.9s audio generated, 2349 tokens, 164 chunks in 16.1s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1176
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1204
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1232
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1260
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1288
orpheus-fastapi     | Progress: 146.2 tokens/sec, est. 15.7s audio generated, 2643 tokens, 185 chunks in 18.1s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1316
orpheus-fastapi     | Audio generation rate: 10.52 chunks/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1344
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1372
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1400
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1428
orpheus-fastapi     | Progress: 146.1 tokens/sec, est. 17.4s audio generated, 2935 tokens, 205 chunks in 20.1s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1456
orpheus-fastapi     | Token processing rate: 72.2 tokens/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1484
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1512
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1540
orpheus-fastapi     | Audio generation rate: 10.42 chunks/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1568
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1596
orpheus-fastapi     | Progress: 146.2 tokens/sec, est. 19.2s audio generated, 3229 tokens, 226 chunks in 22.1s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1624
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1652
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1680
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1708
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1736
orpheus-fastapi     | Progress: 146.0 tokens/sec, est. 20.9s audio generated, 3519 tokens, 246 chunks in 24.1s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1764
orpheus-fastapi     | Audio generation rate: 10.36 chunks/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1792
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1820
orpheus-fastapi     | Token processing rate: 72.2 tokens/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1848
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1876
orpheus-fastapi     | Progress: 145.7 tokens/sec, est. 22.7s audio generated, 3803 tokens, 267 chunks in 26.1s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1904
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1932
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1960
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 1988
orpheus-fastapi     | Audio generation rate: 10.18 chunks/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2016
orpheus-fastapi     | Progress: 145.5 tokens/sec, est. 24.4s audio generated, 4091 tokens, 287 chunks in 28.1s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2044
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2072
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2100
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2128
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2156
orpheus-fastapi     | Progress: 145.3 tokens/sec, est. 26.2s audio generated, 4377 tokens, 308 chunks in 30.1s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2184
orpheus-fastapi     | Token processing rate: 72.1 tokens/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2212
orpheus-fastapi     | Audio generation rate: 10.16 chunks/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2240
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2268
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2296
orpheus-fastapi     | Progress: 145.0 tokens/sec, est. 28.0s audio generated, 4657 tokens, 329 chunks in 32.1s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2324
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2352
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2380
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2408
orpheus-fastapi     | Audio generation rate: 10.29 chunks/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2436
orpheus-fastapi     | Progress: 144.7 tokens/sec, est. 29.7s audio generated, 4937 tokens, 349 chunks in 34.1s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2464
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2492
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2520
orpheus-fastapi     | Token processing rate: 71.8 tokens/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2548
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2576
orpheus-fastapi     | Progress: 144.3 tokens/sec, est. 31.2s audio generated, 5215 tokens, 367 chunks in 36.1s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2604
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2632
orpheus-fastapi     | Audio generation rate: 9.42 chunks/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2660
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2688
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2716
orpheus-fastapi     | Progress: 143.9 tokens/sec, est. 33.0s audio generated, 5489 tokens, 388 chunks in 38.1s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2744
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2772
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2800
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2828
orpheus-fastapi     | Audio generation rate: 9.68 chunks/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2856
orpheus-fastapi     | Progress: 143.5 tokens/sec, est. 34.5s audio generated, 5759 tokens, 406 chunks in 40.1s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2884
orpheus-fastapi     | Token processing rate: 71.2 tokens/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2912
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2940
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2968
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 2996
orpheus-fastapi     | Progress: 142.9 tokens/sec, est. 36.3s audio generated, 6025 tokens, 427 chunks in 42.2s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 3024
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 3052
orpheus-fastapi     | Audio generation rate: 9.67 chunks/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 3080
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 3108
orpheus-fastapi     | Progress: 142.6 tokens/sec, est. 37.8s audio generated, 6299 tokens, 445 chunks in 44.2s
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 3136
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 3164
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 3192
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 3220
orpheus-fastapi     | Token processing rate: 70.8 tokens/second
orpheus-fastapi     | Processing buffer with 28 tokens, total collected: 3248
orpheus-fastapi     | Progress: 142.0 tokens/sec, est. 39.4s audio generated, 6556 tokens, 463 chunks in 46.2s
orpheus-fastapi     | Audio generation rate: 9.03 chunks/second
llama-cpp-server-1  | slot      release: id  0 | task 35253 | stop processing: n_past = 3478, truncated = 0
llama-cpp-server-1  | slot print_timing: id  0 | task 35253 |
llama-cpp-server-1  | prompt eval time =     223.01 ms /   163 tokens (    1.37 ms per token,   730.90 tokens per second)
llama-cpp-server-1  |        eval time =   46293.78 ms /  3312 tokens (   13.98 ms per token,    71.54 tokens per second)
llama-cpp-server-1  |       total time =   46516.79 ms /  3475 tokens
llama-cpp-server-1  | srv  update_slots: all slots are idle
orpheus-fastapi     | Token generation complete: 6581 tokens in 46.52s (141.5 tokens/sec)
orpheus-fastapi     | Producer completed - setting done event
orpheus-fastapi     | Received end-of-stream marker
orpheus-fastapi     | Waiting for token processor thread to complete...
orpheus-fastapi     | Final buffer flush: 12288 bytes
orpheus-fastapi     | Audio saved to outputs/mia_20250715_172336.wav
orpheus-fastapi     | Generated 463 audio segments
orpheus-fastapi     | Generated 39.51 seconds of audio in 46.52 seconds
orpheus-fastapi     | Realtime factor: 0.85x
orpheus-fastapi     | ⚠️ Warning: Generation is slower than realtime
orpheus-fastapi     | Total speech generation completed in 46.52 seconds
orpheus-fastapi     | INFO:     172.29.0.1:59936 - "POST /speak HTTP/1.1" 200 OK
orpheus-fastapi     | INFO:     172.29.0.1:59936 - "GET /mia_20250715_172336.wav HTTP/1.1" 200 OK
orpheus-fastapi     | INFO:     172.29.0.1:37472 - "GET /mia_20250715_172336.wav HTTP/1.1" 304 Not Modified
```
Log of server starting:
```
llama-cpp-server-1  | ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
llama-cpp-server-1  | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
llama-cpp-server-1  | ggml_cuda_init: found 1 CUDA devices:
llama-cpp-server-1  |   Device 0: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes
llama-cpp-server-1  | load_backend: loaded CUDA backend from /app/libggml-cuda.so
llama-cpp-server-1  | load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
llama-cpp-server-1  | warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by command line argument --host
llama-cpp-server-1  | build: 5897 (bdca3837) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
llama-cpp-server-1  | system info: n_threads = 10, n_threads_batch = 10, total_threads = 20
llama-cpp-server-1  |
llama-cpp-server-1  | system_info: n_threads = 10 (n_threads_batch = 10) / 20 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
llama-cpp-server-1  |
llama-cpp-server-1  | main: binding port with default address family
llama-cpp-server-1  | main: HTTP server is listening, hostname: 0.0.0.0, port: 5006, http threads: 19
llama-cpp-server-1  | main: loading model
llama-cpp-server-1  | srv    load_model: loading model '/models/Orpheus-3b-FT-Q8_0.gguf'
llama-cpp-server-1  | llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3070) - 7100 MiB free
llama-cpp-server-1  | llama_model_loader: loaded meta data with 41 key-value pairs and 256 tensors from /models/Orpheus-3b-FT-Q8_0.gguf (version GGUF V3 (latest))
llama-cpp-server-1  | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama-cpp-server-1  | llama_model_loader: - kv   0:                       general.architecture str              = llama
llama-cpp-server-1  | llama_model_loader: - kv   1:                               general.type str              = model
llama-cpp-server-1  | llama_model_loader: - kv   2:                               general.name str              = Orpheus Tts 0.1 Pretrained
llama-cpp-server-1  | llama_model_loader: - kv   3:                            general.version str              = 0.1
llama-cpp-server-1  | llama_model_loader: - kv   4:                       general.organization str              = Canopylabs
llama-cpp-server-1  | llama_model_loader: - kv   5:                           general.finetune str              = ft
llama-cpp-server-1  | llama_model_loader: - kv   6:                           general.basename str              = orpheus
llama-cpp-server-1  | llama_model_loader: - kv   7:                         general.size_label str              = 3B
llama-cpp-server-1  | llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama-cpp-server-1  | llama_model_loader: - kv   9:                   general.base_model.count u32              = 2
llama-cpp-server-1  | llama_model_loader: - kv  10:                  general.base_model.0.name str              = Llama 3.2 3B Instruct
llama-cpp-server-1  | llama_model_loader: - kv  11:          general.base_model.0.organization str              = Meta Llama
llama-cpp-server-1  | llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/meta-llama/Lla...
llama-cpp-server-1  | llama_model_loader: - kv  13:                  general.base_model.1.name str              = Orpheus 3b 0.1 Pretrained
llama-cpp-server-1  | llama_model_loader: - kv  14:               general.base_model.1.version str              = 0.1
llama-cpp-server-1  | llama_model_loader: - kv  15:          general.base_model.1.organization str              = Canopylabs
llama-cpp-server-1  | llama_model_loader: - kv  16:              general.base_model.1.repo_url str              = https://huggingface.co/canopylabs/orp...
llama-cpp-server-1  | llama_model_loader: - kv  17:                               general.tags arr[str,1]       = ["text-to-speech"]
llama-cpp-server-1  | llama_model_loader: - kv  18:                          general.languages arr[str,1]       = ["en"]
llama-cpp-server-1  | llama_model_loader: - kv  19:                          llama.block_count u32              = 28
llama-cpp-server-1  | llama_model_loader: - kv  20:                       llama.context_length u32              = 131072
llama-cpp-server-1  | llama_model_loader: - kv  21:                     llama.embedding_length u32              = 3072
llama-cpp-server-1  | llama_model_loader: - kv  22:                  llama.feed_forward_length u32              = 8192
llama-cpp-server-1  | llama_model_loader: - kv  23:                 llama.attention.head_count u32              = 24
llama-cpp-server-1  | llama_model_loader: - kv  24:              llama.attention.head_count_kv u32              = 8
llama-cpp-server-1  | llama_model_loader: - kv  25:                       llama.rope.freq_base f32              = 500000.000000
llama-cpp-server-1  | llama_model_loader: - kv  26:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama-cpp-server-1  | llama_model_loader: - kv  27:                 llama.attention.key_length u32              = 128
llama-cpp-server-1  | llama_model_loader: - kv  28:               llama.attention.value_length u32              = 128
llama-cpp-server-1  | llama_model_loader: - kv  29:                           llama.vocab_size u32              = 156940
llama-cpp-server-1  | llama_model_loader: - kv  30:                 llama.rope.dimension_count u32              = 128
llama-cpp-server-1  | llama_model_loader: - kv  31:                       tokenizer.ggml.model str              = gpt2
llama-cpp-server-1  | llama_model_loader: - kv  32:                         tokenizer.ggml.pre str              = llama-bpe
llama-cpp-server-1  | llama_model_loader: - kv  33:                      tokenizer.ggml.tokens arr[str,156940]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama-cpp-server-1  | llama_model_loader: - kv  34:                  tokenizer.ggml.token_type arr[i32,156940]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama-cpp-server-1  | llama_model_loader: - kv  35:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama-cpp-server-1  | llama_model_loader: - kv  36:                tokenizer.ggml.bos_token_id u32              = 128000
llama-cpp-server-1  | llama_model_loader: - kv  37:                tokenizer.ggml.eos_token_id u32              = 128009
llama-cpp-server-1  | llama_model_loader: - kv  38:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama-cpp-server-1  | llama_model_loader: - kv  39:               general.quantization_version u32              = 2
llama-cpp-server-1  | llama_model_loader: - kv  40:                          general.file_type u32              = 7
llama-cpp-server-1  | llama_model_loader: - type  f32:   58 tensors
llama-cpp-server-1  | llama_model_loader: - type q8_0:  198 tensors
llama-cpp-server-1  | print_info: file format = GGUF V3 (latest)
llama-cpp-server-1  | print_info: file type   = Q8_0
llama-cpp-server-1  | print_info: file size   = 3.74 GiB (8.50 BPW)
llama-cpp-server-1  | load: special tokens cache size = 28940
llama-cpp-server-1  | load: token to piece cache size = 1.3364 MB
llama-cpp-server-1  | print_info: arch             = llama
llama-cpp-server-1  | print_info: vocab_only       = 0
llama-cpp-server-1  | print_info: n_ctx_train      = 131072
llama-cpp-server-1  | print_info: n_embd           = 3072
llama-cpp-server-1  | print_info: n_layer          = 28
llama-cpp-server-1  | print_info: n_head           = 24
llama-cpp-server-1  | print_info: n_head_kv        = 8
llama-cpp-server-1  | print_info: n_rot            = 128
llama-cpp-server-1  | print_info: n_swa            = 0
llama-cpp-server-1  | print_info: is_swa_any       = 0
llama-cpp-server-1  | print_info: n_embd_head_k    = 128
llama-cpp-server-1  | print_info: n_embd_head_v    = 128
llama-cpp-server-1  | print_info: n_gqa            = 3
llama-cpp-server-1  | print_info: n_embd_k_gqa     = 1024
llama-cpp-server-1  | print_info: n_embd_v_gqa     = 1024
llama-cpp-server-1  | print_info: f_norm_eps       = 0.0e+00
llama-cpp-server-1  | print_info: f_norm_rms_eps   = 1.0e-05
llama-cpp-server-1  | print_info: f_clamp_kqv      = 0.0e+00
llama-cpp-server-1  | print_info: f_max_alibi_bias = 0.0e+00
llama-cpp-server-1  | print_info: f_logit_scale    = 0.0e+00
llama-cpp-server-1  | print_info: f_attn_scale     = 0.0e+00
llama-cpp-server-1  | print_info: n_ff             = 8192
llama-cpp-server-1  | print_info: n_expert         = 0
llama-cpp-server-1  | print_info: n_expert_used    = 0
llama-cpp-server-1  | print_info: causal attn      = 1
llama-cpp-server-1  | print_info: pooling type     = 0
llama-cpp-server-1  | print_info: rope type        = 0
llama-cpp-server-1  | print_info: rope scaling     = linear
llama-cpp-server-1  | print_info: freq_base_train  = 500000.0
llama-cpp-server-1  | print_info: freq_scale_train = 1
llama-cpp-server-1  | print_info: n_ctx_orig_yarn  = 131072
llama-cpp-server-1  | print_info: rope_finetuned   = unknown
llama-cpp-server-1  | print_info: model type       = 3B
llama-cpp-server-1  | print_info: model params     = 3.78 B
llama-cpp-server-1  | print_info: general.name     = Orpheus Tts 0.1 Pretrained
llama-cpp-server-1  | print_info: vocab type       = BPE
llama-cpp-server-1  | print_info: n_vocab          = 156940
llama-cpp-server-1  | print_info: n_merges         = 280147
llama-cpp-server-1  | print_info: BOS token        = 128000 '<|begin_of_text|>'
llama-cpp-server-1  | print_info: EOS token        = 128009 '<|eot_id|>'
llama-cpp-server-1  | print_info: EOT token        = 128009 '<|eot_id|>'
llama-cpp-server-1  | print_info: EOM token        = 128008 '<|eom_id|>'
llama-cpp-server-1  | print_info: LF token         = 198 'Ċ'
llama-cpp-server-1  | print_info: EOG token        = 128001 '<|end_of_text|>'
llama-cpp-server-1  | print_info: EOG token        = 128008 '<|eom_id|>'
llama-cpp-server-1  | print_info: EOG token        = 128009 '<|eot_id|>'
llama-cpp-server-1  | print_info: max token length = 256
llama-cpp-server-1  | load_tensors: loading model tensors, this can take a while... (mmap = true)
orpheus-fastapi     | ✅ Created default .env file from .env.example and environment variables.
orpheus-fastapi     | 🖥️ Hardware: High-end CUDA GPU detected
orpheus-fastapi     | 📊 Device: NVIDIA GeForce RTX 3070
orpheus-fastapi     | 📊 VRAM: 8.00 GB
orpheus-fastapi     | 📊 Compute Capability: 8.6
orpheus-fastapi     | 🚀 Using high-performance optimizations
orpheus-fastapi     | Configuration loaded:
orpheus-fastapi     |   API_URL: http://llama-cpp-server:5006/v1/completions
orpheus-fastapi     |   MAX_TOKENS: 8192
orpheus-fastapi     |   TEMPERATURE: 0.6
orpheus-fastapi     |   TOP_P: 0.9
orpheus-fastapi     |   REPETITION_PENALTY: 1.1
llama-cpp-server-1  | load_tensors: offloading 28 repeating layers to GPU
llama-cpp-server-1  | load_tensors: offloading output layer to GPU
llama-cpp-server-1  | load_tensors: offloaded 29/29 layers to GPU
llama-cpp-server-1  | load_tensors:        CUDA0 model buffer size =  3345.19 MiB
llama-cpp-server-1  | load_tensors:   CPU_Mapped model buffer size =   488.52 MiB
llama-cpp-server-1  | .............................................................................
llama-cpp-server-1  | llama_context: constructing llama_context
llama-cpp-server-1  | llama_context: n_seq_max     = 1
llama-cpp-server-1  | llama_context: n_ctx         = 8192
llama-cpp-server-1  | llama_context: n_ctx_per_seq = 8192
llama-cpp-server-1  | llama_context: n_batch       = 2048
llama-cpp-server-1  | llama_context: n_ubatch      = 512
llama-cpp-server-1  | llama_context: causal_attn   = 1
llama-cpp-server-1  | llama_context: flash_attn    = 0
llama-cpp-server-1  | llama_context: freq_base     = 500000.0
llama-cpp-server-1  | llama_context: freq_scale    = 1
llama-cpp-server-1  | llama_context: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama-cpp-server-1  | llama_context:  CUDA_Host  output buffer size =     0.60 MiB
llama-cpp-server-1  | llama_kv_cache_unified:      CUDA0 KV buffer size =   896.00 MiB
llama-cpp-server-1  | llama_kv_cache_unified: size =  896.00 MiB (  8192 cells,  28 layers,  1 seqs), K (f16):  448.00 MiB, V (f16):  448.00 MiB
llama-cpp-server-1  | llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama-cpp-server-1  | llama_context:      CUDA0 compute buffer size =   424.00 MiB
llama-cpp-server-1  | llama_context:  CUDA_Host compute buffer size =    22.01 MiB
llama-cpp-server-1  | llama_context: graph nodes  = 1014
llama-cpp-server-1  | llama_context: graph splits = 2
llama-cpp-server-1  | common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
llama-cpp-server-1  | common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
llama-cpp-server-1  | srv          init: initializing slots, n_slots = 1
llama-cpp-server-1  | slot         init: id  0 | task -1 | new slot n_ctx_slot = 8192
llama-cpp-server-1  | main: model loaded
llama-cpp-server-1  | main: chat template, chat_template: {{- bos_token }}
llama-cpp-server-1  | {%- if custom_tools is defined %}
llama-cpp-server-1  |     {%- set tools = custom_tools %}
llama-cpp-server-1  | {%- endif %}
llama-cpp-server-1  | {%- if not tools_in_user_message is defined %}
llama-cpp-server-1  |     {%- set tools_in_user_message = true %}
llama-cpp-server-1  | {%- endif %}
llama-cpp-server-1  | {%- if not date_string is defined %}
llama-cpp-server-1  |     {%- if strftime_now is defined %}
llama-cpp-server-1  |         {%- set date_string = strftime_now("%d %b %Y") %}
llama-cpp-server-1  |     {%- else %}
llama-cpp-server-1  |         {%- set date_string = "26 Jul 2024" %}
llama-cpp-server-1  |     {%- endif %}
llama-cpp-server-1  | {%- endif %}
llama-cpp-server-1  | {%- if not tools is defined %}
llama-cpp-server-1  |     {%- set tools = none %}
llama-cpp-server-1  | {%- endif %}
llama-cpp-server-1  |
llama-cpp-server-1  | {#- This block extracts the system message, so we can slot it into the right place. #}
llama-cpp-server-1  | {%- if messages[0]['role'] == 'system' %}
llama-cpp-server-1  |     {%- set system_message = messages[0]['content']|trim %}
llama-cpp-server-1  |     {%- set messages = messages[1:] %}
llama-cpp-server-1  | {%- else %}
llama-cpp-server-1  |     {%- set system_message = "" %}
llama-cpp-server-1  | {%- endif %}
llama-cpp-server-1  |
llama-cpp-server-1  | {#- System message #}
llama-cpp-server-1  | {{- "<|start_header_id|>system<|end_header_id|>\n\n" }}
llama-cpp-server-1  | {%- if tools is not none %}
llama-cpp-server-1  |     {{- "Environment: ipython\n" }}
llama-cpp-server-1  | {%- endif %}
llama-cpp-server-1  | {{- "Cutting Knowledge Date: December 2023\n" }}
llama-cpp-server-1  | {{- "Today Date: " + date_string + "\n\n" }}
llama-cpp-server-1  | {%- if tools is not none and not tools_in_user_message %}
llama-cpp-server-1  |     {{- "You have access to the following functions. To call a function, please respond with JSON for a function call." }}
llama-cpp-server-1  |     {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
llama-cpp-server-1  |     {{- "Do not use variables.\n\n" }}
llama-cpp-server-1  |     {%- for t in tools %}
llama-cpp-server-1  |         {{- t | tojson(indent=4) }}
llama-cpp-server-1  |         {{- "\n\n" }}
llama-cpp-server-1  |     {%- endfor %}
llama-cpp-server-1  | {%- endif %}
llama-cpp-server-1  | {{- system_message }}
llama-cpp-server-1  | {{- "<|eot_id|>" }}
llama-cpp-server-1  |
llama-cpp-server-1  | {#- Custom tools are passed in a user message with some extra guidance #}
llama-cpp-server-1  | {%- if tools_in_user_message and not tools is none %}
llama-cpp-server-1  |     {#- Extract the first user message so we can plug it in here #}
llama-cpp-server-1  |     {%- if messages | length != 0 %}
llama-cpp-server-1  |         {%- set first_user_message = messages[0]['content']|trim %}
llama-cpp-server-1  |         {%- set messages = messages[1:] %}
llama-cpp-server-1  |     {%- else %}
llama-cpp-server-1  |         {{- raise_exception("Cannot put tools in the first user message when there's no first user message!") }}
llama-cpp-server-1  | {%- endif %}
llama-cpp-server-1  |     {{- '<|start_header_id|>user<|end_header_id|>\n\n' -}}
llama-cpp-server-1  |     {{- "Given the following functions, please respond with a JSON for a function call " }}
llama-cpp-server-1  |     {{- "with its proper arguments that best answers the given prompt.\n\n" }}
llama-cpp-server-1  |     {{- 'Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}.' }}
llama-cpp-server-1  |     {{- "Do not use variables.\n\n" }}
llama-cpp-server-1  |     {%- for t in tools %}
llama-cpp-server-1  |         {{- t | tojson(indent=4) }}
llama-cpp-server-1  |         {{- "\n\n" }}
llama-cpp-server-1  |     {%- endfor %}
llama-cpp-server-1  |     {{- first_user_message + "<|eot_id|>"}}
llama-cpp-server-1  | {%- endif %}
llama-cpp-server-1  |
llama-cpp-server-1  | {%- for message in messages %}
llama-cpp-server-1  |     {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}
llama-cpp-server-1  |         {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }}
llama-cpp-server-1  |     {%- elif 'tool_calls' in message %}
llama-cpp-server-1  |         {%- if not message.tool_calls|length == 1 %}
llama-cpp-server-1  |             {{- raise_exception("This model only supports single tool-calls at once!") }}
llama-cpp-server-1  |         {%- endif %}
llama-cpp-server-1  |         {%- set tool_call = message.tool_calls[0].function %}
llama-cpp-server-1  |         {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' -}}
llama-cpp-server-1  |         {{- '{"name": "' + tool_call.name + '", ' }}
llama-cpp-server-1  |         {{- '"parameters": ' }}
llama-cpp-server-1  |         {{- tool_call.arguments | tojson }}
llama-cpp-server-1  |         {{- "}" }}
llama-cpp-server-1  |         {{- "<|eot_id|>" }}
llama-cpp-server-1  |     {%- elif message.role == "tool" or message.role == "ipython" %}
llama-cpp-server-1  |         {{- "<|start_header_id|>ipython<|end_header_id|>\n\n" }}
llama-cpp-server-1  |         {%- if message.content is mapping or message.content is iterable %}
llama-cpp-server-1  |             {{- message.content | tojson }}
llama-cpp-server-1  |         {%- else %}
llama-cpp-server-1  |             {{- message.content }}
llama-cpp-server-1  |         {%- endif %}
llama-cpp-server-1  |         {{- "<|eot_id|>" }}
llama-cpp-server-1  |     {%- endif %}
llama-cpp-server-1  | {%- endfor %}
llama-cpp-server-1  | {%- if add_generation_prompt %}
llama-cpp-server-1  |     {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
llama-cpp-server-1  | {%- endif %}
llama-cpp-server-1  | , example_format: '<|start_header_id|>system<|end_header_id|>
llama-cpp-server-1  |
llama-cpp-server-1  | You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>
llama-cpp-server-1  |
llama-cpp-server-1  | Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>
llama-cpp-server-1  |
llama-cpp-server-1  | Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>
llama-cpp-server-1  |
llama-cpp-server-1  | How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
llama-cpp-server-1  |
llama-cpp-server-1  | '
llama-cpp-server-1  | main: server is listening on http://0.0.0.0:5006 - starting the main loop
llama-cpp-server-1  | srv  update_slots: all slots are idle
orpheus-fastapi     | INFO:     Started server process [1]
orpheus-fastapi     | INFO:     Waiting for application startup.
orpheus-fastapi     | INFO:     Application startup complete.
orpheus-fastapi     | INFO:     Uvicorn running on http://0.0.0.0:5005 (Press CTRL+C to quit)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Audio discontinuity: skipping between paragraphs #71

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Audio discontinuity: skipping between paragraphs #71

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions