NAC-LD-Endpointer

Codebase for the work "Streaming Endpointer for Spoken Dialogue using Neural Audio Codecs and Label-Delayed Training", accepted at ASRU 2025

Data Processing Pipeline

┌─────────────────┐
│  Raw Dataset(s) │
└────────┬────────┘
         │
         ▼
┌─────────────────────────────────────┐
│  Standardize to Common JSON Format  │
└────────┬────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────┐
│   VAD Processing (Silero VAD)       │
│  • Trim turn-level beg/end silence  │
└────────┬────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────┐
│   Add Turn Annotations              │
│  • Add <user_end>/<system_end>      │
│  • Handle timing conflicts          │
└────────┬────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────┐
│   Processed Training Data           │
│  • Special tokens inserted          │
│  • Ready for model training         │
└─────────────────────────────────────┘

Processing Details

Each step produces intermediate outputs that can be cached:

Step	Output Path	Override Flag
Standardization	`preprocessed_{mode}.json`	`override_preprocessed_data`
VAD Processing	`vad_processed_{mode}/`	`override_vad_data`
Turn Annotations	`processed_{mode}/`	`override_processed_data`
Length Filtering	`filtered_{mode}/`	`override_filtered_data`

The length filtered files are used for training.

Example processed segment:

{
  "audio_filepath": "/path/to/audio.wav",
  "segments": [
    {"turn": "user", "start_time": 0.0, "end_time": 0.5, "text": "Hi"},
    {"turn": "user-end", "start_time": 0.5, "end_time": 2.5, "text": ""},
    {"turn": "system", "start_time": 2.5, "end_time": 3.2, "text": "Hello"},
    {"turn": "system-end", "start_time": 3.2, "end_time": 4.3, "text": ""},
    {"turn": "user", "start_time": 4.3, "end_time": 8.5, "text": "Hi, I need some help with .."},
    {"turn": "user-end", "start_time": 8.5, "end_time": 9.1, "text": ""},
    {"turn": "system", "start_time": 9.1, "end_time": 12, "text": "Sure, ..."}
  ]
}

Supported datasets

Dataset	Language(s)	Domain	duration (hrs)
SpokenWoz	en	task-oriented dialogue	30
HumDial	en, zh	conversational	50

Todo

Fisher
Switchboard

Endpointer configurations

This endpointing is designed for conversational speech between 2 speakers (user and system). The end-of-turn is optimised for user speech only. It supports the following configurations

1. Single-stream with both speakers

Used when both speaker audio is present in single audio stream. We are not interested in perfroming diarisation here, so we provide the timings for the system speech to the model, so that the model can learn to predict turn-ends for user.

config: configs/mimi/lstm_mimi_12.5hz_delay2f.yaml

2. Single-stream using only user

Used when both speaker audio is present in single audio stream. We mask the loss for the system frames. While the model can learn from interaction with system, the loss is only computed for user and user-end.

config: configs/mimi/lstm_mimi_12.5hz_delay2f_mask-system.yaml

3. Dual-stream using only user

Here, we have access to seperate audio streams for user and system. We ignore the system stream, and train the single-stream endpointer only on user speech.

4. Dual-stream using both speakers

Here, we have access to seperate audio streams for user and system. Here, we use a modified architecture with 2 LSTM modules to learn user and system specific features, and learn to predict user, user-end, system, system-end.

NOTE: This could be simplied by treating user and system as the same turn. However, we avoid this because depending on the dataset, user and system have varying pause patterns. System tends to have much larger pauses (They are processing user query, looking up information, etc).

The choice of endpointer would depend on the application. For example, a user-only endpointer could be used for speech recogntion. A user-system endpointer could be used for full-duplex integration (however, with additional overhead for waiting for both user and system frames).

For any queries, feel free to reach out to udupa@fit.vutbr.cz / sathvikudupa66@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
configs/mimi		configs/mimi
images		images
src		src
LICENSE		LICENSE
README.md		README.md
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NAC-LD-Endpointer

Data Processing Pipeline

Processing Details

Supported datasets

Endpointer configurations

1. Single-stream with both speakers

2. Single-stream using only user

3. Dual-stream using only user

4. Dual-stream using both speakers

About

Uh oh!

Releases

Packages

Languages

License

BUTSpeechFIT/NAC-LD-Endpointer

Folders and files

Latest commit

History

Repository files navigation

NAC-LD-Endpointer

Data Processing Pipeline

Processing Details

Supported datasets

Endpointer configurations

1. Single-stream with both speakers

2. Single-stream using only user

3. Dual-stream using only user

4. Dual-stream using both speakers

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages