Skip to content

Conversation

@woutd
Copy link

@woutd woutd commented Nov 12, 2025

Replaces gbooleans used for encoding/decoding status with mutexes to ensure all access to the participant encoder and decoder is thread-safe. Additionally, the reset status gboolean is now accessed atomically.

Currently only tested with participants using Opus encoder/decoder.

This needs to be carefully reviewed for performance issues and logic errors.

See also #3587

Replaces gbooleans used for encoding/decoding status with mutexes to ensure all access to the participant encoder and decoder is thread-safe.
Additionally, the reset status gboolean is now accessed atomically.
@januscla
Copy link

Thanks for your contribution, @woutd! Please make sure you sign our CLA, as it's a required step before we can merge this.

@lminiero
Copy link
Member

Thanks, I'll have a look when I can and get back to you!

Copy link
Member

@lminiero lminiero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a first quick review, so you can find some notes and comments inline, thanks!

janus_audiobridge_plainrtp_media plainrtp_media;
janus_mutex pmutex;
janus_mutex encoding_mutex; /* Encoding mutex to lock encoder instance */
janus_mutex decoding_mutex; /* Decoding mutex to lock decoder instance */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we could use a single mutex for both, considering that the same participant thread takes care of both encoding and decoding, and the only thing we need to ensure is that the instances exist, but that's something we can discuss at a later stage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering the same. I also thought to maybe use the participant pmutex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to use a different mutex than pmutex, as we don't want other operations to interfere with transcoding, but we can revisit this later.

bpkt = (janus_audiobridge_buffer_packet *)jbp.data;
if(!g_atomic_int_compare_and_exchange(&participant->decoding, 0, 1)) {
janus_mutex_lock(&participant->decoding_mutex);
if(!participant->decoder) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is broken. If this participant is using G.711 there won't be a decoder instance. This means that lock and check should only occur for Opus participants, and not G.711 participants (for which we use static tables for encoding/decoding that don't require locks).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. For sure the participant->decoder == NULL is wrong for G.711.

I used participant->decoding_mutex because there was a g_atomic_int_compare_and_exchange check on decoding at this point.

I will look into this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the lock so it only is used when doing Opus decoding. This means it will no longer do a "decoding check" for G.711 but I do not think this changes much compared to the original logic were participant->decoding is briefly set to 1 and back to 0 in janus_audiobridge_hangup_media_internal.

if(g_atomic_int_get(&participant->active) && (participant->codec == JANUS_AUDIOCODEC_PCMA ||
participant->codec == JANUS_AUDIOCODEC_PCMU) && g_atomic_int_compare_and_exchange(&participant->encoding, 0, 1)) {
participant->codec == JANUS_AUDIOCODEC_PCMU)) {
janus_mutex_lock(&participant->encoding_mutex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: the lock should not be used when we're encoding with G.711, but only when we're using Opus.

Copy link
Author

@woutd woutd Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as above, I used a lock because of the original g_atomic_int_compare_and_exchange on participant->encoding. I will remove it.

Copy link
Member

@lminiero lminiero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if this took a while, I checked your changes and, apart from the considerations we made on whether it makes more sense to keep two distinct mutexes for encoding/encoding, or use just one, it looks fine to me. I haven't made any test yet, but I think it's ok to remove the draft status for now, and let people play with it to provide feedback. Did you keep on testing this in your own setup, by the way? Anything worth noting or has it worked as expected so far?

janus_audiobridge_plainrtp_media plainrtp_media;
janus_mutex pmutex;
janus_mutex encoding_mutex; /* Encoding mutex to lock encoder instance */
janus_mutex decoding_mutex; /* Decoding mutex to lock decoder instance */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to use a different mutex than pmutex, as we don't want other operations to interfere with transcoding, but we can revisit this later.

@lminiero lminiero marked this pull request as ready for review December 2, 2025 07:37
@woutd
Copy link
Author

woutd commented Dec 2, 2025

No problem @lminiero, thanks for picking this up.
We are running this on development servers since and haven't had any issues so far.

@spscream
Copy link
Contributor

spscream commented Dec 6, 2025

yesterday we've got deadlock on audiobridge sessions_mutex, some of threads are waiting for sessions_mutex

Thread 3469 (LWP 2036810):
#0  0x00007ff99aa17158 in __syscall6 (a6=<optimized out>, a5=<optimized out>, a4=<optimized out>, a3=<optimized out>, a2=<optimized out>, a1=<optimized out>, n=<optimized out>) at ./arch/x86
_64/syscall_arch.h:59
#1  syscall (n=<optimized out>) at src/misc/syscall.c:20
#2  0x00007ff99a6ae81b in ?? () from /usr/lib/libglib-2.0.so.0
#3  0x00007ff999538fd1 in janus_audiobridge_hangup_media (handle=0x7ff9776d8740) at plugins/janus_audiobridge.c:6432
#4  0x00005563b98078d0 in janus_ice_outgoing_traffic_handle (handle=0x7ff9955867f0, pkt=0x5563b9885840 <janus_ice_hangup_peerconnection>) at ice.c:4631
#5  0x00005563b980ad41 in janus_ice_outgoing_traffic_dispatch (source=0x7ff997931e10, callback=<optimized out>, user_data=<optimized out>) at ice.c:528
#6  0x00007ff99a64b255 in ?? () from /usr/lib/libglib-2.0.so.0
#7  0x00007ff99a64e387 in ?? () from /usr/lib/libglib-2.0.so.0
#8  0x00007ff99a64ec77 in g_main_loop_run () from /usr/lib/libglib-2.0.so.0
#9  0x00005563b97fc2c8 in janus_ice_handle_thread (data=0x7ff9955867f0) at ice.c:1354
#10 0x00007ff99a67e5f0 in ?? () from /usr/lib/libglib-2.0.so.0
#11 0x00007ff99aa3a34f in start (p=0x7ff955a218e8) at src/thread/pthread_create.c:207
#12 0x00007ff99aa3c965 in __clone () at src/thread/x86_64/clone.s:22
Backtrace stopped: frame did not save the PC

one thread, which holding sessions_mutex are waiting for participant->decoding_mutex:

Thread 3360 (LWP 2034780):
#0  0x00007ff99aa17158 in __syscall6 (a6=<optimized out>, a5=<optimized out>, a4=<optimized out>, a3=<optimized out>, a2=<optimized out>, a1=<optimized out>, n=<optimized out>) at ./arch/x86
_64/syscall_arch.h:59
#1  syscall (n=<optimized out>) at src/misc/syscall.c:20
#2  0x00007ff99a6ae81b in ?? () from /usr/lib/libglib-2.0.so.0
#3  0x00007ff99953865e in janus_audiobridge_hangup_media_internal (handle=handle@entry=0x7ff9606b7bd0) at plugins/janus_audiobridge.c:6517
#4  0x00007ff999538fd9 in janus_audiobridge_hangup_media (handle=0x7ff9606b7bd0) at plugins/janus_audiobridge.c:6433
#5  0x00005563b98078d0 in janus_ice_outgoing_traffic_handle (handle=0x7ff9968589b0, pkt=0x5563b9885840 <janus_ice_hangup_peerconnection>) at ice.c:4631
#6  0x00005563b980ad41 in janus_ice_outgoing_traffic_dispatch (source=0x7ff997b8de90, callback=<optimized out>, user_data=<optimized out>) at ice.c:528
#7  0x00007ff99a64b255 in ?? () from /usr/lib/libglib-2.0.so.0
#8  0x00007ff99a64e387 in ?? () from /usr/lib/libglib-2.0.so.0
#9  0x00007ff99a64ec77 in g_main_loop_run () from /usr/lib/libglib-2.0.so.0
#10 0x00005563b97fc2c8 in janus_ice_handle_thread (data=0x7ff9968589b0) at ice.c:1354
#11 0x00007ff99a67e5f0 in ?? () from /usr/lib/libglib-2.0.so.0
#12 0x00007ff99aa3a34f in start (p=0x7ff957ed58e8) at src/thread/pthread_create.c:207
#13 0x00007ff99aa3c965 in __clone () at src/thread/x86_64/clone.s:22
Backtrace stopped: frame did not save the PC

full log is here:
gdb_0512.log

upd: found a deadlock with decoding_mutex, wrote a review comment on pr

if(!first && participant->codec == JANUS_AUDIOCODEC_OPUS && lost_packets_gap <= JITTER_BUFFER_MAX_GAP_SIZE && !participant->muted) {
lost_packets_gap++;
if(!g_atomic_int_compare_and_exchange(&participant->decoding, 0, 1)) {
janus_mutex_lock(&participant->decoding_mutex);
Copy link
Contributor

@spscream spscream Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this lock is locked and isn't unlocked, while janus_audiobridge_participant_istalking(session, participant, NULL, NULL); on line 9187 locks audiobridge->mutex

lock on 6517 is taken while audiobridge->mutex is locked on line 6466. So this leads to deadlock I mentioned in comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing this @spscream
I replaced the original atomic operations with mutex lock and unlock. But I overlooked that while participant->decoding_mutex is locked in janus_audiobridge_participant_thread there is a call to janus_audiobridge_participant_istalking which can lock audiobridge->mutex.
The order of locking is the reverse of the one in janus_audiobridge_hangup_media_internal.
I will push a fix to this PR which will unlock participant->decoding_mutex before calling janus_audiobridge_participant_istalking.

Copy link
Member

@lminiero lminiero Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't like this approach: plenty of things can happen between when you release and lock again. A better approach is to avoid the lock in that inner function if we have the mutex already (something we do in a few other places). It may also be we can get rid of that lock in the function altogether, if this is the only place we call it from. I'm travelling today so I'll check tomorrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked and janus_audiobridge_participant_istalking is indeed only called from the participant thread, but that reverse order does make everything more complex. I guess that, if we want to avoid the release-and-lock-again of the mutex there (which I personally don't like) one possible approach is to move the notification part of janus_audiobridge_participant_istalking (which is what the room mutex is used for) out of that function and to something we can do later, e.g., something like (pseudocode)

  1. (changed, talking) = istalking(user)
  2. continue until mutex is released
  3. if changed --> notify(user, talking)

That way we'd ensure the two mutexes don't overlap the wrong way there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand. What do you mean with release-and-lock again? The proposed fix unlocks participant->decoding_mutex before it calls janus_audiobridge_participant_istalking instead of after. It does not lock it again.
Even if we moved the notification part out of janus_audiobridge_participant_istalking and do it later (after participant->decoding_mutex unlocks), we still need to lock audiobridge->mutex at that point too.

Copy link
Member

@lminiero lminiero Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I get it now. Apologies, I thought the idea was to create a "hole" where you'd temporarily release the mutex to avoid the conflict, and then take it again. Checking the code again I can see that's indeed not the case. But in that case I think that means the same fix needs to be done in at least another place, ~L9243, where there's the same istalking done before a decode call (and so wrapped in the same reversed mutex lock).

Note: edited the line number, as it's different in your patch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think participant->decoding_mutex is locked at that point, only after.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, you're right again, I was looking at the main codebase to see where the istalking function was, rather than your patch... Three days at the Faroe Islands and I've forgotten how to check code 😁

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, better we check one too many than one too few :)

…talking in janus_audiobridge_participant_thread.

This can result in a deadlock because participant->decoding_mutex and audiobridge->mutex are locked in reverse order in janus_audiobridge_hangup_media_internal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants