-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Make participant encoder/decoder access thread-safe #3606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Replaces gbooleans used for encoding/decoding status with mutexes to ensure all access to the participant encoder and decoder is thread-safe. Additionally, the reset status gboolean is now accessed atomically.
|
Thanks, I'll have a look when I can and get back to you! |
lminiero
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done a first quick review, so you can find some notes and comments inline, thanks!
| janus_audiobridge_plainrtp_media plainrtp_media; | ||
| janus_mutex pmutex; | ||
| janus_mutex encoding_mutex; /* Encoding mutex to lock encoder instance */ | ||
| janus_mutex decoding_mutex; /* Decoding mutex to lock decoder instance */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we could use a single mutex for both, considering that the same participant thread takes care of both encoding and decoding, and the only thing we need to ensure is that the instances exist, but that's something we can discuss at a later stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering the same. I also thought to maybe use the participant pmutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense to use a different mutex than pmutex, as we don't want other operations to interfere with transcoding, but we can revisit this later.
src/plugins/janus_audiobridge.c
Outdated
| bpkt = (janus_audiobridge_buffer_packet *)jbp.data; | ||
| if(!g_atomic_int_compare_and_exchange(&participant->decoding, 0, 1)) { | ||
| janus_mutex_lock(&participant->decoding_mutex); | ||
| if(!participant->decoder) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is broken. If this participant is using G.711 there won't be a decoder instance. This means that lock and check should only occur for Opus participants, and not G.711 participants (for which we use static tables for encoding/decoding that don't require locks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. For sure the participant->decoder == NULL is wrong for G.711.
I used participant->decoding_mutex because there was a g_atomic_int_compare_and_exchange check on decoding at this point.
I will look into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the lock so it only is used when doing Opus decoding. This means it will no longer do a "decoding check" for G.711 but I do not think this changes much compared to the original logic were participant->decoding is briefly set to 1 and back to 0 in janus_audiobridge_hangup_media_internal.
src/plugins/janus_audiobridge.c
Outdated
| if(g_atomic_int_get(&participant->active) && (participant->codec == JANUS_AUDIOCODEC_PCMA || | ||
| participant->codec == JANUS_AUDIOCODEC_PCMU) && g_atomic_int_compare_and_exchange(&participant->encoding, 0, 1)) { | ||
| participant->codec == JANUS_AUDIOCODEC_PCMU)) { | ||
| janus_mutex_lock(&participant->encoding_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: the lock should not be used when we're encoding with G.711, but only when we're using Opus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason as above, I used a lock because of the original g_atomic_int_compare_and_exchange on participant->encoding. I will remove it.
lminiero
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies if this took a while, I checked your changes and, apart from the considerations we made on whether it makes more sense to keep two distinct mutexes for encoding/encoding, or use just one, it looks fine to me. I haven't made any test yet, but I think it's ok to remove the draft status for now, and let people play with it to provide feedback. Did you keep on testing this in your own setup, by the way? Anything worth noting or has it worked as expected so far?
| janus_audiobridge_plainrtp_media plainrtp_media; | ||
| janus_mutex pmutex; | ||
| janus_mutex encoding_mutex; /* Encoding mutex to lock encoder instance */ | ||
| janus_mutex decoding_mutex; /* Decoding mutex to lock decoder instance */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense to use a different mutex than pmutex, as we don't want other operations to interfere with transcoding, but we can revisit this later.
|
No problem @lminiero, thanks for picking this up. |
|
yesterday we've got deadlock on audiobridge sessions_mutex, some of threads are waiting for sessions_mutex one thread, which holding sessions_mutex are waiting for participant->decoding_mutex: full log is here: upd: found a deadlock with decoding_mutex, wrote a review comment on pr |
| if(!first && participant->codec == JANUS_AUDIOCODEC_OPUS && lost_packets_gap <= JITTER_BUFFER_MAX_GAP_SIZE && !participant->muted) { | ||
| lost_packets_gap++; | ||
| if(!g_atomic_int_compare_and_exchange(&participant->decoding, 0, 1)) { | ||
| janus_mutex_lock(&participant->decoding_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this lock is locked and isn't unlocked, while janus_audiobridge_participant_istalking(session, participant, NULL, NULL); on line 9187 locks audiobridge->mutex
lock on 6517 is taken while audiobridge->mutex is locked on line 6466. So this leads to deadlock I mentioned in comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing this @spscream
I replaced the original atomic operations with mutex lock and unlock. But I overlooked that while participant->decoding_mutex is locked in janus_audiobridge_participant_thread there is a call to janus_audiobridge_participant_istalking which can lock audiobridge->mutex.
The order of locking is the reverse of the one in janus_audiobridge_hangup_media_internal.
I will push a fix to this PR which will unlock participant->decoding_mutex before calling janus_audiobridge_participant_istalking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't like this approach: plenty of things can happen between when you release and lock again. A better approach is to avoid the lock in that inner function if we have the mutex already (something we do in a few other places). It may also be we can get rid of that lock in the function altogether, if this is the only place we call it from. I'm travelling today so I'll check tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checked and janus_audiobridge_participant_istalking is indeed only called from the participant thread, but that reverse order does make everything more complex. I guess that, if we want to avoid the release-and-lock-again of the mutex there (which I personally don't like) one possible approach is to move the notification part of janus_audiobridge_participant_istalking (which is what the room mutex is used for) out of that function and to something we can do later, e.g., something like (pseudocode)
- (changed, talking) = istalking(user)
- continue until mutex is released
- if changed --> notify(user, talking)
That way we'd ensure the two mutexes don't overlap the wrong way there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand. What do you mean with release-and-lock again? The proposed fix unlocks participant->decoding_mutex before it calls janus_audiobridge_participant_istalking instead of after. It does not lock it again.
Even if we moved the notification part out of janus_audiobridge_participant_istalking and do it later (after participant->decoding_mutex unlocks), we still need to lock audiobridge->mutex at that point too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, I get it now. Apologies, I thought the idea was to create a "hole" where you'd temporarily release the mutex to avoid the conflict, and then take it again. Checking the code again I can see that's indeed not the case. But in that case I think that means the same fix needs to be done in at least another place, ~L9243, where there's the same istalking done before a decode call (and so wrapped in the same reversed mutex lock).
Note: edited the line number, as it's different in your patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think participant->decoding_mutex is locked at that point, only after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh, you're right again, I was looking at the main codebase to see where the istalking function was, rather than your patch... Three days at the Faroe Islands and I've forgotten how to check code 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe, better we check one too many than one too few :)
…talking in janus_audiobridge_participant_thread. This can result in a deadlock because participant->decoding_mutex and audiobridge->mutex are locked in reverse order in janus_audiobridge_hangup_media_internal.
Replaces gbooleans used for encoding/decoding status with mutexes to ensure all access to the participant encoder and decoder is thread-safe. Additionally, the reset status gboolean is now accessed atomically.
Currently only tested with participants using Opus encoder/decoder.
This needs to be carefully reviewed for performance issues and logic errors.
See also #3587