Skip to content

MQTT reconnect issues on wifi to mobile data switch #2200

@tobru

Description

@tobru

I noticed that the MQTT connection does not always reconnect when switching from wifi to mobile data. Sometimes it works as expected, and sometimes it doesn't, and then the connection is stale, invisible. Only after a full app restart does the MQTT connection work again. This makes using the app quite unpredictable, because you can't know whether the connection is good or not. Unfortunately, this led to a terrible "WAF" (wife acceptance factor).

This issue replaces #2183. I first thought it was just a friend list update issue, but it's really an MQTT connection issue.

To get an idea of what the issue could be, I used Claude Code with a debugging skill to trace down what the issue could be, and the result looks quite interesting. I'll share it here; feel free to ignore if this is of no use. I would love to contribute a fix, but I'm unable to do it on my own, and using Claude Code to propose a fix feels wrong to me. Please let me know if I should create a PR proposed by Claude Code anyway. At least I got the app compiled and a debug version installed on my device with a potential fix. Verification if the fix solved the issue wasn't successful, as the debug version seems to behave differently than the released version in some way.

I'm running:

  • App version: 2.5.5
  • Android version: 16 (BP4A.260105.004.A2)
  • Device: Pixel 10 Pro
  • Connection mode: MQTT

Analysis by Claude Code

Race Condition 1: onAvailable fires before onLost during network switch

Location: MQTTMessageProcessorEndpoint.kt:92-112

When switching from wifi to mobile data, Android's ConnectivityManager.NetworkCallback can deliver onAvailable (for mobile) before onLost (for wifi). This is documented Android behavior for registerDefaultNetworkCallback.

The sequence:

  1. Mobile data becomes available → onAvailable() fires
  2. State is still CONNECTED (on wifi), so the check endpointStateRepo.endpointState.value == EndpointState.DISCONNECTED at line 100 evaluates to falseno reconnect is attempted
  3. Wifi is lost → onLost() fires → disconnect() is called → state becomes DISCONNECTED
  4. No further onAvailable event fires because mobile data was already reported as available
  5. Result: connection is stuck in DISCONNECTED state permanently

The Paho connectionLost() callback (line 297) might fire and schedule a scheduleMqttReconnect(), but if disconnect() from onLost closes the client cleanly first, Paho won't fire connectionLost — it only fires on unexpected disconnections.

Race Condition 2: Concurrent reconnect and disconnect via the semaphore

Location: MQTTMessageProcessorEndpoint.kt:85, 102, 110

When onAvailable and onLost fire in close succession, both launch coroutines that compete for the connectingLock semaphore:

  1. onAvailablereconnect() acquires lock → calls disconnect() then connectToBroker() → connection established
  2. onLostdisconnect() is waiting for the lock
  3. reconnect() finishes and releases the lock
  4. onLost acquires the lock → disconnects the freshly established connection
  5. No further reconnect is triggered

Logic issue: ExistingWorkPolicy.KEEP prevents fresh reconnects

Location: Scheduler.kt:79

workManager.enqueueUniqueWork(ONETIME_TASK_MQTT_RECONNECT, ExistingWorkPolicy.KEEP, this)

ExistingWorkPolicy.KEEP means: if a reconnect work request is already enqueued, the new request is silently ignored. If a previous reconnect was scheduled but hasn't executed yet (e.g. waiting on its 10-second delay or network constraint), a new disconnect event won't schedule a fresh reconnect. The stale work item may complete without actually recovering the connection.

Potential fix directions

  1. In onLost, trigger a reconnect after disconnecting — after calling disconnect(), check if another network is currently available and initiate a reconnect
  2. In onAvailable, remove the == DISCONNECTED gate — always trigger a reconnect regardless of current state (the semaphore already serializes concurrent operations)
  3. Track specific Network objects — use onCapabilitiesChanged or associate the MQTT connection with a specific Network to properly detect when the active network changes rather than relying on generic available/lost events
  4. Change ExistingWorkPolicy.KEEP to REPLACE — ensure new reconnect requests supersede stale ones in the WorkManager queue

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions