Skip to content

Conversation

@Dairus01
Copy link

@Dairus01 Dairus01 commented Dec 20, 2025

This PR introduces a centralized, opt-in retry and backoff
mechanism for network operations in the Bittensor SDK.

Key points:

  • Adds a shared retry utility with sync and async support
  • Integrates retries at outbound network boundaries only
    (Dendrite and Subtensor)
  • Retries are disabled by default to preserve existing behavior
  • Axon logic is intentionally untouched
  • Async streaming calls are excluded by design
  • Includes unit tests covering retry success, exhaustion,
    disabled behavior, and default exception safety

This change improves SDK reliability under transient network
failures without affecting protocol logic or consensus behavior.

Contribution by Gittensor, learn more at https://gittensor.io/

@Dairus01
Copy link
Author

@basfroman, please, I would love to get your review on this

@Dairus01
Copy link
Author

All CI fixes have been applied:

  • Ruff formatting corrected-
  • Subtensor unit test failure fixed via defensive handling of substrate.query-
  • E2E Subtensor workflow fixed to avoid unauthenticated gh usage-
  • Please let me know if you’d prefer these commits squashed.

@basfroman, please can you re-review

@thewhaleking
Copy link
Contributor

The code seems decent, but I'm not sure this is a good thing.

Generally speaking, an SDK should not be opinionated. One of the big things we got away from in SDKv10 was error handling within the SDK. We decided it's best to leave it up to users on how to handle exceptions/errors/retrying. This very much seems like a step back into that opinionated versions of old.

@Dairus01 Dairus01 force-pushed the feat/centralized-retry branch from 1adebf1 to 5e3c65f Compare December 21, 2025 21:13
@Dairus01
Copy link
Author

The code seems decent, but I'm not sure this is a good thing.

Generally speaking, an SDK should not be opinionated. One of the big things we got away from in SDKv10 was error handling within the SDK. We decided it's best to leave it up to users on how to handle exceptions/errors/retrying. This very much seems like a step back into that opinionated versions of old.

This PR has been updated to remove all default retry behavior from core SDK components, while **preserving the retry utility as an optional helper.


What Changed

1. Retry Utility Remains Available

  • bittensor/utils/retry.py is kept intact
  • Provides:
    • retry_call (sync)
    • retry_async (async)
  • Configuration remains explicit and opt-in
  • This utility is not used internally by the SDK

The retry module is now a user-level helper, intended for developers who want a shared, consistent retry mechanism without re-implementing it themselves.


2. Retry Removed from Core SDK Paths

The SDK itself no longer applies retries anywhere:

  • Dendrite

    • All retry wrapping has been removed
    • Network calls behave exactly as before
  • Subtensor

    • All retry wrapping has been removed
    • Direct calls to substrate.query are restored

This ensures:

  • No automatic retries
  • No implicit error handling
  • Exceptions always propagate exactly as they did prior to this PR

3. Defensive Bug Fix Retained

The fix to Subtensor.get_hyperparameter remains:

  • Guards against substrate.query returning None
  • Guards against results without a .value attribute
  • This was required to satisfy existing unit tests and prevents AttributeError

This change is a bug fix, not retry logic, and is independent of the retry discussion.


Resulting Behavior

  • SDK behavior is fully non-opinionated
  • No retry, backoff, or error policy is enforced
  • Users who want retries must explicitly opt in by using the helper
  • Existing applications observe no behavioral change

@basfroman
Copy link
Collaborator

Hi @Dairus01, I'll review this PR when I get the time. But what @thewhaleking said in this comment is 100% in line with our plan.

@Dairus01
Copy link
Author

Hi @Dairus01, I'll review this PR when I get the time. But what @thewhaleking said in this comment is 100% in line with our plan.

Thanks, I would be looking forward to your review because, after @thewhaleking comment, I now made the retry optional.

Instead of a centralised retry, it is now an optional retry that you can implement if you want to enable it. You can simply enable it, allowing for a maximum retry attempt, a base delay, a maximum base delay, and other features.

I would be happy to improve anything required.

@basfroman
Copy link
Collaborator

Hi @Dairus01, I'll review this PR when I get the time. But what @thewhaleking said in this comment is 100% in line with our plan.

Thanks, I would be looking forward to your review because, after @thewhaleking comment, I now made the retry optional.

Instead of a centralised retry, it is now an optional retry that you can implement if you want to enable it. You can simply enable it, allowing for a maximum retry attempt, a base delay, a maximum base delay, and other features.

I would be happy to improve anything required.

Sounds good for me. I'll back to this PR on the next Monday.

@Dairus01
Copy link
Author

Hi @Dairus01, I'll review this PR when I get the time. But what @thewhaleking said in this comment is 100% in line with our plan.

Thanks, I would be looking forward to your review because, after @thewhaleking comment, I now made the retry optional.
Instead of a centralised retry, it is now an optional retry that you can implement if you want to enable it. You can simply enable it, allowing for a maximum retry attempt, a base delay, a maximum base delay, and other features.
I would be happy to improve anything required.

Sounds good for me. I'll back to this PR on the next Monday.

I would be looking forward to it

Copy link
Collaborator

@basfroman basfroman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is good as a standalone feature, but not as an integrated SDK solution.
But you need to fix what I mentioned to move forward.

Comment on lines +574 to +580
if result is None:
return None

if hasattr(result, "value"):
return result.value

return result
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert it

Comment on lines +827 to +832
return self.substrate.runtime_call(
api=runtime_api,
method=method,
params=params,
block_hash=block_hash,
).value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good, but apply the same changes to async_subtensor.def query_runtime_api
Sync and Async subtensors have to be consistent.

import logging
from typing import Type, Tuple, Optional, Callable, Any, Union

logger = logging.getLogger(__name__)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger = logging.getLogger(__name__)
logger = logging.getLogger("bittensor.utils.retry")

Comment on lines +654 to +668
async def _make_stream_request():
async with (await self.session).post(
url,
headers=synapse.to_headers(),
json=synapse.model_dump(),
timeout=aiohttp.ClientTimeout(total=timeout),
) as response:
# Use synapse subclass' process_streaming_response method to yield the response chunks
async for chunk in synapse.process_streaming_response(response): # type: ignore
yield chunk # Yield each chunk as it's processed
json_response = synapse.extract_response_json(response)

# Process the server response
self.process_server_response(response, json_response, synapse)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an unfinished refactoring. Remove this from the current PR. Create a separate PR for this logic. Cover it with tests that actually show that you benefit from reading large responses.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should located at tests/unit_tests/utils/test_retry.py.

return None # Should not be reached


async def retry_async(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if func is not async one?

return delay * (0.5 + random.random())


def retry_call(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if func is not sync one?


if last_exception:
raise last_exception
return None # Should not be reached
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with assert False, "Unreachable code" or remove. the same below.

"""Calculates backoff time with exponential backoff and jitter."""
delay = min(max_delay, base_delay * (_RETRY_BACKOFF_FACTOR**attempt))
# Add jitter: random value between 0 and delay
return delay * (0.5 + random.random())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the game, but not a stable solution

Comment on lines +59 to +62
"""
Synchronous retry wrapper.
If BT_RETRY_ENABLED is False, executes the function exactly once.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe doctoring properly. The same for async version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants