Skip to content

feat: adds support for image builder api v3#504

Open
daniel-white wants to merge 10 commits intomainfrom
feat/imagebuilderv3
Open

feat: adds support for image builder api v3#504
daniel-white wants to merge 10 commits intomainfrom
feat/imagebuilderv3

Conversation

@daniel-white
Copy link

@daniel-white daniel-white commented Jan 13, 2026

  • adds v3 endpoint suppport
  • new high level builder client
  • replaces existing image build

@daniel-white daniel-white changed the title WIP feat: adds support for image builder api v3 feat: adds support for image builder api v3 Jan 20, 2026
@daniel-white daniel-white marked this pull request as ready for review January 20, 2026 20:30

def __str__(self) -> str:
# Shows a simple human readable representation of the Function. Used in error messages.
function_name: str = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't change this please. This is handling cases of inconsistent misconfigured by user function objects.
When app is deployed it's fully validated and doesn't have any inconsistencies.

return self._awaitables_factory

@property
def function_name(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new getter is most probably not needed because we're already using this information provided by SDK.
I'll figure out how you use this and write an approach that aligns with current architecture.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was getting it via the private path

)
from tensorlake.applications.image_builder.client_v3 import (
ImageBuilderClientV3,
ImageBuilderClientV3Options,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ImageBuilderClient and ImageBuilder are in separate packages? Aren't all of this is an ImageBuilderClient stuff?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separation of concerns.

f"⚙️ Preparing deployment for application(s) from {application_file_path}"
)

opts = ImageBuilderClientV3Options.from_env()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a type hint to opts.


if not build_req.images:
click.secho(
f"❌ No images found for application '{fn_config.function_name}'. "
Copy link
Contributor

@eabatalov eabatalov Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already validated during application validation. No need to do this check again.

raise click.Abort

await builder.build(build_req)
except (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're rewriting ImageBuilder then it should provide a clear contract on what exception types it rases in which cases. We also shouldn't use click inside SDK where the new ImageBuilder was moved now as SDK doesn't depend on Click, only CLI.

except Exception:
# Error message and summary are already printed by builder.build()
# Print final error message
click.secho("\n❌ Image build(s) failed", err=True, fg="red")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We swallow the Exception here (not printing it) how we and user are going to debug such errors?

from typing import AsyncGenerator
from uuid import uuid4 as uuid

import click
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Click is a CLI only dependency, please don't add it into SDK.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets discuss tomorrow.

import tempfile
from datetime import datetime
from typing import AsyncGenerator
from uuid import uuid4 as uuid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically use nanoid (vendored module) but if you want we can use something else.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switching



class BuildRequest:
"""Represents a request to build an application with multiple container images.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need such verbose docstrings? Asking because these are internal non-public components.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i thought these were public

@@ -0,0 +1,1122 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reduce verbosity of docstrings and make them concise and clear.

"""

try:
v3_req = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add type hints to all local variables to keep the code base easy to read and enforce discipline and alignment with existing coding practices in the repo.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyway to force that with a tool?

}
except ImageBuilderClientV3Error as e:
click.secho(str(e), err=True, fg="red")
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these exceptions need to bubble to CLI and then logged their during click.echo.
ImageBuilder needs to document all exceptions it can raise with enough details to make exception handling by CLI meaningful. If CLI is not going to do different things depending on the exception type then just document "raises Exception on error".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea. im thinking we need some kind of verbose mode, otherwise showing full stack traces all the time is annoying

]
_ = await asyncio.gather(*process_log_events_tasks, return_exceptions=True)

except (asyncio.CancelledError, KeyboardInterrupt, click.Abort):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove except for these exceptions from cli code?

reporters.values(), log_streams, strict=True
)
]
_ = await asyncio.gather(*process_log_events_tasks, return_exceptions=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return_exceptions=True means that all exceptions raised are put into the returned list which is ignored here.
Is this correct?

self._handle_build_info_error(
reporter,
summary,
f"Error getting final build info for {reporter.image_build_id}: {e}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to write all these separate except blocks to add these "Error getting final build info" lines while a full backtrace and exception is available to us?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just different error messages. happy to learn

Args:
summary: Dictionary with counts for total, succeeded, failed, canceled, and unknown.
"""
total = summary["total"]
Copy link
Contributor

@eabatalov eabatalov Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a @DataClass or anything similar instead of untyped dicts.
Also please don't print anything except one success message for all image builds if everything succeeded.
We need to reduce cognitive load on users, not increase it by printing more build stats than they need to see.
I think all these print methods can be removed unless you have a clear reason why to keep them.
Also we need to output a single final success/failure message for all applications. Thus this printing needs to be done in CLI.

"""
if not image_info.functions:
raise ValueError("image_info.functions cannot be empty")
self.image_info = image_info
Copy link
Contributor

@eabatalov eabatalov Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the whole purpose of this object is to wrap ImageInformation and provide a method that transforms it, why don't just replace it with private function def _v3_request(image_info) ? This would remove > 100 lines of docstrings and not useful code. Also ImageBuildRequest and BuildRequest are hard to understand entities. It sounds like the same things. If we keep ImageBuildRequest class then it needs to be clear what it is just from its name.

may consist of multiple container images. Each image can contain different
sets of functions and their dependencies.

Tensorlake applications can use multiple images to separate functions with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to explain Tensorlake Applications UX here as this class if for internal use and we have good public docs that are responsible for explaining all of this.

The v3 image build request.
"""
image = self.image_info.image
function_names = [func.function_name for func in self.image_info.functions]
Copy link
Contributor

@eabatalov eabatalov Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use func._function_config.function_name. It's okay because this code is part of SDK.
This is consistent with how function name is obtained currently all across SDK for properly initialized function objects. Also .function_name public property shouldn't be added to Function object because it makes it usable by users because they work with Function objects directly. We minimize API surface exposed to users to minimal to ease maintenance.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

felt kind of yucky reaching into its internals...

print_validation_messages,
validate_loaded_applications,
)
from tensorlake.builder.client_v2 import BuildContext, ImageBuilderV2Client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete all clients not used anymore in this PR.

self._print_build_summary(summary)
# Use os._exit() to bypass asyncio cleanup and avoid "unhandled exception" errors
# This exits immediately without triggering asyncio.run() cleanup issues
os._exit(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a hack, please fix this properly or explain why it's not possible at the moment.

f"Build service error getting final build info for {reporter.image_build_id}: {e}",
)
except Exception as e: # pylint: disable=broad-except
self._handle_build_info_error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have many except blocks that do almost the same?
Also why do we call _handle_build_info_error instead of bubbling the exception up to the caller?


from tensorlake.cli._common import ASYNC_HTTP_EVENT_HOOKS

# Enable httpx debug logging if requested via environment variable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

httpx is used across SDK and CLI.
What side effects does this have on them?
Is it right to put this code to client_v3.py?
Is this approach documented in httpx docs?

return self._project_id

@classmethod
def from_env(cls) -> "ImageBuilderClientV3Options":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to me like we're copy-pasting CLI code here.
Why do we do that? Who's going to use from_env() ?

Validation is not performed in __init__. Call validate() to validate the configuration.
"""

_base_url: str = field(init=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a simple @DataClass then why do we make these fields private and add lots of getter properties for it? Can we remove the getter properties?

_api_key: str | None = field(default=None, init=False)
_pat: str | None = field(default=None, init=False)
_organization_id: str | None = field(default=None, init=False)
_project_id: str | None = field(default=None, init=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use for field here?



@dataclass
class ImageBuilderClientV3Options:
Copy link
Contributor

@eabatalov eabatalov Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this and just create ImageBuilderClientV3() directly in CLI?
We're essentially creating an abstraction here over concepts that ImageBuilderClient shouldn't know about like pat,api_key,organization_id, project_id. This is all managed by CLI. Long story short there's nothing ImageBuilder specific in ImageBuilderClientV3Options.

Also please extract common logic with APIClient, see https://github.com/tensorlakeai/tensorlake/blob/main/src/tensorlake/applications/remote/api_client.py#L157C9-L157C16 and https://github.com/tensorlakeai/tensorlake/blob/main/src/tensorlake/applications/remote/api_client.py#L490 and reuse it in both this client and APIClient. This will result in a more complete refactoring that you started already.

Copy link
Contributor

@eabatalov eabatalov Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you really want to keep ImageBuilderClientV3Options then it needs to be a common non-image builder specific options object accepted by both image builder and existing API clients. The options objects needs to be defined in a common package and filled by CLI which handles configuration of all these things.
Also if you use it then existing CLI code needs to use it too where applicable.

# ============================================================================

# Type aliases for clarity (these are just str at runtime)
ImageBuildId = str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we add these type aliases instead of using str directly?

Copy link
Contributor

@eabatalov eabatalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall feedback: please remove unnecessary docstrings and abstractions + address the comments I left so far.

@eabatalov
Copy link
Contributor

Another thing. Please move all image builder related stuff inside SDK under remote directory. This is where all Tensorlake Cloud (aka remote mode) code is strictly located. Everything else besides local and remote dirs is generic.

I didn't review the following but please ensure that it's correct:

  • Request timeouts for each httpx client request we're doing are correct for the expected duration of the request.
  • Every httpx.client request has a correct retry policy.
    See APIClient for examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants