diff --git a/README.md b/README.md index 8ded88a..3eb23d9 100644 --- a/README.md +++ b/README.md @@ -19,6 +19,7 @@ Samples are organized by use case below: | Link | Description | Instance Type | | --- | --- | --- | | [SD inference](sd_hf_serve) | SD Inference workflow for creating an inference endpoint forwarded by ALB LoadBalancer powered by Karpenter's NodePool | Inf2 | +| [Flux inference](flux_serve) | FLUX.1 dev Inference workflow for creating an inference endpoint forwarded by ALB LoadBalancer powered by Karpenter's NodePool and S3 mountpoints | Trn1/Inf2 | ## Getting Help diff --git a/flux_serve/.gitignore b/flux_serve/.gitignore new file mode 100644 index 0000000..a6cae42 --- /dev/null +++ b/flux_serve/.gitignore @@ -0,0 +1,5 @@ +package-lock.json +.DS_Store +node_modules/* +oci-image-build/cdk.out/ +oci-image-build/node_modules/ diff --git a/flux_serve/README.md b/flux_serve/README.md new file mode 100644 index 0000000..8b33e86 --- /dev/null +++ b/flux_serve/README.md @@ -0,0 +1,98 @@ +# Compile and serve ultra large vision transformers like Flux on Neuron devices at scale + +[FLUX.1 dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) is a 12 billion parameter rectified flow transformer capable of generating images from text. This example illustrates how to optimize its usage on Neuron devices that power EC2 Inferentia (Inf2) and Trainium (Trn1) instances deployed on EKS. Since the FLUX.1 dev model cannot fit on a single Neuron device, neuronx_distributed splits the model transformer into `N=8` chunks along a dimension. This ensures that each device holds only `1/N` chunk of the tensor. By performing partial chunk computation, the model generates partial outputs from all devices, maintaining accuracy and producing high-quality images. However, loading the traced model graph before serving significantly impacts the model loading time. This depends on the `tp_degree`, `N`, and the produced model graph traces `*.pt` need to be loaded during deployment and integrated into the OCI image, which can lead to prolonged load times. To address this, we cache the model traces (in the range of 10 GB) in [S3 Mountpoints](https://docs.aws.amazon.com/eks/latest/userguide/s3-csi.html), decoupling the OCI model load time from the model traces load time. + +Next, we determine the compute type we intend to use. The initial step involves loading the model, executing a single inference, and assessing the image quality based on the number of inference steps. Subsequently, we determine the minimum inference latency with `Trn1` and `Inf2` under the specified acceptable quality parameters, currently set to `num_inference_steps`. Finally, we load each deployment unit, such as Flux on Trn1 8 Neuron devices or Flux on Inf2 6 Neuron devices, until it reaches its breaking point where the latency exceeds the acceptable thresholds. We measure the Neuron core utilization using the built-in [Neuron monitoring tools](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/monitoring-tools.html#monitoring-tools) that are included in Container Insights on Amazon EKS. + +The rest of this post offers a code sample walkthrough that explains the optimization strategies we adopted to streamline the deployment process using S3 CSI Driver and [Deep Learning Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md). These strategies aim to reduce the cost of inference while enhancing model serving throughput. + + +## Walkthrough +* [Create cluster with Karpenter node pools that provisions `trn1` instances](https://karpenter.sh/docs/getting-started/getting-started-with-karpenter/) +* [Setup Container Insights on Amazon EKS and Kubernetes](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-quickstart.html) +* [Deploy your HggingFace user secret as a k8s secret](https://kubernetes.io/docs/concepts/configuration/secret/) +```bash +echo -n 'hf_myhftoken' | base64 +``` +replace the value with the `HUGGINGFACE_TOKEN` and apply the secret into the cluster +```yaml +apiVersion: v1 +kind: Secret +type: Opaque +metadata: + name: hf-secrets + namespace: default +data: + HUGGINGFACE_TOKEN: encodedhfmyhftoken +``` +* Create an S3 bucket to store the model compiled graph and [enable Amazon S3 objects with Mountpoint for Amazon S3 CSI driver](https://docs.aws.amazon.com/eks/latest/userguide/s3-csi.html) +* [Deploy the OCI image pipeline](./oci-image-build) +* [Deploy AWS Load Balancer controller](https://docs.aws.amazon.com/eks/latest/userguide/aws-load-balancer-controller.html) to enable public ingress access to the inference pods; export the k8s deployment as yaml and enforce nodeSelector to the non-neuron instances to avoid IMDS v1 limitation. +* [Deploy the Neuron device plugin and scheduler extention](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/kubernetes-getting-started.html#deploy-neuron-device-plugin) +```bash +helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \ + --set "npd.enabled=false" +helm upgrade --install neuron-helm-chart oci://public.ecr.aws/neuron/neuron-helm-chart \ + --set "scheduler.enabled=true" \ + --set "npd.enabled=false" +``` +* Deploy the Karpenter `NodeClass` and `NodePool` that provisions `trn1` instances upon requests (`nodeSelector:karpenter.sh/nodepool: amd-neuron-trn1`) +```bash +kubectl -f specs/amd-neuron-trn1-nodepool.yaml +``` +* Deploy the S3 CSI driver storage PersistentVolume and PersistentVolumeClaim that stores the compiled Flux graphs. +Edit `bucketName` for the right bucket name created; `accessModes` set to `ReadWriteMany` because we demonstrate graph compilation (upload) and serving (download). +Note the `PersistentVolumeClaim` name; we will need it for the app deployment. +```bash +kubectl apply -f specs/flux-model-s3-storage.yaml +``` +* Compile the model for the requires shapes. We will demonstrate three shapes: 1024x576, 256x144, and 512x512 with `bfloat16` +```bash +kubectl apply -f specs/compile-flux-1024x576.yaml +kubectl apply -f specs/compile-flux-256x144.yaml +kubectl apply -f specs/compile-flux-512x512.yaml +``` +Note the three pending Jobs that Karpenter seeks to fulfill. Current setup requires `aws.amazon.com/neuron: 8` which is half of `trn1.32xlarge` so expect two `trn1.xlarge` to be launched. +* Deploy the Flux serving backend that loads the model from HuggingFace and uses the preloaded neuron model graph from S3 and standby for inference requests. The backend includes Deployment Pods and Services that route inference requests to the Pods so each model-api shapes scales horizontly. +```bash +kubectl apply -f specs/flux-neuron-1024x576-model-api.yaml +kubectl apply -f specs/flux-neuron-256x144-model-api.yaml +kubectl apply -f specs/flux-neuron-512x512-model-api.yaml +``` +Note the three pending Deployment Pods that Karpenter seeks to fulfill. Current setup requires either `aws.amazon.com/neuron: 8` which is half od `trn1.32xlarge` or `aws.amazon.com/neuron: 6` which is a full `inf2.24xlarge` instance. + +* Deploy the Flux serving frontend that includes Gradio app. +```bash +kubectl apply -f specs/flux-neuron-gradio.yaml +kubectl apply -f specs/flux-neuron-ingress.yaml +``` +Discover the model serving endpoint by: +```bash +kubectl get ingress +NAME CLASS HOSTS ADDRESS PORTS AGE +flux-neuron alb * flux-neuron-658286526.us-west-2.elb.amazonaws.com 80 7h20m +``` + +Use [flux-neuron-658286526.us-west-2.elb.amazonaws.com/serve/](flux-neuron-658286526.us-west-2.elb.amazonaws.com/serve/) +![Figure 1-Quality tests](./figures/flux-quality-test.png) +*Figure 1-Quality tests after deploying the model on Inf2 instances (Similar results were shown on Trn1 instances)* + +We benchmarked the Flux serving latency on Trn1 and Inf2 by configuring the `nodeSelector` with the [Trn1 Karpenter nodepool](./specs/amd-neuron-trn1-nodepool.yaml) and the [Inf2 Karpenter nodepool](./specs/amd-neuron-inf2-nodepool.yaml) and launched the benchmark script [./app/benchmark-flux.py](./app/benchmark-flux.py) that is already executed post [compile job](./specs/compile-flux-256x144.yaml). Below are the results pulled by `kubectl logs ...`: +``` +RESULT FOR flux1-dev-50runs with dim 1024x576 on amd-neuron-trn1;num_inference_steps:10: Latency P0=3118.2 Latency P50=3129.3 Latency P90=3141.3 Latency P95=3147.2 Latency P99=3162.2 Latency P100=3162.2 + +RESULT FOR flux1-dev-50runs with dim 256x144 on amd-neuron-trn1;num_inference_steps:10: Latency P0=585.6 Latency P50=588.1 Latency P90=592.2 Latency P95=592.5 Latency P99=597.8 Latency P100=597.8 + +RESULT FOR flux1-dev-50runs with dim 1024x576 on amd-neuron-inf2;num_inference_steps:10: Latency P0=9040.2 Latency P50=9080.8 Latency P90=9115.6 Latency P95=9120.6 Latency P99=9123.7 Latency P100=9123.7 + +RESULT FOR flux1-dev-50runs with dim 256x144 on amd-neuron-inf2;num_inference_steps:10: Latency P0=3067.9 Latency P50=3075.7 Latency P90=3079.8 Latency P95=3081.2 Latency P99=3088.5 Latency P100=3088.5 +``` + +| **Dimension** | **Platform** | **Latency P0** | **Latency P50** | **Latency P90** | **Latency P95** | **Latency P99** | **Latency P100** | +|---------------|-------------:|---------------:|---------------:|---------------:|---------------:|---------------:|----------------:| +| 1024x576 | Trn1 | 3118.2 | 3129.3 | 3141.3 | 3147.2 | 3162.2 | 3162.2 | +| 256x144 | Trn1 | 585.6 | 588.1 | 592.2 | 592.5 | 597.8 | 597.8 | +| 1024x576 | Inf2 | 9040.2 | 9080.8 | 9115.6 | 9120.6 | 9123.7 | 9123.7 | +| 256x144 | Inf2 | 3067.9 | 3075.7 | 3079.8 | 3081.2 | 3088.5 | 3088.5 | + +*Table 1-Latency benchmark between 8 Trn1 NDs and 6 Inf2 NDs with 10 inference_steps for 50 iterations* diff --git a/flux_serve/app/Dockerfile-assets b/flux_serve/app/Dockerfile-assets new file mode 100644 index 0000000..e934b72 --- /dev/null +++ b/flux_serve/app/Dockerfile-assets @@ -0,0 +1,16 @@ +ARG image + +FROM public.ecr.aws/docker/library/python:latest as base +RUN apt-get update -y --fix-missing +RUN apt-get install -y python3-venv g++ gettext-base jq bc +RUN mkdir -p /etc/apt/keyrings/ +RUN curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.29/deb/Release.key | gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg +RUN echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.29/deb/ /' | tee /etc/apt/sources.list.d/kubernetes.list +RUN apt-get update +RUN apt-get install -y kubectl +RUN kubectl version --client +RUN python -m pip install wget +RUN python -m pip install awscli +RUN pip install boto3 +RUN mkdir /root/.aws +ADD config /root/.aws diff --git a/flux_serve/app/Dockerfile.template b/flux_serve/app/Dockerfile.template new file mode 100644 index 0000000..2d79b7e --- /dev/null +++ b/flux_serve/app/Dockerfile.template @@ -0,0 +1,21 @@ +FROM $BASE_IMAGE as base + +RUN apt-get update --fix-missing +RUN apt-get install -y apt-transport-https ca-certificates curl gpg net-tools gettext-base python3-venv g++ +RUN python -m pip install wget +RUN python -m pip install awscli +RUN python -m pip install gradio +RUN python -m pip install "uvicorn[standard]" +RUN python -m pip install fastapi +RUN python -m pip install matplotlib Pillow +RUN pip install diffusers==0.30.3 sentencepiece +RUN pip install httpx + +RUN apt-get update +RUN mkdir -p /etc/apt/keyrings/ +RUN curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.29/deb/Release.key | gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg +RUN echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.29/deb/ /' | tee /etc/apt/sources.list.d/kubernetes.list +RUN apt-get update +RUN apt-get install -y kubectl +RUN kubectl version --client +COPY . . diff --git a/flux_serve/app/benchmark-flux.py b/flux_serve/app/benchmark-flux.py new file mode 100644 index 0000000..6be481f --- /dev/null +++ b/flux_serve/app/benchmark-flux.py @@ -0,0 +1,266 @@ +import math +import time +import argparse +import torch +import torch.nn as nn +import torch_neuronx +import neuronx_distributed +import os + +from diffusers import FluxPipeline +from diffusers.models.modeling_outputs import Transformer2DModelOutput +from typing import Any, Dict, Optional, Union +from huggingface_hub import login +from huggingface_hub import whoami + +nodepool=os.environ['NODEPOOL'] +model_id=os.environ['MODEL_ID'] +hf_token=os.environ['HUGGINGFACE_TOKEN'].strip() +height=int(os.environ['HEIGHT']) +width=int(os.environ['WIDTH']) +max_sequence_length=int(os.environ['MAX_SEQ_LEN']) +guidance_scale=float(os.environ['GUIDANCE_SCALE']) + +prompt= "A cat holding a sign that says hello world" +num_inference_steps=10 + +hf_token=os.environ['HUGGINGFACE_TOKEN'].strip() +try: + user_info = whoami() + print(f"Already logged in as {user_info['name']}") +except: + login(hf_token,add_to_git_credential=True) + +COMPILER_WORKDIR_ROOT = os.environ['COMPILER_WORKDIR_ROOT'] + +TEXT_ENCODER_PATH = os.path.join( + COMPILER_WORKDIR_ROOT, + 'text_encoder_1/compiled_model/model.pt') +TEXT_ENCODER_2_PATH = os.path.join( + COMPILER_WORKDIR_ROOT, + 'text_encoder_2/compiled_model/model.pt') +VAE_DECODER_PATH = os.path.join( + COMPILER_WORKDIR_ROOT, + 'decoder/compiled_model/model.pt') + +TEXT_ENCODER_2_DIR = os.path.join( + COMPILER_WORKDIR_ROOT, + 'text_encoder_2/compiled_model/text_encoder_2') +EMBEDDERS_DIR = os.path.join( + COMPILER_WORKDIR_ROOT, + 'transformer/compiled_model/embedders') +OUT_LAYERS_DIR = os.path.join( + COMPILER_WORKDIR_ROOT, + 'transformer/compiled_model/out_layers') +SINGLE_TRANSFORMER_BLOCKS_DIR = os.path.join( + COMPILER_WORKDIR_ROOT, + 'transformer/compiled_model/single_transformer_blocks') +TRANSFORMER_BLOCKS_DIR = os.path.join( + COMPILER_WORKDIR_ROOT, + 'transformer/compiled_model/transformer_blocks') + +class CustomFluxPipeline(FluxPipeline): + @property + def _execution_device(self): + return torch.device("cpu") + +class TextEncoder2Wrapper(nn.Module): + def __init__(self, sharded_model,dtype=torch.bfloat16): + super().__init__() + self.sharded_model = sharded_model + self.dtype = dtype + + def forward(self, input_ids, output_hidden_states=False, **kwargs): + attention_mask = (input_ids != 0).long() + output = self.sharded_model(input_ids,attention_mask) + last_hidden_state = output[0] + processed_output = last_hidden_state + return (processed_output,) + +class NeuronFluxTransformer2DModel(nn.Module): + def __init__( + self, + config, + x_embedder, + context_embedder + ): + super().__init__() + with torch_neuronx.experimental.neuron_cores_context(start_nc=0, nc_count=8): + self.embedders_model = neuronx_distributed.trace.parallel_model_load(EMBEDDERS_DIR) + self.out_layers_model = neuronx_distributed.trace.parallel_model_load(OUT_LAYERS_DIR) + self.transformer_blocks_model = neuronx_distributed.trace.parallel_model_load(TRANSFORMER_BLOCKS_DIR) + self.single_transformer_blocks_model = neuronx_distributed.trace.parallel_model_load(SINGLE_TRANSFORMER_BLOCKS_DIR) + + self.config = config + self.x_embedder = x_embedder + self.context_embedder = context_embedder + self.device = torch.device("cpu") + + def forward( + self, + hidden_states: torch.Tensor, + encoder_hidden_states: torch.Tensor = None, + pooled_projections: torch.Tensor = None, + timestep: torch.LongTensor = None, + img_ids: torch.Tensor = None, + txt_ids: torch.Tensor = None, + guidance: torch.Tensor = None, + joint_attention_kwargs: Optional[Dict[str, Any]] = None, + return_dict: bool = False, + ) -> Union[torch.FloatTensor, Transformer2DModelOutput]: + + hidden_states = self.x_embedder(hidden_states) + + hidden_states, temb, image_rotary_emb = self.embedders_model( + hidden_states, + timestep, + guidance, + pooled_projections, + txt_ids, + img_ids + ) + + encoder_hidden_states = self.context_embedder(encoder_hidden_states) + + image_rotary_emb = image_rotary_emb.type(torch.bfloat16) + + encoder_hidden_states, hidden_states = self.transformer_blocks_model( + hidden_states, + encoder_hidden_states, + temb, + image_rotary_emb + ) + + hidden_states = torch.cat([encoder_hidden_states, hidden_states], + dim=1) + + hidden_states = self.single_transformer_blocks_model( + hidden_states, + temb, + image_rotary_emb + ) + + hidden_states = hidden_states.to(torch.bfloat16) + + return self.out_layers_model( + hidden_states, + encoder_hidden_states, + temb + ) + + +class NeuronFluxCLIPTextEncoderModel(nn.Module): + def __init__(self, dtype, encoder): + super().__init__() + self.dtype = dtype + self.encoder = encoder + self.device = torch.device("cpu") + + def forward(self, emb, output_hidden_states): + output = self.encoder(emb) + output = CLIPEncoderOutput(output) + return output + + +class CLIPEncoderOutput(): + def __init__(self, dictionary): + self.pooler_output = dictionary["pooler_output"] + +def load_model( + prompt, + height, + width, + max_sequence_length, + num_inference_steps): + with torch_neuronx.experimental.neuron_cores_context(start_nc=8): + pipe = CustomFluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.bfloat16) + + with torch_neuronx.experimental.neuron_cores_context(start_nc=8): + pipe.text_encoder = NeuronFluxCLIPTextEncoderModel( + pipe.text_encoder.dtype, + torch.jit.load(TEXT_ENCODER_PATH)) + + with torch_neuronx.experimental.neuron_cores_context(start_nc=8): + pipe.vae.decoder = torch.jit.load(VAE_DECODER_PATH) + + with torch_neuronx.experimental.neuron_cores_context(start_nc=0, nc_count=8): + sharded_text_encoder_2 = neuronx_distributed.trace.parallel_model_load(TEXT_ENCODER_2_DIR) + pipe.text_encoder_2 = TextEncoder2Wrapper(sharded_text_encoder_2) + + pipe.transformer = NeuronFluxTransformer2DModel( + pipe.transformer.config, + pipe.transformer.x_embedder, + pipe.transformer.context_embedder) + + return pipe + +def benchmark(n_runs, test_name, model, model_inputs): + if not isinstance(model_inputs, tuple): + model_inputs = model_inputs + + warmup_run = model(**model_inputs) + + latency_collector = LatencyCollector() + + for _ in range(n_runs): + latency_collector.pre_hook() + res = model(**model_inputs) + image=res.images[0] + #image.save(os.path.join("/tmp", "flux-dev.png")) + latency_collector.hook() + + p0_latency_ms = latency_collector.percentile(0) * 1000 + p50_latency_ms = latency_collector.percentile(50) * 1000 + p90_latency_ms = latency_collector.percentile(90) * 1000 + p95_latency_ms = latency_collector.percentile(95) * 1000 + p99_latency_ms = latency_collector.percentile(99) * 1000 + p100_latency_ms = latency_collector.percentile(100) * 1000 + + report_dict = dict() + report_dict["Latency P0"] = f'{p0_latency_ms:.1f}' + report_dict["Latency P50"]=f'{p50_latency_ms:.1f}' + report_dict["Latency P90"]=f'{p90_latency_ms:.1f}' + report_dict["Latency P95"]=f'{p95_latency_ms:.1f}' + report_dict["Latency P99"]=f'{p99_latency_ms:.1f}' + report_dict["Latency P100"]=f'{p100_latency_ms:.1f}' + + report = f'RESULT FOR {test_name}:' + for key, value in report_dict.items(): + report += f' {key}={value}' + print(report) + return report + +class LatencyCollector: + def __init__(self): + self.start = None + self.latency_list = [] + + def pre_hook(self, *args): + self.start = time.time() + + def hook(self, *args): + self.latency_list.append(time.time() - self.start) + + def percentile(self, percent): + latency_list = self.latency_list + pos_float = len(latency_list) * percent / 100 + max_pos = len(latency_list) - 1 + pos_floor = min(math.floor(pos_float), max_pos) + pos_ceil = min(math.ceil(pos_float), max_pos) + latency_list = sorted(latency_list) + return latency_list[pos_ceil] if pos_float - pos_floor > 0.5 else latency_list[pos_floor] + +def text2img(prompt,num_inference_steps): + start_time = time.time() + model_args={'prompt':prompt,'height':height,'width':width,'max_sequence_length':max_sequence_length,'num_inference_steps': int(num_inference_steps),'guidance_scale':guidance_scale} + image = model(**model_args).images[0] + total_time = time.time()-start_time + return image, str(total_time) + + +model=load_model(prompt,height,width,max_sequence_length,num_inference_steps) +model_inputs={'prompt':prompt,'height':height,'width':width,'max_sequence_length':max_sequence_length,'num_inference_steps': num_inference_steps,'guidance_scale':guidance_scale} +test_name=f"flux1-dev-50runs with dim {height}x{width} on {nodepool};num_inference_steps:{num_inference_steps}" +benchmark(50,test_name,model,model_inputs) diff --git a/flux_serve/app/build-assets.sh b/flux_serve/app/build-assets.sh new file mode 100755 index 0000000..ee2439a --- /dev/null +++ b/flux_serve/app/build-assets.sh @@ -0,0 +1,50 @@ +#!/bin/bash -x +DLC_ECR_ACCOUNT="763104351884" + +DLC_NEURON_IMAGE="pytorch-inference-neuronx" +DLC_NEURON_TAG=$BASE_IMAGE_TAG +DLC_NEURON_ECR=$DLC_ECR_ACCOUNT".dkr.ecr.us-west-2.amazonaws.com" +if [ "$IMAGE_TAG" == "amd64-neuron" ]; then + docker logout + aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin $DLC_NEURON_ECR + docker pull $DLC_NEURON_ECR/$DLC_NEURON_IMAGE:$DLC_NEURON_TAG + dlc_xla_image_id=$(docker images | grep $DLC_ECR_ACCOUNT | grep $DLC_NEURON_IMAGE | awk '{print $3}') + docker tag $dlc_xla_image_id $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$BASE_REPO:$DLC_NEURON_TAG + docker logout + aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$BASE_REPO + docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$BASE_REPO:$DLC_NEURON_TAG +fi +DLC_CUDA_IMAGE="pytorch-inference" +DLC_CUDA_TAG=$BASE_IMAGE_TAG +DLC_CUDA_ECR=$DLC_ECR_ACCOUNT".dkr.ecr.us-east-1.amazonaws.com" +if [ "$IMAGE_TAG" == "amd64-cuda" ]; then + docker logout + aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin $DLC_CUDA_ECR + docker pull $DLC_CUDA_ECR/$DLC_CUDA_IMAGE:$DLC_CUDA_TAG + dlc_cuda_image_id=$(docker images | grep $DLC_ECR_ACCOUNT | grep $DLC_CUDA_TAG | awk '{print $3}') + docker tag $dlc_cuda_image_id $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$BASE_REPO:$DLC_CUDA_TAG + docker logout + aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$BASE_REPO + docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$BASE_REPO:$DLC_CUDA_TAG +fi +DLC_ARM_CPU_IMAGE="pytorch-inference-graviton" +DLC_ARM_CPU_TAG=$BASE_IMAGE_TAG +DLC_ARM_CPU_ECR=$DLC_ECR_ACCOUNT".dkr.ecr.us-east-1.amazonaws.com" +if [ "$IMAGE_TAG" == "aarch64-cpu" ]; then + docker logout + aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin $DLC_ARM_CPU_ECR + docker pull $DLC_ARM_CPU_ECR/$DLC_ARM_CPU_IMAGE:$DLC_ARM_CPU_TAG + dlc_arm_cpu_image_id=$(docker images | grep $DLC_ECR_ACCOUNT | grep $DLC_ARM_CPU_TAG | awk '{print $3}') + docker tag $dlc_arm_cpu_image_id $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$BASE_REPO:$DLC_ARM_CPU_TAG + docker logout + aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$BASE_REPO + docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$BASE_REPO:$DLC_ARM_CPU_TAG +fi + +docker images + +ASSETS="-assets" +export IMAGE=$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$BASE_REPO:$IMAGE_TAG$ASSETS +aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $IMAGE +docker build -t $IMAGE --build-arg ai_chip=$IMAGE_TAG -f Dockerfile-assets . +docker push $IMAGE diff --git a/flux_serve/app/build.sh b/flux_serve/app/build.sh new file mode 100755 index 0000000..3beda15 --- /dev/null +++ b/flux_serve/app/build.sh @@ -0,0 +1,22 @@ +#!/bin/bash -x + +export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account) + +ASSETS="-assets" +export BASE_IMAGE=$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$BASE_REPO:$BASE_IMAGE_TAG +export ASSETS_IMAGE=$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$BASE_REPO:$IMAGE_TAG$ASSETS +export IMAGE=$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$BASE_REPO:$IMAGE_TAG + +#if [ "$IMAGE_TAG" == "1.13.1-neuronx-py310-sdk2.17.0-ubuntu20.04" ]; then +# docker tag $dlc_xla_image_id $BASE_IMAGE +#fi +#if [ "$IMAGE_TAG" == "2.0.1-gpu-py310-cu118-ubuntu20.04-ec2" ]; then +# docker tag $dlc_cuda_image_id $BASE_IMAGE +#fi +#docker images + +cat Dockerfile.template | envsubst > Dockerfile +cat Dockerfile +aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $BASE_IMAGE +docker build -t $IMAGE . +docker push $IMAGE diff --git a/flux_serve/app/config b/flux_serve/app/config new file mode 100644 index 0000000..539a043 --- /dev/null +++ b/flux_serve/app/config @@ -0,0 +1,4 @@ +[default] +retry_mode = adaptive +max_attempts = 10 +region = us-west-2 diff --git a/flux_serve/app/cova_gradio.py b/flux_serve/app/cova_gradio.py new file mode 100644 index 0000000..e9f22a1 --- /dev/null +++ b/flux_serve/app/cova_gradio.py @@ -0,0 +1,123 @@ +import gradio as gr +import requests +from PIL import Image +import io +import os +from fastapi import FastAPI +import base64 +import asyncio +import httpx + +app = FastAPI() + +models = [ + { + 'name': '512x512', + 'host_env': 'FLUX_NEURON_512X512_MODEL_API_SERVICE_HOST', + 'port_env': 'FLUX_NEURON_512X512_MODEL_API_SERVICE_PORT', + 'height': 512, + 'width': 512 + } +] + +for model in models: + host = os.environ[model['host_env']] + port = os.environ[model['port_env']] + model['url'] = f"http://{host}:{port}/generate" + +async def fetch_image(client, url, prompt, num_inference_steps): + payload = { + "prompt": prompt, + "num_inference_steps": int(num_inference_steps) + } + try: + response = await client.post(url, json=payload, timeout=60.0) + response.raise_for_status() + data = response.json() + image_bytes = base64.b64decode(data['image']) + image = Image.open(io.BytesIO(image_bytes)) + execution_time = data.get('execution_time', 0) + return image, f"{execution_time:.2f} seconds" + except httpx.RequestError as e: + return None, f"Request Error: {str(e)}" + except Exception as e: + return None, f"Error: {str(e)}" + +async def call_model_api(prompt, num_inference_steps): + async with httpx.AsyncClient() as client: + tasks = [ + fetch_image(client, model['url'], prompt, num_inference_steps) + for model in models + ] + results = await asyncio.gather(*tasks) + images = [] + exec_times = [] + for image, exec_time in results: + images.append(image) + exec_times.append(exec_time) + return images + exec_times + +@app.get("/health") +def healthy(): + return {"message": "Service is healthy"} + +@app.get("/readiness") +def ready(): + return {"message": "Service is ready"} + +with gr.Blocks() as interface: + gr.Markdown(f"# Image Generation App") + gr.Markdown("Enter a prompt and specify the number of inference steps to generate images in different shapes.") + + with gr.Row(): + with gr.Column(scale=1): + prompt = gr.Textbox( + label="Prompt", + lines=1, + placeholder="Enter your prompt here...", + elem_id="prompt-box" + ) + inference_steps = gr.Number( + label="Inference Steps", + value=10, + precision=0, + info="Enter the number of inference steps; higher number takes more time but produces better image", + elem_id="steps-number" + ) + generate_button = gr.Button("Generate Images", variant="primary") + + with gr.Column(scale=2): + image_components = [] + exec_time_components = [] + + with gr.Row(): + for idx, model in enumerate(models): + with gr.Column(): + # Title + gr.Markdown(f"**{model['name']}**") + + # Scale down the image + preview_height = int(model['height'] / 2) + preview_width = int(model['width'] / 2) + + img = gr.Image( + label="", + height=preview_height, + width=preview_width, + interactive=False + ) + # Use Markdown for simpler smaller text + exec_time = gr.Markdown(value="") + + image_components.append(img) + exec_time_components.append(exec_time) + + # callback for the button + generate_button.click( + fn=call_model_api, + inputs=[prompt, inference_steps], + outputs=image_components + exec_time_components, + api_name="generate_images" + ) + +app = gr.mount_gradio_app(app, interface, path="/serve") diff --git a/flux_serve/app/cova_gradio_m.py b/flux_serve/app/cova_gradio_m.py new file mode 100644 index 0000000..000c82a --- /dev/null +++ b/flux_serve/app/cova_gradio_m.py @@ -0,0 +1,139 @@ +import os, io, base64, asyncio, json +from typing import Tuple, List +import gradio as gr +import httpx +from PIL import Image +from fastapi import FastAPI + +app = FastAPI() + +# --------------------------------------------------------------------------- +# ❶ Model definitions -– extend with more rows if you deploy more shapes later +# --------------------------------------------------------------------------- +IMAGE_MODELS = [ + dict( + name="512 × 512", + host_env="FLUX_NEURON_512X512_MODEL_API_SERVICE_HOST", + port_env="FLUX_NEURON_512X512_MODEL_API_SERVICE_PORT", + height=512, + width=512, + # caption backend for this image size + caption_host_env="MLLAMA_32_11B_VLLM_TRN1_SERVICE_HOST", + caption_port_env="MLLAMA_32_11B_VLLM_TRN1_SERVICE_PORT", + # number of caption tokens to ask for per request + caption_max_new_tokens=64, + ) +] + +for m in IMAGE_MODELS: + m["image_url"] = f'http://{os.environ[m["host_env"]]}:{os.environ[m["port_env"]]}/generate' + m["caption_url"] = f'http://{os.environ[m["caption_host_env"]]}:{os.environ[m["caption_port_env"]]}/generate' + +# --------------------------------------------------------------------------- +# ❷ Helpers +# --------------------------------------------------------------------------- +async def post_json(client: httpx.AsyncClient, url: str, payload: dict, timeout: float = 60.0): + """Small wrapper that returns (json, elapsed_seconds) or raises.""" + start = asyncio.get_event_loop().time() + r = await client.post(url, json=payload, timeout=timeout) + r.raise_for_status() + elapsed = asyncio.get_event_loop().time() - start + return r.json(), elapsed + +def pil_to_base64(img: Image.Image) -> str: + buf = io.BytesIO() + img.save(buf, format="PNG") + return base64.b64encode(buf.getvalue()).decode() + +async def fetch_end_to_end( + client : httpx.AsyncClient, + model_cfg : dict, + prompt : str, + num_steps : int +) -> Tuple[Image.Image, str, str]: + """ + Returns (image, latency_str, caption_str, caption_latency_str) + """ + # ① Generate the image + img_payload = {"prompt": prompt, "num_inference_steps": int(num_steps)} + img_json, img_latency = await post_json(client, model_cfg["image_url"], img_payload) + image = Image.open(io.BytesIO(base64.b64decode(img_json["image"]))) + + # ② Ask vLLM to describe that image + img_b64 = pil_to_base64(image) + caption_prompt = f"Describe the content of this image (base64 PNG follows): {img_b64}" + cap_payload = {"prompt": caption_prompt, + "max_new_tokens": model_cfg["caption_max_new_tokens"]} + cap_json, cap_latency = await post_json(client, model_cfg["caption_url"], cap_payload) + caption = base64.b64decode(cap_json["text"]).decode() + + return image, f"{img_latency:.2f}s", caption, f"{cap_latency:.2f}s" + +async def orchestrate_calls(prompt: str, num_steps: int): + async with httpx.AsyncClient() as client: + tasks = [fetch_end_to_end(client, cfg, prompt, num_steps) for cfg in IMAGE_MODELS] + results = await asyncio.gather(*tasks) + + # Flatten results for gradio → [img, img_lat, caption, cap_lat] * N + flat: List = [] + for tup in results: + flat.extend(tup) + return flat + +# --------------------------------------------------------------------------- +# ❸ Gradio UI +# --------------------------------------------------------------------------- +with gr.Blocks() as interface: + gr.Markdown("# ⚡ Flux Image-Gen + vLLM Caption Demo") + gr.Markdown("Enter a text prompt ➜ model draws an image ➜ LLM describes the image.") + + with gr.Row(): + # user controls + with gr.Column(scale=1): + prompt_in = gr.Textbox(lines=1, label="Prompt") + steps_in = gr.Number(label="Inference Steps", value=10, precision=0) + btn_generate = gr.Button("Generate", variant="primary") + + # results + with gr.Column(scale=2): + img_out_components: list = [] + img_lat_components: list = [] + cap_out_components: list = [] + cap_lat_components: list = [] + + for cfg in IMAGE_MODELS: + with gr.Group(): + gr.Markdown(f"### {cfg['name']}") + img = gr.Image(height=cfg["height"]//2, + width=cfg["width"]//2, + interactive=False) + lat = gr.Markdown() + cap = gr.Markdown() + cap_lat = gr.Markdown() + img_out_components.append(img) + img_lat_components.append(lat) + cap_out_components.append(cap) + cap_lat_components.append(cap_lat) + + # wire them all up + btn_generate.click( + orchestrate_calls, + inputs=[prompt_in, steps_in], + outputs=( + img_out_components + + img_lat_components + + cap_out_components + + cap_lat_components + ), + api_name="generate_and_caption", + ) + +app = gr.mount_gradio_app(app, interface, path="/serve") + +@app.get("/health") +def healthy(): + return {"message": "Service is healthy"} + +@app.get("/readiness") +def ready(): + return {"message": "Service is ready"} diff --git a/flux_serve/app/download_hf_model.py b/flux_serve/app/download_hf_model.py new file mode 100644 index 0000000..202a32a --- /dev/null +++ b/flux_serve/app/download_hf_model.py @@ -0,0 +1,9 @@ +from huggingface_hub import login,snapshot_download +import os +repo_id=os.environ['MODEL_ID'] +os.environ['NEURON_COMPILED_ARTIFACTS']=repo_id +hf_token=os.environ['HUGGINGFACE_TOKEN'].strip() +login(hf_token, add_to_git_credential=True) +snapshot_download(repo_id=repo_id,local_dir="/"+repo_id,token=hf_token) +print(f"Repository '{repo_id}' downloaded to '{repo_id}'.") + diff --git a/flux_serve/app/flux_gradio.py b/flux_serve/app/flux_gradio.py new file mode 100644 index 0000000..36696cb --- /dev/null +++ b/flux_serve/app/flux_gradio.py @@ -0,0 +1,177 @@ +import gradio as gr +import requests +from PIL import Image +import io +import os +from fastapi import FastAPI +import base64 +import asyncio +import httpx + +app = FastAPI() + +model_id=os.environ['MODEL_ID'] + +models = [ + { + 'name': '256x144', + 'host_env': 'FLUX_NEURON_256X144_MODEL_API_SERVICE_HOST', + 'port_env': 'FLUX_NEURON_256X144_MODEL_API_SERVICE_PORT', + 'height': 256, + 'width': 144 + }, + { + 'name': '1024x576', + 'host_env': 'FLUX_NEURON_1024X576_MODEL_API_SERVICE_HOST', + 'port_env': 'FLUX_NEURON_1024X576_MODEL_API_SERVICE_PORT', + 'height': 1024, + 'width': 576 + }, + { + 'name': '512x512', + 'host_env': 'FLUX_NEURON_512X512_MODEL_API_SERVICE_HOST', + 'port_env': 'FLUX_NEURON_512X512_MODEL_API_SERVICE_PORT', + 'height': 512, + 'width': 512 + } +] + +for model in models: + host = os.environ[model['host_env']] + port = os.environ[model['port_env']] + model['url'] = f"http://{host}:{port}/generate" + +async def fetch_image(client, url, prompt, num_inference_steps): + payload = { + "prompt": prompt, + "num_inference_steps": int(num_inference_steps) + } + try: + response = await client.post(url, json=payload, timeout=60.0) + response.raise_for_status() + data = response.json() + image_bytes = base64.b64decode(data['image']) + image = Image.open(io.BytesIO(image_bytes)) + execution_time = data.get('execution_time', 0) + return image, f"{execution_time:.2f} seconds" + except httpx.RequestError as e: + return None, f"Request Error: {str(e)}" + except Exception as e: + return None, f"Error: {str(e)}" + +async def call_model_api(prompt, num_inference_steps): + async with httpx.AsyncClient() as client: + tasks = [ + fetch_image(client, model['url'], prompt, num_inference_steps) + for model in models + ] + results = await asyncio.gather(*tasks) + images = [] + exec_times = [] + for image, exec_time in results: + images.append(image) + exec_times.append(exec_time) + return images + exec_times + +@app.get("/health") +def healthy(): + return {"message": "Service is healthy"} + +@app.get("/readiness") +def ready(): + return {"message": "Service is ready"} +''' +with gr.Blocks() as interface: + gr.Markdown(f"# {model_id} Image Generation App") + gr.Markdown("Enter a prompt and specify the number of inference steps to generate images in different shapes.") + + with gr.Row(): + with gr.Column(scale=1): + prompt = gr.Textbox(label="Prompt", lines=1, placeholder="Enter your prompt here...",elem_id="prompt-box") + inference_steps = gr.Number( + label="Inference Steps", + value=10, + precision=0, + info="Enter the number of inference steps; higher number takes more time but produces better image", + elem_id="steps-number" + ) + generate_button = gr.Button("Generate Images",variant="primary") + + with gr.Column(scale=2): + image_components = [] + exec_time_components = [] + + with gr.Row(equal_height=True): + for idx, model in enumerate(models): + with gr.Column(scale=1,min_width=300): + img = gr.Image(label=f"{model['name']}",height=model['height'],width=model['width'],interactive=False) + exec_time = gr.Textbox(label=f"Execution Time ({model['name']})",interactive=False,lines=1,placeholder="Execution time will appear here...") + image_components.append(img) + exec_time_components.append(exec_time) + + # callback for the button + generate_button.click( + fn=call_model_api, + inputs=[prompt, inference_steps], + outputs=image_components + exec_time_components, + api_name="generate_images" + ) +''' +with gr.Blocks() as interface: + gr.Markdown(f"# {model_id} Image Generation App") + gr.Markdown("Enter a prompt and specify the number of inference steps to generate images in different shapes.") + + with gr.Row(): + with gr.Column(scale=1): + prompt = gr.Textbox( + label="Prompt", + lines=1, + placeholder="Enter your prompt here...", + elem_id="prompt-box" + ) + inference_steps = gr.Number( + label="Inference Steps", + value=10, + precision=0, + info="Enter the number of inference steps; higher number takes more time but produces better image", + elem_id="steps-number" + ) + generate_button = gr.Button("Generate Images", variant="primary") + + with gr.Column(scale=2): + image_components = [] + exec_time_components = [] + + with gr.Row(): + for idx, model in enumerate(models): + with gr.Column(): + # Title + gr.Markdown(f"**{model['name']}**") + + # Scale down the image + preview_height = int(model['height'] / 2) + preview_width = int(model['width'] / 2) + + img = gr.Image( + label="", + height=preview_height, + width=preview_width, + interactive=False + ) + # Use Markdown for simpler smaller text + exec_time = gr.Markdown(value="") + + image_components.append(img) + exec_time_components.append(exec_time) + + # callback for the button + generate_button.click( + fn=call_model_api, + inputs=[prompt, inference_steps], + outputs=image_components + exec_time_components, + api_name="generate_images" + ) + +app = gr.mount_gradio_app(app, interface, path="/serve") + +app = gr.mount_gradio_app(app, interface, path="/serve") diff --git a/flux_serve/app/flux_model_api.py b/flux_serve/app/flux_model_api.py new file mode 100644 index 0000000..a3ac2e6 --- /dev/null +++ b/flux_serve/app/flux_model_api.py @@ -0,0 +1,342 @@ +import math +import boto3 +import time +import argparse +import torch +import torch.nn as nn +import torch_neuronx +import neuronx_distributed +import os +from fastapi import FastAPI, HTTPException +from pydantic import BaseModel, Field +from typing import Any, Dict, Optional, Union +from huggingface_hub import login +from diffusers import FluxPipeline +from diffusers.models.modeling_outputs import Transformer2DModelOutput +from starlette.responses import StreamingResponse +import base64 + +cw_namespace='hw-agnostic-infer' +cloudwatch = boto3.client('cloudwatch', region_name='us-west-2') + +# Initialize FastAPI app +app = FastAPI() + +# Environment Variables +app_name=os.environ['APP'] +nodepool=os.environ['NODEPOOL'] +model_id = os.environ['MODEL_ID'] +device = os.environ["DEVICE"] +pod_name = os.environ['POD_NAME'] +hf_token = os.environ['HUGGINGFACE_TOKEN'].strip() +height = int(os.environ['HEIGHT']) +width = int(os.environ['WIDTH']) +max_sequence_length = int(os.environ['MAX_SEQ_LEN']) +guidance_scale = float(os.environ['GUIDANCE_SCALE']) +COMPILER_WORKDIR_ROOT = os.environ['COMPILER_WORKDIR_ROOT'] + +DTYPE=torch.bfloat16 + +def cw_pub_metric(metric_name,metric_value,metric_unit): + response = cloudwatch.put_metric_data( + Namespace=cw_namespace, + MetricData=[ + { + 'MetricName':metric_name, + 'Value':metric_value, + 'Unit':metric_unit, + }, + ] + ) + print(f"in pub_deployment_counter - response:{response}") + return response + +# Model Paths +TEXT_ENCODER_PATH = os.path.join( + COMPILER_WORKDIR_ROOT, + 'text_encoder_1/compiled_model/model.pt') +VAE_DECODER_PATH = os.path.join( + COMPILER_WORKDIR_ROOT, + 'decoder/compiled_model/model.pt') + +TEXT_ENCODER_2_DIR = os.path.join( + COMPILER_WORKDIR_ROOT, + 'text_encoder_2/compiled_model/text_encoder_2') +EMBEDDERS_DIR = os.path.join( + COMPILER_WORKDIR_ROOT, + 'transformer/compiled_model/embedders') +OUT_LAYERS_DIR = os.path.join( + COMPILER_WORKDIR_ROOT, + 'transformer/compiled_model/out_layers') +SINGLE_TRANSFORMER_BLOCKS_DIR = os.path.join( + COMPILER_WORKDIR_ROOT, + 'transformer/compiled_model/single_transformer_blocks') +TRANSFORMER_BLOCKS_DIR = os.path.join( + COMPILER_WORKDIR_ROOT, + 'transformer/compiled_model/transformer_blocks') + +# Login to Hugging Face +login(hf_token, add_to_git_credential=True) + +class CustomFluxPipeline(FluxPipeline): + @property + def _execution_device(self): + return torch.device("cpu") + +class TextEncoder2Wrapper(nn.Module): + def __init__(self, sharded_model,dtype=torch.bfloat16): + super().__init__() + self.sharded_model = sharded_model + self.dtype = dtype + + def forward(self, input_ids, output_hidden_states=False, **kwargs): + attention_mask = (input_ids != 0).long() + output = self.sharded_model(input_ids,attention_mask) + last_hidden_state = output[0] + processed_output = last_hidden_state + return (processed_output,) + +class GenerateImageRequest(BaseModel): + prompt: str + num_inference_steps: int + +class GenerateImageResponse(BaseModel): + image: str = Field(..., description="Base64-encoded image") + execution_time: float + +class NeuronFluxTransformer2DModel(nn.Module): + def __init__( + self, + config, + x_embedder, + context_embedder + ): + super().__init__() + with torch_neuronx.experimental.neuron_cores_context(start_nc=4, + nc_count=8): + self.embedders_model = \ + neuronx_distributed.trace.parallel_model_load(EMBEDDERS_DIR) + self.transformer_blocks_model = \ + neuronx_distributed.trace.parallel_model_load( + TRANSFORMER_BLOCKS_DIR) + self.single_transformer_blocks_model = \ + neuronx_distributed.trace.parallel_model_load( + SINGLE_TRANSFORMER_BLOCKS_DIR) + self.out_layers_model = \ + neuronx_distributed.trace.parallel_model_load( + OUT_LAYERS_DIR) + self.config = config + self.x_embedder = x_embedder + self.context_embedder = context_embedder + self.device = torch.device("cpu") + + def forward( + self, + hidden_states: torch.Tensor, + encoder_hidden_states: torch.Tensor = None, + pooled_projections: torch.Tensor = None, + timestep: torch.LongTensor = None, + img_ids: torch.Tensor = None, + txt_ids: torch.Tensor = None, + guidance: torch.Tensor = None, + joint_attention_kwargs: Optional[Dict[str, Any]] = None, + return_dict: bool = False, + ) -> Union[torch.FloatTensor, Transformer2DModelOutput]: + + hidden_states = self.x_embedder(hidden_states) + + hidden_states, temb, image_rotary_emb = self.embedders_model( + hidden_states, + timestep, + guidance, + pooled_projections, + txt_ids, + img_ids + ) + + encoder_hidden_states = self.context_embedder(encoder_hidden_states) + + image_rotary_emb = image_rotary_emb.type(DTYPE) + + encoder_hidden_states, hidden_states = self.transformer_blocks_model( + hidden_states, + encoder_hidden_states, + temb, + image_rotary_emb + ) + + hidden_states = torch.cat([encoder_hidden_states, hidden_states], + dim=1) + + hidden_states = self.single_transformer_blocks_model( + hidden_states, + temb, + image_rotary_emb + ) + + hidden_states = hidden_states.to(DTYPE) + + return self.out_layers_model( + hidden_states, + encoder_hidden_states, + temb + ) + + +class NeuronFluxCLIPTextEncoderModel(nn.Module): + def __init__(self, dtype, encoder): + super().__init__() + self.dtype = dtype + self.encoder = encoder + self.device = torch.device("cpu") + + def forward(self, emb, output_hidden_states): + output = self.encoder(emb) + output = CLIPEncoderOutput(output) + return output + + +class CLIPEncoderOutput(): + def __init__(self, dictionary): + self.pooler_output = dictionary["pooler_output"] + + +class NeuronFluxT5TextEncoderModel(nn.Module): + def __init__(self, dtype, encoder): + super().__init__() + self.dtype = dtype + self.encoder = encoder + self.device = torch.device("cpu") + + def forward(self, emb, output_hidden_states): + return torch.unsqueeze(self.encoder(emb)["last_hidden_state"], 1) + +def benchmark(n_runs, test_name, model, model_inputs): + if not isinstance(model_inputs, tuple): + model_inputs = model_inputs + + warmup_run = model(**model_inputs) + + latency_collector = LatencyCollector() + + for _ in range(n_runs): + latency_collector.pre_hook() + res = model(**model_inputs) + image=res.images[0] + #image.save(os.path.join("/tmp", "flux-dev.png")) + latency_collector.hook() + p0_latency_ms = latency_collector.percentile(0) * 1000 + p50_latency_ms = latency_collector.percentile(50) * 1000 + p90_latency_ms = latency_collector.percentile(90) * 1000 + p95_latency_ms = latency_collector.percentile(95) * 1000 + p99_latency_ms = latency_collector.percentile(99) * 1000 + p100_latency_ms = latency_collector.percentile(100) * 1000 + + report_dict = dict() + report_dict["Latency P0"] = f'{p0_latency_ms:.1f}' + report_dict["Latency P50"]=f'{p50_latency_ms:.1f}' + report_dict["Latency P90"]=f'{p90_latency_ms:.1f}' + report_dict["Latency P95"]=f'{p95_latency_ms:.1f}' + report_dict["Latency P99"]=f'{p99_latency_ms:.1f}' + report_dict["Latency P100"]=f'{p100_latency_ms:.1f}' + + report = f'RESULT FOR {test_name}:' + for key, value in report_dict.items(): + report += f' {key}={value}' + print(report) + return report + +class LatencyCollector: + def __init__(self): + self.start = None + self.latency_list = [] + + def pre_hook(self, *args): + self.start = time.time() + + def hook(self, *args): + self.latency_list.append(time.time() - self.start) + + def percentile(self, percent): + latency_list = self.latency_list + pos_float = len(latency_list) * percent / 100 + max_pos = len(latency_list) - 1 + pos_floor = min(math.floor(pos_float), max_pos) + pos_ceil = min(math.ceil(pos_float), max_pos) + latency_list = sorted(latency_list) + return latency_list[pos_ceil] if pos_float - pos_floor > 0.5 else latency_list[pos_floor] + +# Load the model pipeline +def load_model(): + with torch_neuronx.experimental.neuron_cores_context(start_nc=8): + pipe = CustomFluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.bfloat16) + + with torch_neuronx.experimental.neuron_cores_context(start_nc=8): + pipe.text_encoder = NeuronFluxCLIPTextEncoderModel( + pipe.text_encoder.dtype, + torch.jit.load(TEXT_ENCODER_PATH)) + + with torch_neuronx.experimental.neuron_cores_context(start_nc=8): + pipe.vae.decoder = torch.jit.load(VAE_DECODER_PATH) + + with torch_neuronx.experimental.neuron_cores_context(start_nc=0, nc_count=8): + sharded_text_encoder_2 = neuronx_distributed.trace.parallel_model_load(TEXT_ENCODER_2_DIR) + pipe.text_encoder_2 = TextEncoder2Wrapper(sharded_text_encoder_2) + + pipe.transformer = NeuronFluxTransformer2DModel( + pipe.transformer.config, + pipe.transformer.x_embedder, + pipe.transformer.context_embedder) + return pipe + +model = load_model() +#Warmup +prompt= "A cat holding a sign that says hello world" +num_inference_steps=10 +model_inputs={'prompt':prompt,'height':height,'width':width,'max_sequence_length':max_sequence_length,'num_inference_steps': num_inference_steps,'guidance_scale':guidance_scale} +test_name=f"flux1-dev-50runs with dim {height}x{width} on {nodepool};num_inference_steps:{num_inference_steps}" +benchmark(10,test_name,model,model_inputs) + +# Define the image generation endpoint +@app.post("/generate", response_model=GenerateImageResponse) +def generate_image(request: GenerateImageRequest): + start_time = time.time() + try: + model_args = { + 'prompt': request.prompt, + 'height': height, + 'width': width, + 'max_sequence_length': max_sequence_length, + 'num_inference_steps': request.num_inference_steps, + 'guidance_scale': guidance_scale + } + with torch.no_grad(): + output = model(**model_args) + image = output.images[0] + # Save image to bytes + from io import BytesIO + buf = BytesIO() + image.save(buf, format='PNG') + image_bytes = buf.getvalue() + image_base64 = base64.b64encode(image_bytes).decode('utf-8') + total_time = time.time() - start_time + counter_metric=app_name+'-counter' + cw_pub_metric(counter_metric,1,'Count') + counter_metric=nodepool + cw_pub_metric(counter_metric,1,'Count') + latency_metric=app_name+'-latency' + cw_pub_metric(latency_metric,total_time,'Seconds') + return GenerateImageResponse(image=image_base64, execution_time=total_time) + except Exception as e: + raise HTTPException(status_code=500, detail=f"Image serialization failed: {img_err}") + +# Health and readiness endpoints +@app.get("/health") +def healthy(): + return {"message": f"{pod_name} is healthy"} + +@app.get("/readiness") +def ready(): + return {"message": f"{pod_name} is ready"} diff --git a/flux_serve/app/mllama-offline.py b/flux_serve/app/mllama-offline.py new file mode 100644 index 0000000..6f657b5 --- /dev/null +++ b/flux_serve/app/mllama-offline.py @@ -0,0 +1,113 @@ +import math +import time +import torch +import os +import sys +import yaml +import requests +from PIL import Image +from vllm import LLM, SamplingParams, TextPrompt +from neuronx_distributed_inference.models.mllama.utils import add_instruct +from huggingface_hub import create_repo,upload_folder,login,snapshot_download + +hf_token = os.environ['HUGGINGFACE_TOKEN'].strip() +repo_id=os.environ['MODEL_ID'] +os.environ['NEURON_COMPILED_ARTIFACTS']=repo_id +os.environ['VLLM_NEURON_FRAMEWORK']='neuronx-distributed-inference' +login(hf_token,add_to_git_credential=True) + +config_path = "/vllm_config.yaml" +with open(config_path, 'r') as f: + model_vllm_config_yaml = f.read() + +model_vllm_config = yaml.safe_load(model_vllm_config_yaml) + +class LatencyCollector: + def __init__(self): + self.latency_list = [] + + def record(self, latency_sec): + self.latency_list.append(latency_sec) + + def percentile(self, percent): + if not self.latency_list: + return 0.0 + latency_list = sorted(self.latency_list) + pos_float = len(latency_list) * percent / 100 + max_pos = len(latency_list) - 1 + pos_floor = min(math.floor(pos_float), max_pos) + pos_ceil = min(math.ceil(pos_float), max_pos) + return latency_list[pos_ceil] if pos_float - pos_floor > 0.5 else latency_list[pos_floor] + + def report(self, test_name="Batch Inference"): + print(f"\n📊 LATENCY REPORT for {test_name}") + for p in [0, 50, 90, 95, 99, 100]: + value = self.percentile(p) * 1000 + print(f"Latency P{p}: {value:.2f} ms") + + +def get_image(image_url): + image = Image.open(requests.get(image_url, stream=True).raw) + return image + +# Model Inputs +PROMPTS = ["What is in this image? Tell me a story", + "What is the recipe of mayonnaise in two sentences?" , + "Describe this image", + "What is the capital of Italy famous for?", + ] +IMAGES = [get_image("https://github.com/meta-llama/llama-models/blob/main/models/resources/dog.jpg?raw=true"), + torch.empty((0,0)), + get_image("https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_images/nxd-inference-block-diagram.jpg"), + torch.empty((0,0)), + ] +SAMPLING_PARAMS = [dict(top_k=1, temperature=1.0, top_p=1.0, max_tokens=256), + dict(top_k=1, temperature=0.9, top_p=1.0, max_tokens=256), + dict(top_k=10, temperature=0.9, top_p=0.5, max_tokens=512), + dict(top_k=10, temperature=0.75, top_p=0.5, max_tokens=1024), + ] + + +def get_VLLM_mllama_model_inputs(prompt, single_image, sampling_params): + input_image = single_image + has_image = torch.tensor([1]) + if isinstance(single_image, torch.Tensor) and single_image.numel() == 0: + has_image = torch.tensor([0]) + + instruct_prompt = add_instruct(prompt, has_image) + inputs = TextPrompt(prompt=instruct_prompt) + inputs["multi_modal_data"] = {"image": input_image} + # Create a sampling params object. + sampling_params = SamplingParams(**sampling_params) + return inputs, sampling_params + +def print_outputs(outputs): + # Print the outputs. + for output in outputs: + prompt = output.prompt + generated_text = output.outputs[0].text + print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") + + +llm_model = LLM(**model_vllm_config) +latency_collector = LatencyCollector() + +assert len(PROMPTS) == len(IMAGES) == len(SAMPLING_PARAMS), \ +f"""Text, image prompts and sampling parameters should have the same batch size, + got {len(PROMPTS)}, {len(IMAGES)}, and {len(SAMPLING_PARAMS)}""" + +batched_inputs = [] +batched_sample_params = [] +for i in range(1,21): + for pmpt, img, params in zip(PROMPTS, IMAGES, SAMPLING_PARAMS): + inputs, sampling_params = get_VLLM_mllama_model_inputs(pmpt, img, params) + # test batch-size = 1 + start_time = time.time() + outputs = llm_model.generate(inputs, sampling_params) + latency_sec = time.time() - start_time + latency_collector.record(latency_sec) + print_outputs(outputs) + batched_inputs.append(inputs) + batched_sample_params.append(sampling_params) + +latency_collector.report("MLLAMA") diff --git a/flux_serve/app/src/README.md b/flux_serve/app/src/README.md new file mode 100644 index 0000000..f796700 --- /dev/null +++ b/flux_serve/app/src/README.md @@ -0,0 +1,11 @@ +# flux-enigma + +clean previous compilation +```bash +rm -rf decoder/compiler_workdir/ decoder/__pycache__/ text_encoder_1/compiler_workdir text_encoder_2/compiler_workdir/ text_encoder_2/compiler_workdir/ text_encoder_2/__pycache__/ transformer/compiler_workdir/ transformer/__pycache__/ transformer/compiled_model decoder/compiled_model text_encoder_1/compiled_model text_encoder_2/compiled_model/ text_encoder_1/__pycache__/ +``` +modify the shapes and:.... + +```bash +./compile.sh > compile.log 2>&1 & +``` diff --git a/flux_serve/app/src/compile.sh b/flux_serve/app/src/compile.sh new file mode 100755 index 0000000..75cf813 --- /dev/null +++ b/flux_serve/app/src/compile.sh @@ -0,0 +1,9 @@ +find . -type d \( -name "compiled_model" -o -name "__pycache__" -o -name "compiler_workdir" \) -exec rm -rf {} + + +python "./text_encoder_1/compile.py" +python "./text_encoder_2/compile.py" -m 32 +python "./transformer/compile.py" -hh $HEIGHT -w $WIDTH -m 32 +python "./decoder/compile.py" -hh $HEIGHT -w $WIDTH +echo "done compiling; cleaning compiler_workdir" +find . -type d \( -name "__pycache__" -o -name "compiler_workdir" \) -exec rm -rf {} + + diff --git a/flux_serve/app/src/decoder/compile.py b/flux_serve/app/src/decoder/compile.py new file mode 100644 index 0000000..0ee3830 --- /dev/null +++ b/flux_serve/app/src/decoder/compile.py @@ -0,0 +1,70 @@ +import argparse +import copy +import os +import torch +import torch_neuronx +from diffusers import FluxPipeline +from model import TracingVAEDecoderWrapper +from huggingface_hub import login +from huggingface_hub import whoami +hf_token=os.environ['HUGGINGFACE_TOKEN'].strip() +try: + user_info = whoami() + print(f"Already logged in as {user_info['name']}") +except: + login(hf_token,add_to_git_credential=True) + +COMPILER_WORKDIR_ROOT = os.path.dirname(__file__) +DTYPE=torch.bfloat16 + +def trace_vae(height, width): + pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.float32) + decoder = copy.deepcopy(pipe.vae.decoder) + decoder = TracingVAEDecoderWrapper(decoder) + del pipe + + latents = torch.rand([1, 16, height // 8, width // 8], + dtype=torch.bfloat16) + + decoder_neuron = torch_neuronx.trace( + decoder, + latents, + compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, + 'compiler_workdir'), + compiler_args="""--model-type=unet-inference""" + ) + + torch_neuronx.async_load(decoder_neuron) + + compiled_model_path = os.path.join(COMPILER_WORKDIR_ROOT, 'compiled_model') + if not os.path.exists(compiled_model_path): + os.mkdir(compiled_model_path) + decoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, + 'compiled_model/model.pt') + torch.jit.save(decoder_neuron, decoder_filename) + + del decoder + del decoder_neuron + + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + parser.add_argument( + "-hh", + "--height", + type=int, + default=1024, + help="height of images to be generated by compilation of this model" + ) + parser.add_argument( + "-w", + "--width", + type=int, + default=1024, + help="width of images to be generated by compilation of this model" + ) + args = parser.parse_args() + trace_vae(args.height, args.width) + diff --git a/flux_serve/app/src/decoder/model.py b/flux_serve/app/src/decoder/model.py new file mode 100644 index 0000000..43c24f6 --- /dev/null +++ b/flux_serve/app/src/decoder/model.py @@ -0,0 +1,19 @@ +import torch +import torch.nn as nn + +DTYPE=torch.bfloat16 + +class TracingVAEDecoderWrapper(nn.Module): + def __init__(self, decoder): + super().__init__() + self.decoder = decoder + + def forward( + self, + latents: torch.Tensor + ): + latents = latents.to(torch.float32) + return self.decoder( + latents + ) + diff --git a/flux_serve/app/src/inference.py b/flux_serve/app/src/inference.py new file mode 100644 index 0000000..7530000 --- /dev/null +++ b/flux_serve/app/src/inference.py @@ -0,0 +1,251 @@ +import argparse +import torch +import torch.nn as nn +import torch_neuronx +import neuronx_distributed +import os + +from diffusers import FluxPipeline +from diffusers.models.modeling_outputs import Transformer2DModelOutput +from typing import Any, Dict, Optional, Union + +COMPILER_WORKDIR_ROOT = os.path.dirname(__file__) + +TEXT_ENCODER_PATH = os.path.join( + COMPILER_WORKDIR_ROOT, + 'text_encoder_1/compiled_model/model.pt') +TEXT_ENCODER_2_PATH = os.path.join( + COMPILER_WORKDIR_ROOT, + 'text_encoder_2/compiled_model/model.pt') +VAE_DECODER_PATH = os.path.join( + COMPILER_WORKDIR_ROOT, + 'decoder/compiled_model/model.pt') + +TEXT_ENCODER_2_DIR = os.path.join( + COMPILER_WORKDIR_ROOT, + 'text_encoder_2/compiled_model/text_encoder_2') + +EMBEDDERS_DIR = os.path.join( + COMPILER_WORKDIR_ROOT, + 'transformer/compiled_model/embedders') +OUT_LAYERS_DIR = os.path.join( + COMPILER_WORKDIR_ROOT, + 'transformer/compiled_model/out_layers') +SINGLE_TRANSFORMER_BLOCKS_DIR = os.path.join( + COMPILER_WORKDIR_ROOT, + 'transformer/compiled_model/single_transformer_blocks') +TRANSFORMER_BLOCKS_DIR = os.path.join( + COMPILER_WORKDIR_ROOT, + 'transformer/compiled_model/transformer_blocks') + +class CustomFluxPipeline(FluxPipeline): + @property + def _execution_device(self): + return torch.device("cpu") + +class TextEncoder2Wrapper(nn.Module): + def __init__(self, sharded_model,dtype=torch.bfloat16): + super().__init__() + self.sharded_model = sharded_model + self.dtype = dtype + + def forward(self, input_ids, output_hidden_states=False, **kwargs): + attention_mask = (input_ids != 0).long() + output = self.sharded_model(input_ids,attention_mask) + last_hidden_state = output[0] + processed_output = last_hidden_state + return (processed_output,) + + +class NeuronFluxTransformer2DModel(nn.Module): + def __init__( + self, + config, + x_embedder, + context_embedder + ): + super().__init__() + with torch_neuronx.experimental.neuron_cores_context(start_nc=0, + nc_count=8): + self.embedders_model = \ + neuronx_distributed.trace.parallel_model_load(EMBEDDERS_DIR) + self.transformer_blocks_model = \ + neuronx_distributed.trace.parallel_model_load( + TRANSFORMER_BLOCKS_DIR) + self.single_transformer_blocks_model = \ + neuronx_distributed.trace.parallel_model_load( + SINGLE_TRANSFORMER_BLOCKS_DIR) + self.out_layers_model = \ + neuronx_distributed.trace.parallel_model_load( + OUT_LAYERS_DIR) + self.config = config + self.x_embedder = x_embedder + self.context_embedder = context_embedder + self.device = torch.device("cpu") + + def forward( + self, + hidden_states: torch.Tensor, + encoder_hidden_states: torch.Tensor = None, + pooled_projections: torch.Tensor = None, + timestep: torch.LongTensor = None, + img_ids: torch.Tensor = None, + txt_ids: torch.Tensor = None, + guidance: torch.Tensor = None, + joint_attention_kwargs: Optional[Dict[str, Any]] = None, + return_dict: bool = False, + ) -> Union[torch.FloatTensor, Transformer2DModelOutput]: + + hidden_states = self.x_embedder(hidden_states) + + hidden_states, temb, image_rotary_emb = self.embedders_model( + hidden_states, + timestep, + guidance, + pooled_projections, + txt_ids, + img_ids + ) + + encoder_hidden_states = self.context_embedder(encoder_hidden_states) + + image_rotary_emb = image_rotary_emb.type(torch.bfloat16) + + encoder_hidden_states, hidden_states = self.transformer_blocks_model( + hidden_states, + encoder_hidden_states, + temb, + image_rotary_emb + ) + + hidden_states = torch.cat([encoder_hidden_states, hidden_states], + dim=1) + + hidden_states = self.single_transformer_blocks_model( + hidden_states, + temb, + image_rotary_emb + ) + + hidden_states = hidden_states.to(torch.bfloat16) + + return self.out_layers_model( + hidden_states, + encoder_hidden_states, + temb + ) + + +class NeuronFluxCLIPTextEncoderModel(nn.Module): + def __init__(self, dtype, encoder): + super().__init__() + self.dtype = dtype + self.encoder = encoder + self.device = torch.device("cpu") + + def forward(self, emb, output_hidden_states): + output = self.encoder(emb) + output = CLIPEncoderOutput(output) + return output + + +class CLIPEncoderOutput(): + def __init__(self, dictionary): + self.pooler_output = dictionary["pooler_output"] + + +class NeuronFluxT5TextEncoderModel(nn.Module): + def __init__(self, dtype, encoder): + super().__init__() + self.dtype = dtype + self.encoder = encoder + self.device = torch.device("cpu") + + def forward(self, emb, output_hidden_states): + return torch.unsqueeze(self.encoder(emb)["last_hidden_state"], 1) + + +def run_inference( + prompt, + height, + width, + max_sequence_length, + num_inference_steps): + with torch_neuronx.experimental.neuron_cores_context(start_nc=8): + pipe = CustomFluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.bfloat16) + + with torch_neuronx.experimental.neuron_cores_context(start_nc=8): + pipe.text_encoder = NeuronFluxCLIPTextEncoderModel( + pipe.text_encoder.dtype, + torch.jit.load(TEXT_ENCODER_PATH)) + + with torch_neuronx.experimental.neuron_cores_context(start_nc=8): + pipe.vae.decoder = torch.jit.load(VAE_DECODER_PATH) + + with torch_neuronx.experimental.neuron_cores_context(start_nc=0, nc_count=8): + sharded_text_encoder_2 = neuronx_distributed.trace.parallel_model_load(TEXT_ENCODER_2_DIR) + pipe.text_encoder_2 = TextEncoder2Wrapper(sharded_text_encoder_2) + + pipe.transformer = NeuronFluxTransformer2DModel( + pipe.transformer.config, + pipe.transformer.x_embedder, + pipe.transformer.context_embedder) + + image = pipe( + prompt, + height=height, + width=width, + guidance_scale=3.5, + num_inference_steps=num_inference_steps, + max_sequence_length=max_sequence_length + ).images[0] + image.save(os.path.join(COMPILER_WORKDIR_ROOT, "flux-dev.png")) + + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + parser.add_argument( + "-p", + "--prompt", + type=str, + default="A cat holding a sign that says hello world", + help="prompt for image to be generated; generates cat by default" + ) + parser.add_argument( + "-hh", + "--height", + type=int, + default=1024, + help="height of images to be generated by compilation of this model" + ) + parser.add_argument( + "-w", + "--width", + type=int, + default=1024, + help="width of images to be generated by compilation of this model" + ) + parser.add_argument( + "-m", + "--max_sequence_length", + type=int, + default=512, + help="maximum sequence length for the text embeddings" + ) + parser.add_argument( + "-n", + "--num_inference_steps", + type=int, + default=50, + help="number of inference steps to run in generating image" + ) + args = parser.parse_args() + run_inference( + args.prompt, + args.height, + args.width, + args.max_sequence_length, + args.num_inference_steps) + diff --git a/flux_serve/app/src/inference.sh b/flux_serve/app/src/inference.sh new file mode 100755 index 0000000..4c3e7db --- /dev/null +++ b/flux_serve/app/src/inference.sh @@ -0,0 +1,3 @@ +#!/bin/bash + +python "./inference.py" -p "A cat holding a sign that says hello world" -hh $HEIGHT -w $WIDTH -m 32 -n 50 diff --git a/flux_serve/app/src/text_encoder_1/compile.py b/flux_serve/app/src/text_encoder_1/compile.py new file mode 100644 index 0000000..871f388 --- /dev/null +++ b/flux_serve/app/src/text_encoder_1/compile.py @@ -0,0 +1,53 @@ +import copy +import os +import torch +import torch_neuronx +from diffusers import FluxPipeline +from model import TracingCLIPTextEncoderWrapper +from huggingface_hub import login +from huggingface_hub import whoami +hf_token=os.environ['HUGGINGFACE_TOKEN'].strip() +try: + user_info = whoami() + print(f"Already logged in as {user_info['name']}") +except: + login(hf_token,add_to_git_credential=True) + +COMPILER_WORKDIR_ROOT = os.path.dirname(__file__) +DTYPE=torch.bfloat16 + +def trace_text_encoder(): + pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.bfloat16) + text_encoder = copy.deepcopy(pipe.text_encoder) + del pipe + + text_encoder = TracingCLIPTextEncoderWrapper(text_encoder) + + emb = torch.zeros((1, 77), dtype=torch.int64) + + text_encoder_neuron = torch_neuronx.trace( + text_encoder.neuron_text_encoder, + emb, + compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, + 'compiler_workdir'), + compiler_args=["--enable-fast-loading-neuron-binaries"] + ) + + torch_neuronx.async_load(text_encoder_neuron) + + compiled_model_path = os.path.join(COMPILER_WORKDIR_ROOT, 'compiled_model') + if not os.path.exists(compiled_model_path): + os.mkdir(compiled_model_path) + text_encoder_filename = os.path.join(COMPILER_WORKDIR_ROOT, + 'compiled_model/model.pt') + torch.jit.save(text_encoder_neuron, text_encoder_filename) + + del text_encoder + del text_encoder_neuron + + +if __name__ == '__main__': + trace_text_encoder() + diff --git a/flux_serve/app/src/text_encoder_1/model.py b/flux_serve/app/src/text_encoder_1/model.py new file mode 100644 index 0000000..13151c4 --- /dev/null +++ b/flux_serve/app/src/text_encoder_1/model.py @@ -0,0 +1,34 @@ +import torch +import torch.nn as nn +from transformers.modeling_outputs \ + import BaseModelOutputWithPooling +from typing import Optional, Union, Tuple + + +class TracingCLIPTextEncoderWrapper(nn.Module): + def __init__(self, text_encoder): + super().__init__() + self.neuron_text_encoder = text_encoder + self.config = text_encoder.config + self.dtype = text_encoder.dtype + self.device = text_encoder.device + + def forward( + self, + input_ids: Optional[torch.Tensor] = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = False, + return_dict: Optional[bool] = True + ) -> Union[Tuple, BaseModelOutputWithPooling]: + + return self.neuron_text_encoder( + input_ids=input_ids, + attention_mask=attention_mask, + position_ids=position_ids, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=True, + ) + diff --git a/flux_serve/app/src/text_encoder_2/compile.py b/flux_serve/app/src/text_encoder_2/compile.py new file mode 100644 index 0000000..15e11c4 --- /dev/null +++ b/flux_serve/app/src/text_encoder_2/compile.py @@ -0,0 +1,105 @@ +import argparse +import copy +import os +import torch +import torch_neuronx +from diffusers import FluxPipeline +from model import TracingT5TextEncoderWrapper +import neuronx_distributed +from transformers import T5EncoderModel +from model import ( + TracingT5TextEncoderWrapper, + init_text_encoder_2, +) +from huggingface_hub import login +from huggingface_hub import whoami +hf_token=os.environ['HUGGINGFACE_TOKEN'].strip() +try: + user_info = whoami() + print(f"Already logged in as {user_info['name']}") +except: + login(hf_token,add_to_git_credential=True) + +COMPILER_WORKDIR_ROOT = os.path.dirname(__file__) +TP_DEGREE=8 +DTYPE=torch.bfloat16 + +def build_text_encoder_2(): + pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev",torch_dtype=DTYPE) + text_encoder_2 = copy.deepcopy(pipe.text_encoder_2) + del pipe + + init_text_encoder_2(text_encoder_2) + wrapper = TracingT5TextEncoderWrapper(text_encoder_2) + return wrapper, {} + +def trace_text_encoder_2(max_sequence_length=512): + input_ids = torch.zeros((1, max_sequence_length), dtype=torch.int64) + attention_mask = torch.ones((1, max_sequence_length), dtype=torch.int64) + + sample_inputs = (input_ids, attention_mask) + + model = neuronx_distributed.trace.parallel_model_trace( + build_text_encoder_2, + sample_inputs, + tp_degree=TP_DEGREE, + compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, "compiler_workdir"), + compiler_args=["--enable-fast-loading-neuron-binaries"], + ) + + torch_neuronx.async_load(model) + + compiled_model_path = os.path.join(COMPILER_WORKDIR_ROOT, "compiled_model") + if not os.path.exists(compiled_model_path): + os.mkdir(compiled_model_path) + + model_filename = os.path.join(compiled_model_path, "text_encoder_2") + neuronx_distributed.trace.parallel_model_save(model, model_filename) + + del model + +''' +def trace_text_encoder_2(max_sequence_length): + pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.bfloat16) + text_encoder_2 = copy.deepcopy(pipe.text_encoder_2) + del pipe + + text_encoder_2 = TracingT5TextEncoderWrapper(text_encoder_2) + + emb = torch.zeros((1, max_sequence_length), dtype=torch.int64) + + text_encoder_2_neuron = torch_neuronx.trace( + text_encoder_2.neuron_text_encoder, + emb, + compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, + 'compiler_workdir'), + compiler_args=["--enable-fast-loading-neuron-binaries"] + ) + + torch_neuronx.async_load(text_encoder_2_neuron) + + compiled_model_path = os.path.join(COMPILER_WORKDIR_ROOT, 'compiled_model') + if not os.path.exists(compiled_model_path): + os.mkdir(compiled_model_path) + text_encoder_2_filename = os.path.join(COMPILER_WORKDIR_ROOT, + 'compiled_model/model.pt') + torch.jit.save(text_encoder_2_neuron, text_encoder_2_filename) + + del text_encoder_2 + del text_encoder_2_neuron +''' + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + parser.add_argument( + "-m", + "--max_sequence_length", + type=int, + default=512, + help="maximum sequence length for the text embeddings" + ) + args = parser.parse_args() + trace_text_encoder_2(args.max_sequence_length) + diff --git a/flux_serve/app/src/text_encoder_2/model.py b/flux_serve/app/src/text_encoder_2/model.py new file mode 100644 index 0000000..c63caed --- /dev/null +++ b/flux_serve/app/src/text_encoder_2/model.py @@ -0,0 +1,189 @@ +import torch +import torch.nn as nn + +from transformers.modeling_outputs import BaseModelOutput +from typing import Optional, Union, Tuple + +from neuronx_distributed.parallel_layers import parallel_state +from neuronx_distributed.parallel_layers.layers import ( + ColumnParallelLinear, + RowParallelLinear, +) +DTYPE=torch.bfloat16 + +def get_sharded_data(data: torch.Tensor, dim: int) -> torch.Tensor: + tp_rank = parallel_state.get_tensor_model_parallel_rank() + tp_size = parallel_state.get_tensor_model_parallel_size() + per_partition_size = data.shape[dim] // tp_size + + if dim == 0: + return ( + data[per_partition_size * tp_rank : per_partition_size * (tp_rank + 1)] + .clone() + .to(DTYPE) + ) + elif dim == 1: + return ( + data[:, per_partition_size * tp_rank : per_partition_size * (tp_rank + 1)] + .clone() + .to(DTYPE) + ) + else: + raise ValueError("Partition dimension must be 0 or 1.") + +def shard_t5_attention(t5_attention): + # Shard q + orig_q = t5_attention.q + t5_attention.q = ColumnParallelLinear( + orig_q.in_features, + orig_q.out_features, + bias=(orig_q.bias is not None), + gather_output=True + ) + t5_attention.q.weight.data = get_sharded_data(orig_q.weight.data, 0) + # T5 uses bias=False by default, but we handle it just in case + if orig_q.bias is not None: + t5_attention.q.bias.data = get_sharded_data(orig_q.bias.data, 0) + del orig_q + + # Shard k + orig_k = t5_attention.k + t5_attention.k = ColumnParallelLinear( + orig_k.in_features, + orig_k.out_features, + bias=(orig_k.bias is not None), + gather_output=True + ) + t5_attention.k.weight.data = get_sharded_data(orig_k.weight.data, 0) + if orig_k.bias is not None: + t5_attention.k.bias.data = get_sharded_data(orig_k.bias.data, 0) + del orig_k + + # Shard v + orig_v = t5_attention.v + t5_attention.v = ColumnParallelLinear( + orig_v.in_features, + orig_v.out_features, + bias=(orig_v.bias is not None), + gather_output=True + ) + t5_attention.v.weight.data = get_sharded_data(orig_v.weight.data, 0) + if orig_v.bias is not None: + t5_attention.v.bias.data = get_sharded_data(orig_v.bias.data, 0) + del orig_v + + # Shard o + orig_o = t5_attention.o + t5_attention.o = RowParallelLinear( + orig_o.in_features, + orig_o.out_features, + bias=(orig_o.bias is not None), + input_is_parallel=False + ) + t5_attention.o.weight.data = get_sharded_data(orig_o.weight.data, 1) + if orig_o.bias is not None: + t5_attention.o.bias.data = orig_o.bias.data.detach() + del orig_o + +def shard_t5_ff(ff_block): + # Helper function for ColumnParallel + def make_column_parallel(orig_layer): + from neuronx_distributed.parallel_layers.layers import ColumnParallelLinear + new_layer = ColumnParallelLinear( + orig_layer.in_features, + orig_layer.out_features, + bias=(orig_layer.bias is not None), + gather_output=False + ) + new_layer.weight.data = get_sharded_data(orig_layer.weight.data, 0) + if orig_layer.bias is not None: + new_layer.bias.data = get_sharded_data(orig_layer.bias.data, 0) + return new_layer + + # Helper function for RowParallel + def make_row_parallel(orig_layer): + from neuronx_distributed.parallel_layers.layers import RowParallelLinear + new_layer = RowParallelLinear( + orig_layer.in_features, + orig_layer.out_features, + bias=(orig_layer.bias is not None), + input_is_parallel=True + ) + # For RowParallel, we shard dimension=1 + new_layer.weight.data = get_sharded_data(orig_layer.weight.data, 1) + if orig_layer.bias is not None: + new_layer.bias.data = orig_layer.bias.data.detach() + return new_layer + + if hasattr(ff_block, "wi") and hasattr(ff_block, "wo"): + orig_wi = ff_block.wi + ff_block.wi = make_column_parallel(orig_wi) + del orig_wi + + orig_wo = ff_block.wo + ff_block.wo = make_row_parallel(orig_wo) + del orig_wo + + elif hasattr(ff_block, "wi_0") and hasattr(ff_block, "wi_1") and hasattr(ff_block, "wo"): + orig_wi_0 = ff_block.wi_0 + ff_block.wi_0 = make_column_parallel(orig_wi_0) + del orig_wi_0 + + orig_wi_1 = ff_block.wi_1 + ff_block.wi_1 = make_column_parallel(orig_wi_1) + del orig_wi_1 + + orig_wo = ff_block.wo + ff_block.wo = make_row_parallel(orig_wo) + del orig_wo + + else: + raise ValueError( + f"Unsupported T5 FF block type: {type(ff_block).__name__}. " + f"Expected T5DenseReluDense or T5DenseGatedActDense." + ) + +def init_text_encoder_2(t5_encoder): + encoder_stack = t5_encoder.encoder # T5Stack + for block in encoder_stack.block: + # block.layer[0] => T5LayerSelfAttention + # block.layer[1] => T5LayerFF + attn = block.layer[0].SelfAttention + shard_t5_attention(attn) + ff = block.layer[1].DenseReluDense + shard_t5_ff(ff) + + +class TracingT5TextEncoderWrapper(nn.Module): + def __init__(self, text_encoder): + super().__init__() + self.neuron_text_encoder = text_encoder + self.config = text_encoder.config + self.dtype = text_encoder.dtype + self.device = text_encoder.device + + def forward( + self, + input_ids: Optional[torch.LongTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = False, + return_dict: Optional[bool] = False, + ) -> Union[Tuple[torch.FloatTensor], BaseModelOutput]: + return_dict = return_dict if return_dict is not None \ + else self.config.use_return_dict + + encoder_outputs = self.neuron_text_encoder( + input_ids=input_ids, + attention_mask=attention_mask, + inputs_embeds=inputs_embeds, + head_mask=head_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + return encoder_outputs + diff --git a/flux_serve/app/src/transformer/compile.py b/flux_serve/app/src/transformer/compile.py new file mode 100644 index 0000000..21fcd18 --- /dev/null +++ b/flux_serve/app/src/transformer/compile.py @@ -0,0 +1,220 @@ +import argparse +import copy +import neuronx_distributed +import os +import torch +import torch_neuronx +from diffusers import FluxPipeline +from diffusers.models.transformers.transformer_flux \ + import FluxTransformer2DModel +from model import (TracingTransformerEmbedderWrapper, + TracingTransformerBlockWrapper, + TracingSingleTransformerBlockWrapper, + TracingTransformerOutLayerWrapper, + init_transformer) +from huggingface_hub import login +from huggingface_hub import whoami +hf_token=os.environ['HUGGINGFACE_TOKEN'].strip() +try: + user_info = whoami() + print(f"Already logged in as {user_info['name']}") +except: + login(hf_token,add_to_git_credential=True) + +COMPILER_WORKDIR_ROOT = os.path.dirname(__file__) +TP_DEGREE=8 + + +def trace_transformer_embedders(): + pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.bfloat16) + transformer: FluxTransformer2DModel = copy.deepcopy(pipe.transformer) + del pipe + init_transformer(transformer) + + mod_pipe_transformer_f = TracingTransformerEmbedderWrapper( + transformer.x_embedder, transformer.context_embedder, + transformer.time_text_embed, transformer.pos_embed) + return mod_pipe_transformer_f, {} + + +def trace_transformer_blocks(): + pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.bfloat16) + transformer: FluxTransformer2DModel = copy.deepcopy(pipe.transformer) + del pipe + init_transformer(transformer) + + mod_pipe_transformer_f = TracingTransformerBlockWrapper( + transformer, transformer.transformer_blocks) + return mod_pipe_transformer_f, {} + + +def trace_single_transformer_blocks(): + pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.bfloat16) + transformer: FluxTransformer2DModel = copy.deepcopy(pipe.transformer) + del pipe + init_transformer(transformer) + + mod_pipe_transformer_f = TracingSingleTransformerBlockWrapper( + transformer, transformer.single_transformer_blocks) + return mod_pipe_transformer_f, {} + + +def trace_transformer_out_layers(): + pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.bfloat16) + transformer: FluxTransformer2DModel = copy.deepcopy(pipe.transformer) + del pipe + init_transformer(transformer) + + mod_pipe_transformer_f = TracingTransformerOutLayerWrapper( + transformer.norm_out, transformer.proj_out) + return mod_pipe_transformer_f, {} + + +def trace_transformer(height, width, max_sequence_length): + hidden_states = torch.rand([1, height * width // 256, 3072], + dtype=torch.bfloat16) + timestep = torch.rand([1], dtype=torch.bfloat16) + guidance = torch.rand([1], dtype=torch.float32) + pooled_projections = torch.rand([1, 768], dtype=torch.bfloat16) + txt_ids = torch.rand([1, max_sequence_length, 3], dtype=torch.bfloat16) + img_ids = torch.rand([1, height * width // 256, 3], dtype=torch.bfloat16) + sample_inputs = hidden_states, timestep, guidance, pooled_projections, \ + txt_ids, img_ids + + model = neuronx_distributed.trace.parallel_model_trace( + trace_transformer_embedders, + sample_inputs, + tp_degree=TP_DEGREE, + compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, + 'compiler_workdir'), + compiler_args="""--model-type=unet-inference""" + ) + + torch_neuronx.async_load(model) + + compiled_model_path = os.path.join(COMPILER_WORKDIR_ROOT, 'compiled_model') + if not os.path.exists(compiled_model_path): + os.mkdir(compiled_model_path) + model_filename = os.path.join(COMPILER_WORKDIR_ROOT, + 'compiled_model/embedders') + neuronx_distributed.trace.parallel_model_save(model, model_filename) + + del model + + hidden_states = torch.rand([1, height * width // 256, 3072], + dtype=torch.bfloat16) + encoder_hidden_states = torch.rand([1, max_sequence_length, 3072], + dtype=torch.bfloat16) + temb = torch.rand([1, 3072], dtype=torch.bfloat16) + image_rotary_emb = torch.rand( + [1, 1, height * width // 256 + max_sequence_length, 64, 2, 2], + dtype=torch.bfloat16) + sample_inputs = hidden_states, encoder_hidden_states, \ + temb, image_rotary_emb + + model = neuronx_distributed.trace.parallel_model_trace( + trace_transformer_blocks, + sample_inputs, + tp_degree=TP_DEGREE, + compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, + 'compiler_workdir'), + compiler_args="""--model-type=unet-inference""" + ) + + torch_neuronx.async_load(model) + + model_filename = os.path.join(COMPILER_WORKDIR_ROOT, + 'compiled_model/transformer_blocks') + neuronx_distributed.trace.parallel_model_save(model, model_filename) + + del model + + hidden_states = torch.rand( + [1, height * width // 256 + max_sequence_length, 3072], + dtype=torch.bfloat16) + temb = torch.rand([1, 3072], dtype=torch.bfloat16) + image_rotary_emb = torch.rand( + [1, 1, height * width // 256 + max_sequence_length, 64, 2, 2], + dtype=torch.bfloat16) + sample_inputs = hidden_states, temb, image_rotary_emb + + model = neuronx_distributed.trace.parallel_model_trace( + trace_single_transformer_blocks, + sample_inputs, + tp_degree=TP_DEGREE, + compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, + 'compiler_workdir'), + compiler_args="""--model-type=unet-inference""" + ) + + torch_neuronx.async_load(model) + + model_filename = os.path.join(COMPILER_WORKDIR_ROOT, + 'compiled_model/single_transformer_blocks') + neuronx_distributed.trace.parallel_model_save(model, model_filename) + + del model + + hidden_states = torch.rand( + [1, height * width // 256 + max_sequence_length, 3072], + dtype=torch.bfloat16) + encoder_hidden_states = torch.rand([1, max_sequence_length, 3072], + dtype=torch.bfloat16) + temb = torch.rand([1, 3072], dtype=torch.bfloat16) + sample_inputs = hidden_states, encoder_hidden_states, temb + + model = neuronx_distributed.trace.parallel_model_trace( + trace_transformer_out_layers, + sample_inputs, + tp_degree=TP_DEGREE, + compiler_workdir=os.path.join(COMPILER_WORKDIR_ROOT, + 'compiler_workdir'), + compiler_args="""--model-type=unet-inference""" + ) + + torch_neuronx.async_load(model) + + model_filename = os.path.join(COMPILER_WORKDIR_ROOT, + 'compiled_model/out_layers') + neuronx_distributed.trace.parallel_model_save(model, model_filename) + + del model + + +if __name__ == '__main__': + parser = argparse.ArgumentParser() + parser.add_argument( + "-hh", + "--height", + type=int, + default=1024, + help="height of images to be generated by compilation of this model" + ) + parser.add_argument( + "-w", + "--width", + type=int, + default=1024, + help="width of images to be generated by compilation of this model" + ) + parser.add_argument( + "-m", + "--max_sequence_length", + type=int, + default=512, + help="maximum sequence length for the text embeddings" + ) + args = parser.parse_args() + trace_transformer( + args.height, + args.width, + args.max_sequence_length) + diff --git a/flux_serve/app/src/transformer/model.py b/flux_serve/app/src/transformer/model.py new file mode 100644 index 0000000..3689de4 --- /dev/null +++ b/flux_serve/app/src/transformer/model.py @@ -0,0 +1,448 @@ +import torch +import torch.nn as nn +from diffusers.models.attention import FeedForward +from diffusers.models.attention_processor import Attention +from diffusers.models.embeddings import TimestepEmbedding, \ + PixArtAlphaTextProjection +from neuronx_distributed.parallel_layers import parallel_state +from neuronx_distributed.parallel_layers.layers \ + import ColumnParallelLinear, RowParallelLinear + +DTYPE=torch.bfloat16 + +class TracingTransformerEmbedderWrapper(nn.Module): + def __init__( + self, + x_embedder, + context_embedder, + time_text_embed, + pos_embed): + super().__init__() + self.x_embedder = x_embedder + self.context_embedder = context_embedder + self.time_text_embed = time_text_embed + self.pos_embed = pos_embed + + def forward( + self, + hidden_states, + timestep, + guidance, + pooled_projections, + txt_ids, + img_ids): + + timestep = timestep.to(hidden_states.dtype) * 1000 + if guidance is not None: + guidance = guidance.to(hidden_states.dtype) * 1000 + else: + guidance = None + temb = ( + self.time_text_embed(timestep, pooled_projections) + if guidance is None + else self.time_text_embed(timestep, guidance, pooled_projections) + ) + + ids = torch.cat((txt_ids, img_ids), dim=1) + image_rotary_emb = self.pos_embed(ids) + return hidden_states, temb, image_rotary_emb + + +class TracingTransformerBlockWrapper(nn.Module): + def __init__(self, transformer, transformerblock): + super().__init__() + self.transformerblock = transformerblock + self.config = transformer.config + self.dtype = transformer.dtype + self.device = transformer.device + + def forward( + self, + hidden_states, + encoder_hidden_states, + temb, + image_rotary_emb): + for block in self.transformerblock: + encoder_hidden_states, hidden_states = block( + hidden_states=hidden_states, + encoder_hidden_states=encoder_hidden_states, + temb=temb, + image_rotary_emb=image_rotary_emb + ) + return encoder_hidden_states, hidden_states + + +class TracingSingleTransformerBlockWrapper(nn.Module): + def __init__(self, transformer, transformerblock): + super().__init__() + self.transformerblock = transformerblock + self.config = transformer.config + self.dtype = transformer.dtype + self.device = transformer.device + + def forward(self, hidden_states, temb, image_rotary_emb): + for block in self.transformerblock: + hidden_states = block( + hidden_states=hidden_states, + temb=temb, + image_rotary_emb=image_rotary_emb + ) + return hidden_states + + +class TracingTransformerOutLayerWrapper(nn.Module): + def __init__(self, norm_out, proj_out): + super().__init__() + self.norm_out = norm_out + self.proj_out = proj_out + + def forward(self, hidden_states, encoder_hidden_states, temb): + hidden_states = hidden_states[:, encoder_hidden_states.shape[1]:, ...] + + hidden_states = self.norm_out(hidden_states, temb) + return (self.proj_out(hidden_states),) + + +class TracingSingleTransformerBlock(nn.Module): + def __init__(self): + super().__init__() + self.mlp_hidden_dim = None + + self.norm = None + self.proj_mlp = None + self.act_mlp = None + self.proj_out = None + self.proj_out_2 = None + + self.attn = None + + def forward( + self, + hidden_states: torch.FloatTensor, + temb: torch.FloatTensor, + image_rotary_emb=None, + ): + residual = hidden_states + norm_hidden_states, gate = self.norm(hidden_states, emb=temb) + mlp_hidden_states = self.act_mlp(self.proj_mlp(norm_hidden_states)) + + attn_output = self.attn( + hidden_states=norm_hidden_states, + image_rotary_emb=image_rotary_emb, + ) + gate = gate.unsqueeze(1) + hidden_states = gate * (self.proj_out(attn_output) + + self.proj_out_2(mlp_hidden_states)) + hidden_states = residual + hidden_states + if hidden_states.dtype == torch.float16: + hidden_states = hidden_states.clip(-65504, 65504) + + return hidden_states + + +def get_sharded_data(data, dim): + tp_rank = parallel_state.get_tensor_model_parallel_rank() + per_partition_size = \ + data.shape[dim] // parallel_state.get_tensor_model_parallel_size() + if dim == 0: + return data[ + per_partition_size * tp_rank: per_partition_size * (tp_rank + 1) + ].clone().to(torch.bfloat16) + elif dim == 1: + return data[:, + per_partition_size * tp_rank: per_partition_size * + (tp_rank + 1) + ].clone().to(torch.bfloat16) + else: + raise Exception( + f"Partiton value of 0,1 are supported, found {dim}." + ) + + +def shard_attn(attn: Attention): + attn.heads = 3 + + orig_q = attn.to_q + attn.to_q = ColumnParallelLinear( + attn.to_q.in_features, + attn.to_q.out_features, + bias=(attn.to_q.bias is not None), + gather_output=False) + attn.to_q.weight.data = get_sharded_data(orig_q.weight.data, 0) + if attn.to_q.bias is not None: + attn.to_q.bias.data = get_sharded_data(orig_q.bias.data, 0) + del (orig_q) + + orig_k = attn.to_k + attn.to_k = ColumnParallelLinear( + attn.to_k.in_features, + attn.to_k.out_features, + bias=(attn.to_k.bias is not None), + gather_output=False) + attn.to_k.weight.data = get_sharded_data(orig_k.weight.data, 0) + if attn.to_k.bias is not None: + attn.to_k.bias.data = get_sharded_data(orig_k.bias.data, 0) + del (orig_k) + + orig_v = attn.to_v + attn.to_v = ColumnParallelLinear( + attn.to_v.in_features, + attn.to_v.out_features, + bias=(attn.to_v.bias is not None), + gather_output=False) + attn.to_v.weight.data = get_sharded_data(orig_v.weight.data, 0) + if attn.to_v.bias is not None: + attn.to_v.bias.data = get_sharded_data(orig_v.bias.data, 0) + del (orig_v) + + orig_q_proj = attn.add_q_proj + attn.add_q_proj = ColumnParallelLinear( + attn.add_q_proj.in_features, + attn.add_q_proj.out_features, + bias=(attn.add_q_proj.bias is not None), + gather_output=False) + attn.add_q_proj.weight.data = get_sharded_data(orig_q_proj.weight.data, 0) + if attn.add_q_proj.bias is not None: + attn.add_q_proj.bias.data = get_sharded_data(orig_q_proj.bias.data, 0) + del (orig_q_proj) + + orig_k_proj = attn.add_k_proj + attn.add_k_proj = ColumnParallelLinear( + attn.add_k_proj.in_features, + attn.add_k_proj.out_features, + bias=(attn.add_k_proj.bias is not None), + gather_output=False) + attn.add_k_proj.weight.data = get_sharded_data(orig_k_proj.weight.data, 0) + if attn.add_k_proj.bias is not None: + attn.add_k_proj.bias.data = get_sharded_data(orig_k_proj.bias.data, 0) + del (orig_k_proj) + + orig_v_proj = attn.add_v_proj + attn.add_v_proj = ColumnParallelLinear( + attn.add_v_proj.in_features, + attn.add_v_proj.out_features, + bias=(attn.add_v_proj.bias is not None), + gather_output=False) + attn.add_v_proj.weight.data = get_sharded_data(orig_v_proj.weight.data, 0) + if attn.add_v_proj.bias is not None: + attn.add_v_proj.bias.data = get_sharded_data(orig_v_proj.bias.data, 0) + del (orig_v_proj) + + orig_out = attn.to_out[0] + attn.to_out[0] = RowParallelLinear( + attn.to_out[0].in_features, + attn.to_out[0].out_features, + bias=(attn.to_out[0].bias is not None), + input_is_parallel=True) + attn.to_out[0].weight.data = get_sharded_data(orig_out.weight.data, 1) + if attn.to_out[0].bias is not None: + attn.to_out[0].bias.data = orig_out.bias.data.detach() + del (orig_out) + + orig_out = attn.to_add_out + attn.to_add_out = RowParallelLinear( + attn.to_add_out.in_features, + attn.to_add_out.out_features, + bias=(attn.to_add_out.bias is not None), + input_is_parallel=True) + attn.to_add_out.weight.data = get_sharded_data(orig_out.weight.data, 1) + if attn.to_add_out.bias is not None: + attn.to_add_out.bias.data = orig_out.bias.data.detach() + del (orig_out) + return attn + + +def shard_attn_lite(block): + attn = block.attn + attn.heads = 3 + + orig_q = attn.to_q + attn.to_q = ColumnParallelLinear( + attn.to_q.in_features, + attn.to_q.out_features, + bias=(attn.to_q.bias is not None), + gather_output=False) + attn.to_q.weight.data = get_sharded_data(orig_q.weight.data, 0) + if attn.to_q.bias is not None: + attn.to_q.bias.data = get_sharded_data(orig_q.bias.data, 0) + del (orig_q) + + orig_k = attn.to_k + attn.to_k = ColumnParallelLinear( + attn.to_k.in_features, + attn.to_k.out_features, + bias=(attn.to_k.bias is not None), + gather_output=False) + attn.to_k.weight.data = get_sharded_data(orig_k.weight.data, 0) + if attn.to_k.bias is not None: + attn.to_k.bias.data = get_sharded_data(orig_k.bias.data, 0) + del (orig_k) + + orig_v = attn.to_v + attn.to_v = ColumnParallelLinear( + attn.to_v.in_features, + attn.to_v.out_features, + bias=(attn.to_v.bias is not None), + gather_output=False) + attn.to_v.weight.data = get_sharded_data(orig_v.weight.data, 0) + if attn.to_v.bias is not None: + attn.to_v.bias.data = get_sharded_data(orig_v.bias.data, 0) + del (orig_v) + + orig_mlp = block.proj_mlp + block.proj_mlp = ColumnParallelLinear( + block.proj_mlp.in_features, + block.proj_mlp.out_features, + bias=(block.proj_mlp.bias is not None), + gather_output=False) + block.proj_mlp.weight.data = get_sharded_data(orig_mlp.weight.data, 0) + if block.proj_mlp.bias is not None: + block.proj_mlp.bias.data = get_sharded_data(orig_mlp.bias.data, 0) + del (orig_mlp) + + orig_out = block.proj_out + out_features = block.proj_out.out_features + bias = block.proj_out.bias + block.proj_out = RowParallelLinear( + 3072, + out_features, + bias=(bias is not None), + input_is_parallel=True) + block.proj_out.weight.data = get_sharded_data( + orig_out.weight.data[..., 0:3072], 1) + if block.proj_out.bias is not None: + block.proj_out.bias.data = orig_out.bias.data.detach() + + block.proj_out_2 = RowParallelLinear( + 12288, + out_features, + bias=False, + input_is_parallel=True) + block.proj_out_2.weight.data = get_sharded_data( + orig_out.weight.data[..., 3072:15360], 1) + del (orig_out) + + return attn + + +def shard_ff(ff: FeedForward) -> FeedForward: + orig_proj = ff.net[0].proj + ff.net[0].proj = ColumnParallelLinear( + ff.net[0].proj.in_features, + ff.net[0].proj.out_features, + bias=(ff.net[0].proj.bias is not None), + gather_output=False) + ff.net[0].proj.weight.data = get_sharded_data(orig_proj.weight.data, 0) + if ff.net[0].proj.bias is not None: + ff.net[0].proj.bias.data = get_sharded_data(orig_proj.bias.data, 0) + del (orig_proj) + orig_linear = ff.net[2] + ff.net[2] = RowParallelLinear( + ff.net[2].in_features, + ff.net[2].out_features, + bias=(ff.net[2].bias is not None), + input_is_parallel=True) + if ff.net[2].bias is not None: + ff.net[2].bias.data = orig_linear.bias.data.detach() + ff.net[2].weight.data = get_sharded_data(orig_linear.weight.data, 1) + del (orig_linear) + return ff + + +def init_transformer(transformer): + timestep_embedder: TimestepEmbedding = \ + transformer.time_text_embed.timestep_embedder + orig_linear_1 = timestep_embedder.linear_1 + timestep_embedder.linear_1 = ColumnParallelLinear( + timestep_embedder.linear_1.in_features, + timestep_embedder.linear_1.out_features, + bias=(timestep_embedder.linear_1.bias is not None), + gather_output=False) + timestep_embedder.linear_1.weight.data = \ + get_sharded_data(orig_linear_1.weight.data, 0) + if timestep_embedder.linear_1.bias is not None: + timestep_embedder.linear_1.bias.data = \ + get_sharded_data(orig_linear_1.bias.data, 0) + del (orig_linear_1) + orig_linear_2 = timestep_embedder.linear_2 + timestep_embedder.linear_2 = RowParallelLinear( + timestep_embedder.linear_2.in_features, + timestep_embedder.linear_2.out_features, + bias=(timestep_embedder.linear_2.bias is not None), + input_is_parallel=True) + if timestep_embedder.linear_2.bias is not None: + timestep_embedder.linear_2.bias.data = orig_linear_2.bias.data.detach() + timestep_embedder.linear_2.weight.data = \ + get_sharded_data(orig_linear_2.weight.data, 1) + del (orig_linear_2) + + guidance_embedder: TimestepEmbedding = \ + transformer.time_text_embed.guidance_embedder + orig_linear_1 = guidance_embedder.linear_1 + guidance_embedder.linear_1 = ColumnParallelLinear( + guidance_embedder.linear_1.in_features, + guidance_embedder.linear_1.out_features, + bias=(guidance_embedder.linear_1.bias is not None), + gather_output=False) + guidance_embedder.linear_1.weight.data = \ + get_sharded_data(orig_linear_1.weight.data, 0) + if guidance_embedder.linear_1.bias is not None: + guidance_embedder.linear_1.bias.data = \ + get_sharded_data(orig_linear_1.bias.data, 0) + del (orig_linear_1) + orig_linear_2 = guidance_embedder.linear_2 + guidance_embedder.linear_2 = RowParallelLinear( + guidance_embedder.linear_2.in_features, + guidance_embedder.linear_2.out_features, + bias=(guidance_embedder.linear_2.bias is not None), + input_is_parallel=True) + if guidance_embedder.linear_2.bias is not None: + guidance_embedder.linear_2.bias.data = orig_linear_2.bias.data.detach() + guidance_embedder.linear_2.weight.data = \ + get_sharded_data(orig_linear_2.weight.data, 1) + del (orig_linear_2) + + text_embedder: PixArtAlphaTextProjection = \ + transformer.time_text_embed.text_embedder + orig_linear_1 = text_embedder.linear_1 + text_embedder.linear_1 = ColumnParallelLinear( + text_embedder.linear_1.in_features, + text_embedder.linear_1.out_features, + bias=(text_embedder.linear_1.bias is not None), + gather_output=False) + text_embedder.linear_1.weight.data = \ + get_sharded_data(orig_linear_1.weight.data, 0) + if text_embedder.linear_1.bias is not None: + text_embedder.linear_1.bias.data = \ + get_sharded_data(orig_linear_1.bias.data, 0) + del (orig_linear_1) + orig_linear_2 = text_embedder.linear_2 + text_embedder.linear_2 = RowParallelLinear( + text_embedder.linear_2.in_features, + text_embedder.linear_2.out_features, + bias=(text_embedder.linear_2.bias is not None), + input_is_parallel=True) + if text_embedder.linear_2.bias is not None: + text_embedder.linear_2.bias.data = orig_linear_2.bias.data.detach() + text_embedder.linear_2.weight.data = \ + get_sharded_data(orig_linear_2.weight.data, 1) + del (orig_linear_2) + + for block_idx, block in enumerate(transformer.transformer_blocks): + block.attn = shard_attn(block.attn) + block.ff = shard_ff(block.ff) + block.ff_context = shard_ff(block.ff_context) + + for block_idx, block in enumerate(transformer.single_transformer_blocks): + newblock = TracingSingleTransformerBlock() + newblock.mlp_hidden_dim = block.mlp_hidden_dim + newblock.norm = block.norm + newblock.proj_mlp = block.proj_mlp + newblock.act_mlp = block.act_mlp + newblock.proj_out = block.proj_out + newblock.proj_out_2 = None + newblock.attn = block.attn + transformer.single_transformer_blocks[block_idx] = newblock + block = newblock + block.attn = shard_attn_lite(block) + diff --git a/flux_serve/app/t5_model_api.py b/flux_serve/app/t5_model_api.py new file mode 100644 index 0000000..731e895 --- /dev/null +++ b/flux_serve/app/t5_model_api.py @@ -0,0 +1,172 @@ +import traceback +import math +import boto3 +import time +import argparse +import torch +import os +from fastapi import FastAPI, HTTPException +from pydantic import BaseModel, Field +from typing import Any, Dict, Optional, Union +from huggingface_hub import login,snapshot_download +import base64 +import torch +from transformers import T5Tokenizer +from neuronx_distributed.trace import parallel_model_load + +cw_namespace='hw-agnostic-infer' +default_max_new_tokens=50 +cloudwatch = boto3.client('cloudwatch', region_name='us-west-2') +sampling_params = SamplingParams(temperature=0.7,top_k=50,top_p=0.9,max_tokens=128,) + +app_name=os.environ['APP'] +nodepool=os.environ['NODEPOOL'] +pod_name = os.environ['POD_NAME'] +hf_token = os.environ['HUGGINGFACE_TOKEN'].strip() + +model_id=os.environ['MODEL_ID'] +repo_id=os.environ['COMPILED_MODEL_ID'] +local_dir=snapshot_download(repo_id,allow_patterns="tp_*.pt") +max_sequence_length = int(os.environ['MAX_SEQ_LEN']) + +tokenizer = T5Tokenizer.from_pretrained(model_id) +tokenizer.model_max_length = max_sequence_length +model = parallel_model_load(local_dir) + +def gentext(prompt,max_new_tokens): + start_time = time.time() + inputs = tokenizer(promp, return_tensors="pt", truncation=True, padding="max_length", max_length=max_new_tokens) + with torch.no_grad(): + output = model(inputs["input_ids"], inputs["attention_mask"]) + if isinstance(output, dict): + last_hidden_state = output["last_hidden_state"] + else: + last_hidden_state = output + embeddings = last_hidden_state.mean(dim=1).squeeze().to(torch.float32).cpu().numpy() + total_time = time.time()-start_time + return str(embeddings), float(total_time) + +def cw_pub_metric(metric_name,metric_value,metric_unit): + response = cloudwatch.put_metric_data( + Namespace=cw_namespace, + MetricData=[ + { + 'MetricName':metric_name, + 'Value':metric_value, + 'Unit':metric_unit, + }, + ] + ) + print(f"in pub_deployment_counter - response:{response}") + return response + +login(hf_token, add_to_git_credential=True) + +def benchmark(n_runs,test_name,prompt,max_new_tokens): + latency_collector = LatencyCollector() + + for _ in range(n_runs): + latency_collector.pre_hook() + gentext(prompt,max_new_tokens) + res = model.generate(prompt,sampling_params) + latency_collector.hook() + + p0_latency_ms = latency_collector.percentile(0) * 1000 + p50_latency_ms = latency_collector.percentile(50) * 1000 + p90_latency_ms = latency_collector.percentile(90) * 1000 + p95_latency_ms = latency_collector.percentile(95) * 1000 + p99_latency_ms = latency_collector.percentile(99) * 1000 + p100_latency_ms = latency_collector.percentile(100) * 1000 + + report_dict = dict() + report_dict["Latency P0"] = f'{p0_latency_ms:.1f}' + report_dict["Latency P50"]=f'{p50_latency_ms:.1f}' + report_dict["Latency P90"]=f'{p90_latency_ms:.1f}' + report_dict["Latency P95"]=f'{p95_latency_ms:.1f}' + report_dict["Latency P99"]=f'{p99_latency_ms:.1f}' + report_dict["Latency P100"]=f'{p100_latency_ms:.1f}' + + report = f'RESULT FOR {test_name}:' + for key, value in report_dict.items(): + report += f' {key}={value}' + print(report) + return report + +class LatencyCollector: + def __init__(self): + self.start = None + self.latency_list = [] + + def pre_hook(self, *args): + self.start = time.time() + + def hook(self, *args): + self.latency_list.append(time.time() - self.start) + + def percentile(self, percent): + latency_list = self.latency_list + pos_float = len(latency_list) * percent / 100 + max_pos = len(latency_list) - 1 + pos_floor = min(math.floor(pos_float), max_pos) + pos_ceil = min(math.ceil(pos_float), max_pos) + latency_list = sorted(latency_list) + return latency_list[pos_ceil] if pos_float - pos_floor > 0.5 else latency_list[pos_floor] + +class GenerateRequest(BaseModel): + max_new_tokens: int + prompt: str + +class GenerateBenchmarkRequest(BaseModel): + n_runs: int + max_new_tokens: int + prompt: str + +class GenerateResponse(BaseModel): + text: str = Field(..., description="Base64-encoded text") + execution_time: float + +class GenerateBenchmarkResponse(BaseModel): + report: str = Field(..., description="Benchmark report") + +prompt= "What model are you?" +benchmark(10,"warmup",prompt,default_max_new_tokens) +app = FastAPI() + +@app.post("/benchmark",response_model=GenerateBenchmarkResponse) +def generate_benchmark_report(request: GenerateBenchmarkRequest): + print(f'DEBUG: GenerateBenchmarkRequest:{request}') + try: + with torch.no_grad(): + test_name=f'benchmark:{app_name} on {nodepool} with {request.max_new_tokens} output tokens' + response_report=benchmark(request.n_runs,test_name,model,request.prompt,request.max_new_tokens) + report_base64 = base64.b64encode(response_report.encode()).decode() + return GenerateBenchmarkResponse(report=report_base64) + except Exception as e: + traceback.print_exc() + raise HTTPException(status_code=500, detail=f"{e}") + +@app.post("/generate", response_model=GenerateResponse) +def generate_text_post(request: GenerateRequest): + try: + with torch.no_grad(): + response_text,total_time=gentext(request.prompt,request.max_new_tokens) + counter_metric=app_name+'-counter' + cw_pub_metric(counter_metric,1,'Count') + counter_metric=nodepool + cw_pub_metric(counter_metric,1,'Count') + latency_metric=app_name+'-latency' + cw_pub_metric(latency_metric,total_time,'Seconds') + text_base64 = base64.b64encode(response_text.encode()).decode() + return GenerateResponse(text=text_base64, execution_time=total_time) + except Exception as e: + traceback.print_exc() + raise HTTPException(status_code=500, detail=f"text serialization failed: {e}") + +# Health and readiness endpoints +@app.get("/health") +def healthy(): + return {"message": f"{pod_name} is healthy"} + +@app.get("/readiness") +def ready(): + return {"message": f"{pod_name} is ready"} diff --git a/flux_serve/app/vllm_model_api.py b/flux_serve/app/vllm_model_api.py new file mode 100644 index 0000000..5c18dd7 --- /dev/null +++ b/flux_serve/app/vllm_model_api.py @@ -0,0 +1,175 @@ +import traceback +import math +import boto3 +import time +import argparse +import torch +import torch.nn as nn +#import torch_neuronx +#import neuronx_distributed +import os +from fastapi import FastAPI, HTTPException +from pydantic import BaseModel, Field +from typing import Any, Dict, Optional, Union +from huggingface_hub import login +from starlette.responses import StreamingResponse +import base64 +from vllm import LLM, SamplingParams +from sentence_transformers import SentenceTransformer +import yaml + +cw_namespace='hw-agnostic-infer' +default_max_new_tokens=50 +cloudwatch = boto3.client('cloudwatch', region_name='us-west-2') +sampling_params = SamplingParams(temperature=0.7,top_k=50,top_p=0.9,max_tokens=128,) + +app_name=os.environ['APP'] +nodepool=os.environ['NODEPOOL'] +pod_name = os.environ['POD_NAME'] +hf_token = os.environ['HUGGINGFACE_TOKEN'].strip() + +repo_id=os.environ['MODEL_ID'] +os.environ['NEURON_COMPILED_ARTIFACTS']=repo_id + +with open("/vllm_config.yaml", "r") as file: + vllm_config=yaml.safe_load(file) + +login(hf_token, add_to_git_credential=True) + +def gentext(prompt,max_new_tokens): + start_time = time.time() + outputs = model.generate(prompt,sampling_params) + response = outputs[0].outputs[0].text + total_time = time.time()-start_time + return str(response), float(total_time) + +def cw_pub_metric(metric_name,metric_value,metric_unit): + response = cloudwatch.put_metric_data( + Namespace=cw_namespace, + MetricData=[ + { + 'MetricName':metric_name, + 'Value':metric_value, + 'Unit':metric_unit, + }, + ] + ) + print(f"in pub_deployment_counter - response:{response}") + return response + +login(hf_token, add_to_git_credential=True) + +# TBD change to text from image +def benchmark(n_runs, test_name,model,prompt,max_new_tokens): + warmup_run = model.generate(prompt,sampling_params) + latency_collector = LatencyCollector() + + for _ in range(n_runs): + latency_collector.pre_hook() + res = model.generate(prompt,sampling_params) + latency_collector.hook() + + p0_latency_ms = latency_collector.percentile(0) * 1000 + p50_latency_ms = latency_collector.percentile(50) * 1000 + p90_latency_ms = latency_collector.percentile(90) * 1000 + p95_latency_ms = latency_collector.percentile(95) * 1000 + p99_latency_ms = latency_collector.percentile(99) * 1000 + p100_latency_ms = latency_collector.percentile(100) * 1000 + + report_dict = dict() + report_dict["Latency P0"] = f'{p0_latency_ms:.1f}' + report_dict["Latency P50"]=f'{p50_latency_ms:.1f}' + report_dict["Latency P90"]=f'{p90_latency_ms:.1f}' + report_dict["Latency P95"]=f'{p95_latency_ms:.1f}' + report_dict["Latency P99"]=f'{p99_latency_ms:.1f}' + report_dict["Latency P100"]=f'{p100_latency_ms:.1f}' + + report = f'RESULT FOR {test_name}:' + for key, value in report_dict.items(): + report += f' {key}={value}' + print(report) + return report + +class LatencyCollector: + def __init__(self): + self.start = None + self.latency_list = [] + + def pre_hook(self, *args): + self.start = time.time() + + def hook(self, *args): + self.latency_list.append(time.time() - self.start) + + def percentile(self, percent): + latency_list = self.latency_list + pos_float = len(latency_list) * percent / 100 + max_pos = len(latency_list) - 1 + pos_floor = min(math.floor(pos_float), max_pos) + pos_ceil = min(math.ceil(pos_float), max_pos) + latency_list = sorted(latency_list) + return latency_list[pos_ceil] if pos_float - pos_floor > 0.5 else latency_list[pos_floor] + +class GenerateRequest(BaseModel): + max_new_tokens: int + prompt: str + +class GenerateBenchmarkRequest(BaseModel): + n_runs: int + max_new_tokens: int + prompt: str + +class GenerateResponse(BaseModel): + text: str = Field(..., description="Base64-encoded text") + execution_time: float + +class GenerateBenchmarkResponse(BaseModel): + report: str = Field(..., description="Benchmark report") + +def load_model(): + model = LLM(**vllm_config) + return model + +model = load_model() +prompt= "What model are you?" +benchmark(10,"warmup",model,prompt,default_max_new_tokens) +app = FastAPI() + +@app.post("/benchmark",response_model=GenerateBenchmarkResponse) +def generate_benchmark_report(request: GenerateBenchmarkRequest): + print(f'DEBUG: GenerateBenchmarkRequest:{request}') + try: + with torch.no_grad(): + test_name=f'benchmark:{app_name} on {nodepool} with {request.max_new_tokens} output tokens' + response_report=benchmark(request.n_runs,test_name,model,request.prompt,request.max_new_tokens) + report_base64 = base64.b64encode(response_report.encode()).decode() + return GenerateBenchmarkResponse(report=report_base64) + except Exception as e: + traceback.print_exc() + raise HTTPException(status_code=500, detail=f"{e}") + +@app.post("/generate", response_model=GenerateResponse) +def generate_text_post(request: GenerateRequest): + try: + with torch.no_grad(): + response_text,total_time=gentext(request.prompt,request.max_new_tokens) + counter_metric=app_name+'-counter' + cw_pub_metric(counter_metric,1,'Count') + counter_metric=nodepool + cw_pub_metric(counter_metric,1,'Count') + latency_metric=app_name+'-latency' + cw_pub_metric(latency_metric,total_time,'Seconds') + text_base64 = base64.b64encode(response_text.encode()).decode() + return GenerateResponse(text=text_base64, execution_time=total_time) + except Exception as e: + traceback.print_exc() + raise HTTPException(status_code=500, detail=f"text serialization failed: {e}") + +# Health and readiness endpoints +@app.get("/health") +def healthy(): + return {"message": f"{pod_name} is healthy"} + +@app.get("/readiness") +def ready(): + return {"message": f"{pod_name} is ready"} diff --git a/flux_serve/figures/flux-quality-test.png b/flux_serve/figures/flux-quality-test.png new file mode 100644 index 0000000..a40ce49 Binary files /dev/null and b/flux_serve/figures/flux-quality-test.png differ diff --git a/flux_serve/oci-image-build/README.md b/flux_serve/oci-image-build/README.md new file mode 100644 index 0000000..a6f87cf --- /dev/null +++ b/flux_serve/oci-image-build/README.md @@ -0,0 +1,29 @@ + +* Fork https://github.com/aws-samples/scalable-hw-agnostic-inference and populate the `GITHUB_USER` and `GITHUB_OAUTH_TOKEN` based on `Settings/Developer Settings/Personal access tokens`. +* Check the latest [DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) for `BASE_IMAGE_AMD_XLA_TAG` and `BASE_IMAGE_AMD_CUD_TAG` values. +* Export the following variables: +```bash +export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account) +export AWS_REGION=us-west-2 +export BASE_IMAGE_AMD_XLA_TAG=2.5.1-neuronx-py310-sdk2.22.0-ubuntu22.04 +export IMAGE_AMD_XLA_TAG=amd64-neuron +export BASE_REPO=model +export BASE_TAG=multiarch-ubuntu +export BASE_AMD_TAG=amd64 +export GITHUB_BRANCH=master +export GITHUB_USER=yahavb +export GITHUB_REPO=aws-neuron-eks-samples +export CF_STACK=flux-oci-image-inference-cdk +``` +* Install needed packages + +```bash +npm uninstall -g aws-cdk +npm install -g aws-cdk +``` + +* Deploy the pipeline + +```bash +./deploy-pipeline.sh +``` diff --git a/flux_serve/oci-image-build/deploy-pipeline.sh b/flux_serve/oci-image-build/deploy-pipeline.sh new file mode 100755 index 0000000..6389271 --- /dev/null +++ b/flux_serve/oci-image-build/deploy-pipeline.sh @@ -0,0 +1,12 @@ +#!/bin/bash +rm -rf cdk.* package* node_modules/ +npm install -g aws-cdk +npm install aws-cdk-lib +npm install ts-node typescript +npm install typescript --save-dev +npx tsc --init +npx tsc +. ~/.bash_profile +cdk bootstrap aws://$AWS_ACCOUNT_ID/$AWS_REGION +npm install +cdk deploy --app "npx ts-node --prefer-ts-exts ./pipeline.ts" --parameters BASEIMAGEAMDXLATAG=$BASE_IMAGE_AMD_XLA_TAG --parameters BASEIMAGEAMDCUDTAG=$BASE_IMAGE_AMD_CUD_TAG --parameters BASEREPO=$BASE_REPO --parameters IMAGEAMDXLATAG=$IMAGE_AMD_XLA_TAG --parameters IMAGEAMDCUDTAG=$IMAGE_AMD_CUD_TAG --parameters GITHUBREPO=$GITHUB_REPO --parameters GITHUBUSER=$GITHUB_USER --parameters GITHUBBRANCH=$GITHUB_BRANCH --parameters GITHUBOAUTHTOKEN=$GITHUB_OAUTH_TOKEN --parameters BASEIMAGEARMCPUTAG=$BASE_IMAGE_ARM_CPU_TAG --parameters IMAGEARMCPUTAG=$IMAGE_ARM_CPU_TAG diff --git a/flux_serve/oci-image-build/package-lock.json b/flux_serve/oci-image-build/package-lock.json new file mode 100644 index 0000000..825107e --- /dev/null +++ b/flux_serve/oci-image-build/package-lock.json @@ -0,0 +1,615 @@ +{ + "name": "oci-image-build", + "lockfileVersion": 3, + "requires": true, + "packages": { + "": { + "dependencies": { + "aws-cdk-lib": "^2.173.4", + "ts-node": "^10.9.2" + }, + "devDependencies": { + "typescript": "^5.7.2" + } + }, + "node_modules/@aws-cdk/asset-awscli-v1": { + "version": "2.2.217", + "resolved": "https://registry.npmjs.org/@aws-cdk/asset-awscli-v1/-/asset-awscli-v1-2.2.217.tgz", + "integrity": "sha512-vqMxZaMO3ILc7OuPGH59KryvGqY1wNx7RYLfxM4aMk6uda5eG/rCo1jGRovB1fXXQCPd9NedicJz3n+DkhxIzw==", + "license": "Apache-2.0" + }, + "node_modules/@aws-cdk/asset-kubectl-v20": { + "version": "2.1.3", + "resolved": "https://registry.npmjs.org/@aws-cdk/asset-kubectl-v20/-/asset-kubectl-v20-2.1.3.tgz", + "integrity": "sha512-cDG1w3ieM6eOT9mTefRuTypk95+oyD7P5X/wRltwmYxU7nZc3+076YEVS6vrjDKr3ADYbfn0lDKpfB1FBtO9CQ==", + "license": "Apache-2.0" + }, + "node_modules/@aws-cdk/asset-node-proxy-agent-v6": { + "version": "2.1.0", + "resolved": "https://registry.npmjs.org/@aws-cdk/asset-node-proxy-agent-v6/-/asset-node-proxy-agent-v6-2.1.0.tgz", + "integrity": "sha512-7bY3J8GCVxLupn/kNmpPc5VJz8grx+4RKfnnJiO1LG+uxkZfANZG3RMHhE+qQxxwkyQ9/MfPtTpf748UhR425A==", + "license": "Apache-2.0" + }, + "node_modules/@aws-cdk/cloud-assembly-schema": { + "version": "38.0.1", + "resolved": "https://registry.npmjs.org/@aws-cdk/cloud-assembly-schema/-/cloud-assembly-schema-38.0.1.tgz", + "integrity": "sha512-KvPe+NMWAulfNVwY7jenFhzhuLhLqJ/OPy5jx7wUstbjnYnjRVLpUHPU3yCjXFE0J8cuJVdx95BJ4rOs66Pi9w==", + "bundleDependencies": [ + "jsonschema", + "semver" + ], + "license": "Apache-2.0", + "dependencies": { + "jsonschema": "^1.4.1", + "semver": "^7.6.3" + } + }, + "node_modules/@aws-cdk/cloud-assembly-schema/node_modules/jsonschema": { + "version": "1.4.1", + "inBundle": true, + "license": "MIT", + "engines": { + "node": "*" + } + }, + "node_modules/@aws-cdk/cloud-assembly-schema/node_modules/semver": { + "version": "7.6.3", + "inBundle": true, + "license": "ISC", + "bin": { + "semver": "bin/semver.js" + }, + "engines": { + "node": ">=10" + } + }, + "node_modules/@cspotcode/source-map-support": { + "version": "0.8.1", + "resolved": "https://registry.npmjs.org/@cspotcode/source-map-support/-/source-map-support-0.8.1.tgz", + "integrity": "sha512-IchNf6dN4tHoMFIn/7OE8LWZ19Y6q/67Bmf6vnGREv8RSbBVb9LPJxEcnwrcwX6ixSvaiGoomAUvu4YSxXrVgw==", + "license": "MIT", + "dependencies": { + "@jridgewell/trace-mapping": "0.3.9" + }, + "engines": { + "node": ">=12" + } + }, + "node_modules/@jridgewell/resolve-uri": { + "version": "3.1.2", + "resolved": "https://registry.npmjs.org/@jridgewell/resolve-uri/-/resolve-uri-3.1.2.tgz", + "integrity": "sha512-bRISgCIjP20/tbWSPWMEi54QVPRZExkuD9lJL+UIxUKtwVJA8wW1Trb1jMs1RFXo1CBTNZ/5hpC9QvmKWdopKw==", + "license": "MIT", + "engines": { + "node": ">=6.0.0" + } + }, + "node_modules/@jridgewell/sourcemap-codec": { + "version": "1.5.0", + "resolved": "https://registry.npmjs.org/@jridgewell/sourcemap-codec/-/sourcemap-codec-1.5.0.tgz", + "integrity": "sha512-gv3ZRaISU3fjPAgNsriBRqGWQL6quFx04YMPW/zD8XMLsU32mhCCbfbO6KZFLjvYpCZ8zyDEgqsgf+PwPaM7GQ==", + "license": "MIT" + }, + "node_modules/@jridgewell/trace-mapping": { + "version": "0.3.9", + "resolved": "https://registry.npmjs.org/@jridgewell/trace-mapping/-/trace-mapping-0.3.9.tgz", + "integrity": "sha512-3Belt6tdc8bPgAtbcmdtNJlirVoTmEb5e2gC94PnkwEW9jI6CAHUeoG85tjWP5WquqfavoMtMwiG4P926ZKKuQ==", + "license": "MIT", + "dependencies": { + "@jridgewell/resolve-uri": "^3.0.3", + "@jridgewell/sourcemap-codec": "^1.4.10" + } + }, + "node_modules/@tsconfig/node10": { + "version": "1.0.11", + "resolved": "https://registry.npmjs.org/@tsconfig/node10/-/node10-1.0.11.tgz", + "integrity": "sha512-DcRjDCujK/kCk/cUe8Xz8ZSpm8mS3mNNpta+jGCA6USEDfktlNvm1+IuZ9eTcDbNk41BHwpHHeW+N1lKCz4zOw==", + "license": "MIT" + }, + "node_modules/@tsconfig/node12": { + "version": "1.0.11", + "resolved": "https://registry.npmjs.org/@tsconfig/node12/-/node12-1.0.11.tgz", + "integrity": "sha512-cqefuRsh12pWyGsIoBKJA9luFu3mRxCA+ORZvA4ktLSzIuCUtWVxGIuXigEwO5/ywWFMZ2QEGKWvkZG1zDMTag==", + "license": "MIT" + }, + "node_modules/@tsconfig/node14": { + "version": "1.0.3", + "resolved": "https://registry.npmjs.org/@tsconfig/node14/-/node14-1.0.3.tgz", + "integrity": "sha512-ysT8mhdixWK6Hw3i1V2AeRqZ5WfXg1G43mqoYlM2nc6388Fq5jcXyr5mRsqViLx/GJYdoL0bfXD8nmF+Zn/Iow==", + "license": "MIT" + }, + "node_modules/@tsconfig/node16": { + "version": "1.0.4", + "resolved": "https://registry.npmjs.org/@tsconfig/node16/-/node16-1.0.4.tgz", + "integrity": "sha512-vxhUy4J8lyeyinH7Azl1pdd43GJhZH/tP2weN8TntQblOY+A0XbT8DJk1/oCPuOOyg/Ja757rG0CgHcWC8OfMA==", + "license": "MIT" + }, + "node_modules/@types/node": { + "version": "22.10.4", + "resolved": "https://registry.npmjs.org/@types/node/-/node-22.10.4.tgz", + "integrity": "sha512-99l6wv4HEzBQhvaU/UGoeBoCK61SCROQaCCGyQSgX2tEQ3rKkNZ2S7CEWnS/4s1LV+8ODdK21UeyR1fHP2mXug==", + "license": "MIT", + "peer": true, + "dependencies": { + "undici-types": "~6.20.0" + } + }, + "node_modules/acorn": { + "version": "8.14.0", + "resolved": "https://registry.npmjs.org/acorn/-/acorn-8.14.0.tgz", + "integrity": "sha512-cl669nCJTZBsL97OF4kUQm5g5hC2uihk0NxY3WENAC0TYdILVkAyHymAntgxGkl7K+t0cXIrH5siy5S4XkFycA==", + "license": "MIT", + "bin": { + "acorn": "bin/acorn" + }, + "engines": { + "node": ">=0.4.0" + } + }, + "node_modules/acorn-walk": { + "version": "8.3.4", + "resolved": "https://registry.npmjs.org/acorn-walk/-/acorn-walk-8.3.4.tgz", + "integrity": "sha512-ueEepnujpqee2o5aIYnvHU6C0A42MNdsIDeqy5BydrkuC5R1ZuUFnm27EeFJGoEHJQgn3uleRvmTXaJgfXbt4g==", + "license": "MIT", + "dependencies": { + "acorn": "^8.11.0" + }, + "engines": { + "node": ">=0.4.0" + } + }, + "node_modules/arg": { + "version": "4.1.3", + "resolved": "https://registry.npmjs.org/arg/-/arg-4.1.3.tgz", + "integrity": "sha512-58S9QDqG0Xx27YwPSt9fJxivjYl432YCwfDMfZ+71RAqUrZef7LrKQZ3LHLOwCS4FLNBplP533Zx895SeOCHvA==", + "license": "MIT" + }, + "node_modules/aws-cdk-lib": { + "version": "2.173.4", + "resolved": "https://registry.npmjs.org/aws-cdk-lib/-/aws-cdk-lib-2.173.4.tgz", + "integrity": "sha512-0reN94TzkWmyVZDDBlYB4qzJUig8wTHEe82YLHlWRUhrU78fT+drVGUr+lYZwwETaZ+8fLdCOl9ULvFNq7iczQ==", + "bundleDependencies": [ + "@balena/dockerignore", + "case", + "fs-extra", + "ignore", + "jsonschema", + "minimatch", + "punycode", + "semver", + "table", + "yaml", + "mime-types" + ], + "license": "Apache-2.0", + "dependencies": { + "@aws-cdk/asset-awscli-v1": "^2.2.208", + "@aws-cdk/asset-kubectl-v20": "^2.1.3", + "@aws-cdk/asset-node-proxy-agent-v6": "^2.1.0", + "@aws-cdk/cloud-assembly-schema": "^38.0.1", + "@balena/dockerignore": "^1.0.2", + "case": "1.6.3", + "fs-extra": "^11.2.0", + "ignore": "^5.3.2", + "jsonschema": "^1.4.1", + "mime-types": "^2.1.35", + "minimatch": "^3.1.2", + "punycode": "^2.3.1", + "semver": "^7.6.3", + "table": "^6.8.2", + "yaml": "1.10.2" + }, + "engines": { + "node": ">= 14.15.0" + }, + "peerDependencies": { + "constructs": "^10.0.0" + } + }, + "node_modules/aws-cdk-lib/node_modules/@balena/dockerignore": { + "version": "1.0.2", + "inBundle": true, + "license": "Apache-2.0" + }, + "node_modules/aws-cdk-lib/node_modules/ajv": { + "version": "8.17.1", + "inBundle": true, + "license": "MIT", + "dependencies": { + "fast-deep-equal": "^3.1.3", + "fast-uri": "^3.0.1", + "json-schema-traverse": "^1.0.0", + "require-from-string": "^2.0.2" + }, + "funding": { + "type": "github", + "url": "https://github.com/sponsors/epoberezkin" + } + }, + "node_modules/aws-cdk-lib/node_modules/ansi-regex": { + "version": "5.0.1", + "inBundle": true, + "license": "MIT", + "engines": { + "node": ">=8" + } + }, + "node_modules/aws-cdk-lib/node_modules/ansi-styles": { + "version": "4.3.0", + "inBundle": true, + "license": "MIT", + "dependencies": { + "color-convert": "^2.0.1" + }, + "engines": { + "node": ">=8" + }, + "funding": { + "url": "https://github.com/chalk/ansi-styles?sponsor=1" + } + }, + "node_modules/aws-cdk-lib/node_modules/astral-regex": { + "version": "2.0.0", + "inBundle": true, + "license": "MIT", + "engines": { + "node": ">=8" + } + }, + "node_modules/aws-cdk-lib/node_modules/balanced-match": { + "version": "1.0.2", + "inBundle": true, + "license": "MIT" + }, + "node_modules/aws-cdk-lib/node_modules/brace-expansion": { + "version": "1.1.11", + "inBundle": true, + "license": "MIT", + "dependencies": { + "balanced-match": "^1.0.0", + "concat-map": "0.0.1" + } + }, + "node_modules/aws-cdk-lib/node_modules/case": { + "version": "1.6.3", + "inBundle": true, + "license": "(MIT OR GPL-3.0-or-later)", + "engines": { + "node": ">= 0.8.0" + } + }, + "node_modules/aws-cdk-lib/node_modules/color-convert": { + "version": "2.0.1", + "inBundle": true, + "license": "MIT", + "dependencies": { + "color-name": "~1.1.4" + }, + "engines": { + "node": ">=7.0.0" + } + }, + "node_modules/aws-cdk-lib/node_modules/color-name": { + "version": "1.1.4", + "inBundle": true, + "license": "MIT" + }, + "node_modules/aws-cdk-lib/node_modules/concat-map": { + "version": "0.0.1", + "inBundle": true, + "license": "MIT" + }, + "node_modules/aws-cdk-lib/node_modules/emoji-regex": { + "version": "8.0.0", + "inBundle": true, + "license": "MIT" + }, + "node_modules/aws-cdk-lib/node_modules/fast-deep-equal": { + "version": "3.1.3", + "inBundle": true, + "license": "MIT" + }, + "node_modules/aws-cdk-lib/node_modules/fast-uri": { + "version": "3.0.3", + "inBundle": true, + "license": "BSD-3-Clause" + }, + "node_modules/aws-cdk-lib/node_modules/fs-extra": { + "version": "11.2.0", + "inBundle": true, + "license": "MIT", + "dependencies": { + "graceful-fs": "^4.2.0", + "jsonfile": "^6.0.1", + "universalify": "^2.0.0" + }, + "engines": { + "node": ">=14.14" + } + }, + "node_modules/aws-cdk-lib/node_modules/graceful-fs": { + "version": "4.2.11", + "inBundle": true, + "license": "ISC" + }, + "node_modules/aws-cdk-lib/node_modules/ignore": { + "version": "5.3.2", + "inBundle": true, + "license": "MIT", + "engines": { + "node": ">= 4" + } + }, + "node_modules/aws-cdk-lib/node_modules/is-fullwidth-code-point": { + "version": "3.0.0", + "inBundle": true, + "license": "MIT", + "engines": { + "node": ">=8" + } + }, + "node_modules/aws-cdk-lib/node_modules/json-schema-traverse": { + "version": "1.0.0", + "inBundle": true, + "license": "MIT" + }, + "node_modules/aws-cdk-lib/node_modules/jsonfile": { + "version": "6.1.0", + "inBundle": true, + "license": "MIT", + "dependencies": { + "universalify": "^2.0.0" + }, + "optionalDependencies": { + "graceful-fs": "^4.1.6" + } + }, + "node_modules/aws-cdk-lib/node_modules/jsonschema": { + "version": "1.4.1", + "inBundle": true, + "license": "MIT", + "engines": { + "node": "*" + } + }, + "node_modules/aws-cdk-lib/node_modules/lodash.truncate": { + "version": "4.4.2", + "inBundle": true, + "license": "MIT" + }, + "node_modules/aws-cdk-lib/node_modules/mime-db": { + "version": "1.52.0", + "inBundle": true, + "license": "MIT", + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/aws-cdk-lib/node_modules/mime-types": { + "version": "2.1.35", + "inBundle": true, + "license": "MIT", + "dependencies": { + "mime-db": "1.52.0" + }, + "engines": { + "node": ">= 0.6" + } + }, + "node_modules/aws-cdk-lib/node_modules/minimatch": { + "version": "3.1.2", + "inBundle": true, + "license": "ISC", + "dependencies": { + "brace-expansion": "^1.1.7" + }, + "engines": { + "node": "*" + } + }, + "node_modules/aws-cdk-lib/node_modules/punycode": { + "version": "2.3.1", + "inBundle": true, + "license": "MIT", + "engines": { + "node": ">=6" + } + }, + "node_modules/aws-cdk-lib/node_modules/require-from-string": { + "version": "2.0.2", + "inBundle": true, + "license": "MIT", + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/aws-cdk-lib/node_modules/semver": { + "version": "7.6.3", + "inBundle": true, + "license": "ISC", + "bin": { + "semver": "bin/semver.js" + }, + "engines": { + "node": ">=10" + } + }, + "node_modules/aws-cdk-lib/node_modules/slice-ansi": { + "version": "4.0.0", + "inBundle": true, + "license": "MIT", + "dependencies": { + "ansi-styles": "^4.0.0", + "astral-regex": "^2.0.0", + "is-fullwidth-code-point": "^3.0.0" + }, + "engines": { + "node": ">=10" + }, + "funding": { + "url": "https://github.com/chalk/slice-ansi?sponsor=1" + } + }, + "node_modules/aws-cdk-lib/node_modules/string-width": { + "version": "4.2.3", + "inBundle": true, + "license": "MIT", + "dependencies": { + "emoji-regex": "^8.0.0", + "is-fullwidth-code-point": "^3.0.0", + "strip-ansi": "^6.0.1" + }, + "engines": { + "node": ">=8" + } + }, + "node_modules/aws-cdk-lib/node_modules/strip-ansi": { + "version": "6.0.1", + "inBundle": true, + "license": "MIT", + "dependencies": { + "ansi-regex": "^5.0.1" + }, + "engines": { + "node": ">=8" + } + }, + "node_modules/aws-cdk-lib/node_modules/table": { + "version": "6.8.2", + "inBundle": true, + "license": "BSD-3-Clause", + "dependencies": { + "ajv": "^8.0.1", + "lodash.truncate": "^4.4.2", + "slice-ansi": "^4.0.0", + "string-width": "^4.2.3", + "strip-ansi": "^6.0.1" + }, + "engines": { + "node": ">=10.0.0" + } + }, + "node_modules/aws-cdk-lib/node_modules/universalify": { + "version": "2.0.1", + "inBundle": true, + "license": "MIT", + "engines": { + "node": ">= 10.0.0" + } + }, + "node_modules/aws-cdk-lib/node_modules/yaml": { + "version": "1.10.2", + "inBundle": true, + "license": "ISC", + "engines": { + "node": ">= 6" + } + }, + "node_modules/constructs": { + "version": "10.4.2", + "resolved": "https://registry.npmjs.org/constructs/-/constructs-10.4.2.tgz", + "integrity": "sha512-wsNxBlAott2qg8Zv87q3eYZYgheb9lchtBfjHzzLHtXbttwSrHPs1NNQbBrmbb1YZvYg2+Vh0Dor76w4mFxJkA==", + "license": "Apache-2.0", + "peer": true + }, + "node_modules/create-require": { + "version": "1.1.1", + "resolved": "https://registry.npmjs.org/create-require/-/create-require-1.1.1.tgz", + "integrity": "sha512-dcKFX3jn0MpIaXjisoRvexIJVEKzaq7z2rZKxf+MSr9TkdmHmsU4m2lcLojrj/FHl8mk5VxMmYA+ftRkP/3oKQ==", + "license": "MIT" + }, + "node_modules/diff": { + "version": "4.0.2", + "resolved": "https://registry.npmjs.org/diff/-/diff-4.0.2.tgz", + "integrity": "sha512-58lmxKSA4BNyLz+HHMUzlOEpg09FV+ev6ZMe3vJihgdxzgcwZ8VoEEPmALCZG9LmqfVoNMMKpttIYTVG6uDY7A==", + "license": "BSD-3-Clause", + "engines": { + "node": ">=0.3.1" + } + }, + "node_modules/make-error": { + "version": "1.3.6", + "resolved": "https://registry.npmjs.org/make-error/-/make-error-1.3.6.tgz", + "integrity": "sha512-s8UhlNe7vPKomQhC1qFelMokr/Sc3AgNbso3n74mVPA5LTZwkB9NlXf4XPamLxJE8h0gh73rM94xvwRT2CVInw==", + "license": "ISC" + }, + "node_modules/ts-node": { + "version": "10.9.2", + "resolved": "https://registry.npmjs.org/ts-node/-/ts-node-10.9.2.tgz", + "integrity": "sha512-f0FFpIdcHgn8zcPSbf1dRevwt047YMnaiJM3u2w2RewrB+fob/zePZcrOyQoLMMO7aBIddLcQIEK5dYjkLnGrQ==", + "license": "MIT", + "dependencies": { + "@cspotcode/source-map-support": "^0.8.0", + "@tsconfig/node10": "^1.0.7", + "@tsconfig/node12": "^1.0.7", + "@tsconfig/node14": "^1.0.0", + "@tsconfig/node16": "^1.0.2", + "acorn": "^8.4.1", + "acorn-walk": "^8.1.1", + "arg": "^4.1.0", + "create-require": "^1.1.0", + "diff": "^4.0.1", + "make-error": "^1.1.1", + "v8-compile-cache-lib": "^3.0.1", + "yn": "3.1.1" + }, + "bin": { + "ts-node": "dist/bin.js", + "ts-node-cwd": "dist/bin-cwd.js", + "ts-node-esm": "dist/bin-esm.js", + "ts-node-script": "dist/bin-script.js", + "ts-node-transpile-only": "dist/bin-transpile.js", + "ts-script": "dist/bin-script-deprecated.js" + }, + "peerDependencies": { + "@swc/core": ">=1.2.50", + "@swc/wasm": ">=1.2.50", + "@types/node": "*", + "typescript": ">=2.7" + }, + "peerDependenciesMeta": { + "@swc/core": { + "optional": true + }, + "@swc/wasm": { + "optional": true + } + } + }, + "node_modules/typescript": { + "version": "5.7.2", + "resolved": "https://registry.npmjs.org/typescript/-/typescript-5.7.2.tgz", + "integrity": "sha512-i5t66RHxDvVN40HfDd1PsEThGNnlMCMT3jMUuoh9/0TaqWevNontacunWyN02LA9/fIbEWlcHZcgTKb9QoaLfg==", + "license": "Apache-2.0", + "bin": { + "tsc": "bin/tsc", + "tsserver": "bin/tsserver" + }, + "engines": { + "node": ">=14.17" + } + }, + "node_modules/undici-types": { + "version": "6.20.0", + "resolved": "https://registry.npmjs.org/undici-types/-/undici-types-6.20.0.tgz", + "integrity": "sha512-Ny6QZ2Nju20vw1SRHe3d9jVu6gJ+4e3+MMpqu7pqE5HT6WsTSlce++GQmK5UXS8mzV8DSYHrQH+Xrf2jVcuKNg==", + "license": "MIT", + "peer": true + }, + "node_modules/v8-compile-cache-lib": { + "version": "3.0.1", + "resolved": "https://registry.npmjs.org/v8-compile-cache-lib/-/v8-compile-cache-lib-3.0.1.tgz", + "integrity": "sha512-wa7YjyUGfNZngI/vtK0UHAN+lgDCxBPCylVXGp0zu59Fz5aiGtNXaq3DhIov063MorB+VfufLh3JlF2KdTK3xg==", + "license": "MIT" + }, + "node_modules/yn": { + "version": "3.1.1", + "resolved": "https://registry.npmjs.org/yn/-/yn-3.1.1.tgz", + "integrity": "sha512-Ux4ygGWsu2c7isFWe8Yu1YluJmqVhxqK2cLXNQA5AcC3QfbGNpM7fu0Y8b/z16pXLnFxZYvWhd3fhBY9DLmC6Q==", + "license": "MIT", + "engines": { + "node": ">=6" + } + } + } +} diff --git a/flux_serve/oci-image-build/package.json b/flux_serve/oci-image-build/package.json new file mode 100644 index 0000000..8926107 --- /dev/null +++ b/flux_serve/oci-image-build/package.json @@ -0,0 +1,9 @@ +{ + "dependencies": { + "aws-cdk-lib": "^2.173.4", + "ts-node": "^10.9.2" + }, + "devDependencies": { + "typescript": "^5.7.2" + } +} diff --git a/flux_serve/oci-image-build/pipeline-stack.js b/flux_serve/oci-image-build/pipeline-stack.js new file mode 100644 index 0000000..f886219 --- /dev/null +++ b/flux_serve/oci-image-build/pipeline-stack.js @@ -0,0 +1,185 @@ +"use strict"; +var __createBinding = (this && this.__createBinding) || (Object.create ? (function(o, m, k, k2) { + if (k2 === undefined) k2 = k; + var desc = Object.getOwnPropertyDescriptor(m, k); + if (!desc || ("get" in desc ? !m.__esModule : desc.writable || desc.configurable)) { + desc = { enumerable: true, get: function() { return m[k]; } }; + } + Object.defineProperty(o, k2, desc); +}) : (function(o, m, k, k2) { + if (k2 === undefined) k2 = k; + o[k2] = m[k]; +})); +var __setModuleDefault = (this && this.__setModuleDefault) || (Object.create ? (function(o, v) { + Object.defineProperty(o, "default", { enumerable: true, value: v }); +}) : function(o, v) { + o["default"] = v; +}); +var __importStar = (this && this.__importStar) || (function () { + var ownKeys = function(o) { + ownKeys = Object.getOwnPropertyNames || function (o) { + var ar = []; + for (var k in o) if (Object.prototype.hasOwnProperty.call(o, k)) ar[ar.length] = k; + return ar; + }; + return ownKeys(o); + }; + return function (mod) { + if (mod && mod.__esModule) return mod; + var result = {}; + if (mod != null) for (var k = ownKeys(mod), i = 0; i < k.length; i++) if (k[i] !== "default") __createBinding(result, mod, k[i]); + __setModuleDefault(result, mod); + return result; + }; +})(); +Object.defineProperty(exports, "__esModule", { value: true }); +exports.PipelineStack = void 0; +const aws_cdk_lib_1 = require("aws-cdk-lib"); +const ecr = __importStar(require("aws-cdk-lib/aws-ecr")); +const codebuild = __importStar(require("aws-cdk-lib/aws-codebuild")); +const codepipeline = __importStar(require("aws-cdk-lib/aws-codepipeline")); +const codepipeline_actions = __importStar(require("aws-cdk-lib/aws-codepipeline-actions")); +const iam = __importStar(require("aws-cdk-lib/aws-iam")); +const secretsmanager = __importStar(require("aws-cdk-lib/aws-secretsmanager")); +const cdk = __importStar(require("aws-cdk-lib/core")); +class PipelineStack extends aws_cdk_lib_1.Stack { + constructor(scope, id, props) { + super(scope, id, props); + const BASE_REPO = new aws_cdk_lib_1.CfnParameter(this, "BASEREPO", { type: "String" }); + const BASE_IMAGE_AMD_XLA_TAG = new aws_cdk_lib_1.CfnParameter(this, "BASEIMAGEAMDXLATAG", { type: "String" }); + const BASE_IMAGE_AMD_CUD_TAG = new aws_cdk_lib_1.CfnParameter(this, "BASEIMAGEAMDCUDTAG", { type: "String" }); + const BASE_IMAGE_ARM_CPU_TAG = new aws_cdk_lib_1.CfnParameter(this, "BASEIMAGEARMCPUTAG", { type: "String" }); + const IMAGE_AMD_XLA_TAG = new aws_cdk_lib_1.CfnParameter(this, "IMAGEAMDXLATAG", { type: "String" }); + const IMAGE_AMD_CUD_TAG = new aws_cdk_lib_1.CfnParameter(this, "IMAGEAMDCUDTAG", { type: "String" }); + const IMAGE_ARM_CPU_TAG = new aws_cdk_lib_1.CfnParameter(this, "IMAGEARMCPUTAG", { type: "String" }); + const GITHUB_OAUTH_TOKEN = new aws_cdk_lib_1.CfnParameter(this, "GITHUBOAUTHTOKEN", { type: "String" }); + const GITHUB_USER = new aws_cdk_lib_1.CfnParameter(this, "GITHUBUSER", { type: "String" }); + const GITHUB_REPO = new aws_cdk_lib_1.CfnParameter(this, "GITHUBREPO", { type: "String" }); + const GITHUB_BRANCH = new aws_cdk_lib_1.CfnParameter(this, "GITHUBBRANCH", { type: "String" }); + /* uncomment when you test the stack and dont want to manually delete the ecr registry + const base_registry = new ecr.Repository(this,`base_repo`,{ + repositoryName:BASE_REPO.valueAsString, + imageScanOnPush: true + });*/ + const base_registry = ecr.Repository.fromRepositoryName(this, `base_repo`, BASE_REPO.valueAsString); + //create a roleARN for codebuild + const buildRole = new iam.Role(this, 'BaseCodeBuildDeployRole', { + roleName: 'fluxneuronBaseCodeBuildDeployRole', + assumedBy: new iam.ServicePrincipal('codebuild.amazonaws.com'), + }); + buildRole.addToPolicy(new iam.PolicyStatement({ + resources: ['*'], + actions: ['ssm:*', 's3:*'], + })); + const githubSecret = new secretsmanager.Secret(this, 'githubSecret', { + secretObjectValue: { + token: aws_cdk_lib_1.SecretValue.unsafePlainText(GITHUB_OAUTH_TOKEN.valueAsString) + }, + }); + const githubOAuthToken = aws_cdk_lib_1.SecretValue.secretsManager(githubSecret.secretArn, { jsonField: 'token' }); + new cdk.CfnOutput(this, 'githubOAuthTokenRuntimeOutput1', { + //value: SecretValue.secretsManager("githubtoken",{jsonField: "token"}).toString() + value: githubSecret.secretValueFromJson('token').toString() + }); + new cdk.CfnOutput(this, 'githubOAuthTokenRuntimeOutput2', { + value: aws_cdk_lib_1.SecretValue.secretsManager(githubSecret.secretArn, { jsonField: "token" }).toString() + }); + const base_image_amd_xla_build = new codebuild.Project(this, `ImageXlaAmdBuild`, { + environment: { privileged: true, buildImage: codebuild.LinuxBuildImage.AMAZON_LINUX_2_3 }, + cache: codebuild.Cache.local(codebuild.LocalCacheMode.DOCKER_LAYER, codebuild.LocalCacheMode.CUSTOM), + role: buildRole, + buildSpec: codebuild.BuildSpec.fromObject({ + version: "0.2", + env: { + 'exported-variables': [ + 'AWS_ACCOUNT_ID', 'AWS_REGION', 'BASE_REPO', 'IMAGE_AMD_XLA_TAG', 'BASE_IMAGE_AMD_XLA_TAG' + ], + }, + phases: { + build: { + commands: [ + `export AWS_ACCOUNT_ID="${this.account}"`, + `export AWS_REGION="${this.region}"`, + `export BASE_REPO="${BASE_REPO.valueAsString}"`, + `export IMAGE_TAG="${IMAGE_AMD_XLA_TAG.valueAsString}"`, + `export BASE_IMAGE_TAG="${BASE_IMAGE_AMD_XLA_TAG.valueAsString}"`, + `cd flux_serve/app`, + `chmod +x ./build.sh && ./build.sh` + ], + } + }, + artifacts: { + files: ['imageDetail.json'] + }, + }), + }); + const assets_image_xla_amd_build = new codebuild.Project(this, `AssetsImageXlaAmdBuild`, { + environment: { privileged: true, buildImage: codebuild.LinuxBuildImage.AMAZON_LINUX_2_3 }, + cache: codebuild.Cache.local(codebuild.LocalCacheMode.DOCKER_LAYER, codebuild.LocalCacheMode.CUSTOM), + role: buildRole, + buildSpec: codebuild.BuildSpec.fromObject({ + version: "0.2", + env: { + 'exported-variables': [ + 'AWS_ACCOUNT_ID', 'AWS_REGION', 'BASE_REPO', 'IMAGE_AMD_XLA_TAG', 'BASE_IMAGE_AMD_XLA_TAG' + ], + }, + phases: { + build: { + commands: [ + `export AWS_ACCOUNT_ID="${this.account}"`, + `export AWS_REGION="${this.region}"`, + `export BASE_REPO="${BASE_REPO.valueAsString}"`, + `export IMAGE_TAG="${IMAGE_AMD_XLA_TAG.valueAsString}"`, + `export BASE_IMAGE_TAG="${BASE_IMAGE_AMD_XLA_TAG.valueAsString}"`, + `cd flux_serve/app`, + `chmod +x ./build-assets.sh && ./build-assets.sh` + ], + } + }, + artifacts: { + files: ['imageDetail.json'] + }, + }), + }); + //we allow the buildProject principal to push images to ecr + base_registry.grantPullPush(assets_image_xla_amd_build.grantPrincipal); + base_registry.grantPullPush(base_image_amd_xla_build.grantPrincipal); + // here we define our pipeline and put together the assembly line + const sourceOutput = new codepipeline.Artifact(); + const basebuildpipeline = new codepipeline.Pipeline(this, `BuildBasePipeline`); + basebuildpipeline.addStage({ + stageName: 'Source', + actions: [ + new codepipeline_actions.GitHubSourceAction({ + actionName: 'GitHub_Source', + owner: GITHUB_USER.valueAsString, + repo: GITHUB_REPO.valueAsString, + branch: GITHUB_BRANCH.valueAsString, + output: sourceOutput, + oauthToken: aws_cdk_lib_1.SecretValue.secretsManager("githubtoken", { jsonField: "token" }), + trigger: codepipeline_actions.GitHubTrigger.WEBHOOK, + //oauthToken: SecretValue.unsafePlainText(GITHUB_OAUTH_TOKEN.valueAsString) + }) + ] + }); + basebuildpipeline.addStage({ + stageName: 'ImageBuild', + actions: [ + new codepipeline_actions.CodeBuildAction({ + actionName: 'AssetsImageXlaAmdBuild', + input: sourceOutput, + runOrder: 1, + project: assets_image_xla_amd_build + }), + new codepipeline_actions.CodeBuildAction({ + actionName: 'BaseImageAmdXlaBuild', + input: sourceOutput, + runOrder: 2, + project: base_image_amd_xla_build + }) + ] + }); + } +} +exports.PipelineStack = PipelineStack; diff --git a/flux_serve/oci-image-build/pipeline-stack.ts b/flux_serve/oci-image-build/pipeline-stack.ts new file mode 100644 index 0000000..c6a4d6b --- /dev/null +++ b/flux_serve/oci-image-build/pipeline-stack.ts @@ -0,0 +1,166 @@ +import { Stack, StackProps,CfnParameter,SecretValue} from 'aws-cdk-lib'; +import { Construct } from 'constructs' +import * as codecommit from 'aws-cdk-lib/aws-codecommit'; +import * as ecr from 'aws-cdk-lib/aws-ecr'; +import * as codebuild from 'aws-cdk-lib/aws-codebuild'; +import * as codepipeline from 'aws-cdk-lib/aws-codepipeline'; +import * as codepipeline_actions from 'aws-cdk-lib/aws-codepipeline-actions'; +import * as iam from "aws-cdk-lib/aws-iam"; +import * as secretsmanager from 'aws-cdk-lib/aws-secretsmanager'; +import * as cdk from 'aws-cdk-lib/core'; +import * as cfn from 'aws-cdk-lib/aws-cloudformation'; + +export class PipelineStack extends Stack { + constructor(scope: Construct, id: string, props?: StackProps) { + super(scope, id, props); + const BASE_REPO = new CfnParameter(this,"BASEREPO",{type:"String"}); + const BASE_IMAGE_AMD_XLA_TAG = new CfnParameter(this,"BASEIMAGEAMDXLATAG",{type:"String"}); + const BASE_IMAGE_AMD_CUD_TAG = new CfnParameter(this,"BASEIMAGEAMDCUDTAG",{type:"String"}); + const BASE_IMAGE_ARM_CPU_TAG = new CfnParameter(this,"BASEIMAGEARMCPUTAG",{type:"String"}); + const IMAGE_AMD_XLA_TAG = new CfnParameter(this,"IMAGEAMDXLATAG",{type:"String"}); + const IMAGE_AMD_CUD_TAG = new CfnParameter(this,"IMAGEAMDCUDTAG",{type:"String"}); + const IMAGE_ARM_CPU_TAG = new CfnParameter(this,"IMAGEARMCPUTAG",{type:"String"}); + const GITHUB_OAUTH_TOKEN = new CfnParameter(this,"GITHUBOAUTHTOKEN",{type:"String"}); + const GITHUB_USER = new CfnParameter(this,"GITHUBUSER",{type:"String"}); + const GITHUB_REPO = new CfnParameter(this,"GITHUBREPO",{type:"String"}); + const GITHUB_BRANCH = new CfnParameter(this,"GITHUBBRANCH",{type:"String"}); + /* uncomment when you test the stack and dont want to manually delete the ecr registry + const base_registry = new ecr.Repository(this,`base_repo`,{ + repositoryName:BASE_REPO.valueAsString, + imageScanOnPush: true + });*/ + + const base_registry = ecr.Repository.fromRepositoryName(this,`base_repo`,BASE_REPO.valueAsString) + + //create a roleARN for codebuild + const buildRole = new iam.Role(this, 'BaseCodeBuildDeployRole',{ + roleName: 'fluxneuronBaseCodeBuildDeployRole', + assumedBy: new iam.ServicePrincipal('codebuild.amazonaws.com'), + }); + + buildRole.addToPolicy(new iam.PolicyStatement({ + resources: ['*'], + actions: ['ssm:*','s3:*'], + })); + + const githubSecret = new secretsmanager.Secret(this, 'githubSecret', { + secretObjectValue: { + token: SecretValue.unsafePlainText(GITHUB_OAUTH_TOKEN.valueAsString) + }, + }); + const githubOAuthToken = SecretValue.secretsManager(githubSecret.secretArn,{jsonField:'token'}); + new cdk.CfnOutput(this, 'githubOAuthTokenRuntimeOutput1', { + //value: SecretValue.secretsManager("githubtoken",{jsonField: "token"}).toString() + value: githubSecret.secretValueFromJson('token').toString() + }); + new cdk.CfnOutput(this, 'githubOAuthTokenRuntimeOutput2', { + value: SecretValue.secretsManager(githubSecret.secretArn,{jsonField: "token"}).toString() + }); + + + const base_image_amd_xla_build = new codebuild.Project(this, `ImageXlaAmdBuild`, { + environment: {privileged:true,buildImage: codebuild.LinuxBuildImage.AMAZON_LINUX_2_3}, + cache: codebuild.Cache.local(codebuild.LocalCacheMode.DOCKER_LAYER, codebuild.LocalCacheMode.CUSTOM), + role: buildRole, + buildSpec: codebuild.BuildSpec.fromObject( + { + version: "0.2", + env: { + 'exported-variables': [ + 'AWS_ACCOUNT_ID','AWS_REGION','BASE_REPO','IMAGE_AMD_XLA_TAG','BASE_IMAGE_AMD_XLA_TAG' + ], + }, + phases: { + build: { + commands: [ + `export AWS_ACCOUNT_ID="${this.account}"`, + `export AWS_REGION="${this.region}"`, + `export BASE_REPO="${BASE_REPO.valueAsString}"`, + `export IMAGE_TAG="${IMAGE_AMD_XLA_TAG.valueAsString}"`, + `export BASE_IMAGE_TAG="${BASE_IMAGE_AMD_XLA_TAG.valueAsString}"`, + `cd flux_serve/app`, + `chmod +x ./build.sh && ./build.sh` + ], + } + }, + artifacts: { + files: ['imageDetail.json'] + }, + } + ), + }); + + const assets_image_xla_amd_build = new codebuild.Project(this, `AssetsImageXlaAmdBuild`, { + environment: {privileged:true,buildImage: codebuild.LinuxBuildImage.AMAZON_LINUX_2_3}, + cache: codebuild.Cache.local(codebuild.LocalCacheMode.DOCKER_LAYER, codebuild.LocalCacheMode.CUSTOM), + role: buildRole, + buildSpec: codebuild.BuildSpec.fromObject( + { + version: "0.2", + env: { + 'exported-variables': [ + 'AWS_ACCOUNT_ID','AWS_REGION','BASE_REPO','IMAGE_AMD_XLA_TAG','BASE_IMAGE_AMD_XLA_TAG' + ], + }, + phases: { + build: { + commands: [ + `export AWS_ACCOUNT_ID="${this.account}"`, + `export AWS_REGION="${this.region}"`, + `export BASE_REPO="${BASE_REPO.valueAsString}"`, + `export IMAGE_TAG="${IMAGE_AMD_XLA_TAG.valueAsString}"`, + `export BASE_IMAGE_TAG="${BASE_IMAGE_AMD_XLA_TAG.valueAsString}"`, + `cd flux_serve/app`, + `chmod +x ./build-assets.sh && ./build-assets.sh` + ], + } + }, + artifacts: { + files: ['imageDetail.json'] + }, + } + ), + }); + + //we allow the buildProject principal to push images to ecr + base_registry.grantPullPush(assets_image_xla_amd_build.grantPrincipal); + base_registry.grantPullPush(base_image_amd_xla_build.grantPrincipal); + + // here we define our pipeline and put together the assembly line + const sourceOutput = new codepipeline.Artifact(); + const basebuildpipeline = new codepipeline.Pipeline(this,`BuildBasePipeline`); + basebuildpipeline.addStage({ + stageName: 'Source', + actions: [ + new codepipeline_actions.GitHubSourceAction({ + actionName: 'GitHub_Source', + owner: GITHUB_USER.valueAsString, + repo: GITHUB_REPO.valueAsString, + branch: GITHUB_BRANCH.valueAsString, + output: sourceOutput, + oauthToken: SecretValue.secretsManager("githubtoken",{jsonField: "token"}), + trigger: codepipeline_actions.GitHubTrigger.WEBHOOK, + //oauthToken: SecretValue.unsafePlainText(GITHUB_OAUTH_TOKEN.valueAsString) + }) + ] + }); + + basebuildpipeline.addStage({ + stageName: 'ImageBuild', + actions: [ + new codepipeline_actions.CodeBuildAction({ + actionName: 'AssetsImageXlaAmdBuild', + input: sourceOutput, + runOrder: 1, + project: assets_image_xla_amd_build + }), + new codepipeline_actions.CodeBuildAction({ + actionName: 'BaseImageAmdXlaBuild', + input: sourceOutput, + runOrder: 2, + project: base_image_amd_xla_build + }) + ] + }); + } +} diff --git a/flux_serve/oci-image-build/pipeline.js b/flux_serve/oci-image-build/pipeline.js new file mode 100644 index 0000000..d8dc6e9 --- /dev/null +++ b/flux_serve/oci-image-build/pipeline.js @@ -0,0 +1,44 @@ +#!/usr/bin/env node +"use strict"; +var __createBinding = (this && this.__createBinding) || (Object.create ? (function(o, m, k, k2) { + if (k2 === undefined) k2 = k; + var desc = Object.getOwnPropertyDescriptor(m, k); + if (!desc || ("get" in desc ? !m.__esModule : desc.writable || desc.configurable)) { + desc = { enumerable: true, get: function() { return m[k]; } }; + } + Object.defineProperty(o, k2, desc); +}) : (function(o, m, k, k2) { + if (k2 === undefined) k2 = k; + o[k2] = m[k]; +})); +var __setModuleDefault = (this && this.__setModuleDefault) || (Object.create ? (function(o, v) { + Object.defineProperty(o, "default", { enumerable: true, value: v }); +}) : function(o, v) { + o["default"] = v; +}); +var __importStar = (this && this.__importStar) || (function () { + var ownKeys = function(o) { + ownKeys = Object.getOwnPropertyNames || function (o) { + var ar = []; + for (var k in o) if (Object.prototype.hasOwnProperty.call(o, k)) ar[ar.length] = k; + return ar; + }; + return ownKeys(o); + }; + return function (mod) { + if (mod && mod.__esModule) return mod; + var result = {}; + if (mod != null) for (var k = ownKeys(mod), i = 0; i < k.length; i++) if (k[i] !== "default") __createBinding(result, mod, k[i]); + __setModuleDefault(result, mod); + return result; + }; +})(); +Object.defineProperty(exports, "__esModule", { value: true }); +require("source-map-support/register"); +const cdk = __importStar(require("aws-cdk-lib")); +const pipeline_stack_1 = require("./pipeline-stack"); +const app = new cdk.App(); +let stack = process.env.CF_STACK; +new pipeline_stack_1.PipelineStack(app, stack, { + env: { account: process.env.AWS_ACCOUNT_ID, region: process.env.AWS_REGION }, +}); diff --git a/flux_serve/oci-image-build/pipeline.ts b/flux_serve/oci-image-build/pipeline.ts new file mode 100644 index 0000000..f5f0bb6 --- /dev/null +++ b/flux_serve/oci-image-build/pipeline.ts @@ -0,0 +1,10 @@ +#!/usr/bin/env node +import 'source-map-support/register'; +import * as cdk from 'aws-cdk-lib'; +import { PipelineStack } from './pipeline-stack'; + +const app = new cdk.App(); +let stack = process.env.CF_STACK as string; +new PipelineStack(app,stack,{ + env: { account: process.env.AWS_ACCOUNT_ID, region: process.env.AWS_REGION}, +}); diff --git a/flux_serve/oci-image-build/tsconfig.json b/flux_serve/oci-image-build/tsconfig.json new file mode 100644 index 0000000..c9c555d --- /dev/null +++ b/flux_serve/oci-image-build/tsconfig.json @@ -0,0 +1,111 @@ +{ + "compilerOptions": { + /* Visit https://aka.ms/tsconfig to read more about this file */ + + /* Projects */ + // "incremental": true, /* Save .tsbuildinfo files to allow for incremental compilation of projects. */ + // "composite": true, /* Enable constraints that allow a TypeScript project to be used with project references. */ + // "tsBuildInfoFile": "./.tsbuildinfo", /* Specify the path to .tsbuildinfo incremental compilation file. */ + // "disableSourceOfProjectReferenceRedirect": true, /* Disable preferring source files instead of declaration files when referencing composite projects. */ + // "disableSolutionSearching": true, /* Opt a project out of multi-project reference checking when editing. */ + // "disableReferencedProjectLoad": true, /* Reduce the number of projects loaded automatically by TypeScript. */ + + /* Language and Environment */ + "target": "es2016", /* Set the JavaScript language version for emitted JavaScript and include compatible library declarations. */ + // "lib": [], /* Specify a set of bundled library declaration files that describe the target runtime environment. */ + // "jsx": "preserve", /* Specify what JSX code is generated. */ + // "experimentalDecorators": true, /* Enable experimental support for legacy experimental decorators. */ + // "emitDecoratorMetadata": true, /* Emit design-type metadata for decorated declarations in source files. */ + // "jsxFactory": "", /* Specify the JSX factory function used when targeting React JSX emit, e.g. 'React.createElement' or 'h'. */ + // "jsxFragmentFactory": "", /* Specify the JSX Fragment reference used for fragments when targeting React JSX emit e.g. 'React.Fragment' or 'Fragment'. */ + // "jsxImportSource": "", /* Specify module specifier used to import the JSX factory functions when using 'jsx: react-jsx*'. */ + // "reactNamespace": "", /* Specify the object invoked for 'createElement'. This only applies when targeting 'react' JSX emit. */ + // "noLib": true, /* Disable including any library files, including the default lib.d.ts. */ + // "useDefineForClassFields": true, /* Emit ECMAScript-standard-compliant class fields. */ + // "moduleDetection": "auto", /* Control what method is used to detect module-format JS files. */ + + /* Modules */ + "module": "commonjs", /* Specify what module code is generated. */ + // "rootDir": "./", /* Specify the root folder within your source files. */ + // "moduleResolution": "node10", /* Specify how TypeScript looks up a file from a given module specifier. */ + // "baseUrl": "./", /* Specify the base directory to resolve non-relative module names. */ + // "paths": {}, /* Specify a set of entries that re-map imports to additional lookup locations. */ + // "rootDirs": [], /* Allow multiple folders to be treated as one when resolving modules. */ + // "typeRoots": [], /* Specify multiple folders that act like './node_modules/@types'. */ + // "types": [], /* Specify type package names to be included without being referenced in a source file. */ + // "allowUmdGlobalAccess": true, /* Allow accessing UMD globals from modules. */ + // "moduleSuffixes": [], /* List of file name suffixes to search when resolving a module. */ + // "allowImportingTsExtensions": true, /* Allow imports to include TypeScript file extensions. Requires '--moduleResolution bundler' and either '--noEmit' or '--emitDeclarationOnly' to be set. */ + // "rewriteRelativeImportExtensions": true, /* Rewrite '.ts', '.tsx', '.mts', and '.cts' file extensions in relative import paths to their JavaScript equivalent in output files. */ + // "resolvePackageJsonExports": true, /* Use the package.json 'exports' field when resolving package imports. */ + // "resolvePackageJsonImports": true, /* Use the package.json 'imports' field when resolving imports. */ + // "customConditions": [], /* Conditions to set in addition to the resolver-specific defaults when resolving imports. */ + // "noUncheckedSideEffectImports": true, /* Check side effect imports. */ + // "resolveJsonModule": true, /* Enable importing .json files. */ + // "allowArbitraryExtensions": true, /* Enable importing files with any extension, provided a declaration file is present. */ + // "noResolve": true, /* Disallow 'import's, 'require's or ''s from expanding the number of files TypeScript should add to a project. */ + + /* JavaScript Support */ + // "allowJs": true, /* Allow JavaScript files to be a part of your program. Use the 'checkJS' option to get errors from these files. */ + // "checkJs": true, /* Enable error reporting in type-checked JavaScript files. */ + // "maxNodeModuleJsDepth": 1, /* Specify the maximum folder depth used for checking JavaScript files from 'node_modules'. Only applicable with 'allowJs'. */ + + /* Emit */ + // "declaration": true, /* Generate .d.ts files from TypeScript and JavaScript files in your project. */ + // "declarationMap": true, /* Create sourcemaps for d.ts files. */ + // "emitDeclarationOnly": true, /* Only output d.ts files and not JavaScript files. */ + // "sourceMap": true, /* Create source map files for emitted JavaScript files. */ + // "inlineSourceMap": true, /* Include sourcemap files inside the emitted JavaScript. */ + // "noEmit": true, /* Disable emitting files from a compilation. */ + // "outFile": "./", /* Specify a file that bundles all outputs into one JavaScript file. If 'declaration' is true, also designates a file that bundles all .d.ts output. */ + // "outDir": "./", /* Specify an output folder for all emitted files. */ + // "removeComments": true, /* Disable emitting comments. */ + // "importHelpers": true, /* Allow importing helper functions from tslib once per project, instead of including them per-file. */ + // "downlevelIteration": true, /* Emit more compliant, but verbose and less performant JavaScript for iteration. */ + // "sourceRoot": "", /* Specify the root path for debuggers to find the reference source code. */ + // "mapRoot": "", /* Specify the location where debugger should locate map files instead of generated locations. */ + // "inlineSources": true, /* Include source code in the sourcemaps inside the emitted JavaScript. */ + // "emitBOM": true, /* Emit a UTF-8 Byte Order Mark (BOM) in the beginning of output files. */ + // "newLine": "crlf", /* Set the newline character for emitting files. */ + // "stripInternal": true, /* Disable emitting declarations that have '@internal' in their JSDoc comments. */ + // "noEmitHelpers": true, /* Disable generating custom helper functions like '__extends' in compiled output. */ + // "noEmitOnError": true, /* Disable emitting files if any type checking errors are reported. */ + // "preserveConstEnums": true, /* Disable erasing 'const enum' declarations in generated code. */ + // "declarationDir": "./", /* Specify the output directory for generated declaration files. */ + + /* Interop Constraints */ + // "isolatedModules": true, /* Ensure that each file can be safely transpiled without relying on other imports. */ + // "verbatimModuleSyntax": true, /* Do not transform or elide any imports or exports not marked as type-only, ensuring they are written in the output file's format based on the 'module' setting. */ + // "isolatedDeclarations": true, /* Require sufficient annotation on exports so other tools can trivially generate declaration files. */ + // "allowSyntheticDefaultImports": true, /* Allow 'import x from y' when a module doesn't have a default export. */ + "esModuleInterop": true, /* Emit additional JavaScript to ease support for importing CommonJS modules. This enables 'allowSyntheticDefaultImports' for type compatibility. */ + // "preserveSymlinks": true, /* Disable resolving symlinks to their realpath. This correlates to the same flag in node. */ + "forceConsistentCasingInFileNames": true, /* Ensure that casing is correct in imports. */ + + /* Type Checking */ + "strict": true, /* Enable all strict type-checking options. */ + // "noImplicitAny": true, /* Enable error reporting for expressions and declarations with an implied 'any' type. */ + // "strictNullChecks": true, /* When type checking, take into account 'null' and 'undefined'. */ + // "strictFunctionTypes": true, /* When assigning functions, check to ensure parameters and the return values are subtype-compatible. */ + // "strictBindCallApply": true, /* Check that the arguments for 'bind', 'call', and 'apply' methods match the original function. */ + // "strictPropertyInitialization": true, /* Check for class properties that are declared but not set in the constructor. */ + // "strictBuiltinIteratorReturn": true, /* Built-in iterators are instantiated with a 'TReturn' type of 'undefined' instead of 'any'. */ + // "noImplicitThis": true, /* Enable error reporting when 'this' is given the type 'any'. */ + // "useUnknownInCatchVariables": true, /* Default catch clause variables as 'unknown' instead of 'any'. */ + // "alwaysStrict": true, /* Ensure 'use strict' is always emitted. */ + // "noUnusedLocals": true, /* Enable error reporting when local variables aren't read. */ + // "noUnusedParameters": true, /* Raise an error when a function parameter isn't read. */ + // "exactOptionalPropertyTypes": true, /* Interpret optional property types as written, rather than adding 'undefined'. */ + // "noImplicitReturns": true, /* Enable error reporting for codepaths that do not explicitly return in a function. */ + // "noFallthroughCasesInSwitch": true, /* Enable error reporting for fallthrough cases in switch statements. */ + // "noUncheckedIndexedAccess": true, /* Add 'undefined' to a type when accessed using an index. */ + // "noImplicitOverride": true, /* Ensure overriding members in derived classes are marked with an override modifier. */ + // "noPropertyAccessFromIndexSignature": true, /* Enforces using indexed accessors for keys declared using an indexed type. */ + // "allowUnusedLabels": true, /* Disable error reporting for unused labels. */ + // "allowUnreachableCode": true, /* Disable error reporting for unreachable code. */ + + /* Completeness */ + // "skipDefaultLibCheck": true, /* Skip type checking .d.ts files that are included with TypeScript. */ + "skipLibCheck": true /* Skip type checking all .d.ts files. */ + } +} diff --git a/flux_serve/specs/amd-neuron-inf2-nodepool.yaml b/flux_serve/specs/amd-neuron-inf2-nodepool.yaml new file mode 100644 index 0000000..201eebb --- /dev/null +++ b/flux_serve/specs/amd-neuron-inf2-nodepool.yaml @@ -0,0 +1,48 @@ +apiVersion: karpenter.sh/v1 +kind: NodePool +metadata: + name: amd-neuron-inf2 +spec: + template: + spec: + requirements: + - key: kubernetes.io/arch + operator: In + values: ["amd64"] + - key: karpenter.k8s.aws/instance-family + operator: In + values: ["inf2"] + - key: karpenter.sh/capacity-type + operator: In + values: ["on-demand"] + nodeClassRef: + group: karpenter.k8s.aws + kind: EC2NodeClass + name: amd-neuron-al2023 + expireAfter: 720h # 30 * 24h = 720h + limits: + cpu: 1000 + disruption: + consolidationPolicy: WhenEmptyOrUnderutilized + consolidateAfter: 10m +--- +apiVersion: karpenter.k8s.aws/v1 +kind: EC2NodeClass +metadata: + name: amd-neuron-al2023 +spec: + amiSelectorTerms: + - alias: "al2023@v20250501" + role: "KarpenterNodeRole-cova-use1" + subnetSelectorTerms: + - tags: + karpenter.sh/discovery: "cova-use1" + securityGroupSelectorTerms: + - tags: + karpenter.sh/discovery: "cova-use1" + blockDeviceMappings: + - deviceName: /dev/xvda + ebs: + volumeSize: 900Gi + volumeType: gp3 + encrypted: true diff --git a/flux_serve/specs/amd-neuron-trn1-nodepool.yaml b/flux_serve/specs/amd-neuron-trn1-nodepool.yaml new file mode 100644 index 0000000..c8176d5 --- /dev/null +++ b/flux_serve/specs/amd-neuron-trn1-nodepool.yaml @@ -0,0 +1,48 @@ +apiVersion: karpenter.sh/v1 +kind: NodePool +metadata: + name: amd-neuron-trn1 +spec: + template: + spec: + requirements: + - key: kubernetes.io/arch + operator: In + values: ["amd64"] + - key: karpenter.k8s.aws/instance-family + operator: In + values: ["trn1"] + - key: karpenter.sh/capacity-type + operator: In + values: ["on-demand"] + nodeClassRef: + group: karpenter.k8s.aws + kind: EC2NodeClass + name: amd-neuron-al2023 + expireAfter: 720h # 30 * 24h = 720h + limits: + cpu: 1000 + disruption: + consolidationPolicy: WhenEmptyOrUnderutilized + consolidateAfter: 10m +--- +apiVersion: karpenter.k8s.aws/v1 +kind: EC2NodeClass +metadata: + name: amd-neuron-al2023 +spec: + amiSelectorTerms: + - alias: "al2023@v20250501" + role: "KarpenterNodeRole-cova-use1" + subnetSelectorTerms: + - tags: + karpenter.sh/discovery: "cova-use1" + securityGroupSelectorTerms: + - tags: + karpenter.sh/discovery: "cova-use1" + blockDeviceMappings: + - deviceName: /dev/xvda + ebs: + volumeSize: 900Gi + volumeType: gp3 + encrypted: true diff --git a/flux_serve/specs/compile-flux-1024x576.yaml b/flux_serve/specs/compile-flux-1024x576.yaml new file mode 100644 index 0000000..1055b01 --- /dev/null +++ b/flux_serve/specs/compile-flux-1024x576.yaml @@ -0,0 +1,64 @@ +apiVersion: batch/v1 +kind: Job +metadata: + name: compile-1024x576 +spec: + template: + spec: + restartPolicy: OnFailure + nodeSelector: + karpenter.sh/nodepool: amd-neuron-trn1 + #serviceAccountName: appsimulator + schedulerName: my-scheduler + containers: + - name: app + image: 920372998901.dkr.ecr.us-west-2.amazonaws.com/model:amd64-neuron + imagePullPolicy: Always + volumeMounts: + - name: dshm + mountPath: /dev/shm + - name: s3-flux-pvc + mountPath: /model + command: + - /bin/bash + - "-exc" + - | + set -x + mkdir -p /model/1024x576 + cd /src + ./compile.sh + cp -r * /model/1024x576 + python /benchmark-flux.py + while true; do sleep 3600; done + resources: + limits: + aws.amazon.com/neuron: 8 + requests: + aws.amazon.com/neuron: 8 + env: + - name: NODEPOOL + value: "amd-neuron-trn1" + - name: COMPILER_WORKDIR_ROOT + value: "/model/1024x576" + - name: HEIGHT + value: "1024" + - name: WIDTH + value: "576" + - name: MAX_SEQ_LEN + value: "32" + - name: GUIDANCE_SCALE + value: "3.5" + - name: MODEL_ID + value: "black-forest-labs/FLUX.1-dev" + - name: HUGGINGFACE_TOKEN + valueFrom: + secretKeyRef: + name: hf-secrets + key: HUGGINGFACE_TOKEN + volumes: + - name: dshm + emptyDir: + medium: Memory + - name: s3-flux-pvc + persistentVolumeClaim: + claimName: s3-flux-pvc diff --git a/flux_serve/specs/compile-flux-256x144.yaml b/flux_serve/specs/compile-flux-256x144.yaml new file mode 100644 index 0000000..5f7b3d1 --- /dev/null +++ b/flux_serve/specs/compile-flux-256x144.yaml @@ -0,0 +1,65 @@ +apiVersion: batch/v1 +kind: Job +metadata: + name: compile-256x256 +spec: + template: + spec: + restartPolicy: OnFailure + nodeSelector: + karpenter.sh/nodepool: amd-neuron-trn1 + serviceAccountName: appsimulator + schedulerName: my-scheduler + containers: + - name: app + image: 891377065549.dkr.ecr.us-west-2.amazonaws.com/model:amd64-neuron + #image: 920372998901.dkr.ecr.us-west-2.amazonaws.com/model:amd64-neuron + imagePullPolicy: Always + volumeMounts: + - name: dshm + mountPath: /dev/shm + #- name: s3-flux-pvc + # mountPath: /model + command: + - /bin/bash + - "-exc" + - | + set -x + while true; do sleep 3600; done + mkdir -p /model/256x256 + cd /src + ./compile.sh + cp -r * /model/256x256 + python /benchmark-flux.py + resources: + limits: + aws.amazon.com/neuron: 8 + requests: + aws.amazon.com/neuron: 8 + env: + - name: NODEPOOL + value: "amd-neuron-trn1" + - name: COMPILER_WORKDIR_ROOT + value: "/model/256x256" + - name: HEIGHT + value: "256" + - name: WIDTH + value: "256" + - name: MAX_SEQ_LEN + value: "32" + - name: GUIDANCE_SCALE + value: "3.5" + - name: MODEL_ID + value: "black-forest-labs/FLUX.1-dev" + - name: HUGGINGFACE_TOKEN + valueFrom: + secretKeyRef: + name: hf-secrets + key: HUGGINGFACE_TOKEN + volumes: + - name: dshm + emptyDir: + medium: Memory + #- name: s3-flux-pvc + # persistentVolumeClaim: + # claimName: s3-flux-pvc diff --git a/flux_serve/specs/compile-flux-512x512.yaml b/flux_serve/specs/compile-flux-512x512.yaml new file mode 100644 index 0000000..1572338 --- /dev/null +++ b/flux_serve/specs/compile-flux-512x512.yaml @@ -0,0 +1,65 @@ +apiVersion: batch/v1 +kind: Job +metadata: + name: compile-512x512 +spec: + template: + spec: + restartPolicy: OnFailure + nodeSelector: + karpenter.sh/nodepool: amd-neuron-trn1 + #serviceAccountName: appsimulator + schedulerName: my-scheduler + containers: + - name: app + #image: 891377065549.dkr.ecr.us-west-2.amazonaws.com/model:amd64-neuron + image: 920372998901.dkr.ecr.us-east-1.amazonaws.com/model:amd64-neuron + imagePullPolicy: Always + volumeMounts: + - name: dshm + mountPath: /dev/shm + - name: s3-flux-pvc + mountPath: /model + command: + - /bin/bash + - "-exc" + - | + set -x + #mkdir -p /model/512x512 + #cd /src + #./compile.sh + #cp -r * /model/512x512 + #python /benchmark-flux.py + while true; do sleep 3600; done + resources: + limits: + aws.amazon.com/neuron: 16 + requests: + aws.amazon.com/neuron: 16 + env: + - name: NODEPOOL + value: "amd-neuron-trn1" + - name: COMPILER_WORKDIR_ROOT + value: "/model/512x512" + - name: HEIGHT + value: "512" + - name: WIDTH + value: "512" + - name: MAX_SEQ_LEN + value: "32" + - name: GUIDANCE_SCALE + value: "3.5" + - name: MODEL_ID + value: "black-forest-labs/FLUX.1-dev" + - name: HUGGINGFACE_TOKEN + valueFrom: + secretKeyRef: + name: hf-secrets + key: HUGGINGFACE_TOKEN + volumes: + - name: dshm + emptyDir: + medium: Memory + - name: s3-flux-pvc + persistentVolumeClaim: + claimName: s3-flux-pvc diff --git a/flux_serve/specs/compile.yaml b/flux_serve/specs/compile.yaml new file mode 100644 index 0000000..2556a4e --- /dev/null +++ b/flux_serve/specs/compile.yaml @@ -0,0 +1,72 @@ +apiVersion: batch/v1 +kind: Job +metadata: + name: compile +spec: + template: + spec: + restartPolicy: OnFailure + nodeSelector: + karpenter.sh/nodepool: amd-neuron-trn1 + #serviceAccountName: appsimulator + schedulerName: my-scheduler + containers: + - name: app + image: 920372998901.dkr.ecr.us-east-1.amazonaws.com/model:amd64-neuron + imagePullPolicy: Always + volumeMounts: + - name: dshm + mountPath: /dev/shm + - name: neuron-bits-use1-pvc + mountPath: /model + command: + - /bin/bash + - "-exc" + - | + set -x + python /download_hf_model.py + find /model -type f -name '*.deb' -exec sh -c ' + for pkg; do + cp "$pkg" / && + apt install -y "/$(basename "$pkg")" + done + ' _ {} + + find /model -type f -name '*.whl' -exec sh -c ' + for pkg; do + cp "$pkg" / && + pip install "/$(basename "$pkg")" + done + ' _ {} + + while true; do sleep 3600; done + resources: + limits: + aws.amazon.com/neuron: 16 + requests: + aws.amazon.com/neuron: 16 + env: + - name: NODEPOOL + value: "amd-neuron-trn1" + - name: COMPILER_WORKDIR_ROOT + value: "/model" + - name: HEIGHT + value: "512" + - name: WIDTH + value: "512" + - name: MAX_SEQ_LEN + value: "32" + - name: GUIDANCE_SCALE + value: "3.5" + - name: MODEL_ID + value: "black-forest-labs/FLUX.1-dev" + - name: HUGGINGFACE_TOKEN + valueFrom: + secretKeyRef: + name: hf-secrets + key: HUGGINGFACE_TOKEN + volumes: + - name: dshm + emptyDir: + medium: Memory + - name: neuron-bits-use1-pvc + persistentVolumeClaim: + claimName: neuron-bits-use1-pvc diff --git a/flux_serve/specs/cova-gradio-config.yaml b/flux_serve/specs/cova-gradio-config.yaml new file mode 100644 index 0000000..6b34cbb --- /dev/null +++ b/flux_serve/specs/cova-gradio-config.yaml @@ -0,0 +1,18 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + name: cova-gradio-config +data: + models.json: | + [ + { + "name": "512 × 512", + "host_env": "FLUX_NEURON_512X512_MODEL_API_SERVICE_HOST", + "port_env": "FLUX_NEURON_512X512_MODEL_API_SERVICE_PORT", + "height": 512, + "width": 512, + "caption_host_env": "MLLAMA_32_11B_VLLM_TRN1_SERVICE_HOST", + "caption_port_env": "MLLAMA_32_11B_VLLM_TRN1_SERVICE_PORT", + "caption_max_new_tokens": 1024 + } + ] diff --git a/flux_serve/specs/cova-gradio-deploy.yaml b/flux_serve/specs/cova-gradio-deploy.yaml new file mode 100644 index 0000000..c84d003 --- /dev/null +++ b/flux_serve/specs/cova-gradio-deploy.yaml @@ -0,0 +1,64 @@ +apiVersion: v1 +kind: Service +metadata: + name: cova-gradio +spec: + selector: + app: cova-gradio + ports: + - protocol: TCP + port: 8000 + targetPort: 8000 + type: ClusterIP +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + labels: + app: cova-gradio + name: cova-gradio +spec: + selector: + matchLabels: + app: cova-gradio + template: + metadata: + labels: + app: cova-gradio + spec: + nodeSelector: + alpha.eksctl.io/nodegroup-name: kub316-ng + #serviceAccountName: appsimulator + containers: + - name: app + image: 891377065549.dkr.ecr.us-west-2.amazonaws.com/model:amd64-neuron + #image: 920372998901.dkr.ecr.us-west-2.amazonaws.com/model:amd64-neuron + imagePullPolicy: Always + volumeMounts: + - name: cova-gradio-volume + mountPath: /app + #command: ["sh", "-c", "while true; do sleep 3600; done"] + command: ["sh", "-c", "uvicorn cova_gradio_m:app --host=0.0.0.0"] + ports: + - containerPort: 8000 + protocol: TCP + readinessProbe: + httpGet: + path: /readiness + port: 8000 + initialDelaySeconds: 60 + periodSeconds: 10 + env: + - name: MODELS_FILE_PATH + value: "/app/models.json" + - name: POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name + volumes: + - name: cova-gradio-volume + configMap: + name: cova-gradio-config + items: + - key: models.json + path: models.json diff --git a/flux_serve/specs/cova-ingress.yaml b/flux_serve/specs/cova-ingress.yaml new file mode 100644 index 0000000..928c6d7 --- /dev/null +++ b/flux_serve/specs/cova-ingress.yaml @@ -0,0 +1,34 @@ +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: cova + annotations: + kubernetes.io/ingress.class: alb + alb.ingress.kubernetes.io/scheme: internet-facing + alb.ingress.kubernetes.io/target-type: ip + alb.ingress.kubernetes.io/healthcheck-path: /health + alb.ingress.kubernetes.io/healthcheck-interval-seconds: '10' + alb.ingress.kubernetes.io/healthcheck-timeout-seconds: '9' + alb.ingress.kubernetes.io/healthy-threshold-count: '2' + alb.ingress.kubernetes.io/unhealthy-threshold-count: '10' + alb.ingress.kubernetes.io/success-codes: '200-301' + alb.ingress.kubernetes.io/listen-ports: '[{"HTTP":80}]' + alb.ingress.kubernetes.io/backend-protocol: HTTP + alb.ingress.kubernetes.io/target-group-attributes: stickiness.enabled=true,stickiness.lb_cookie.duration_seconds=10 + alb.ingress.kubernetes.io/load-balancer-name: cova + alb.ingress.kubernetes.io/actions.weighted-routing: > + {"type":"forward","forwardConfig":{"targetGroups":[{"serviceName":"cova-gradio","servicePort":8000,"weight":100}],"targetGroupStickinessConfig":{"enabled":true,"durationSeconds":200}}} + labels: + app: cova-neuron-gradio +spec: + ingressClassName: alb + rules: + - http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: weighted-routing + port: + name: use-annotation diff --git a/flux_serve/specs/flux-karpenter-inline-sts.json b/flux_serve/specs/flux-karpenter-inline-sts.json new file mode 100644 index 0000000..7642272 --- /dev/null +++ b/flux_serve/specs/flux-karpenter-inline-sts.json @@ -0,0 +1,13 @@ +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "sts:AssumeRole", + "sts:AssumeRoleWithWebIdentity" + ], + "Resource": "*" + } + ] +} diff --git a/flux_serve/specs/flux-model-s3-storage.yaml b/flux_serve/specs/flux-model-s3-storage.yaml new file mode 100644 index 0000000..884eb75 --- /dev/null +++ b/flux_serve/specs/flux-model-s3-storage.yaml @@ -0,0 +1,34 @@ +apiVersion: v1 +kind: PersistentVolume +metadata: + name: s3-flux-pv +spec: + capacity: + storage: 1200Gi + accessModes: + - ReadWriteMany + storageClassName: "" + claimRef: + namespace: default + name: s3-flux-pvc + mountOptions: + - region=us-west-2 + csi: + driver: s3.csi.aws.com + volumeHandle: s3-csi-driver-volume + volumeAttributes: + bucketName: flux1-dev-neuron2 + #bucketName: flux1-dev-neuron +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: s3-flux-pvc +spec: + accessModes: + - ReadWriteMany # Supported options: ReadWriteMany / ReadOnlyMany + storageClassName: "" # Required for static provisioning + resources: + requests: + storage: 1200Gi + volumeName: s3-flux-pv diff --git a/flux_serve/specs/flux-neuron-1024x576-model-api.yaml b/flux_serve/specs/flux-neuron-1024x576-model-api.yaml new file mode 100644 index 0000000..9ae6c4b --- /dev/null +++ b/flux_serve/specs/flux-neuron-1024x576-model-api.yaml @@ -0,0 +1,93 @@ +apiVersion: v1 +kind: Service +metadata: + name: flux-neuron-1024x576-model-api +spec: + selector: + app: flux-neuron-1024x576-model-api + ports: + - protocol: TCP + port: 8000 + targetPort: 8000 + type: ClusterIP +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + labels: + app: flux-neuron-1024x576-model-api + name: flux-neuron-1024x576-model-api +spec: + selector: + matchLabels: + app: flux-neuron-1024x576-model-api + template: + metadata: + labels: + app: flux-neuron-1024x576-model-api + spec: + nodeSelector: + karpenter.sh/nodepool: amd-neuron-inf2 + serviceAccountName: flux-serviceaccount + schedulerName: my-scheduler + containers: + - name: app + image: 920372998901.dkr.ecr.us-west-2.amazonaws.com/model:amd64-neuron + imagePullPolicy: Always + volumeMounts: + - name: dshm + mountPath: /dev/shm + - name: s3-flux-pvc + mountPath: /model + command: ["sh", "-c", "uvicorn flux_model_api:app --host=0.0.0.0"] + resources: + requests: + aws.amazon.com/neuron: 6 + limits: + aws.amazon.com/neuron: 6 + ports: + - containerPort: 8000 + protocol: TCP + readinessProbe: + httpGet: + path: /readiness + port: 8000 + initialDelaySeconds: 60 + periodSeconds: 10 + env: + - name: APP + value: "flux1.1-dev-1024x576-inf2" + - name: NODEPOOL + value: "amd-neuron-inf2" + - name: DEVICE + value: "xla" + - name: MODEL_ID + value: "black-forest-labs/FLUX.1-dev" + - name: COMPILER_WORKDIR_ROOT + value: "/model/1024x576" + - name: HEIGHT + value: "1024" + - name: WIDTH + value: "576" + - name: MAX_SEQ_LEN + value: "32" + - name: GUIDANCE_SCALE + value: "3.5" + - name: HUGGINGFACE_TOKEN + valueFrom: + secretKeyRef: + name: hf-secrets + key: HUGGINGFACE_TOKEN + - name: POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name + volumes: + - name: workdir + emptyDir: {} + - name: dshm + emptyDir: + medium: Memory + - name: s3-flux-pvc + persistentVolumeClaim: + claimName: s3-flux-pvc diff --git a/flux_serve/specs/flux-neuron-256x144-model-api.yaml b/flux_serve/specs/flux-neuron-256x144-model-api.yaml new file mode 100644 index 0000000..1ada152 --- /dev/null +++ b/flux_serve/specs/flux-neuron-256x144-model-api.yaml @@ -0,0 +1,105 @@ +apiVersion: v1 +kind: Service +metadata: + name: flux-neuron-256x144-model-api +spec: + selector: + app: flux-neuron-256x144-model-api + ports: + - protocol: TCP + port: 8000 + targetPort: 8000 + type: ClusterIP +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + labels: + app: flux-neuron-256x144-model-api + name: flux-neuron-256x144-model-api +spec: + selector: + matchLabels: + app: flux-neuron-256x144-model-api + template: + metadata: + labels: + app: flux-neuron-256x144-model-api + spec: + nodeSelector: + karpenter.sh/nodepool: amd-neuron-inf2 + serviceAccountName: appsimulator + #serviceAccountName: flux-serviceaccount + schedulerName: my-scheduler + containers: + - name: app + image: 891377065549.dkr.ecr.us-west-2.amazonaws.com/model:amd64-neuron + #image: 920372998901.dkr.ecr.us-west-2.amazonaws.com/model:amd64-neuron + imagePullPolicy: Always + volumeMounts: + - name: dshm + mountPath: /dev/shm + - name: s3-flux-pvc + mountPath: /model + command: + - /bin/bash + - "-exc" + - | + set -x + pip install --upgrade pip + pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com + pip install --upgrade "neuronx-distributed==0.11.0" --extra-index-url https://pip.repos.neuron.amazonaws.com + pip install --upgrade "neuronx-distributed-inference>=0.2.0" --extra-index-url https://pip.repos.neuron.amazonaws.com + pip install --upgrade tenacity diffusers + uvicorn flux_model_api:app --host=0.0.0.0 + resources: + requests: + aws.amazon.com/neuron: 6 + limits: + aws.amazon.com/neuron: 6 + ports: + - containerPort: 8000 + protocol: TCP + readinessProbe: + httpGet: + path: /readiness + port: 8000 + initialDelaySeconds: 60 + periodSeconds: 10 + env: + - name: APP + value: "flux1.1-dev-256x144-inf2" + - name: NODEPOOL + value: "amd-neuron-inf2" + - name: DEVICE + value: "xla" + - name: MODEL_ID + value: "black-forest-labs/FLUX.1-dev" + - name: COMPILER_WORKDIR_ROOT + value: "/model/256x144" + - name: HEIGHT + value: "256" + - name: WIDTH + value: "144" + - name: MAX_SEQ_LEN + value: "32" + - name: GUIDANCE_SCALE + value: "3.5" + - name: HUGGINGFACE_TOKEN + valueFrom: + secretKeyRef: + name: hf-secrets + key: HUGGINGFACE_TOKEN + - name: POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name + volumes: + - name: workdir + emptyDir: {} + - name: dshm + emptyDir: + medium: Memory + - name: s3-flux-pvc + persistentVolumeClaim: + claimName: s3-flux-pvc diff --git a/flux_serve/specs/flux-neuron-512x512-model-api.yaml b/flux_serve/specs/flux-neuron-512x512-model-api.yaml new file mode 100644 index 0000000..a2cc706 --- /dev/null +++ b/flux_serve/specs/flux-neuron-512x512-model-api.yaml @@ -0,0 +1,105 @@ +apiVersion: v1 +kind: Service +metadata: + name: flux-neuron-512x512-model-api +spec: + selector: + app: flux-neuron-512x512-model-api + ports: + - protocol: TCP + port: 8000 + targetPort: 8000 + type: ClusterIP +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + labels: + app: flux-neuron-512x512-model-api + name: flux-neuron-512x512-model-api +spec: + selector: + matchLabels: + app: flux-neuron-512x512-model-api + template: + metadata: + labels: + app: flux-neuron-512x512-model-api + spec: + nodeSelector: + karpenter.sh/nodepool: amd-neuron-inf2 + serviceAccountName: flux-serviceaccount + #serviceAccountName: appsimulator + schedulerName: my-scheduler + containers: + - name: app + #image: 891377065549.dkr.ecr.us-west-2.amazonaws.com/model:amd64-neuron + image: 920372998901.dkr.ecr.us-west-2.amazonaws.com/model:amd64-neuron + imagePullPolicy: Always + volumeMounts: + - name: dshm + mountPath: /dev/shm + - name: s3-flux-pvc + mountPath: /model + command: + - /bin/bash + - "-exc" + - | + set -x + pip install --upgrade pip + pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com + pip install --upgrade "neuronx-distributed==0.11.0" --extra-index-url https://pip.repos.neuron.amazonaws.com + pip install --upgrade "neuronx-distributed-inference>=0.2.0" --extra-index-url https://pip.repos.neuron.amazonaws.com + pip install --upgrade tenacity diffusers + uvicorn flux_model_api:app --host=0.0.0.0 + resources: + requests: + aws.amazon.com/neuron: 8 + limits: + aws.amazon.com/neuron: 8 + ports: + - containerPort: 8000 + protocol: TCP + readinessProbe: + httpGet: + path: /readiness + port: 8000 + initialDelaySeconds: 460 + periodSeconds: 10 + env: + - name: APP + value: "flux1.1-dev-512x512-inf2" + - name: NODEPOOL + value: "amd-neuron-inf2" + - name: DEVICE + value: "xla" + - name: MODEL_ID + value: "black-forest-labs/FLUX.1-dev" + - name: COMPILER_WORKDIR_ROOT + value: "/model/512x512" + - name: HEIGHT + value: "512" + - name: WIDTH + value: "512" + - name: MAX_SEQ_LEN + value: "32" + - name: GUIDANCE_SCALE + value: "3.5" + - name: HUGGINGFACE_TOKEN + valueFrom: + secretKeyRef: + name: hf-secrets + key: HUGGINGFACE_TOKEN + - name: POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name + volumes: + - name: workdir + emptyDir: {} + - name: dshm + emptyDir: + medium: Memory + - name: s3-flux-pvc + persistentVolumeClaim: + claimName: s3-flux-pvc diff --git a/flux_serve/specs/flux-neuron-gradio.yaml b/flux_serve/specs/flux-neuron-gradio.yaml new file mode 100644 index 0000000..843c348 --- /dev/null +++ b/flux_serve/specs/flux-neuron-gradio.yaml @@ -0,0 +1,51 @@ +apiVersion: v1 +kind: Service +metadata: + name: flux-neuron-gradio +spec: + selector: + app: flux-neuron-gradio + ports: + - protocol: TCP + port: 8000 + targetPort: 8000 + type: ClusterIP +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + labels: + app: flux-neuron-gradio + name: flux-neuron-gradio +spec: + selector: + matchLabels: + app: flux-neuron-gradio + template: + metadata: + labels: + app: flux-neuron-gradio + spec: + nodeSelector: + alpha.eksctl.io/nodegroup-name: cova-ng + #serviceAccountName: appsimulator + containers: + - name: app + image: 920372998901.dkr.ecr.us-west-2.amazonaws.com/model:amd64-neuron + imagePullPolicy: Always + #command: ["sh", "-c", "while true; do sleep 3600; done"] + command: ["sh", "-c", "uvicorn cova_gradio:app --host=0.0.0.0"] + ports: + - containerPort: 8000 + protocol: TCP + readinessProbe: + httpGet: + path: /readiness + port: 8000 + initialDelaySeconds: 60 + periodSeconds: 10 + env: + - name: POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name diff --git a/flux_serve/specs/flux-neuron-ingress.yaml b/flux_serve/specs/flux-neuron-ingress.yaml new file mode 100644 index 0000000..d81c8a2 --- /dev/null +++ b/flux_serve/specs/flux-neuron-ingress.yaml @@ -0,0 +1,34 @@ +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: cova + annotations: + kubernetes.io/ingress.class: alb + alb.ingress.kubernetes.io/scheme: internet-facing + alb.ingress.kubernetes.io/target-type: ip + alb.ingress.kubernetes.io/healthcheck-path: /health + alb.ingress.kubernetes.io/healthcheck-interval-seconds: '10' + alb.ingress.kubernetes.io/healthcheck-timeout-seconds: '9' + alb.ingress.kubernetes.io/healthy-threshold-count: '2' + alb.ingress.kubernetes.io/unhealthy-threshold-count: '10' + alb.ingress.kubernetes.io/success-codes: '200-301' + alb.ingress.kubernetes.io/listen-ports: '[{"HTTP":80}]' + alb.ingress.kubernetes.io/backend-protocol: HTTP + alb.ingress.kubernetes.io/target-group-attributes: stickiness.enabled=true,stickiness.lb_cookie.duration_seconds=10 + alb.ingress.kubernetes.io/load-balancer-name: cova + alb.ingress.kubernetes.io/actions.weighted-routing: > + {"type":"forward","forwardConfig":{"targetGroups":[{"serviceName":"flux-neuron-gradio","servicePort":8000,"weight":100}],"targetGroupStickinessConfig":{"enabled":true,"durationSeconds":200}}} + labels: + app: cova-neuron-gradio +spec: + ingressClassName: alb + rules: + - http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: weighted-routing + port: + name: use-annotation diff --git a/flux_serve/specs/flux-sa.yaml b/flux_serve/specs/flux-sa.yaml new file mode 100644 index 0000000..9f711b9 --- /dev/null +++ b/flux_serve/specs/flux-sa.yaml @@ -0,0 +1,7 @@ +apiVersion: v1 +kind: ServiceAccount +metadata: + name: flux-serviceaccount + namespace: default + annotations: + eks.amazonaws.com/role-arn: arn:aws:iam::920372998901:role/KarpenterNodeRole-flux-usw2 diff --git a/flux_serve/specs/load.yaml b/flux_serve/specs/load.yaml new file mode 100644 index 0000000..d845383 --- /dev/null +++ b/flux_serve/specs/load.yaml @@ -0,0 +1,31 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + labels: + app: load + name: load +spec: + selector: + matchLabels: + app: load + template: + metadata: + labels: + app: load + spec: + nodeSelector: + alpha.eksctl.io/nodegroup-name: flux-usw2-ng + containers: + - name: load + image: public.ecr.aws/docker/library/python + imagePullPolicy: Always + command: + - /bin/bash + - -c + - -x + - | + SLEEP_TIME=2 + while true; do + curl -X POST -H "Content-Type: application/json" -d '{"prompt": "A majestic mountainscape in a surreal style","num_inference_steps": 10}' http://$FLUX_NEURON_256X144_MODEL_API_SERVICE_HOST:8000/generate + sleep $SLEEP_TIME + done diff --git a/flux_serve/specs/mllama-32-11b-vllm-trn1-config.yaml b/flux_serve/specs/mllama-32-11b-vllm-trn1-config.yaml new file mode 100644 index 0000000..adbed23 --- /dev/null +++ b/flux_serve/specs/mllama-32-11b-vllm-trn1-config.yaml @@ -0,0 +1,24 @@ +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: mllama-32-11b-vllm-trn1-config +data: + vllm_config.yaml: | + model: "yahavb/Llama-3.2-11B-Vision-Instruct-neuron-checkpoint" + tensor_parallel_size: 32 + max_num_seqs: 1 + block_size: 4096 + max_model_len: 128000 + override_neuron_config: + skip_warmup: true + context_encoding_buckets: [1024, 16384] + token_generation_buckets: [1024, 16384] + sequence_parallel_enabled: False + is_continuous_batching: True + on_device_sampling_config: + global_topk: 64 + dynamic: True + deterministic: False + device: "neuron" +--- diff --git a/flux_serve/specs/mllama-32-11b-vllm-trn1-deploy.yaml b/flux_serve/specs/mllama-32-11b-vllm-trn1-deploy.yaml new file mode 100644 index 0000000..2837eea --- /dev/null +++ b/flux_serve/specs/mllama-32-11b-vllm-trn1-deploy.yaml @@ -0,0 +1,110 @@ +apiVersion: v1 +kind: Service +metadata: + name: mllama-32-11b-vllm-trn1 +spec: + selector: + app: mllama-32-11b-vllm-trn1 + ports: + - port: 8000 + targetPort: 8000 + type: NodePort +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + labels: + app: mllama-32-11b-vllm-trn1 + name: mllama-32-11b-vllm-trn1 +spec: + selector: + matchLabels: + app: mllama-32-11b-vllm-trn1 + template: + metadata: + labels: + app: mllama-32-11b-vllm-trn1 + spec: + nodeSelector: + karpenter.sh/nodepool: amd-neuron-trn1 + serviceAccountName: appsimulator + #serviceAccountName: flux-serviceaccount + schedulerName: my-scheduler + volumes: + - name: dshm + emptyDir: + medium: Memory + containers: + - name: app + image: 891377065549.dkr.ecr.us-west-2.amazonaws.com/model:amd64-neuron + #image: 920372998901.dkr.ecr.us-west-2.amazonaws.com/model:amd64-neuron + imagePullPolicy: Always + volumeMounts: + - mountPath: /vllm_config.yaml + name: vllm-config-volume + subPath: vllm_config.yaml + - mountPath: /dev/shm + name: dshm + command: + - /bin/bash + - "-exc" + - | + set -x + pip install --upgrade pip + pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com + pip install --upgrade "neuronx-distributed==0.11.0" --extra-index-url https://pip.repos.neuron.amazonaws.com + pip install --upgrade "neuronx-distributed-inference>=0.2.0" --extra-index-url https://pip.repos.neuron.amazonaws.com + pip install --upgrade transformers accelerate protobuf sentence_transformers tenacity torch-neuron + git clone -b neuron-2.22-vllm-v0.7.2 https://github.com/aws-neuron/upstreaming-to-vllm.git + cd upstreaming-to-vllm + pip install -r requirements-neuron.txt + VLLM_TARGET_DEVICE="neuron" pip install -e . + cd / + python /download_hf_model.py + #python /mllama-offline.py + uvicorn vllm_model_api_m:app --host=0.0.0.0 + resources: + requests: + aws.amazon.com/neuron: 16 + limits: + aws.amazon.com/neuron: 16 + env: + - name: VLLM_NEURON_FRAMEWORK + value: "neuronx-distributed-inference" + - name: MODEL_ID + value: "yahavb/Llama-3.2-11B-Vision-Instruct" + - name: APP + value: "Llama-3.2-11B-Vision-Instruct-NxDI-TRN1" + - name: NODEPOOL + value: "amd-neuron-trn1" + - name: POD_NAME + valueFrom: + fieldRef: + fieldPath: metadata.name + - name: HUGGINGFACE_TOKEN + valueFrom: + secretKeyRef: + name: hf-secrets + key: HUGGINGFACE_TOKEN + - name: MAX_NEW_TOKENS + value: "50" + ports: + - containerPort: 8000 + protocol: TCP + readinessProbe: + failureThreshold: 3 + httpGet: + path: /readiness + port: 8000 + scheme: HTTP + initialDelaySeconds: 280 + periodSeconds: 10 + successThreshold: 1 + timeoutSeconds: 5 + volumes: + - name: vllm-config-volume + configMap: + name: mllama-32-11b-vllm-trn1-config + - name: dshm + emptyDir: + medium: Memory diff --git a/flux_serve/specs/neuron-bits-s3-storage.yaml b/flux_serve/specs/neuron-bits-s3-storage.yaml new file mode 100644 index 0000000..c26379b --- /dev/null +++ b/flux_serve/specs/neuron-bits-s3-storage.yaml @@ -0,0 +1,33 @@ +apiVersion: v1 +kind: PersistentVolume +metadata: + name: neuron-bits-use1-pv +spec: + capacity: + storage: 1200Gi + accessModes: + - ReadWriteMany + storageClassName: "" + claimRef: + namespace: default + name: neuron-bits-use1-pvc + mountOptions: + - region=us-east-1 + csi: + driver: s3.csi.aws.com + volumeHandle: s3-csi-driver-volume + volumeAttributes: + bucketName: neuron-bits-use1 +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: neuron-bits-use1-pvc +spec: + accessModes: + - ReadWriteMany # Supported options: ReadWriteMany / ReadOnlyMany + storageClassName: "" # Required for static provisioning + resources: + requests: + storage: 1200Gi + volumeName: neuron-bits-use1-pv