🚀 Serverless Qwen3 Embedding API

A high-performance, OpenAI-compatible embedding service powered by the state-of-the-art Qwen/Qwen3-Embedding-8B model. Designed for RunPod Serverless and standard Docker deployments.

📖 Introduction

In the era of RAG (Retrieval-Augmented Generation) and semantic search, having a powerful embedding model is crucial. The Qwen3-Embedding-8B model is a beast—offering massive context windows and superior semantic understanding.

However, deploying such a large model can be tricky and expensive. This project solves that by providing a production-ready, dual-mode API that runs anywhere:

Serverless: Optimized for RunPod Serverless to scale to zero and save costs.
Standalone: Standard Docker container for dedicated GPU instances or local testing.

It provides a drop-in replacement for OpenAI's embedding API (/v1/embeddings), making it instantly compatible with LangChain, LlamaIndex, and other frameworks.

✨ Key Features

🔥 State-of-the-Art Model: Defaults to Qwen/Qwen3-Embedding-8B (configurable).
⚡ Serverless Native: Built-in handler.py optimized for RunPod's job queue architecture.
🐳 Docker Ready: Production-grade Dockerfile with CUDA support.
🔌 OpenAI Compatible: Standard /v1/embeddings endpoint.
🎛️ Highly Configurable: Control quantization (4-bit/8-bit), batch sizes, and sequence lengths via environment variables.

🚀 Deployment Guide

Option 1: RunPod Serverless (Recommended)

RunPod Serverless allows you to pay only for the seconds your model is actually generating embeddings.

Step 1: Create Endpoint and Select Repository

First, click on +New Endpoint and select Repository. In our case chose embedding-api repository.

Step 2: Configure Repository

Leave the branch as main and also leave the Dockerfile as it is, or you can give your own Dockerfile path.

With doing this, you dont need to build the image yourself, RunPod will do it for you.
if you have private repository, you can give credentials in the repository settings.

Proceed to the next step with clicking on Next.

Step 3: GPU Selection and Environment Variables

Chose below options:

select GPU (RTX 3090/4090 or A100 recommended). Mainly anything more than 24gb is recommended. Then leave the rest as it is. Follow to set below environment variables.
MODEL_NAME: Qwen/Qwen3-Embedding-8B
QUANTIZATION: 8bit

Then click on Deploy Endpoint.

Step 4: Edit endpoint to configure default GPU settings

By default, the runpod will set default GPU workers as 3, you can change them as below:

Max workers: 1
Active workers: 1
GPU count: 1

rest leave as it is.

Click on Save.

This will take a few minutes to create the image and then run the image into GPU.

Step 5: Get API Key

Goto settings on the left sidebar --> Click on API Keys --> Click on Create API Key --> Copy the API Key.

Step 6: Invoke API

Once the endpoint is ready, copy your Endpoint ID and API Key.

curl https://api.runpod.ai/v2/{YOUR_ENDPOINT_ID}/runsync \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer {YOUR_API_KEY}" \
  -d '{
    "input": {
        "input": ["Hello world", "Machine learning is amazing"],
        "model": "qwen3-embedding-8b"
    }
  }'

Option 2: Standalone Docker (Dedicated GPU)

Perfect for local development or sustained high-throughput workloads on a rented GPU server.

Run Container

docker run --gpus all -p 9292:9292 \
  -e MODEL_NAME="Qwen/Qwen3-Embedding-8B" \
  -e QUANTIZATION="8bit" \
  yourusername/qwen3-embedding:latest

Note: To run the standard server, you can override the command with python main.py if needed, but the image is set up to be flexible.

Invoke

curl http://localhost:9292/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello world",
    "model": "qwen3-embedding-8b"
  }'

⚙️ Configuration

All settings are managed via environment variables.

Variable	Default	Description
`MODEL_NAME`	`Qwen/Qwen3-Embedding-8B`	The HuggingFace model ID to load.
`QUANTIZATION`	`8bit`	Memory optimization: `4bit`, `8bit`, or `none`.
`EMBED_BATCH_SIZE`	`16`	Number of texts to process in parallel.
`MAX_SEQ_LENGTH`	`32768`	Maximum token context length.
`MAX_EMBED_DIM`	`1024`	Output dimension size.
`DEVICE`	`auto`	`cuda` or `cpu`.

📄 License

This project is licensed under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
ServerlessRepoEmbedTest_step5.png		ServerlessRepoEmbedTest_step5.png
ServerlessRepoGitConfig_step1.png		ServerlessRepoGitConfig_step1.png
ServerlessRepoGitConfig_step2.png		ServerlessRepoGitConfig_step2.png
ServerlessRepoGitConfig_step3.png		ServerlessRepoGitConfig_step3.png
ServerlessRepoGitConfig_step4.png		ServerlessRepoGitConfig_step4.png
ServerlessRunpod_APIKey.png		ServerlessRunpod_APIKey.png
config.py		config.py
docker-compose.yml		docker-compose.yml
handler.py		handler.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Serverless Qwen3 Embedding API

📖 Introduction

✨ Key Features

🚀 Deployment Guide

Option 1: RunPod Serverless (Recommended)

Step 1: Create Endpoint and Select Repository

Step 2: Configure Repository

Step 3: GPU Selection and Environment Variables

Step 4: Edit endpoint to configure default GPU settings

Step 5: Get API Key

Step 6: Invoke API

Option 2: Standalone Docker (Dedicated GPU)

⚙️ Configuration

📄 License

About

Uh oh!

Releases

Packages

Languages

PrabhaAnde/embedding-api

Folders and files

Latest commit

History

Repository files navigation

🚀 Serverless Qwen3 Embedding API

📖 Introduction

✨ Key Features

🚀 Deployment Guide

Option 1: RunPod Serverless (Recommended)

Step 1: Create Endpoint and Select Repository

Step 2: Configure Repository

Step 3: GPU Selection and Environment Variables

Step 4: Edit endpoint to configure default GPU settings

Step 5: Get API Key

Step 6: Invoke API

Option 2: Standalone Docker (Dedicated GPU)

⚙️ Configuration

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages