A high-performance, OpenAI-compatible embedding service powered by the state-of-the-art Qwen/Qwen3-Embedding-8B model. Designed for RunPod Serverless and standard Docker deployments.
In the era of RAG (Retrieval-Augmented Generation) and semantic search, having a powerful embedding model is crucial. The Qwen3-Embedding-8B model is a beast—offering massive context windows and superior semantic understanding.
However, deploying such a large model can be tricky and expensive. This project solves that by providing a production-ready, dual-mode API that runs anywhere:
- Serverless: Optimized for RunPod Serverless to scale to zero and save costs.
- Standalone: Standard Docker container for dedicated GPU instances or local testing.
It provides a drop-in replacement for OpenAI's embedding API (/v1/embeddings), making it instantly compatible with LangChain, LlamaIndex, and other frameworks.
- 🔥 State-of-the-Art Model: Defaults to
Qwen/Qwen3-Embedding-8B(configurable). - ⚡ Serverless Native: Built-in
handler.pyoptimized for RunPod's job queue architecture. - 🐳 Docker Ready: Production-grade
Dockerfilewith CUDA support. - 🔌 OpenAI Compatible: Standard
/v1/embeddingsendpoint. - 🎛️ Highly Configurable: Control quantization (4-bit/8-bit), batch sizes, and sequence lengths via environment variables.
RunPod Serverless allows you to pay only for the seconds your model is actually generating embeddings.
First, click on +New Endpoint and select Repository. In our case chose embedding-api repository.
Leave the branch as main and also leave the Dockerfile as it is, or you can give your own Dockerfile path.
- With doing this, you dont need to build the image yourself, RunPod will do it for you.
- if you have private repository, you can give credentials in the repository settings.
Proceed to the next step with clicking on Next.
Chose below options:
- select GPU (RTX 3090/4090 or A100 recommended). Mainly anything more than 24gb is recommended. Then leave the rest as it is. Follow to set below environment variables.
MODEL_NAME:Qwen/Qwen3-Embedding-8BQUANTIZATION:8bit
Then click on Deploy Endpoint.
By default, the runpod will set default GPU workers as 3, you can change them as below:
- Max workers: 1
- Active workers: 1
- GPU count: 1
Click on Save.
This will take a few minutes to create the image and then run the image into GPU.
Goto settings on the left sidebar --> Click on API Keys --> Click on Create API Key --> Copy the API Key.
Once the endpoint is ready, copy your Endpoint ID and API Key.
curl https://api.runpod.ai/v2/{YOUR_ENDPOINT_ID}/runsync \
-H "Content-Type: application/json" \
-H "Authorization: Bearer {YOUR_API_KEY}" \
-d '{
"input": {
"input": ["Hello world", "Machine learning is amazing"],
"model": "qwen3-embedding-8b"
}
}'Perfect for local development or sustained high-throughput workloads on a rented GPU server.
-
Run Container
docker run --gpus all -p 9292:9292 \ -e MODEL_NAME="Qwen/Qwen3-Embedding-8B" \ -e QUANTIZATION="8bit" \ yourusername/qwen3-embedding:latest
Note: To run the standard server, you can override the command with
python main.pyif needed, but the image is set up to be flexible. -
Invoke
curl http://localhost:9292/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ "input": "Hello world", "model": "qwen3-embedding-8b" }'
All settings are managed via environment variables.
| Variable | Default | Description |
|---|---|---|
MODEL_NAME |
Qwen/Qwen3-Embedding-8B |
The HuggingFace model ID to load. |
QUANTIZATION |
8bit |
Memory optimization: 4bit, 8bit, or none. |
EMBED_BATCH_SIZE |
16 |
Number of texts to process in parallel. |
MAX_SEQ_LENGTH |
32768 |
Maximum token context length. |
MAX_EMBED_DIM |
1024 |
Output dimension size. |
DEVICE |
auto |
cuda or cpu. |
This project is licensed under the Apache 2.0 License.





