Skip to content

Qwen3 Embedding 8b model running as API in Runpod or local GPU in Docker

Notifications You must be signed in to change notification settings

PrabhaAnde/embedding-api

Repository files navigation

🚀 Serverless Qwen3 Embedding API

License Python Docker RunPod

A high-performance, OpenAI-compatible embedding service powered by the state-of-the-art Qwen/Qwen3-Embedding-8B model. Designed for RunPod Serverless and standard Docker deployments.


📖 Introduction

In the era of RAG (Retrieval-Augmented Generation) and semantic search, having a powerful embedding model is crucial. The Qwen3-Embedding-8B model is a beast—offering massive context windows and superior semantic understanding.

However, deploying such a large model can be tricky and expensive. This project solves that by providing a production-ready, dual-mode API that runs anywhere:

  1. Serverless: Optimized for RunPod Serverless to scale to zero and save costs.
  2. Standalone: Standard Docker container for dedicated GPU instances or local testing.

It provides a drop-in replacement for OpenAI's embedding API (/v1/embeddings), making it instantly compatible with LangChain, LlamaIndex, and other frameworks.


✨ Key Features

  • 🔥 State-of-the-Art Model: Defaults to Qwen/Qwen3-Embedding-8B (configurable).
  • ⚡ Serverless Native: Built-in handler.py optimized for RunPod's job queue architecture.
  • 🐳 Docker Ready: Production-grade Dockerfile with CUDA support.
  • 🔌 OpenAI Compatible: Standard /v1/embeddings endpoint.
  • 🎛️ Highly Configurable: Control quantization (4-bit/8-bit), batch sizes, and sequence lengths via environment variables.

🚀 Deployment Guide

Option 1: RunPod Serverless (Recommended)

RunPod Serverless allows you to pay only for the seconds your model is actually generating embeddings.

Step 1: Create Endpoint and Select Repository

First, click on +New Endpoint and select Repository. In our case chose embedding-api repository.

Template Configuration

Step 2: Configure Repository

Leave the branch as main and also leave the Dockerfile as it is, or you can give your own Dockerfile path.

  • With doing this, you dont need to build the image yourself, RunPod will do it for you.
  • if you have private repository, you can give credentials in the repository settings.

Proceed to the next step with clicking on Next.

Template Configuration

Step 3: GPU Selection and Environment Variables

Chose below options:

  1. select GPU (RTX 3090/4090 or A100 recommended). Mainly anything more than 24gb is recommended. Then leave the rest as it is. Follow to set below environment variables.
  2. MODEL_NAME: Qwen/Qwen3-Embedding-8B
  3. QUANTIZATION: 8bit

Then click on Deploy Endpoint.

Endpoint Deployment

Step 4: Edit endpoint to configure default GPU settings

By default, the runpod will set default GPU workers as 3, you can change them as below:

  1. Max workers: 1
  2. Active workers: 1
  3. GPU count: 1

rest leave as it is. Endpoint Edit

Click on Save.

This will take a few minutes to create the image and then run the image into GPU.

Step 5: Get API Key

Goto settings on the left sidebar --> Click on API Keys --> Click on Create API Key --> Copy the API Key.

API Key

Step 6: Invoke API

Once the endpoint is ready, copy your Endpoint ID and API Key.

API Invocation

curl https://api.runpod.ai/v2/{YOUR_ENDPOINT_ID}/runsync \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer {YOUR_API_KEY}" \
  -d '{
    "input": {
        "input": ["Hello world", "Machine learning is amazing"],
        "model": "qwen3-embedding-8b"
    }
  }'

Option 2: Standalone Docker (Dedicated GPU)

Perfect for local development or sustained high-throughput workloads on a rented GPU server.

  1. Run Container

    docker run --gpus all -p 9292:9292 \
      -e MODEL_NAME="Qwen/Qwen3-Embedding-8B" \
      -e QUANTIZATION="8bit" \
      yourusername/qwen3-embedding:latest

    Note: To run the standard server, you can override the command with python main.py if needed, but the image is set up to be flexible.

  2. Invoke

    curl http://localhost:9292/v1/embeddings \
      -H "Content-Type: application/json" \
      -d '{
        "input": "Hello world",
        "model": "qwen3-embedding-8b"
      }'

⚙️ Configuration

All settings are managed via environment variables.

Variable Default Description
MODEL_NAME Qwen/Qwen3-Embedding-8B The HuggingFace model ID to load.
QUANTIZATION 8bit Memory optimization: 4bit, 8bit, or none.
EMBED_BATCH_SIZE 16 Number of texts to process in parallel.
MAX_SEQ_LENGTH 32768 Maximum token context length.
MAX_EMBED_DIM 1024 Output dimension size.
DEVICE auto cuda or cpu.

📄 License

This project is licensed under the Apache 2.0 License.

About

Qwen3 Embedding 8b model running as API in Runpod or local GPU in Docker

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published