Usage

Photofield AI

Experimental machine learning API supporting Photofield.

Report Bug · Request Feature

Table of Contents

About
Getting Started
Usage
Configuration
Troubleshooting
Development
Contributing
License
Acknowledgements

About

Photofield AI is a machine learning companion service for Photofield, providing semantic image search capabilities through OpenAI CLIP embeddings. It's a separate REST API service designed to keep the main app lightweight while leveraging Python's AI ecosystem.

Quick Start: docker run -it -p 8081:8081 ghcr.io/smilyorg/photofield-ai:latest

Features

Fast CLIP Embeddings - Convert images and text to semantic vectors for similarity search
High Performance - ~20 req/sec (i7-5820K CPU), ~200 req/sec (GTX 1070 Ti GPU)
Multiple Models - Support for various CLIP model sizes and quantization levels
Easy Integration - Simple REST API with multipart image uploads
Docker Ready - Pre-built images available on GitHub Container Registry
Modern Stack - Built with FastAPI, ONNX Runtime, and Python 3.13+

Run uv run python benchmark.py to benchmark on your own hardware.

Limitations

The current REST API is designed for Photofield integration. The CLIP model itself has some limitations as noted by OpenAI:

CLIP currently struggles with respect to certain tasks such as fine grained classification and counting objects. CLIP also poses issues with regards to fairness and bias which we discuss in the paper and briefly in the next section.

See the CLIP: Model Use section for more details on responsible model usage.

Built With

Python
uv - dependency management
FastAPI - REST API framework
ONNX Runtime - machine learning inference
CLIP Variants - CLIP converted to ONNX (by yours truly)
+ more Python libraries

Getting Started

Choose one of the following methods to run Photofield AI:

Quick Start with Docker (Recommended)

The easiest way to get started is using Docker:

docker run -it -p 8081:8081 ghcr.io/smilyorg/photofield-ai:latest

The clip-vit-base-patch32-(visual|textual)-float16 models are bundled for an out-of-the-box experience.

Note: The Docker image is currently CPU-only. GPU support contributions are welcome!

Verify it's running

Open http://localhost:8081/docs in your browser to see the API documentation.

Connect with Photofield

Add the following to your Photofield configuration.yaml:

ai:
  # photofield-ai API server URL
  host: http://localhost:8081

Install with uv (Development)

For development or customization, install from source using uv:

Prerequisites

Python 3.13 or later
uv (recommended) or pip

Quick Install

# Clone the repository
git clone https://github.com/smilyorg/photofield-ai.git
cd photofield-ai

# Install dependencies
uv sync

# Run the server
uv run python main.py

Or use Task for convenience:

task run    # Run the server
task watch  # Run with auto-reload

First Run

On first run, the server will download the default models (~300MB) and then start:

❯ uv run python main.py
Available providers: CPUExecutionProvider
Using providers: CPUExecutionProvider
Loading visual model: models/clip-vit-base-patch32-visual-float16.onnx
Loading textual model: models/clip-vit-base-patch32-textual-float16.onnx
Visual inference ready, input size 224, type tensor(float16)
Textual inference ready, input size 77, type tensor(int32)
Listening on 0.0.0.0:8081

Build Your Own Docker Image

If you want to build the Docker image locally:

# Build the image
task docker

# Or manually
docker build -t photofield-ai .

# Run your local build
docker run -it -p 8081:8081 photofield-ai

Usage

Some request/response examples are listed below. If you use the neat REST Client extension for VSCode you can even execute them directly if you open the README 😎. See examples.http for more.

{{api}} refers to the root URL of the API, the following defines it for the REST Client extension.

@api = http://localhost:8081

Embed Text

The /text-embeddings endpoint accepts a list of text strings that are converted to embeddings by the textual model.

Request

POST {{api}}/text-embeddings HTTP/1.1
Content-Type: application/json

{
    "texts": ["hawk"]
}

Response

{
  "texts": [
    {
      "text": "hawk",
      "embedding_f16_b64": "/7SvLO0xxqbgs4urVqq/vT21ozEAsGcu060XMxY0WyJbOBczIrJ0rkM4UB+Mrag007PhMuspxzTLshAxEjDgtXOsOzbrMRYwii2UKwS1IzNlsKUum6posPIwTS4KqG60arLbJgWu/K8hNaW1ry1QMx0hPpposuox2jMSsCixL6KNJQCw/LGttB4xwyoxM72QUa14NrsyMapespypirRHuHW177F5MNI0JiGfH+CxsbATMSqzIa94Mpy+eDcONnWvki27s4AqK7MlNsKgnjUpJ6Cy8y5snAqtTrB4JASxHCr4NF4wa6lMtHS2PK3HL1CtUjLKtXcuyTQlMQY0h7JZqU+1bTMpryYwArD7RSoonrR4rFEugTVaNssw8bBKKsyxzbKTLT22Jy8Rscy1BCmvsTSwT654pwkrii8gLkK116kDpvCwJ63frxYl9K+2sCk02DQ+LdS0C6dSrH0uOrUqL3e3vqa7sA4snjVgM48xeicYMMEoyy6AMAsjB69ft7OiDzC1MLKwtKbVIpqn6KjatmIsrjD8rY2xWa7MKW0fXLePuTWpdza/MYi5aB3atZivSqyYqC4kI60pKda0qbHbqLM126IFNA4xtrXKojyvDCynNGQioi4pLU01sp5/tcmoyasdODmuN7StLBawsqZKuYSsULSgtZE18i9SOLy5a7JksecrrS2dLGwx5q4etdm1rrQELLovxbY+prI0I7dLuXoxH66JKEkwszQitCSyia1cMfgwNK3lqDg0XzHlNrooTbhos9KyOR6GMYM1gbC8KyiwM7MoM+Oo7a7RLZ4mSx2wpSKy2qccMgEyFTS1tYMc+EWqNgayvzIfMrKrlDg3mV44GTBDKDg42jIOLBewtChhLNbAerFTNGywyDP3NjkxnTCgr8kyQrPoKLytxCo/tKYsQrDoJJSx2K7jNO8pjLL8KpirojVoMxWtV6msMjcyYDIqsIypYDSvNSO2ZK4+rg21EKE3owC0ALMSMwMu2KeKtXE0rbLPpsW20yUiLWuv/bUdsVuoKDYtKBSv9K4FJ560arXxNVs1RKwCsHCnkjhktam1rTGUn6yvdimYOKUHRDbHsHU38TdGKhE1UTKsLuYqxyVGq5UwaglCtEu0HiVasgg5ubQdrrUzLCw2Ji8yCa3ksnYpYzJlsF4y3beDKlaxFrRLnX+0W6IsOHO1TaE2s4gwKrEUNMWv5yFhsRosAzTVNtoxJrZKMCIoaavTM5UvMyyvLfM0D6zTtc60cK7Hlm62/iubLZgyODibrOgwN62auL0opzPiMvcvsK3KMpmtEbNIta+lYSysOM+vEDPqMOIzJKkOtCsq6bS3MqM3XixvsEC0Ni2HrbC1Bay4rg==",
      "embedding_inv_norm_f16_uint16": 11810
    }
  ]
}

embedding_f16_b64 - embedding that comes out of machine learning model. It's a base64-encoded list of 512 (or more, depending on model) float16 2-byte floating point values. You can compare this embedding to any other text or image embedding via cosine similarity (normalized dot product) to get the semantic similarity between them.
embedding_inv_norm_f16_uint16 - the Euclidean / L2 norm of the embedding (vector length). It is inverted, converted to float16 2-byte floating point and then written out as an integer uint16 value. Using this precomputed value comes in handy while computing the cosine similarity for semantic image search as you can skip computing it for each image embedding.

Embed Images

The /image-embeddings endpoint accepts multiple form multipart image uploads and computes the embedding for each using the visual model.

@image = heavy-industry.jpg

Request

POST {{api}}/image-embeddings
Content-Type: multipart/form-data; boundary=------------------------23f534be8db8eca0

--------------------------23f534be8db8eca0
Content-Disposition: form-data; name="image"; filename="heavy-industry.jpg"
Content-Type: image/jpeg

< {{image}}

--------------------------23f534be8db8eca0

Response

{
  "images": [
    {
      "field": "image",
      "filename": "heavy-industry.jpg",
      "embedding_f16_b64": "VykzsNCu7CQ2NLCqkjQjspWtlDeXrmm4z6htMZ6z7ayNO2M2xbVrqjU0gK3uN3c0hZiRMDY6jDTXm2GkUrZUrXarLzZINGonQ7ScuI64QTk+tH64LbhrNOU2BLU6tfYyjrT1t5QtYKgOMfi3b6lxty2437JWNA+x1TceMtQtNbBVqXegt6ttMMUtXLFaseo0H7BEt+U1bzF+tPGs0iAmsPAtCDFus/m0m64SNo8ujTDDJoIodi/vMY3Fa5meLMy3/LECOUWv/7QXIcOwO7o2K+03nLIev0A1YR1KNBkwnrbptuszlDWcrrkyOrP4Mdk7dzCGrk8tmTciMvWyti5iNGieNTMyMyWsJrmdOoAulyzGOHEvoLOYrE8nvy9TtQsym7HGJDS4MDVcsRsuyR2CN/wo7bIHODc4HDvPNWgiCy+kqkSkmrdtsuuyVDXhu+41r7FWLKc44a1rsFSx8bOJMnywRjLrNt4xNyYSpmgoLrSnsLA0KDUbq1gp7rSCLZkg+bExOIq01rAaqtuxM6yqtB4wELQ4sqC2UrYSrCe1WqzdtGOsEzbZqYE0oqBZMSo087VBNBO6UqorNuo0iLiktLO5gjYtNz+237LLomupHTgTMWkrQRyDPC80WjFQtCEwxjqaME23rytRr0E7Oy+SMma6+7B9ILwhaTjJM2Mwkjc+tKOwWT+SN5k2JKsAMu89LK4ttxqty7QVL7OuobUkolikITQ3trutbbHSGEK1cjQ5MwMszq5Itqq01LBzPFE0abfSJwe0XDB1t9M2orNytEg2QClbNW00Frmer9uz5qIVLPO0+CRxLUk1rTIOMBUwmzqrNlg3pjOGrz824bhutI40RT0stMS1yLK9LjKtwLAvuHevtzAlOioz/LEVuJa2CLgDMbY2PycplR00VLGPHvEws7gTMTMuxyh2NmI1srE6qFMse6yYtKm1QDUcs6Esz7jmsM2zUa8ltA4t/p0buGM2f6pHpw+127S7I4cwKa+1sMGhRbKiNWi06ytKtqCkh7Y7NoqxdCIJM+i3wLSYs+uvUTTPtFU1mSC4tE83UjMdMJitpzUZM7Qz8hn4tiMsbbJjK+m6BTDUsmc0+7MqLiI23DS2LO8mUCi+rB00zS1JrZKvKLDkM1S0mjCEOMqy3zeZKWk2mbD0uZWwDjXbpzU0/zoeMhS56qi5NIywXbilMi619TaCrrO3rawoqoOyOTHarBolarJFMGQySrf/sVWn5Tm0LXqvHLeCsssuVixJOF2uN5DMJi4vL7SSNGwqeLOEtHmtnThJOSowWjWsMEQ4bbiVK+qsmzNFuvwoM7amtjeqC7VWJh847rgvpIE0OToiuDSzE6R8M8U4wbQ1sA==",
      "embedding_inv_norm_f16_uint16": 11922
    }
  ]
}

Configuration

You can configure the app via environment variables.

Environment variable name	Default value	Purpose
`PHOTOFIELD_AI_HOST`	`0.0.0.0`	The host the server will listen on.
`PHOTOFIELD_AI_PORT`	`8081`	The port the server will listen on.
`PHOTOFIELD_AI_MODELS_DIR`	`models/`	The directory models will be downloaded to if a URL is provided
`PHOTOFIELD_AI_VISUAL_MODEL`	`https://huggingface.co/mlunar/clip-variants/resolve/main/modelclip-vit-base-patch32-visual-float16.onnx`	URL or local file path to the visual ONNX CLIP model to use for image embedding. If a URL is provided, the model will first be downloaded to `PHOTOFIELD_AI_MODELS_DIR` if it doesn't exist there already. If a local path is provided, the model will be used as is.
`PHOTOFIELD_AI_TEXTUAL_MODEL`	`https://huggingface.co/mlunar/clip-variants/resolve/main/modelclip-vit-base-patch32-textual-float16.onnx`	Same as `PHOTOFIELD_AI_VISUAL_MODEL`, but for the textual model used for text embedding.
`PHOTOFIELD_AI_RUNTIME`	`all`	`all` enables all available ONNX runtime providers, making use of any GPU or other accelerator device if you have the right ONNX Runtime prerequisites installed. `cpu` for CPU-only execution, which is faster to startup and develop with, but it is usually going to be ~10x slower than a GPU at inference. `cpu` is a shortcut for `PHOTOFIELD_AI_PROVIDERS=CPUExecutionProvider`.
`PHOTOFIELD_AI_PROVIDERS`	unset	If `PHOTOFIELD_AI_RUNTIME` is not set, you can use this specify the ONNX providers you would like to use directly comma-delimited. For example: `CUDAExecutionProvider,CPUExecutionProvider`.

Models

For PHOTOFIELD_AI_VISUAL_MODEL and PHOTOFIELD_AI_TEXTUAL_MODEL you can use any model from clip-variants models.

The bigger models are likely to be better, however it probably depends on your use-case. The different model types most likely won't be compatible with each other, however combining different data types might work fine.

Note that the qint8 models don't seem to work right now, so use quint8 ones instead.

Troubleshooting

Server won't start

Issue: Server fails to start or crashes immediately

Solutions:

Ensure you have Python 3.13 or later: python --version
Make sure dependencies are installed: uv sync
Check if port 8081 is already in use: lsof -i :8081 (Linux/Mac) or netstat -ano | findstr :8081 (Windows)

Models not downloading

Issue: First run doesn't download models or fails partway through

Solutions:

Check your internet connection
Verify you have ~300MB of free disk space
Try downloading models manually from clip-variants models and place them in the models/ directory

GPU not being used

Issue: Server runs but only uses CPU

Solutions:

Verify GPU support: The server will print available providers on startup
Install GPU-specific dependencies for your hardware (CUDA for NVIDIA, etc.)
Set PHOTOFIELD_AI_RUNTIME=all to enable all available providers
Check ONNX Runtime documentation for GPU prerequisites

Out of Memory errors

Issue: Server crashes with memory errors during inference

Solutions:

Use smaller models (e.g., patch16 instead of patch32)
Reduce batch sizes if processing multiple images
Switch to CPU execution: PHOTOFIELD_AI_RUNTIME=cpu

Connection issues with Photofield

Issue: Photofield can't connect to the AI server

Solutions:

Verify the server is running: curl http://localhost:8081/docs
Check the host URL in Photofield's configuration.yaml
If using Docker, ensure the port mapping is correct: -p 8081:8081
Check firewall settings if running on different machines

Development

For development work on Photofield AI itself:

Prerequisites

Python 3.13 or later
uv - for dependency management
Task - for running development tasks

Setup

git clone https://github.com/smilyorg/photofield-ai.git
cd photofield-ai
uv sync

Available Tasks

Run task to see all available tasks, or use these common ones:

task run - Run the server
task watch - Run with auto-reload for development
task benchmark - Run performance benchmarks
task docker - Build Docker image
task sync - Install/sync dependencies
task update - Update dependencies to latest versions
task clean - Clean build artifacts and cache

Testing & Benchmarks

Run task benchmark to benchmark performance on your hardware (requires server running in another terminal)
Test API endpoints with example requests in examples.http using the REST Client extension for VSCode

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgements

OpenAI CLIP for the research and machine learning model weights used here
Hugging Face for hosting the ONNX models
CLIP-as-service by Jina as a big inspiration for this project
openai-clip-js by josephrocca on how to convert CLIP to ONNX
CLIP-ONNX by Lednik7 on more CLIP with ONNX example code
Exporting a Model from PyTorch to ONNX and running it using ONNX Runtime - PyTorch
imgbeddings by minimaxir for a similar image-focused CLIP ONNX implementation
Best-README-Template
readme.so

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
assets		assets
cliponnx		cliponnx
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
Taskfile.yml		Taskfile.yml
benchmark.py		benchmark.py
examples.http		examples.http
heavy-industry.jpg		heavy-industry.jpg
justfile		justfile
main.py		main.py
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject-poetry.toml		pyproject-poetry.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

SmilyOrg/photofield-ai

Folders and files

Latest commit

History

Repository files navigation

Photofield AI

About

Features

Limitations

Built With

Getting Started

Quick Start with Docker (Recommended)

Verify it's running

Connect with Photofield

Install with uv (Development)

Prerequisites

Quick Install

First Run

Build Your Own Docker Image

Usage

Embed Text

Request

Response

Embed Images

Request

Response

Configuration

Models

Troubleshooting

Server won't start

Models not downloading

GPU not being used

Out of Memory errors

Connection issues with Photofield

Development

Prerequisites

Setup

Available Tasks

Testing & Benchmarks

Contributing

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Languages

Packages