Self-hosted, multi-model AI inference server. Runs LLMs alongside specialized models (TTS, speech-to-text, embeddings, image generation) on GPU or CPU, exposing an OpenAI-compatible API. Built on Ray Serve with pluggable inference backends: vLLM for high-throughput GPU inference, HuggingFace Transformers for CPU and lightweight GPU workloads, llama.cpp for high-efficiency GGUF models on CPU, Diffusers for image generation, and a plugin system for custom backends.
Most self-hosted inference tools focus on running a single model. Modelship is for when you need multiple models running simultaneously — an LLM, a TTS engine, a speech-to-text model, an embedding model, and an image generator — all behind a single OpenAI-compatible API, with fine-grained control over GPU memory allocation across them.
- One server, many models — run a full AI stack (chat + TTS + STT + embeddings + image gen) on a single machine instead of juggling separate services
- GPU memory control — allocate exact GPU fractions per model (e.g. 70% for the LLM, 5% for TTS) so everything fits on your hardware
- Mix and match backends — use vLLM for high-throughput GPU inference, Transformers or llama.cpp for CPU-only workloads, Diffusers for images, and plugins for custom backends — in the same deployment
- Drop-in OpenAI replacement — any OpenAI SDK client works out of the box, making it easy to integrate with existing apps and tools like Home Assistant
graph TD
Client["Client (OpenAI SDK / curl)"]
API["FastAPI Gateway<br/>OpenAI-compatible API<br/>:8000"]
Client -->|HTTP| API
API -->|round-robin| LLM_GPU
API -->|round-robin| LLM_CPU
API -->|round-robin| TTS
API -->|round-robin| STT
API -->|round-robin| EMB
API -->|round-robin| IMG
subgraph GPU0["GPU 0 — vLLM"]
LLM_GPU["LLM Deployment<br/>e.g. Llama 3.1 8B<br/>70% GPU"]
TTS["TTS Deployment<br/>e.g. Kokoro 82M<br/>5% GPU"]
end
subgraph GPU1["GPU 1 — Mixed backends"]
STT["STT Deployment (vLLM)<br/>e.g. Whisper Large<br/>50% GPU"]
EMB["Embedding Deployment<br/>e.g. Nomic Embed<br/>50% GPU"]
end
subgraph CPU["CPU — Transformers / llama.cpp"]
LLM_CPU["LLM Deployment<br/>e.g. Qwen3-0.6B<br/>CPU-only"]
STT_CPU["STT Deployment<br/>e.g. Whisper Small<br/>CPU-only"]
end
subgraph GPU2["GPU 2 — Diffusers"]
IMG["Image Generation<br/>e.g. SDXL Turbo<br/>35% GPU"]
end
Each model runs as an isolated Ray Serve deployment with its own lifecycle, health checks, and resource budget. Five inference backends are available:
| Backend | Best for | GPU required |
|---|---|---|
| vLLM | High-throughput chat, embeddings, transcription | Yes |
| llama.cpp | High-efficiency quantized GGUF models (chat, embeddings) | No |
| Transformers | Chat, embeddings, transcription, TTS on CPU or lightweight GPU | No |
| Diffusers | Image generation | Yes |
| Custom (plugins) | TTS backends (Kokoro ONNX, Bark, Orpheus), STT backends (whisper.cpp) | No |
Models can be deployed across multiple GPUs, run on CPU-only, or both — multiple deployments of the same model (e.g. one on GPU via vLLM, one on CPU via Transformers) are load-balanced with round-robin routing. Each deployment can also scale horizontally with num_replicas.
...
- Docker (or Python 3.12+ with
uvfor local development) - NVIDIA GPU (optional) — 16 GB+ VRAM recommended for a full stack (LLM + TTS + STT + embeddings) via vLLM; 8 GB is sufficient for lighter setups. Not required when using the Transformers backend on CPU
- NVIDIA Container Toolkit — required only when running GPU models in Docker
- HuggingFace token for gated models
- Multi-model, multi-GPU — run chat, embedding, STT, TTS, and image generation models simultaneously across one or more GPUs with tunable per-model GPU memory allocation
- CPU-only support — run models without a GPU using the Transformers backend (chat, embeddings, transcription, TTS). Useful for development, testing, or small models that don't need GPU acceleration
- Multiple inference backends — vLLM for high-throughput GPU inference, HuggingFace Transformers for CPU and lightweight GPU workloads, Diffusers for image generation, and a plugin system for custom backends
- Zero-downtime hot-reloads — modify your
models.yamland run a cluster reconcile; changes are applied incrementally without interrupting the API gateway or unchanged models - Advanced agentic capabilities — native support for DeepSeek-style reasoning (
<think>blocks parsed intoreasoning_content) and universal tool/function calling across vLLM, GGUF (llama.cpp), and Transformers backends - Per-model isolated deployments — each model runs in its own Ray Serve deployment with independent lifecycle, health checks, failure isolation, and configurable replica count
- OpenAI-compatible API — drop-in replacement for any OpenAI SDK client
- Streaming — SSE streaming for chat completions and TTS audio
- Plugin system — opt-in TTS and STT backends installed as isolated uv workspace packages
- Multi-GPU & hybrid routing — assign models to specific GPUs or run them on CPU-only; deploy the same model on both GPU and CPU and requests are load-balanced via round-robin; full tensor parallelism support for large models spanning multiple GPUs
- Client disconnect detection — cancels in-flight inference when the client disconnects, freeing GPU resources immediately
- Built-in observability — Prometheus metrics, custom
modelship:*metrics, vLLM engine stats, Ray cluster metrics, structured JSON logging, and OpenTelemetry log export; pre-built Grafana dashboard and alerting rules included
| Endpoint | Usecase |
|---|---|
POST /v1/chat/completions |
Chat / text generation (streaming and non-streaming) |
POST /v1/embeddings |
Text embeddings |
POST /v1/audio/transcriptions |
Speech-to-text |
POST /v1/audio/translations |
Audio translation |
POST /v1/audio/speech |
Text-to-speech (SSE streaming or single-response) |
POST /v1/images/generations |
Image generation |
GET /v1/models |
List available models |
The fastest way to try Modelship: run a tiny reasoning model on a laptop — no GPU required. Copy-paste this block and you'll have an OpenAI-compatible API on http://localhost:8000 in a few minutes.
mkdir -p models-cache && cat > models.yaml <<'EOF'
models:
- name: reasoning-qwen
model: "lmstudio-community/Qwen3-0.6B-GGUF:*Q4_K_M.gguf"
usecase: generate
loader: llama_cpp
num_cpus: 3
llama_cpp_config:
n_ctx: 4096 # Give reasoning space to think
EOF
docker run --rm --shm-size=8g \
-v ./models.yaml:/modelship/config/models.yaml \
-v ./models-cache:/.cache \
-p 8000:8000 \
ghcr.io/alez007/modelship:latest-cpuImages are multi-arch (amd64 + arm64), so this works on Apple Silicon and ARM Linux hosts too.
Once the server is up (look for Deployed app 'modelship api' successfully), call it and watch the model think:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "reasoning-qwen",
"messages": [{"role": "user", "content": "Which is larger, 9.11 or 9.9?"}]
}'For high-throughput GPU inference, use the standard image and add --gpus all. You'll also need the NVIDIA Container Toolkit and an HF_TOKEN for gated models. Example models.yaml entries for vLLM, Diffusers, and multi-GPU setups live in docs/model-configuration.md; ready-to-run configs are in config/examples/.
docker run --rm --shm-size=8g --gpus all \
-e HF_TOKEN=your_token_here \
-v ./models.yaml:/modelship/config/models.yaml \
-v ./models-cache:/.cache \
-p 8000:8000 \
ghcr.io/alez007/modelship:latestTip
Always set --shm-size=8g (or higher) when running the docker container to prevent PyTorch from hitting shared memory limits during multi-process operations.
Hitting an error? Check docs/troubleshooting.md.
Modelship's TTS and STT systems are built around a plugin architecture — each backend is an opt-in package with its own isolated dependencies. Plugins ship inside this repo (plugins/) or can be installed from PyPI.
Built-in plugins:
- Kokoro ONNX — lightweight TTS via ONNX Runtime (CPU or GPU)
- Bark — multilingual TTS by Suno (GPU recommended)
- Orpheus — expressive TTS
- whisper.cpp — CPU-only STT via
pywhispercpp
To enable plugins for local development, pass them as extras at sync time:
uv sync --extra kokoroonnx
uv sync --extra kokoroonnx --extra whispercpp # multiple pluginsFor deployment, plugins are automatically loaded from standalone Python wheels via Ray's runtime_env when referenced in models.yaml. This ensures that complex backend dependencies don't pollute the main API gateway or other deployments.
For a full guide on writing your own plugin, see Plugin Development.
- Development — dev environment setup, building, and running locally
- Model Configuration — full
models.yamlreference, GPU pinning, environment variables - Architecture — system design, request lifecycle, plugin loading
- Plugin Development — writing custom TTS/STT backends
- Home Assistant Integration — Wyoming protocol setup for voice automation
- Monitoring & Logging — Prometheus metrics, Grafana dashboard, structured logging, health checks
- Troubleshooting — common first-run errors and fixes
- Roadmap — what's planned next and where to contribute
Modelship exposes Prometheus metrics (Ray cluster, Ray Serve, vLLM, and custom modelship:* metrics) through a single scrape endpoint on port 8079. Metrics are enabled by default — set MSHIP_METRICS=false to disable. A pre-built Grafana dashboard and Prometheus alerting rules are included in the repository.
Logging supports structured JSON output (MSHIP_LOG_FORMAT=json) and request ID correlation across Ray actor boundaries. Logs can be shipped to a remote syslog server (--log-target syslog://host:514) or an OpenTelemetry collector (--otel-endpoint http://collector:4317). Set MSHIP_LOG_LEVEL to TRACE for full request/response payloads, or DEBUG for detailed diagnostics without payloads.
See Monitoring & Logging for full details.
Modelship is actively used and designed for stability in multi-tenant setups. Key guarantees include:
- Mutex-backed deployments: A cluster-wide deploy coordinator prevents VRAM exhaustion by ensuring models are never loaded concurrently if resources are tight.
- Comprehensive HTTP-level tests: The
tests/test_integration.pysuite validates chat, reasoning, tool-calling, and streaming across all loaders using real (small) models. - Payload & concurrency limits: Built-in safeguards against large payloads (
MSHIP_MAX_REQUEST_BODY_BYTES) and configurable limits per backend. - Observability: Deep integration with Prometheus, OpenTelemetry, and structured logging.
We are currently working towards Kubernetes-native hardening (Helm charts, GPU-aware probes) and rate-limiting. See the full Production Readiness Plan for the scorecard and roadmap.
See CONTRIBUTING.md for guidelines on setting up the dev environment, code style, and submitting pull requests.