Modelship

Self-hosted, multi-model AI inference server. Runs LLMs alongside specialized models (TTS, speech-to-text, embeddings, image generation) on GPU or CPU, exposing an OpenAI-compatible API. Built on Ray Serve with pluggable inference backends: vLLM for high-throughput GPU inference, HuggingFace Transformers for CPU and lightweight GPU workloads, llama.cpp for high-efficiency GGUF models on CPU, Diffusers for image generation, and a plugin system for custom backends.

Why Modelship?

Most self-hosted inference tools focus on running a single model. Modelship is for when you need multiple models running simultaneously — an LLM, a TTS engine, a speech-to-text model, an embedding model, and an image generator — all behind a single OpenAI-compatible API, with fine-grained control over GPU memory allocation across them.

One server, many models — run a full AI stack (chat + TTS + STT + embeddings + image gen) on a single machine instead of juggling separate services
GPU memory control — allocate exact GPU fractions per model (e.g. 70% for the LLM, 5% for TTS) so everything fits on your hardware
Mix and match backends — use vLLM for high-throughput GPU inference, Transformers or llama.cpp for CPU-only workloads, Diffusers for images, and plugins for custom backends — in the same deployment
Drop-in OpenAI replacement — any OpenAI SDK client works out of the box, making it easy to integrate with existing apps and tools like Home Assistant

Architecture

graph TD
    Client["Client (OpenAI SDK / curl)"]
    API["FastAPI Gateway<br/>OpenAI-compatible API<br/>:8000"]

    Client -->|HTTP| API
    API -->|round-robin| LLM_GPU
    API -->|round-robin| LLM_CPU
    API -->|round-robin| TTS
    API -->|round-robin| STT
    API -->|round-robin| EMB
    API -->|round-robin| IMG

    subgraph GPU0["GPU 0 — vLLM"]
        LLM_GPU["LLM Deployment<br/>e.g. Llama 3.1 8B<br/>70% GPU"]
        TTS["TTS Deployment<br/>e.g. Kokoro 82M<br/>5% GPU"]
    end

    subgraph GPU1["GPU 1 — Mixed backends"]
        STT["STT Deployment (vLLM)<br/>e.g. Whisper Large<br/>50% GPU"]
        EMB["Embedding Deployment<br/>e.g. Nomic Embed<br/>50% GPU"]
    end

    subgraph CPU["CPU — Transformers / llama.cpp"]
        LLM_CPU["LLM Deployment<br/>e.g. Qwen3-0.6B<br/>CPU-only"]
        STT_CPU["STT Deployment<br/>e.g. Whisper Small<br/>CPU-only"]
    end

    subgraph GPU2["GPU 2 — Diffusers"]
        IMG["Image Generation<br/>e.g. SDXL Turbo<br/>35% GPU"]
    end

Each model runs as an isolated Ray Serve deployment with its own lifecycle, health checks, and resource budget. Five inference backends are available:

Backend	Best for	GPU required
vLLM	High-throughput chat, embeddings, transcription	Yes
llama.cpp	High-efficiency quantized GGUF models (chat, embeddings)	No
Transformers	Chat, embeddings, transcription, TTS on CPU or lightweight GPU	No
Diffusers	Image generation	Yes
Custom (plugins)	TTS backends (Kokoro ONNX, Bark, Orpheus), STT backends (whisper.cpp)	No

Models can be deployed across multiple GPUs, run on CPU-only, or both — multiple deployments of the same model (e.g. one on GPU via vLLM, one on CPU via Transformers) are load-balanced with round-robin routing. Each deployment can also scale horizontally with num_replicas. ...

Requirements

Docker (or Python 3.12+ with uv for local development)
NVIDIA GPU (optional) — 16 GB+ VRAM recommended for a full stack (LLM + TTS + STT + embeddings) via vLLM; 8 GB is sufficient for lighter setups. Not required when using the Transformers backend on CPU
NVIDIA Container Toolkit — required only when running GPU models in Docker
HuggingFace token for gated models

Features

Multi-model, multi-GPU — run chat, embedding, STT, TTS, and image generation models simultaneously across one or more GPUs with tunable per-model GPU memory allocation
CPU-only support — run models without a GPU using the Transformers backend (chat, embeddings, transcription, TTS). Useful for development, testing, or small models that don't need GPU acceleration
Multiple inference backends — vLLM for high-throughput GPU inference, HuggingFace Transformers for CPU and lightweight GPU workloads, Diffusers for image generation, and a plugin system for custom backends
Zero-downtime hot-reloads — modify your models.yaml and run a cluster reconcile; changes are applied incrementally without interrupting the API gateway or unchanged models
Advanced agentic capabilities — native support for DeepSeek-style reasoning (<think> blocks parsed into reasoning_content) and universal tool/function calling across vLLM, GGUF (llama.cpp), and Transformers backends
Per-model isolated deployments — each model runs in its own Ray Serve deployment with independent lifecycle, health checks, failure isolation, and configurable replica count
OpenAI-compatible API — drop-in replacement for any OpenAI SDK client
Streaming — SSE streaming for chat completions and TTS audio
Plugin system — opt-in TTS and STT backends installed as isolated uv workspace packages
Multi-GPU & hybrid routing — assign models to specific GPUs or run them on CPU-only; deploy the same model on both GPU and CPU and requests are load-balanced via round-robin; full tensor parallelism support for large models spanning multiple GPUs
Client disconnect detection — cancels in-flight inference when the client disconnects, freeing GPU resources immediately
Built-in observability — Prometheus metrics, custom modelship:* metrics, vLLM engine stats, Ray cluster metrics, structured JSON logging, and OpenTelemetry log export; pre-built Grafana dashboard and alerting rules included

Supported OpenAI Endpoints

Endpoint	Usecase
`POST /v1/chat/completions`	Chat / text generation (streaming and non-streaming)
`POST /v1/embeddings`	Text embeddings
`POST /v1/audio/transcriptions`	Speech-to-text
`POST /v1/audio/translations`	Audio translation
`POST /v1/audio/speech`	Text-to-speech (SSE streaming or single-response)
`POST /v1/images/generations`	Image generation
`GET /v1/models`	List available models

Quick Start

The fastest way to try Modelship: run a tiny reasoning model on a laptop — no GPU required. Copy-paste this block and you'll have an OpenAI-compatible API on http://localhost:8000 in a few minutes.

mkdir -p models-cache && cat > models.yaml <<'EOF'
models:
  - name: reasoning-qwen
    model: "lmstudio-community/Qwen3-0.6B-GGUF:*Q4_K_M.gguf"
    usecase: generate
    loader: llama_cpp
    num_cpus: 3
    llama_cpp_config:
      n_ctx: 4096  # Give reasoning space to think
EOF

docker run --rm --shm-size=8g \
  -v ./models.yaml:/modelship/config/models.yaml \
  -v ./models-cache:/.cache \
  -p 8000:8000 \
  ghcr.io/alez007/modelship:latest-cpu

Images are multi-arch (amd64 + arm64), so this works on Apple Silicon and ARM Linux hosts too.

Once the server is up (look for Deployed app 'modelship api' successfully), call it and watch the model think:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "reasoning-qwen",
    "messages": [{"role": "user", "content": "Which is larger, 9.11 or 9.9?"}]
  }'

GPU (vLLM, Diffusers)

For high-throughput GPU inference, use the standard image and add --gpus all. You'll also need the NVIDIA Container Toolkit and an HF_TOKEN for gated models. Example models.yaml entries for vLLM, Diffusers, and multi-GPU setups live in docs/model-configuration.md; ready-to-run configs are in config/examples/.

docker run --rm --shm-size=8g --gpus all \
  -e HF_TOKEN=your_token_here \
  -v ./models.yaml:/modelship/config/models.yaml \
  -v ./models-cache:/.cache \
  -p 8000:8000 \
  ghcr.io/alez007/modelship:latest

Tip

Always set --shm-size=8g (or higher) when running the docker container to prevent PyTorch from hitting shared memory limits during multi-process operations.

Hitting an error? Check docs/troubleshooting.md.

Plugin Support

Modelship's TTS and STT systems are built around a plugin architecture — each backend is an opt-in package with its own isolated dependencies. Plugins ship inside this repo (plugins/) or can be installed from PyPI.

Built-in plugins:

Kokoro ONNX — lightweight TTS via ONNX Runtime (CPU or GPU)
Bark — multilingual TTS by Suno (GPU recommended)
Orpheus — expressive TTS
whisper.cpp — CPU-only STT via pywhispercpp

To enable plugins for local development, pass them as extras at sync time:

uv sync --extra kokoroonnx
uv sync --extra kokoroonnx --extra whispercpp  # multiple plugins

For deployment, plugins are automatically loaded from standalone Python wheels via Ray's runtime_env when referenced in models.yaml. This ensures that complex backend dependencies don't pollute the main API gateway or other deployments.

For a full guide on writing your own plugin, see Plugin Development.

Documentation

Development — dev environment setup, building, and running locally
Model Configuration — full models.yaml reference, GPU pinning, environment variables
Architecture — system design, request lifecycle, plugin loading
Plugin Development — writing custom TTS/STT backends
Home Assistant Integration — Wyoming protocol setup for voice automation
Monitoring & Logging — Prometheus metrics, Grafana dashboard, structured logging, health checks
Troubleshooting — common first-run errors and fixes
Roadmap — what's planned next and where to contribute

Monitoring

Modelship exposes Prometheus metrics (Ray cluster, Ray Serve, vLLM, and custom modelship:* metrics) through a single scrape endpoint on port 8079. Metrics are enabled by default — set MSHIP_METRICS=false to disable. A pre-built Grafana dashboard and Prometheus alerting rules are included in the repository.

Logging supports structured JSON output (MSHIP_LOG_FORMAT=json) and request ID correlation across Ray actor boundaries. Logs can be shipped to a remote syslog server (--log-target syslog://host:514) or an OpenTelemetry collector (--otel-endpoint http://collector:4317). Set MSHIP_LOG_LEVEL to TRACE for full request/response payloads, or DEBUG for detailed diagnostics without payloads.

See Monitoring & Logging for full details.

Production Readiness

Modelship is actively used and designed for stability in multi-tenant setups. Key guarantees include:

Mutex-backed deployments: A cluster-wide deploy coordinator prevents VRAM exhaustion by ensuring models are never loaded concurrently if resources are tight.
Comprehensive HTTP-level tests: The tests/test_integration.py suite validates chat, reasoning, tool-calling, and streaming across all loaders using real (small) models.
Payload & concurrency limits: Built-in safeguards against large payloads (MSHIP_MAX_REQUEST_BODY_BYTES) and configurable limits per backend.
Observability: Deep integration with Prometheus, OpenTelemetry, and structured logging.

We are currently working towards Kubernetes-native hardening (Helm charts, GPU-aware probes) and rate-limiting. See the full Production Readiness Plan for the scorecard and roadmap.

Contributing

See CONTRIBUTING.md for guidelines on setting up the dev environment, code style, and submitting pull requests.

Name		Name	Last commit message	Last commit date
Latest commit History 317 Commits
.devcontainer		.devcontainer
.github		.github
config/examples		config/examples
docs		docs
modelship		modelship
plugins		plugins
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.rayignore		.rayignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
mship_deploy.py		mship_deploy.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modelship

Why Modelship?

Architecture

Requirements

Features

Supported OpenAI Endpoints

Quick Start

GPU (vLLM, Diffusers)

Plugin Support

Documentation

Monitoring

Production Readiness

Contributing

About

Uh oh!

Releases 26

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Modelship

Why Modelship?

Architecture

Requirements

Features

Supported OpenAI Endpoints

Quick Start

GPU (vLLM, Diffusers)

Plugin Support

Documentation

Monitoring

Production Readiness

Contributing

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 26

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages