Skip to content

OTLP HTTP endpoint hangs in certain conditions #6326

@oleksandr-zhyhalo

Description

@oleksandr-zhyhalo

Summary

In a single-node Quickwit 0.9.0-nightly deployment with a Postgres metastore and an SQS file source running at ~1500 files/minute, the OTLP HTTP endpoint (POST /api/v1/otlp/v1/logs) hangs indefinitely. Quickwit logs report:

ERROR quickwit_serve::otlp_api::rest_handler:
  otlp internal error: status: 'The service is currently unavailable',
  self: "ingest service is unavailable (no shards available)"

ERROR quickwit_ingest::ingest_v2::router:
  ingest request should not timeout as there is a timeout on independent ingest requests too.
  timeout after 35000

ERROR quickwit_actors::actor_handle: actor-timeout actor="ControlPlane-..."

But the chitchat state shows an _ingest-source shard IS created and assigned to the indexer, and ingester.status=ready. The router cannot see the shard as assignable.

Environment

Item Value
Image quickwit/quickwit:v0.9.0-rc (published 2026-04-19 on Docker Hub)
Deployment Single-node via docker-compose, command: run
Host AWS EC2, 4 vCPUs, not resource-constrained (CPU/memory under-utilised)
Target aarch64-unknown-linux-gnu
Metastore PostgreSQL
Storage S3 (s3://…/indexes/, region eu-west-1)
enabled_services (chitchat) metastore,searcher,control_plane,janitor,indexer
ingester.status (chitchat) ready
readiness (chitchat) READY

(Yes, I did restart container)

Workload

  • Noisy index: ***-logs with an SQS file source (***-sqs-filesource) consuming S3 notifications.

  • Sustained rate: ~1500 files per minute. Each S3 file becomes a distinct shard in the metastore:

    INFO quickwit_metastore::metastore::postgres::metastore:
      opened shard index_uid=***-logs:01KNF3635YNGTBZCWQEY6943JP
      source_id=***-sqs-filesource
      shard_id=s3://***-logs-prod/.../1776710923-xxxxx.ndjson.gz
      leader_id= follower_id=None
    

    at roughly 15–20 log lines per second.

  • Target index for OTLP: otel-logs-v0_9, ingest_settings.min_shards=1, _ingest-source (ingest-v2) present alongside _ingest-api-source.

Recurring error pattern in Quickwit logs

ERROR quickwit_actors::actor_handle: actor-timeout actor="ControlPlane-purple-6GTo"
ERROR quickwit_serve::otlp_api::rest_handler: otlp internal error: ... "ingest service is unavailable (no shards available)"
ERROR quickwit_ingest::ingest_v2::router: ingest request should not timeout... timeout after 35000
ERROR quickwit_indexing::source::queue_sources::shared_state: failed to prune shards error=TooManyRequests
ERROR quickwit_serve::rest: failed to serve connection: connection closed before message completed

The TooManyRequests from queue-sources correlates with the S3/SQS shard churn.

Direct Quickwit (bypassing any proxy) repros

curl -i --max-time 20 -X POST 'http://localhost:7280/api/v1/otlp/v1/logs' \
  -H 'content-type: application/x-protobuf' \
  -H 'qw-otel-logs-index: otel-logs-v0_9' \
  --data-binary @valid-otlp.bin
# curl: (28) Operation timed out after 20001 milliseconds with 0 bytes received

A malformed or truncated protobuf body (\x00, or the first 50/100 bytes of a valid body) returns 400 "failed to decode Protobuf message" instantly — proving the request reaches Quickwit and the parse path is fast. Only full valid bodies hang.

Disabling the SQS file source unblocks OTLP

I temporarily disabled SQS file sources, then retried the Node OTLP client. OTLP started working within seconds — the log record landed in otel-logs-v0_9 on the first attempt.

After enabling sqs source, currently endpoint is responsive, I will post any additional updates.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions