Skip to content

Add Anthropic AI Gateway stream regression tests#21

Open
jhaynie wants to merge 1 commit into
mainfrom
test/anthropic-aigateway-stream-regression
Open

Add Anthropic AI Gateway stream regression tests#21
jhaynie wants to merge 1 commit into
mainfrom
test/anthropic-aigateway-stream-regression

Conversation

@jhaynie
Copy link
Copy Markdown
Member

@jhaynie jhaynie commented May 26, 2026

Summary

  • add opt-in live coverage for Anthropic Messages streaming through llmproxy
  • add opt-in live coverage for deployed Agentuity AI Gateway Anthropic Messages streaming
  • assert message_stop, text deltas, sentinel text, and no thinking_delta when thinking is disabled

Verification

  • LLMPROXY_LIVE_AIGATEWAY_REGRESSION=1 gluon run -s ion -- go test -tags=integration . -run TestLiveAnthropicMessagesStreamCompletes -count=1 -v
  • go test -tags=integration . -run TestLiveAnthropicMessagesStreamCompletes -count=1 -v

Summary by CodeRabbit

  • Tests
    • Enhanced integration test coverage for Anthropic message streaming functionality, including validation of stream completion and sentinel signals across multiple routing configurations.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

📝 Walkthrough

Walkthrough

This PR adds live integration tests for Anthropic message streaming to aigateway_regression_live_test.go. The change includes two test functions (one using the internal router, one using a direct gateway URL), a stream parser that extracts SSE events and delta fields, a summary type with formatted output, and validation helpers that check event counts and stream content.

Changes

Anthropic Streaming Live Regression Tests

Layer / File(s) Summary
Stream parsing infrastructure
aigateway_regression_live_test.go
Imports (sort, strconv, strings, providers/anthropic) enable streaming support. The anthropicLiveStreamSummary struct holds aggregated event type counts, text/thinking delta counts, and concatenated text. summarizeAnthropicLiveStream parses raw SSE payloads by splitting blocks, extracting event: and data: fields, decoding JSON, and aggregating counts and text content.
Stream validation and utilities
aigateway_regression_live_test.go
assertAnthropicLiveStream validates parsed streams by checking for message_stop events, text deltas, a required sentinel string, and absence of thinking deltas. fmtInt and firstNonEmptyLiveEnv helpers format integers and select the first non-empty environment variable from a list of candidate names.
Integration tests
aigateway_regression_live_test.go
TestLiveAnthropicMessagesStreamCompletes routes an Anthropic /v1/messages streaming request through the internal auto-router with Accept: text/event-stream header, reads the response, and validates stream content. TestLiveAgentuityAIGatewayAnthropicMessagesStreamCompletes sends a direct HTTP request to an AIGateway URL (base and API key from environment variables), includes authorization headers, and validates the stream behavior using the same assertion logic.
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
aigateway_regression_live_test.go (2)

128-128: ⚡ Quick win

Consider reducing max_tokens for this smoke test.

The request allows up to 64,000 tokens but the prompt explicitly asks for a minimal reply ("Reply with GENESIS_DRIVER_SMOKE_OK and nothing else"). A much smaller value like 256 or 512 would suffice and reduce API costs.

💰 Proposed fix to reduce max_tokens
-	body := `{"model":"` + model + `","stream":true,"max_tokens":64000,"thinking":{"type":"disabled"},"messages":[{"role":"user","content":[{"type":"text","text":"Reply with GENESIS_DRIVER_SMOKE_OK and nothing else."}]}]}`
+	body := `{"model":"` + model + `","stream":true,"max_tokens":512,"thinking":{"type":"disabled"},"messages":[{"role":"user","content":[{"type":"text","text":"Reply with GENESIS_DRIVER_SMOKE_OK and nothing else."}]}]}`
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@aigateway_regression_live_test.go` at line 128, Reduce the unnecessary token
budget in the test payload: in aigateway_regression_live_test.go update the JSON
assigned to the body variable (the line that builds the request body string
containing "max_tokens":64000) to use a much smaller value such as 256 or 512
(e.g., "max_tokens":512) since the prompt expects a single short token, which
lowers API cost while preserving test intent.

178-178: ⚡ Quick win

Consider reducing max_tokens for this smoke test.

Similar to the internal router test, this request allows up to 64,000 tokens for a minimal reply. A smaller value like 512 would be sufficient and more cost-effective.

💰 Proposed fix to reduce max_tokens
-		"max_tokens": 64_000,
+		"max_tokens": 512,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@aigateway_regression_live_test.go` at line 178, Reduce the excessive token
allowance in the smoke test by changing the JSON request field "max_tokens" from
64_000 to a smaller value (e.g., 512) in the test that builds the request body
(the line containing "max_tokens": 64_000); update any related test expectations
or comments as needed to reflect the lower limit so the test remains valid and
cost-efficient.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@aigateway_regression_live_test.go`:
- Around line 256-301: In summarizeAnthropicLiveStream, avoid double-counting
EventTypes: stop unconditionally incrementing summary.EventTypes[eventType] from
the SSE "event:" line and instead increment summary.EventTypes only from the
parsed JSON payload.Type (payload.Type) when present; if payload.Type is empty,
you may fall back to using the SSE eventType once. Update logic around eventType
and payload.Type (references: summarizeAnthropicLiveStream,
anthropicLiveStreamSummary, EventTypes, payload.Type) so counts come from the
JSON payload first and the SSE event line only as a fallback.

---

Nitpick comments:
In `@aigateway_regression_live_test.go`:
- Line 128: Reduce the unnecessary token budget in the test payload: in
aigateway_regression_live_test.go update the JSON assigned to the body variable
(the line that builds the request body string containing "max_tokens":64000) to
use a much smaller value such as 256 or 512 (e.g., "max_tokens":512) since the
prompt expects a single short token, which lowers API cost while preserving test
intent.
- Line 178: Reduce the excessive token allowance in the smoke test by changing
the JSON request field "max_tokens" from 64_000 to a smaller value (e.g., 512)
in the test that builds the request body (the line containing "max_tokens":
64_000); update any related test expectations or comments as needed to reflect
the lower limit so the test remains valid and cost-efficient.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 92d057cc-e600-4e96-8ecf-b02b81f80197

📥 Commits

Reviewing files that changed from the base of the PR and between 610f141 and e1930d9.

📒 Files selected for processing (1)
  • aigateway_regression_live_test.go
📜 Review details
🔇 Additional comments (4)
aigateway_regression_live_test.go (4)

8-8: LGTM!

Also applies to: 13-14, 20-20


223-238: LGTM!


240-254: LGTM!


303-305: LGTM!

Also applies to: 315-322

Comment on lines +256 to +301
func summarizeAnthropicLiveStream(raw []byte) anthropicLiveStreamSummary {
summary := anthropicLiveStreamSummary{
EventTypes: make(map[string]int),
}
for _, block := range strings.Split(string(raw), "\n\n") {
var eventType string
var data string
for _, line := range strings.Split(block, "\n") {
line = strings.TrimSpace(line)
if strings.HasPrefix(line, "event:") {
eventType = strings.TrimSpace(strings.TrimPrefix(line, "event:"))
}
if strings.HasPrefix(line, "data:") {
data = strings.TrimSpace(strings.TrimPrefix(line, "data:"))
}
}
if eventType != "" {
summary.EventTypes[eventType]++
}
if data == "" {
continue
}
var payload struct {
Type string `json:"type"`
Delta struct {
Type string `json:"type"`
Text string `json:"text"`
Thinking string `json:"thinking"`
} `json:"delta"`
}
if err := json.Unmarshal([]byte(data), &payload); err != nil {
continue
}
if payload.Type != "" {
summary.EventTypes[payload.Type]++
}
switch payload.Delta.Type {
case "text_delta":
summary.TextDeltas++
summary.Text += payload.Delta.Text
case "thinking_delta":
summary.ThinkingDeltas++
}
}
return summary
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Event types may be double-counted in the summary.

The function counts event types from both the SSE event: line (line 273) and the JSON type field (line 290). In Anthropic's SSE format, these typically have the same value:

event: content_block_delta
data: {"type":"content_block_delta",...}

This results in each event being counted twice in EventTypes. While current assertions only check for presence (> 0), the counts themselves are misleading.

🔧 Proposed fix to avoid double-counting

Count only from the JSON payload to match the actual event semantics:

 	for _, block := range strings.Split(string(raw), "\n\n") {
-		var eventType string
 		var data string
 		for _, line := range strings.Split(block, "\n") {
 			line = strings.TrimSpace(line)
-			if strings.HasPrefix(line, "event:") {
-				eventType = strings.TrimSpace(strings.TrimPrefix(line, "event:"))
-			}
 			if strings.HasPrefix(line, "data:") {
 				data = strings.TrimSpace(strings.TrimPrefix(line, "data:"))
 			}
 		}
-		if eventType != "" {
-			summary.EventTypes[eventType]++
-		}
 		if data == "" {
 			continue
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func summarizeAnthropicLiveStream(raw []byte) anthropicLiveStreamSummary {
summary := anthropicLiveStreamSummary{
EventTypes: make(map[string]int),
}
for _, block := range strings.Split(string(raw), "\n\n") {
var eventType string
var data string
for _, line := range strings.Split(block, "\n") {
line = strings.TrimSpace(line)
if strings.HasPrefix(line, "event:") {
eventType = strings.TrimSpace(strings.TrimPrefix(line, "event:"))
}
if strings.HasPrefix(line, "data:") {
data = strings.TrimSpace(strings.TrimPrefix(line, "data:"))
}
}
if eventType != "" {
summary.EventTypes[eventType]++
}
if data == "" {
continue
}
var payload struct {
Type string `json:"type"`
Delta struct {
Type string `json:"type"`
Text string `json:"text"`
Thinking string `json:"thinking"`
} `json:"delta"`
}
if err := json.Unmarshal([]byte(data), &payload); err != nil {
continue
}
if payload.Type != "" {
summary.EventTypes[payload.Type]++
}
switch payload.Delta.Type {
case "text_delta":
summary.TextDeltas++
summary.Text += payload.Delta.Text
case "thinking_delta":
summary.ThinkingDeltas++
}
}
return summary
}
func summarizeAnthropicLiveStream(raw []byte) anthropicLiveStreamSummary {
summary := anthropicLiveStreamSummary{
EventTypes: make(map[string]int),
}
for _, block := range strings.Split(string(raw), "\n\n") {
var data string
for _, line := range strings.Split(block, "\n") {
line = strings.TrimSpace(line)
if strings.HasPrefix(line, "data:") {
data = strings.TrimSpace(strings.TrimPrefix(line, "data:"))
}
}
if data == "" {
continue
}
var payload struct {
Type string `json:"type"`
Delta struct {
Type string `json:"type"`
Text string `json:"text"`
Thinking string `json:"thinking"`
} `json:"delta"`
}
if err := json.Unmarshal([]byte(data), &payload); err != nil {
continue
}
if payload.Type != "" {
summary.EventTypes[payload.Type]++
}
switch payload.Delta.Type {
case "text_delta":
summary.TextDeltas++
summary.Text += payload.Delta.Text
case "thinking_delta":
summary.ThinkingDeltas++
}
}
return summary
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@aigateway_regression_live_test.go` around lines 256 - 301, In
summarizeAnthropicLiveStream, avoid double-counting EventTypes: stop
unconditionally incrementing summary.EventTypes[eventType] from the SSE "event:"
line and instead increment summary.EventTypes only from the parsed JSON
payload.Type (payload.Type) when present; if payload.Type is empty, you may fall
back to using the SSE eventType once. Update logic around eventType and
payload.Type (references: summarizeAnthropicLiveStream,
anthropicLiveStreamSummary, EventTypes, payload.Type) so counts come from the
JSON payload first and the SSE event line only as a fallback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant