MM-67433, MM-66612, MM-67291: Add new ES/OS changes#8882
MM-67433, MM-66612, MM-67291: Add new ES/OS changes#8882agarciamontoro wants to merge 2 commits intomasterfrom
Conversation
📝 WalkthroughWalkthroughAdded documentation describing Mattermost v11.7+ automatic handling of Elasticsearch/OpenSearch outages: periodic health checks, immediate fallback to database search on failure, exponential-backoff reconnection attempts, automatic recovery, re-evaluation triggers on config/license changes, and a new Prometheus metric plus expected log messages. Changes
Sequence Diagram(s)sequenceDiagram
actor Client
participant Mattermost_Server as "Mattermost Server"
participant Search_Engine as "Elasticsearch/OpenSearch"
participant Database as "Database"
participant Prometheus as "Prometheus"
Client->>Mattermost_Server: Search request
Mattermost_Server->>Search_Engine: Forward query (if engine healthy)
alt Search engine responds
Search_Engine-->>Mattermost_Server: Search results
Mattermost_Server-->>Client: Return results
else Search engine unavailable
Mattermost_Server->>Database: Fallback to Database Search
Database-->>Mattermost_Server: Search results
Mattermost_Server-->>Client: Return results
end
Note over Mattermost_Server,Search_Engine: Periodic health checks (every 60s)
Mattermost_Server->>Search_Engine: Health check ping
alt Failure(s)
Mattermost_Server-->>Mattermost_Server: Mark unhealthy after first failure for monitoring / stop engine after 3 consecutive failures
Mattermost_Server->>Search_Engine: Reconnection attempts (exponential backoff 15s→5min)
Prometheus-->>Mattermost_Server: Scrape `mattermost_search_engine_status`
else Recovery
Search_Engine-->>Mattermost_Server: Healthy response
Mattermost_Server-->>Prometheus: metric = 1
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
source/administration-guide/scale/enterprise-search.rst (1)
142-160: Add direct operator actions next to each outage log message.The table explains meanings, but it doesn’t tell admins what to do for each case. Add a short “Recommended action” column (or follow-up bullets) so troubleshooting is immediately actionable.
As per coding guidelines, "Cover common error messages and provide direct solutions in troubleshooting sections to prevent silent failures".
Suggested doc tweak
.. list-table:: - :widths: 50 50 + :widths: 35 35 30 :header-rows: 1 * - Log message - Meaning + - Recommended action * - ``Search engine health check failed`` - An individual health check failed. Includes a ``consecutive_failures`` count. + - Verify search cluster health endpoint and network path; monitor if failures continue. * - ``Search engine health check failed repeatedly; stopping engine`` - The failure threshold was reached and the engine has been stopped. Search falls back to the database. + - Expect DB-search fallback; investigate cluster outage and node availability immediately. * - ``Search engine watcher: Start() failed, will retry`` - A reconnection attempt failed. Includes a ``next_backoff`` field indicating the time until the next retry. + - No restart needed; fix connectivity/auth config and wait for next retry window. * - ``Search engine watcher: engine started successfully`` - The engine has recovered and is active again. + - Confirm search latency/accuracy returned to normal; close incident if stable. * - ``Search engine watcher: engine disabled, parking`` - The health monitor is idle because the search engine is disabled in configuration. + - Re-enable enterprise search settings if this is unexpected.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/administration-guide/scale/enterprise-search.rst` around lines 142 - 160, Add a "Recommended action" column (or follow-up bullet) next to each log entry so operators know immediate steps; for each specific message—"Search engine health check failed" (check engine logs, inspect consecutive_failures, restart engine process if transient), "Search engine health check failed repeatedly; stopping engine" (verify configuration, restart engine manually, investigate root cause and increase resources or tune thresholds), "Search engine watcher: Start() failed, will retry" (inspect next_backoff, check network/credentials and retry logs, consider manual restart or backoff tuning), "Search engine watcher: engine started successfully" (verify search functionality and clear any alerts), and "Search engine watcher: engine disabled, parking" (confirm config disable was intentional or re-enable in config and restart watcher); update the table rows to include these concise operator steps mapped to the exact log strings so troubleshooting is immediately actionable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@source/administration-guide/scale/enterprise-search.rst`:
- Around line 142-160: Add a "Recommended action" column (or follow-up bullet)
next to each log entry so operators know immediate steps; for each specific
message—"Search engine health check failed" (check engine logs, inspect
consecutive_failures, restart engine process if transient), "Search engine
health check failed repeatedly; stopping engine" (verify configuration, restart
engine manually, investigate root cause and increase resources or tune
thresholds), "Search engine watcher: Start() failed, will retry" (inspect
next_backoff, check network/credentials and retry logs, consider manual restart
or backoff tuning), "Search engine watcher: engine started successfully" (verify
search functionality and clear any alerts), and "Search engine watcher: engine
disabled, parking" (confirm config disable was intentional or re-enable in
config and restart watcher); update the table rows to include these concise
operator steps mapped to the exact log strings so troubleshooting is immediately
actionable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2260267d-8934-4bda-a93c-ca5ae7107b23
📒 Files selected for processing (1)
source/administration-guide/scale/enterprise-search.rst
|
Newest code from mattermost has been published to preview environment for Git SHA 451b8e9 |
|
Newest code from mattermost has been published to preview environment for Git SHA 6cadebf |
|
@agarciamontoro can you please rebase this to the 11.7 branch once it's cut in the next few days? |
|
@esethna Sure, will do! |
6cadebf to
2e5d1d3
Compare
|
Newest code from mattermost has been published to preview environment for Git SHA 2e5d1d3 |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
source/administration-guide/scale/common-configure-mattermost-for-enterprise-search.rst (1)
38-38: Split this warning into structured bullets and define “health check” inline.This block is hard to scan in one paragraph, and “health check” is introduced without a quick definition. Consider 2–3 bullets (recommendation, fallback impact, v11.7 behaviour) and define health checks as periodic connectivity checks to the search backend.
As per coding guidelines, “Define technical terms briefly inline on first use rather than assuming familiarity…” and “Avoid walls of text by using clear structure, breaks, and formatting to make documentation easy to follow while performing tasks.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@source/administration-guide/scale/common-configure-mattermost-for-enterprise-search.rst` at line 38, Split the paragraph into 2–3 structured bullets: (1) Recommendation — "Disable Database Search once Elasticsearch/OpenSearch is fully configured and running" (include reason), (2) Fallback impact — "Mattermost Server will fall back to Database Search if Elasticsearch/OpenSearch are unavailable, which can cause performance degradation on high post volume deployments", and (3) v11.7 behavior — "From Mattermost v11.7, the server proactively detects outages via periodic health checks and falls back on the first health check failure rather than waiting for request timeouts." Inline after the first mention of "health check" add a brief definition such as "health checks are periodic connectivity checks to the search backend" and ensure each bullet is concise and scan-friendly; update the sentence that references the outage handling FAQ to remain as a final reference link.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@source/administration-guide/scale/common-configure-mattermost-for-enterprise-search.rst`:
- Line 38: Split the paragraph into 2–3 structured bullets: (1) Recommendation —
"Disable Database Search once Elasticsearch/OpenSearch is fully configured and
running" (include reason), (2) Fallback impact — "Mattermost Server will fall
back to Database Search if Elasticsearch/OpenSearch are unavailable, which can
cause performance degradation on high post volume deployments", and (3) v11.7
behavior — "From Mattermost v11.7, the server proactively detects outages via
periodic health checks and falls back on the first health check failure rather
than waiting for request timeouts." Inline after the first mention of "health
check" add a brief definition such as "health checks are periodic connectivity
checks to the search backend" and ensure each bullet is concise and
scan-friendly; update the sentence that references the outage handling FAQ to
remain as a final reference link.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 590b92c5-aa11-410e-ada3-9b3598858f25
📒 Files selected for processing (3)
source/administration-guide/scale/common-configure-mattermost-for-enterprise-search.rstsource/administration-guide/scale/enterprise-search.rstsource/administration-guide/scale/performance-monitoring-metrics.rst
✅ Files skipped from review due to trivial changes (2)
- source/administration-guide/scale/performance-monitoring-metrics.rst
- source/administration-guide/scale/enterprise-search.rst
|
Grouped all ES/OS-related changes from MM-67433, MM-66612 and MM-67291 in this PR. |
Summary
mattermost_search_engine_statusmetric to the list of metrics.Ticket Link
https://mattermost.atlassian.net/browse/MM-67433
https://mattermost.atlassian.net/browse/MM-66612
https://mattermost.atlassian.net/browse/MM-67291