MM-67433, MM-66612, MM-67291: Add new ES/OS changes by agarciamontoro · Pull Request #8882 · mattermost/docs

agarciamontoro · 2026-04-13T09:44:28Z

Summary

Add a new FAQ explaining the new behaviour when the Elasticsearch or Opensearch server suffers an outage, describing the health monitor and the new behaviour to fail-fast ES queries when the server is down.
Add the new mattermost_search_engine_status metric to the list of metrics.

Ticket Link

https://mattermost.atlassian.net/browse/MM-67433
https://mattermost.atlassian.net/browse/MM-66612
https://mattermost.atlassian.net/browse/MM-67291

coderabbitai · 2026-04-13T09:46:21Z

📝 Walkthrough

Walkthrough

Added documentation describing Mattermost v11.7+ automatic handling of Elasticsearch/OpenSearch outages: periodic health checks, immediate fallback to database search on failure, exponential-backoff reconnection attempts, automatic recovery, re-evaluation triggers on config/license changes, and a new Prometheus metric plus expected log messages.

Changes

Cohort / File(s)	Summary
Enterprise Search FAQ `source/administration-guide/scale/enterprise-search.rst`	Added FAQ describing automatic outage detection and handling for Elasticsearch/OpenSearch: 60s health-check interval, 3 consecutive failures to stop engine (first failure marks unhealthy for monitoring), exponential-backoff reconnection (15s → 5min), automatic resume, re-evaluation on config/license changes, and expected log messages.
Configuration guidance `source/administration-guide/scale/common-configure-mattermost-for-enterprise-search.rst`	Updated fallback wording to explicitly state Mattermost falls back to Database Search when Elasticsearch/OpenSearch are unavailable and added cross-reference to the outage-handling FAQ with version note (v11.7 behavior).
Monitoring metrics `source/administration-guide/scale/performance-monitoring-metrics.rst`	Added documented metric `mattermost_search_engine_status` (numeric: `1` = healthy or not configured, `0` = configured but unavailable) for alerting on search engine outages.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Mattermost_Server as "Mattermost Server"
    participant Search_Engine as "Elasticsearch/OpenSearch"
    participant Database as "Database"
    participant Prometheus as "Prometheus"

    Client->>Mattermost_Server: Search request
    Mattermost_Server->>Search_Engine: Forward query (if engine healthy)
    alt Search engine responds
        Search_Engine-->>Mattermost_Server: Search results
        Mattermost_Server-->>Client: Return results
    else Search engine unavailable
        Mattermost_Server->>Database: Fallback to Database Search
        Database-->>Mattermost_Server: Search results
        Mattermost_Server-->>Client: Return results
    end

    Note over Mattermost_Server,Search_Engine: Periodic health checks (every 60s)
    Mattermost_Server->>Search_Engine: Health check ping
    alt Failure(s)
        Mattermost_Server-->>Mattermost_Server: Mark unhealthy after first failure for monitoring / stop engine after 3 consecutive failures
        Mattermost_Server->>Search_Engine: Reconnection attempts (exponential backoff 15s→5min)
        Prometheus-->>Mattermost_Server: Scrape `mattermost_search_engine_status`
    else Recovery
        Search_Engine-->>Mattermost_Server: Healthy response
        Mattermost_Server-->>Prometheus: metric = 1
    end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title references three ticket numbers but only vaguely mentions 'ES/OS changes' without clearly identifying the main change: documenting automatic Elasticsearch/OpenSearch outage handling.	Revise the title to be more descriptive and specific, such as 'Document Elasticsearch/OpenSearch outage handling and health checks' to better convey the primary content.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	The pull request description accurately describes the changeset, detailing the new FAQ about Elasticsearch/OpenSearch outage behaviour and the addition of the new search engine status metric.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch MM-67433.es.startup.fail

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

source/administration-guide/scale/enterprise-search.rst (1)

142-160: Add direct operator actions next to each outage log message.

The table explains meanings, but it doesn’t tell admins what to do for each case. Add a short “Recommended action” column (or follow-up bullets) so troubleshooting is immediately actionable.

As per coding guidelines, "Cover common error messages and provide direct solutions in troubleshooting sections to prevent silent failures".

Suggested doc tweak

 .. list-table::
-   :widths: 50 50
+   :widths: 35 35 30
    :header-rows: 1

    * - Log message
      - Meaning
+     - Recommended action
    * - ``Search engine health check failed``
      - An individual health check failed. Includes a ``consecutive_failures`` count.
+     - Verify search cluster health endpoint and network path; monitor if failures continue.
    * - ``Search engine health check failed repeatedly; stopping engine``
      - The failure threshold was reached and the engine has been stopped. Search falls back to the database.
+     - Expect DB-search fallback; investigate cluster outage and node availability immediately.
    * - ``Search engine watcher: Start() failed, will retry``
      - A reconnection attempt failed. Includes a ``next_backoff`` field indicating the time until the next retry.
+     - No restart needed; fix connectivity/auth config and wait for next retry window.
    * - ``Search engine watcher: engine started successfully``
      - The engine has recovered and is active again.
+     - Confirm search latency/accuracy returned to normal; close incident if stable.
    * - ``Search engine watcher: engine disabled, parking``
      - The health monitor is idle because the search engine is disabled in configuration.
+     - Re-enable enterprise search settings if this is unexpected.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@source/administration-guide/scale/enterprise-search.rst` around lines 142 -
160, Add a "Recommended action" column (or follow-up bullet) next to each log
entry so operators know immediate steps; for each specific message—"Search
engine health check failed" (check engine logs, inspect consecutive_failures,
restart engine process if transient), "Search engine health check failed
repeatedly; stopping engine" (verify configuration, restart engine manually,
investigate root cause and increase resources or tune thresholds), "Search
engine watcher: Start() failed, will retry" (inspect next_backoff, check
network/credentials and retry logs, consider manual restart or backoff tuning),
"Search engine watcher: engine started successfully" (verify search
functionality and clear any alerts), and "Search engine watcher: engine
disabled, parking" (confirm config disable was intentional or re-enable in
config and restart watcher); update the table rows to include these concise
operator steps mapped to the exact log strings so troubleshooting is immediately
actionable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@source/administration-guide/scale/enterprise-search.rst`:
- Around line 142-160: Add a "Recommended action" column (or follow-up bullet)
next to each log entry so operators know immediate steps; for each specific
message—"Search engine health check failed" (check engine logs, inspect
consecutive_failures, restart engine process if transient), "Search engine
health check failed repeatedly; stopping engine" (verify configuration, restart
engine manually, investigate root cause and increase resources or tune
thresholds), "Search engine watcher: Start() failed, will retry" (inspect
next_backoff, check network/credentials and retry logs, consider manual restart
or backoff tuning), "Search engine watcher: engine started successfully" (verify
search functionality and clear any alerts), and "Search engine watcher: engine
disabled, parking" (confirm config disable was intentional or re-enable in
config and restart watcher); update the table rows to include these concise
operator steps mapped to the exact log strings so troubleshooting is immediately
actionable.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2260267d-8934-4bda-a93c-ca5ae7107b23

📥 Commits

Reviewing files that changed from the base of the PR and between 24b9fb6 and 451b8e9.

📒 Files selected for processing (1)

source/administration-guide/scale/enterprise-search.rst

github-actions · 2026-04-13T09:47:57Z

Newest code from mattermost has been published to preview environment for Git SHA 451b8e9

github-actions · 2026-04-13T16:11:35Z

Newest code from mattermost has been published to preview environment for Git SHA 6cadebf

esethna · 2026-04-13T21:55:33Z

@agarciamontoro can you please rebase this to the 11.7 branch once it's cut in the next few days?

agarciamontoro · 2026-04-14T10:05:28Z

@esethna Sure, will do!

github-actions · 2026-04-14T10:51:45Z

Newest code from mattermost has been published to preview environment for Git SHA 2e5d1d3

coderabbitai

🧹 Nitpick comments (1)

source/administration-guide/scale/common-configure-mattermost-for-enterprise-search.rst (1)
38-38: Split this warning into structured bullets and define “health check” inline.

This block is hard to scan in one paragraph, and “health check” is introduced without a quick definition. Consider 2–3 bullets (recommendation, fallback impact, v11.7 behaviour) and define health checks as periodic connectivity checks to the search backend.

As per coding guidelines, “Define technical terms briefly inline on first use rather than assuming familiarity…” and “Avoid walls of text by using clear structure, breaks, and formatting to make documentation easy to follow while performing tasks.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@source/administration-guide/scale/common-configure-mattermost-for-enterprise-search.rst`
at line 38, Split the paragraph into 2–3 structured bullets: (1) Recommendation
— "Disable Database Search once Elasticsearch/OpenSearch is fully configured and
running" (include reason), (2) Fallback impact — "Mattermost Server will fall
back to Database Search if Elasticsearch/OpenSearch are unavailable, which can
cause performance degradation on high post volume deployments", and (3) v11.7
behavior — "From Mattermost v11.7, the server proactively detects outages via
periodic health checks and falls back on the first health check failure rather
than waiting for request timeouts." Inline after the first mention of "health
check" add a brief definition such as "health checks are periodic connectivity
checks to the search backend" and ensure each bullet is concise and
scan-friendly; update the sentence that references the outage handling FAQ to
remain as a final reference link.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@source/administration-guide/scale/common-configure-mattermost-for-enterprise-search.rst`:
- Line 38: Split the paragraph into 2–3 structured bullets: (1) Recommendation —
"Disable Database Search once Elasticsearch/OpenSearch is fully configured and
running" (include reason), (2) Fallback impact — "Mattermost Server will fall
back to Database Search if Elasticsearch/OpenSearch are unavailable, which can
cause performance degradation on high post volume deployments", and (3) v11.7
behavior — "From Mattermost v11.7, the server proactively detects outages via
periodic health checks and falls back on the first health check failure rather
than waiting for request timeouts." Inline after the first mention of "health
check" add a brief definition such as "health checks are periodic connectivity
checks to the search backend" and ensure each bullet is concise and
scan-friendly; update the sentence that references the outage handling FAQ to
remain as a final reference link.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 590b92c5-aa11-410e-ada3-9b3598858f25

📥 Commits

Reviewing files that changed from the base of the PR and between 451b8e9 and 2e5d1d3.

📒 Files selected for processing (3)

source/administration-guide/scale/common-configure-mattermost-for-enterprise-search.rst
source/administration-guide/scale/enterprise-search.rst
source/administration-guide/scale/performance-monitoring-metrics.rst

✅ Files skipped from review due to trivial changes (2)

source/administration-guide/scale/performance-monitoring-metrics.rst
source/administration-guide/scale/enterprise-search.rst

agarciamontoro · 2026-04-14T11:04:38Z

Grouped all ES/OS-related changes from MM-67433, MM-66612 and MM-67291 in this PR.

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

coderabbitai bot approved these changes Apr 13, 2026

View reviewed changes

agarciamontoro mentioned this pull request Apr 13, 2026

MM-67433: Elasticsearch health monitor mattermost/mattermost#35747

Merged

esethna approved these changes Apr 13, 2026

View reviewed changes

esethna added the Do Not Merge/Awaiting Next Release To be merged with the next release (e.g. API documentation updates) label Apr 13, 2026

esethna added this to the v11.7.0 milestone Apr 13, 2026

agarciamontoro added 2 commits April 14, 2026 12:47

Add new ES/OS behaviour on outages

275a7f7

Document new fast-fail behaviour and health metric

2e5d1d3

agarciamontoro force-pushed the MM-67433.es.startup.fail branch from 6cadebf to 2e5d1d3 Compare April 14, 2026 10:48

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

agarciamontoro changed the title ~~MM-67433: Add new ES/OS behaviour on outages~~ MM-67433, MM-66612, MM-67291: Add new ES/OS changes Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MM-67433, MM-66612, MM-67291: Add new ES/OS changes#8882

MM-67433, MM-66612, MM-67291: Add new ES/OS changes#8882
agarciamontoro wants to merge 2 commits intomasterfrom
MM-67433.es.startup.fail

agarciamontoro commented Apr 13, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 13, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

esethna commented Apr 13, 2026

Uh oh!

agarciamontoro commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

agarciamontoro commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

agarciamontoro commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Ticket Link

Uh oh!

coderabbitai bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

esethna commented Apr 13, 2026

Uh oh!

agarciamontoro commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

agarciamontoro commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

agarciamontoro commented Apr 13, 2026 •

edited

Loading

coderabbitai bot commented Apr 13, 2026 •

edited

Loading