Skip to content

MM-67433, MM-66612, MM-67291: Add new ES/OS changes#8882

Open
agarciamontoro wants to merge 2 commits intomasterfrom
MM-67433.es.startup.fail
Open

MM-67433, MM-66612, MM-67291: Add new ES/OS changes#8882
agarciamontoro wants to merge 2 commits intomasterfrom
MM-67433.es.startup.fail

Conversation

@agarciamontoro
Copy link
Copy Markdown
Member

@agarciamontoro agarciamontoro commented Apr 13, 2026

Summary

  • Add a new FAQ explaining the new behaviour when the Elasticsearch or Opensearch server suffers an outage, describing the health monitor and the new behaviour to fail-fast ES queries when the server is down.
  • Add the new mattermost_search_engine_status metric to the list of metrics.

Ticket Link

https://mattermost.atlassian.net/browse/MM-67433
https://mattermost.atlassian.net/browse/MM-66612
https://mattermost.atlassian.net/browse/MM-67291

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

Added documentation describing Mattermost v11.7+ automatic handling of Elasticsearch/OpenSearch outages: periodic health checks, immediate fallback to database search on failure, exponential-backoff reconnection attempts, automatic recovery, re-evaluation triggers on config/license changes, and a new Prometheus metric plus expected log messages.

Changes

Cohort / File(s) Summary
Enterprise Search FAQ
source/administration-guide/scale/enterprise-search.rst
Added FAQ describing automatic outage detection and handling for Elasticsearch/OpenSearch: 60s health-check interval, 3 consecutive failures to stop engine (first failure marks unhealthy for monitoring), exponential-backoff reconnection (15s → 5min), automatic resume, re-evaluation on config/license changes, and expected log messages.
Configuration guidance
source/administration-guide/scale/common-configure-mattermost-for-enterprise-search.rst
Updated fallback wording to explicitly state Mattermost falls back to Database Search when Elasticsearch/OpenSearch are unavailable and added cross-reference to the outage-handling FAQ with version note (v11.7 behavior).
Monitoring metrics
source/administration-guide/scale/performance-monitoring-metrics.rst
Added documented metric mattermost_search_engine_status (numeric: 1 = healthy or not configured, 0 = configured but unavailable) for alerting on search engine outages.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Mattermost_Server as "Mattermost Server"
    participant Search_Engine as "Elasticsearch/OpenSearch"
    participant Database as "Database"
    participant Prometheus as "Prometheus"

    Client->>Mattermost_Server: Search request
    Mattermost_Server->>Search_Engine: Forward query (if engine healthy)
    alt Search engine responds
        Search_Engine-->>Mattermost_Server: Search results
        Mattermost_Server-->>Client: Return results
    else Search engine unavailable
        Mattermost_Server->>Database: Fallback to Database Search
        Database-->>Mattermost_Server: Search results
        Mattermost_Server-->>Client: Return results
    end

    Note over Mattermost_Server,Search_Engine: Periodic health checks (every 60s)
    Mattermost_Server->>Search_Engine: Health check ping
    alt Failure(s)
        Mattermost_Server-->>Mattermost_Server: Mark unhealthy after first failure for monitoring / stop engine after 3 consecutive failures
        Mattermost_Server->>Search_Engine: Reconnection attempts (exponential backoff 15s→5min)
        Prometheus-->>Mattermost_Server: Scrape `mattermost_search_engine_status`
    else Recovery
        Search_Engine-->>Mattermost_Server: Healthy response
        Mattermost_Server-->>Prometheus: metric = 1
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title references three ticket numbers but only vaguely mentions 'ES/OS changes' without clearly identifying the main change: documenting automatic Elasticsearch/OpenSearch outage handling. Revise the title to be more descriptive and specific, such as 'Document Elasticsearch/OpenSearch outage handling and health checks' to better convey the primary content.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The pull request description accurately describes the changeset, detailing the new FAQ about Elasticsearch/OpenSearch outage behaviour and the addition of the new search engine status metric.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch MM-67433.es.startup.fail

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
source/administration-guide/scale/enterprise-search.rst (1)

142-160: Add direct operator actions next to each outage log message.

The table explains meanings, but it doesn’t tell admins what to do for each case. Add a short “Recommended action” column (or follow-up bullets) so troubleshooting is immediately actionable.

As per coding guidelines, "Cover common error messages and provide direct solutions in troubleshooting sections to prevent silent failures".

Suggested doc tweak
 .. list-table::
-   :widths: 50 50
+   :widths: 35 35 30
    :header-rows: 1

    * - Log message
      - Meaning
+     - Recommended action
    * - ``Search engine health check failed``
      - An individual health check failed. Includes a ``consecutive_failures`` count.
+     - Verify search cluster health endpoint and network path; monitor if failures continue.
    * - ``Search engine health check failed repeatedly; stopping engine``
      - The failure threshold was reached and the engine has been stopped. Search falls back to the database.
+     - Expect DB-search fallback; investigate cluster outage and node availability immediately.
    * - ``Search engine watcher: Start() failed, will retry``
      - A reconnection attempt failed. Includes a ``next_backoff`` field indicating the time until the next retry.
+     - No restart needed; fix connectivity/auth config and wait for next retry window.
    * - ``Search engine watcher: engine started successfully``
      - The engine has recovered and is active again.
+     - Confirm search latency/accuracy returned to normal; close incident if stable.
    * - ``Search engine watcher: engine disabled, parking``
      - The health monitor is idle because the search engine is disabled in configuration.
+     - Re-enable enterprise search settings if this is unexpected.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@source/administration-guide/scale/enterprise-search.rst` around lines 142 -
160, Add a "Recommended action" column (or follow-up bullet) next to each log
entry so operators know immediate steps; for each specific message—"Search
engine health check failed" (check engine logs, inspect consecutive_failures,
restart engine process if transient), "Search engine health check failed
repeatedly; stopping engine" (verify configuration, restart engine manually,
investigate root cause and increase resources or tune thresholds), "Search
engine watcher: Start() failed, will retry" (inspect next_backoff, check
network/credentials and retry logs, consider manual restart or backoff tuning),
"Search engine watcher: engine started successfully" (verify search
functionality and clear any alerts), and "Search engine watcher: engine
disabled, parking" (confirm config disable was intentional or re-enable in
config and restart watcher); update the table rows to include these concise
operator steps mapped to the exact log strings so troubleshooting is immediately
actionable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@source/administration-guide/scale/enterprise-search.rst`:
- Around line 142-160: Add a "Recommended action" column (or follow-up bullet)
next to each log entry so operators know immediate steps; for each specific
message—"Search engine health check failed" (check engine logs, inspect
consecutive_failures, restart engine process if transient), "Search engine
health check failed repeatedly; stopping engine" (verify configuration, restart
engine manually, investigate root cause and increase resources or tune
thresholds), "Search engine watcher: Start() failed, will retry" (inspect
next_backoff, check network/credentials and retry logs, consider manual restart
or backoff tuning), "Search engine watcher: engine started successfully" (verify
search functionality and clear any alerts), and "Search engine watcher: engine
disabled, parking" (confirm config disable was intentional or re-enable in
config and restart watcher); update the table rows to include these concise
operator steps mapped to the exact log strings so troubleshooting is immediately
actionable.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2260267d-8934-4bda-a93c-ca5ae7107b23

📥 Commits

Reviewing files that changed from the base of the PR and between 24b9fb6 and 451b8e9.

📒 Files selected for processing (1)
  • source/administration-guide/scale/enterprise-search.rst

@github-actions
Copy link
Copy Markdown
Contributor

Newest code from mattermost has been published to preview environment for Git SHA 451b8e9

@github-actions
Copy link
Copy Markdown
Contributor

Newest code from mattermost has been published to preview environment for Git SHA 6cadebf

@esethna
Copy link
Copy Markdown
Contributor

esethna commented Apr 13, 2026

@agarciamontoro can you please rebase this to the 11.7 branch once it's cut in the next few days?

@esethna esethna added the Do Not Merge/Awaiting Next Release To be merged with the next release (e.g. API documentation updates) label Apr 13, 2026
@esethna esethna added this to the v11.7.0 milestone Apr 13, 2026
@agarciamontoro
Copy link
Copy Markdown
Member Author

@esethna Sure, will do!

@agarciamontoro agarciamontoro force-pushed the MM-67433.es.startup.fail branch from 6cadebf to 2e5d1d3 Compare April 14, 2026 10:48
@github-actions
Copy link
Copy Markdown
Contributor

Newest code from mattermost has been published to preview environment for Git SHA 2e5d1d3

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
source/administration-guide/scale/common-configure-mattermost-for-enterprise-search.rst (1)

38-38: Split this warning into structured bullets and define “health check” inline.

This block is hard to scan in one paragraph, and “health check” is introduced without a quick definition. Consider 2–3 bullets (recommendation, fallback impact, v11.7 behaviour) and define health checks as periodic connectivity checks to the search backend.

As per coding guidelines, “Define technical terms briefly inline on first use rather than assuming familiarity…” and “Avoid walls of text by using clear structure, breaks, and formatting to make documentation easy to follow while performing tasks.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@source/administration-guide/scale/common-configure-mattermost-for-enterprise-search.rst`
at line 38, Split the paragraph into 2–3 structured bullets: (1) Recommendation
— "Disable Database Search once Elasticsearch/OpenSearch is fully configured and
running" (include reason), (2) Fallback impact — "Mattermost Server will fall
back to Database Search if Elasticsearch/OpenSearch are unavailable, which can
cause performance degradation on high post volume deployments", and (3) v11.7
behavior — "From Mattermost v11.7, the server proactively detects outages via
periodic health checks and falls back on the first health check failure rather
than waiting for request timeouts." Inline after the first mention of "health
check" add a brief definition such as "health checks are periodic connectivity
checks to the search backend" and ensure each bullet is concise and
scan-friendly; update the sentence that references the outage handling FAQ to
remain as a final reference link.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@source/administration-guide/scale/common-configure-mattermost-for-enterprise-search.rst`:
- Line 38: Split the paragraph into 2–3 structured bullets: (1) Recommendation —
"Disable Database Search once Elasticsearch/OpenSearch is fully configured and
running" (include reason), (2) Fallback impact — "Mattermost Server will fall
back to Database Search if Elasticsearch/OpenSearch are unavailable, which can
cause performance degradation on high post volume deployments", and (3) v11.7
behavior — "From Mattermost v11.7, the server proactively detects outages via
periodic health checks and falls back on the first health check failure rather
than waiting for request timeouts." Inline after the first mention of "health
check" add a brief definition such as "health checks are periodic connectivity
checks to the search backend" and ensure each bullet is concise and
scan-friendly; update the sentence that references the outage handling FAQ to
remain as a final reference link.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 590b92c5-aa11-410e-ada3-9b3598858f25

📥 Commits

Reviewing files that changed from the base of the PR and between 451b8e9 and 2e5d1d3.

📒 Files selected for processing (3)
  • source/administration-guide/scale/common-configure-mattermost-for-enterprise-search.rst
  • source/administration-guide/scale/enterprise-search.rst
  • source/administration-guide/scale/performance-monitoring-metrics.rst
✅ Files skipped from review due to trivial changes (2)
  • source/administration-guide/scale/performance-monitoring-metrics.rst
  • source/administration-guide/scale/enterprise-search.rst

@agarciamontoro agarciamontoro changed the title MM-67433: Add new ES/OS behaviour on outages MM-67433, MM-66612, MM-67291: Add new ES/OS changes Apr 14, 2026
@agarciamontoro
Copy link
Copy Markdown
Member Author

Grouped all ES/OS-related changes from MM-67433, MM-66612 and MM-67291 in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Do Not Merge/Awaiting Next Release To be merged with the next release (e.g. API documentation updates)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants