libpcp: recover from corrupted archives in multi-archive contexts#2586
Conversation
|
@natoscott , @kmcdonell May I ask you for a review, please? |
|
[babble from slack] If they are seeing corruption that is not of the "truncation" form I'd be very interested to know the details of the corruption. |
ebae07d to
989ad25
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughThis PR makes multi-archive log handling resilient: init skips unreadable archives with warnings and can rebuild shared state, read removes corrupted archives and retries, fetch treats corruption non-fatally when multiple archives exist, and QA normalizes stderr output for stable tests. ChangesMulti-archive corruption recovery
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/libpcp/src/context.c`:
- Around line 947-957: The current code in context.c that calls
__pmFindOrOpenArchive(ctxp, current, multi_arch) treats every negative sts as a
recoverable "skip and continue" when multi_arch is true; instead, only treat
truly corrupt/unreadable errors as recoverable and propagate
semantic/incompatibility errors. Change the multi_arch branch to inspect sts and
only call pmNotifyErr(... "skipping corrupt/unreadable archive ...") and
continue when sts matches the specific recoverable error codes (e.g., corrupt or
I/O related codes returned by __pmFindOrOpenArchive); for all other negative sts
(label/host/schema mismatches or other semantic incompatibilities) log an error
and return sts (or otherwise fail the open) so the caller sees the failure.
Update the block around __pmFindOrOpenArchive, sts, multi_arch, current, end,
pmNotifyErr and pmErrStr to implement this conditional handling and propagate
non-recoverable errors instead of silently skipping them.
In `@src/libpcp/src/logutil.c`:
- Around line 1994-2003: The retry path inside the PM_MODE_FORW branch fails to
refresh the archive file offset after swapping archives: when
__pmLogChangeArchive(ctxp, corrupt_idx) succeeds we update lcp and f but leave
acp->ac_offset pointing to the removed archive; modify the branch that sets lcp
= acp->ac_log and f = acp->ac_mfp to also reset acp->ac_offset =
__pmLogLabelSize(lcp) (matching the other archive-switch paths) before setting
acp->ac_vol = acp->ac_curvol and goto again so subsequent seeks/read recovery
use the correct offset for the new file.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: a19c6ec9-8e44-41f7-8a44-15f2e38fe877
⛔ Files ignored due to path filters (2)
qa/1671.outis excluded by!**/*.outqa/722.outis excluded by!**/*.out
📒 Files selected for processing (5)
qa/1671qa/722src/libpcp/src/context.csrc/libpcp/src/interp.csrc/libpcp/src/logutil.c
ad73707 to
9e214bd
Compare
When reading multi-archive PCP logs a corrupted volume no longer terminates the entire replay. Instead, corrupted archives are removed from ac_log_list, so that data from healthy volumes can still be read. Tests qa/1671 and qa/722 are modified accordingly to reflect this change.
9e214bd to
20605cc
Compare
When reading multi-archive PCP logs (e.g. Cockpit metrics history), a corrupted volume no longer terminates the entire replay. Instead, corrupted archives are flagged and skipped so that data from healthy volumes can still be read.
Resolves: RHEL-169855
Summary by CodeRabbit
Bug Fixes
Tests