Skip to content

Commit dd77c15

Browse files
committed
docs/src: Add 2021-07-17 postmortem
1 parent e239532 commit dd77c15

2 files changed

Lines changed: 115 additions & 0 deletions

File tree

docs/src/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,3 +33,4 @@
3333
- [Image Building](./upkeep/images.md)
3434
- [Postmortems](./postmortem/index.md)
3535
- [2021-06-06: a-hel-fi Failure](./postmortem/2021-06-06.md)
36+
- [2021-07-17: package signing issues](./postmortem/2021-07-17.md)

docs/src/postmortem/2021-07-17.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Post Mortem 2021-06-06
2+
3+
## Incident summary
4+
5+
For approximately 2 days the Nomad managed tasks on a-fsn-de were
6+
unavailable, including repository management tasks. Resolution was
7+
and continues to be hampered by a distinct and ongoing outage.
8+
9+
### Ammended 2021-07-31
10+
11+
CAS service was restored on 2021-07-22, any point below that mentions
12+
and ongoing outage was written while the incident was ongoing.
13+
14+
## Leadup
15+
16+
At this time we do not know what caused the reboot of a-fsn-de,
17+
however we do know that a parallel and ongoing incident was occurring
18+
at the same time. Void uses a centralized authentication service (CAS) to
19+
manage access to our machines, and like many secure services this one
20+
relies on TLS certificates. This certificate expired without being
21+
noticed, which prevented what would have otherwise been a quick
22+
recovery of logging in and bouncing some services.
23+
24+
Additionally, when the CAS is unavailable, we maintain a break-glass
25+
login capability for a handful of extremely trusted maintainers (1:1
26+
with the people that have access to the package signing key). This
27+
access was discovered to be impaired by one developer's key missing
28+
entirely, and one developer having failed to rotate their key. The
29+
third developer was on vacation but was able to log in and rectify the
30+
keying problem.
31+
32+
## Fault
33+
34+
The Nomad outage was caused by an unexpected restart of a-fsn-de.
35+
When Nomad hosts reboot there is a known defect in that runit may
36+
bring up the services in a race that results in Nomad not being usable
37+
until services are restarted in a specific order.
38+
39+
The unavailability of the CAS system means that we still cannot log in
40+
to all hosts normally.
41+
42+
The issues with the break-glass keys caused the recovery of both the
43+
specific Nomad host and the CAS server to be slow.
44+
45+
## Impact
46+
47+
Publicly Visible:
48+
49+
* Repository signing unavailable
50+
* Builds not completing normally
51+
* musl and aarch64 appeared to lag behind glibc
52+
53+
Internally Visible:
54+
55+
* CAS logins not available
56+
* Detailed failure logs not available in Grafana (requires CAS login)
57+
* Couldn't make control requests to Nomad (requires Vault CAS integration)
58+
59+
## Detection
60+
61+
The Nomad failure was detected by external observation that an update
62+
to the `less` package was not signed, and was preventing installs
63+
from proceeding.
64+
65+
The CAS failure was discovered a few days before the signing issue but
66+
was deemed non-critical as it was an inconvenience that could be fixed
67+
within a week.
68+
69+
## Response
70+
71+
The ability to merge changes in GitHub was restricted to prevent new
72+
builds from running that might further complicate recovery efforts.
73+
74+
@the-maldridge and @Gottox were recalled from vacation to recover
75+
access to the system.
76+
77+
## Recovery
78+
79+
@Gottox used break-glass access to both restore break-glass access for
80+
@the-maldridge and @duncaen, and to restart the stuck signing process.
81+
82+
83+
## What went well
84+
85+
* Excellent internal communication kept everyone in the loop as to
86+
what was broken, what was being done to fix it, and who was
87+
responsible for taking action.
88+
89+
* Break-glass access, once used, did work effectively.
90+
91+
## What could be done better
92+
93+
* External communication was not great. A post went up on
94+
voidlinux.org, but no twitter notification was made, and the post
95+
was not widely shared on our other channels.
96+
97+
* Break glass connectivity existed, but did not work.
98+
99+
* Initially recalling critical team members from vacation was an
100+
ad-hoc process.
101+
102+
## Lessons learned
103+
104+
* Having the capability isn't enough. Break glass needs to be
105+
regularly tested to be effective.
106+
107+
* For foundational infrastructure that has very infrequent updates,
108+
such as long lived TLS certificates, we should ensure multiple
109+
people are aware of the expiry date, and make use of multiple
110+
calendars to ensure critical life-cycle events are not missed.
111+
112+
## Timeline
113+
114+
No timeline is provided for this incident.

0 commit comments

Comments
 (0)