|
| 1 | +# Post Mortem 2021-06-06 |
| 2 | + |
| 3 | +## Incident summary |
| 4 | + |
| 5 | +For approximately 2 days the Nomad managed tasks on a-fsn-de were |
| 6 | +unavailable, including repository management tasks. Resolution was |
| 7 | +and continues to be hampered by a distinct and ongoing outage. |
| 8 | + |
| 9 | +### Ammended 2021-07-31 |
| 10 | + |
| 11 | +CAS service was restored on 2021-07-22, any point below that mentions |
| 12 | +and ongoing outage was written while the incident was ongoing. |
| 13 | + |
| 14 | +## Leadup |
| 15 | + |
| 16 | +At this time we do not know what caused the reboot of a-fsn-de, |
| 17 | +however we do know that a parallel and ongoing incident was occurring |
| 18 | +at the same time. Void uses a centralized authentication service (CAS) to |
| 19 | +manage access to our machines, and like many secure services this one |
| 20 | +relies on TLS certificates. This certificate expired without being |
| 21 | +noticed, which prevented what would have otherwise been a quick |
| 22 | +recovery of logging in and bouncing some services. |
| 23 | + |
| 24 | +Additionally, when the CAS is unavailable, we maintain a break-glass |
| 25 | +login capability for a handful of extremely trusted maintainers (1:1 |
| 26 | +with the people that have access to the package signing key). This |
| 27 | +access was discovered to be impaired by one developer's key missing |
| 28 | +entirely, and one developer having failed to rotate their key. The |
| 29 | +third developer was on vacation but was able to log in and rectify the |
| 30 | +keying problem. |
| 31 | + |
| 32 | +## Fault |
| 33 | + |
| 34 | +The Nomad outage was caused by an unexpected restart of a-fsn-de. |
| 35 | +When Nomad hosts reboot there is a known defect in that runit may |
| 36 | +bring up the services in a race that results in Nomad not being usable |
| 37 | +until services are restarted in a specific order. |
| 38 | + |
| 39 | +The unavailability of the CAS system means that we still cannot log in |
| 40 | +to all hosts normally. |
| 41 | + |
| 42 | +The issues with the break-glass keys caused the recovery of both the |
| 43 | +specific Nomad host and the CAS server to be slow. |
| 44 | + |
| 45 | +## Impact |
| 46 | + |
| 47 | +Publicly Visible: |
| 48 | + |
| 49 | + * Repository signing unavailable |
| 50 | + * Builds not completing normally |
| 51 | + * musl and aarch64 appeared to lag behind glibc |
| 52 | + |
| 53 | +Internally Visible: |
| 54 | + |
| 55 | + * CAS logins not available |
| 56 | + * Detailed failure logs not available in Grafana (requires CAS login) |
| 57 | + * Couldn't make control requests to Nomad (requires Vault CAS integration) |
| 58 | + |
| 59 | +## Detection |
| 60 | + |
| 61 | +The Nomad failure was detected by external observation that an update |
| 62 | +to the `less` package was not signed, and was preventing installs |
| 63 | +from proceeding. |
| 64 | + |
| 65 | +The CAS failure was discovered a few days before the signing issue but |
| 66 | +was deemed non-critical as it was an inconvenience that could be fixed |
| 67 | +within a week. |
| 68 | + |
| 69 | +## Response |
| 70 | + |
| 71 | +The ability to merge changes in GitHub was restricted to prevent new |
| 72 | +builds from running that might further complicate recovery efforts. |
| 73 | + |
| 74 | +@the-maldridge and @Gottox were recalled from vacation to recover |
| 75 | +access to the system. |
| 76 | + |
| 77 | +## Recovery |
| 78 | + |
| 79 | +@Gottox used break-glass access to both restore break-glass access for |
| 80 | +@the-maldridge and @duncaen, and to restart the stuck signing process. |
| 81 | + |
| 82 | + |
| 83 | +## What went well |
| 84 | + |
| 85 | + * Excellent internal communication kept everyone in the loop as to |
| 86 | + what was broken, what was being done to fix it, and who was |
| 87 | + responsible for taking action. |
| 88 | + |
| 89 | + * Break-glass access, once used, did work effectively. |
| 90 | + |
| 91 | +## What could be done better |
| 92 | + |
| 93 | + * External communication was not great. A post went up on |
| 94 | + voidlinux.org, but no twitter notification was made, and the post |
| 95 | + was not widely shared on our other channels. |
| 96 | + |
| 97 | + * Break glass connectivity existed, but did not work. |
| 98 | + |
| 99 | + * Initially recalling critical team members from vacation was an |
| 100 | + ad-hoc process. |
| 101 | + |
| 102 | +## Lessons learned |
| 103 | + |
| 104 | + * Having the capability isn't enough. Break glass needs to be |
| 105 | + regularly tested to be effective. |
| 106 | + |
| 107 | + * For foundational infrastructure that has very infrequent updates, |
| 108 | + such as long lived TLS certificates, we should ensure multiple |
| 109 | + people are aware of the expiry date, and make use of multiple |
| 110 | + calendars to ensure critical life-cycle events are not missed. |
| 111 | + |
| 112 | +## Timeline |
| 113 | + |
| 114 | +No timeline is provided for this incident. |
0 commit comments