Skip to content

Commit 35ee4ec

Browse files
authored
docs: add Post Mortem section, document incident on 2021-0606 (#118)
1 parent 58734b8 commit 35ee4ec

2 files changed

Lines changed: 86 additions & 0 deletions

File tree

docs/src/postmortem/2021-06-06.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Post Mortem 2021-06-06
2+
3+
## Incident summary
4+
5+
Due to a hardware defect on a-hel-fi we got a service degredation in various
6+
systems
7+
8+
## Leadup
9+
10+
The server a-hel-fi had some strange behavior for about a week.
11+
12+
## Fault
13+
14+
The datacenter reported faulty hardware.
15+
16+
## Impact
17+
18+
* build.voidlinux.org was down
19+
* docs.voidlinux.org was down
20+
* alpha.de.repo.voidlinux.org was down
21+
* man.voidlinux.org was down
22+
* package search on voidlinux.org was unavailable
23+
24+
## Detection
25+
26+
The issue was reported in IRC 4 minutes after monitoring and automation
27+
detected the fault. No automatic alerts were raised.
28+
29+
## Response
30+
31+
* hardware reset from hetzners robot webinterface
32+
* hardware reset to rescue system from hetzners robot webinterface
33+
* ticket was opened
34+
35+
## Recovery
36+
37+
The datacenter moved the hdds to new hardware.
38+
39+
## What went well
40+
41+
Communication with the datacenter was good. From the initial report to the fix
42+
took only an hour and most of the delay was caused at our side.
43+
44+
The handling of the incident was good and the response time was fast. We also
45+
shared the state of the incident via twitter and reddit, which helped to let
46+
users show understanding for the downtime.
47+
48+
## What could be done better
49+
50+
It was just luck, that Gottox was available. He was the only one that was able
51+
to interact with the webinterface.
52+
53+
## Lessons learned
54+
55+
* putting to many services on one host isn't the best idea
56+
* the access to the webinterface of Hetzner should be accessible for more people
57+
58+
## Timeline
59+
60+
Timestamps are GMT+00
61+
62+
* 2021-06-06 09:30: The machine stopped replying to heartbeats.
63+
* 2021-06-06 09:54: Issue was reported on IRC by maldridge
64+
* 2021-06-06 09:58: Hardware reset was issued by Gottox from the robot
65+
webinterface.
66+
* 2021-06-06 10:13: Hardware reset to rescue system was issued by Gottox from
67+
the robot webinterface.
68+
* 2021-06-06 10:23: maldridge was provided with access to the robot
69+
webinterface for that specific server
70+
* 2021-06-06 10:26: Ticket was opened at the datacenter.
71+
* 2021-06-06 10:31: A remote power button press was initiated as the
72+
webinterface reported 'power off' by maldridge
73+
* 2021-06-06 10:58: A hardware reset was initiated by the Hetzner support
74+
* 2021-06-06 11:14: Reporting back to the support, that the server is still no
75+
reachable by Gottox
76+
* 2021-06-06 11:24: Server started pinging again
77+
* 2021-06-06 11:25: Hetzner support reported back, that the server was hanging
78+
in post and that they replaced the hardware.
79+
* 2021-06-06 11:25: restart of the following services as reported by maldridge,
80+
done by Gottox: wireguard, unbound, consul, nomad
81+
* 2021-06-06 12:08: restart of nginx as firefox reported cert issues, done by
82+
Gottox

docs/src/postmortem/index.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Post Mortem
2+
3+
In this section we collect Post Mortem Documentation to incrementally harden and
4+
improve our infrastructure.

0 commit comments

Comments
 (0)