Skip to content

Commit 39e3ea8

Browse files
committed
_posts: Add day 4 of infra week
1 parent cfd949e commit 39e3ea8

1 file changed

Lines changed: 180 additions & 0 deletions

File tree

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
---
2+
title: "Infrastructure Week - Day 4: Downtime Both Planned and Not"
3+
layout: post
4+
---
5+
6+
# Downtime Both Planned and Not
7+
8+
Yesterday we looked at what Void does to monitor the various services
9+
and systems that provide all our services, and how we can be alerted
10+
when issues occur. When we're alerted, this means that whatever's
11+
gone wrong needs to be handled by a human, but not always. Sometimes
12+
an alert can trip if we have systems down for planned maintenance
13+
activities. During these windows, we intentionally take down services
14+
in order to repair, replace, or upgrade components so that we don't
15+
have unexpected breakage later.
16+
17+
## Planned Downtime
18+
19+
When possible, we always prefer for services to go down during a
20+
planned maintenance window. This allows for services to come down
21+
cleanly and for people involved to have planned for the time
22+
investment to effect changes to the system. We take planned downtime
23+
when its not possible to make a change to a system with it up, or when
24+
it would be unsafe to do so. Examples of planned downtime include
25+
kernel upgrades, major version changes of container runtimes, and
26+
major package upgrades.
27+
28+
When we plan for an interruption, the relevant people will agree on a
29+
date usually at least a week in the future and will talk about what
30+
the impacts will be. Based on these conversations the team will then
31+
decide whether or not to post a blog post or notification to social
32+
media that an interruption is coming. Most of the changes we do don't
33+
warrant this, but some changes will interrupt services in either an
34+
unintuitive way or for an extended period of time. Usually just
35+
rebooting a mirror server doesn't warrant a notification, but
36+
suspending the sync to one for a few days would.
37+
38+
## Unplanned Downtime
39+
40+
Unplanned downtime is usually much more exciting because it is by
41+
definition unexpected. These events happen when something breaks. By
42+
and large the most common way that things break for Void is running
43+
out of space on disk. This happens because while disk drives are
44+
cheap, getting a drive that can survive years powered on with high
45+
read/write load is still not a straightforward ask. Especially not a
46+
straightforward problem if you want high performance throughput with
47+
low latency. The build servers need large volumes of scratch space
48+
while building certain packages due to the need to maintain large
49+
caches or lots of object files prior to linking. These large elastic
50+
use cases mean that we can have hundreds of gigabytes of free space
51+
and then over the course of a single build run out of space.
52+
53+
When this happens, we have to log on to a box and look at where we can
54+
reclaim some space and possibly dispatch builds back through the
55+
system one architecture at a time to ensure they use low enough space
56+
requirements to complete. We also have to make sure that when we
57+
clean space, we're not cleaning files that will be immediately
58+
redownloaded. One of the easiest places to claim space back from,
59+
after all, is the cache of downloaded files. The primary point of
60+
complication in this workflow can be getting a build to restart.
61+
Sometimes we have builds that get submitted in specific orders and
62+
when a crash occurs in the middle we may need to re-queue the builds
63+
to ensure dependencies get built in the right order.
64+
65+
Sometimes downtime occurs due to network partitions. Void runs in
66+
many datacenters around the globe, and incidents ranging from street
67+
repaving to literal ship anchors can disrupt the fiber optic cables
68+
connecting our various network sites together. When this happens, we
69+
can often arrive upon a state where people can see both sides of the
70+
split, but our machines can't see each other anymore. Sometimes we're
71+
able to fix this by manually reloading routes or cycling tunnels
72+
between machines, but often times its easier for us to just drain
73+
services from an affected location and wait out the issue using our
74+
remaining capacity elsewhere.
75+
76+
## Lessening the Effects of Downtime
77+
78+
As was alluded to with network partitions, we take a lot of steps to
79+
mitigate downtime and the effects of unplanned incidents. A large
80+
part of this effort goes into making as much content as possible
81+
static so that it can be served from minimal infrastructure, usually
82+
nothing more than an nginx instance. This is how the docs,
83+
infrastructure docs, main website, and a number of services like
84+
xlocate work. There's a batch task that runs to refresh the
85+
information, it gets copied to multiple servers, and then as long as
86+
at least one of those servers remains up the service remains up.
87+
88+
Mirrors of course are highly available by being byte-for-byte copies
89+
of each other. Since the mirrors are static files, they're easy to
90+
make available redundantly. We also configure all mirrors to be able
91+
to serve under any name, so during an extended outage, the DNS entry
92+
for a given name can be changed and the traffic serviced by another
93+
mirror. This allows us to present the illusion that the mirrors don't
94+
go down when we perform longer maintenance at the cost of some
95+
complexity in the DNS layer. The mirrors don't just host sstatic
96+
content though. We also serve the <https://man.voidlinux.org> site
97+
from the mirrors which involves a CGI executable and a collection of
98+
static man pages to be available. The nginx frontends on each mirror
99+
are configured to first seek out their local services, but if those
100+
are unavailable they will reach across Void's private network to finda
101+
n instance of the service that is up.
102+
103+
This private network is a mesh of wireguard tunnels that span all our
104+
different machines and different providers. You can think of it like
105+
a multi-cloud VPC which enables us to ignore a lot of the complexity
106+
that would otherwise manifest when operating in a multi-cloud design
107+
pattern. Theprivate network also allows us to use distributed service
108+
instances while still fronting them trhough rlatively few points.
109+
This improves security because very few people and places need access
110+
to the certificates for voidlinux.org, as opposed to the certificates
111+
having to be present on every machine.
112+
113+
For services that are containerized, we have an additional set of
114+
tricks available that can let us lessen the effects of a downed
115+
server. As long as the task in question doesn't require access to
116+
specific disks or data that are not available elsewhere, Nomad can
117+
reschedule the task to some other machine and update its entry in our
118+
internal service catalog so that other services know where to find it.
119+
This allows us to move things like our IRC bots and some parts of our
120+
mirror control infrastructure around when hosts are unavailable,
121+
rather than those services having to be unavailable for the duration
122+
of a host level outage. If we know that the downtime is coming in
123+
advance, we can actually instruct Nomad to smoothly remove services
124+
from the specific machine in question and relocate those services
125+
somewhere else. When the relocation is handled as a specific event
126+
rather than as the result of a machine going away, the service
127+
interruption is measured in seconds.
128+
129+
## Design Choices and Tradeoffs
130+
131+
Of course there is no free lunch, and tehse choices come with
132+
tradeoffs. Some of the design choices we've made have to do with the
133+
difference in effort required to test a service locally and debug it
134+
remotely. Containers help a lot with this process since its possible
135+
to run the exact same image with the exact same code in it as what is
136+
running in the production instance. This also lets us insulate Void's
137+
infrastructure from any possible breakage caused by a bad update,
138+
since each service is encapsulated and resistant to bad updates. We
139+
simply review each service's behavior as they are updated individually
140+
and this results in a clean migration path from one version to another
141+
without any question of if it will work or not. If we do discover a
142+
problem, the infrastructure is checked into git and the old versions
143+
of the containers are retained, so we can easily roll back.
144+
145+
We leverage the containers to make the workflows easier to debug in
146+
the genreal case, but of course the complexity doesn't go away. Its
147+
important to understand that container orchestrators don't remove
148+
complexity, quite to the contrary they increase it. What they do is
149+
shift and concentrate the complexity from one group of people
150+
(application developers) to another (infrastructure teams). This
151+
shift allows for fewer people to need to have to care about the
152+
specifics of running applications or deploying servers, since they
153+
truly can say "well it works on my machine" and be reasonably
154+
confident that the samee container wil work when deployed on the
155+
fleet.
156+
157+
The last major tradeoff that we make when deciding where to run
158+
something is thinking about how hard it will be to move later if we
159+
decide we're unahppy with the provider. Void is actually currently in
160+
the process of migrating our email server from one host to another at
161+
the time of writing due to IP reputation issues at our previous
162+
hosting provider. In order to make it easier to perform the
163+
migration, we deployed the mail server originally as a container via
164+
Nomad, which means that standing up the new mail server is as easy as
165+
moving the DNS entries and telling Nomad that the old mail server
166+
should be drained of workload.
167+
168+
Our infrastructure only works as well as the software running on it,
169+
but we do spend a lot of time making sure that the experience of
170+
developing and deploying that software is as easy as possible.
171+
172+
---
173+
174+
This has been day four of Void's infrastructure week. Tomorrow we'll
175+
wrap up the series with a look at how we make distributed
176+
infrastructure work for our distributed team. This post was authored
177+
by `maldridge` who runs most of the day to day operations of the Void
178+
fleet. Feel free to ask questions on [GitHub
179+
Discussions](https://github.com/void-linux/void-packages/discussions/45140)
180+
or in [IRC](https://web.libera.chat/?nick=Guest?#voidlinux).

0 commit comments

Comments
 (0)