Skip to content

Commit 4454ba5

Browse files
committed
_posts: Add day 5 of infra week
1 parent 39e3ea8 commit 4454ba5

1 file changed

Lines changed: 254 additions & 0 deletions

File tree

Lines changed: 254 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,254 @@
1+
---
2+
title: "Infrastructure Week - Day 5: Making Distributed Infrastructure Work for Distributed Teams"
3+
layout: post
4+
---
5+
6+
Void runs a distributed team of maintainers and contributors. Making
7+
infrastructure work for any team is a confluence of goals, user
8+
experience choices, and hard requirements. Making infrastructure work
9+
for a distributed team adds on the complexity of accessing everything
10+
securely over the open internet, and doing so in a way that is still
11+
convenient and easy to setup. After all, a light switch is difficult
12+
to use is likely to lead to lights being left on.
13+
14+
We take several design criteria into mind when designing new systems
15+
and services that make Void work. We also periodically re-evaluate
16+
systems that have been built to ensure that they still follow good
17+
design practices in a way that we are able to maintain, and that does
18+
what we want. Lets dive in to some of these design practices.
19+
20+
## No Maintainer Facing VPNs
21+
22+
VPNs, or Virtual Private Networks are ways of interconnecting systems
23+
such that the network in between appears to vanish beneath a layer of
24+
abstraction. WireGuard, OpenVPN, and IPSec are examples of VPNs.
25+
OpenVPN and IPSec, a client program handles encryption and decryption
26+
of traffic on a tunnel or tap device that translates packets into and
27+
out of the kernel network stack. If you work in a field that involves
28+
using a computer for your job, your employer may make use of a VPN to
29+
grant your device connectivity to their corporate network environment
30+
without you having to be physically present in a building. VPN
31+
technologies can also be used to make multiple physical sites appear
32+
to all be on the same network.
33+
34+
Void uses WireGuard to provide machine-to-machine connectivity for our
35+
fleet, but only within our fleet. Maintainers always access services
36+
without a VPN. Why do we do this, and how do we do it? First the
37+
why. We operate in this way because corporate VPNs are often
38+
cumbersome, require split horizon DNS (where you get different DNS
39+
answers depending on where you resolve from) and require careful
40+
planning to make sure no subnet overlap occurs between the VPN, the
41+
network you are connecting to, and your local network. If there were
42+
an overlap, the kernel would be unable to determine where to send the
43+
packets since it has multiple routes for the same subnets. There are
44+
cases where this is a valid network topology (ECMP), but that is not
45+
what is being discussed here. We also have no reason to use a VPN.
46+
Most of the use cases that still require a VPN have to do with
47+
transporting arbitrary TCP streams across a network, but this is
48+
unnecessary. For Void, all our services are either HTTP based or are
49+
transported over SSH.
50+
51+
For almost all our systems that we interact with daily, either a web
52+
interface or HTTP-based API is provided. For the devspace file
53+
hosting system, maintainers can use SFTP via SSH. Both HTTP and SSH
54+
have robust, extremely well tested authentication and encryption
55+
options. When designing a system for secure access, defense in depth
56+
is important, but so is trust that the cryptographic primitives you
57+
have selected actually work. We trust that HTTPS works, and so there
58+
is no need to wrap the connection in an additional layer of
59+
encryption. The same goes for SSH, which we use exclusively
60+
public-key authentication for. This choice is sometimes challenging
61+
to maintain, since it means that we need to ensure highly available
62+
HTTP proxies and secure, easily maintained SSH key implementations, we
63+
have found it works well for us. In addition to the static files that
64+
all our tier 1 mirrors serve, the mirrors are additionally capable of
65+
acting as proxies. This allows us to terminate the externally trusted
66+
TLS session at a webserver running nginx, and then pass the traffic
67+
over our internal encrypted fabric to the destination service.
68+
69+
For SSH we simply make use of `AuthorizedKeysCommand` to summon keys
70+
from NetAuth allowing authorized maintainers to log onto servers or
71+
ssh-enabled services wherever their keys are validated. For the
72+
devspace service which has a broader ACL than our base hardware, we
73+
can enhance its separation by running an SFTP server distinct from the
74+
host sshd. This allows us to ensure that it is impossible for a key
75+
validated for devspace to inadvertently authorize a shell login to the
76+
underlying host.
77+
78+
For all other services, we make use of the service level
79+
authentication as and when required. We use combinations of Native
80+
NetAuth, LDAP proxies, and PAM helpers to make all access seamless for
81+
maintainers via our single sign on system. Removing the barrier of a
82+
VPN also means that during an outage, there's one less component we
83+
need to troubleshoot and debug, and one less place for systems to
84+
break.
85+
86+
## Use of Composable Systems
87+
88+
Distributed systems are often made up of complex, interdependent
89+
sub-assemblies. This level of complexity is fine for dedicated teams
90+
who are paid to maintain systems day in and day out, but is difficult
91+
to pull off with an all-volunteer team that works on Void in their
92+
free time. Distributed systems are also best understood on a
93+
whiteboard, and this doesn't lend itself well to making a change on a
94+
laptop from a train, or reviewing a delta from a tablet between other
95+
tasks. While substantive changes are almost always made from a full
96+
terminal, the ratio of substantive changes to items requiring only
97+
quick verification is significant, and its important to maintain a
98+
level of understand-ability.
99+
100+
In order to maintain the level of understand-ability of the
101+
infrastructure at a level that permits a reasonable time investment,
102+
we make use of composable systems. Composable systems can best be
103+
thought of as infrastructure built out of common sub-assemblies. Think
104+
Lego blocks for servers. This allows us to have a common base library
105+
of components, for example webservers, synchronization primitives, and
106+
timers, and then build these into complex systems through joining
107+
their functionality together.
108+
109+
We primarily use containers to achieve this composeability. Each
110+
container performs a single task or a well defined sub-process in a
111+
larger workflow. For example we can look at the workflow required to
112+
serve <https://man.voidlinux.org/> In this workflow, a task runs
113+
periodically to extract all man pages from all packages, then another
114+
process runs to copy those files to the mirrors, and finally a process
115+
runs to produce an HTTP response to a given man page request. Notice
116+
there that its an HTTP response, but the man site is served securely
117+
over HTTPS. This is because across all of our web-based services we
118+
make use of common infrastructure such as load balancers and our
119+
internal network. This allows applications to focus on their
120+
individual functions without needing to think about the complexity of
121+
serving an encrypted connection to the outside world.
122+
123+
By designing our systems this way, we also gain another neat feature:
124+
local testing. Since applications can be broken down into smaller
125+
building blocks, we can take just the single building block under
126+
scrutiny and run it locally. Likewise, we can upgrade individual
127+
components of the system to determine if they improve or worsen a
128+
problem. With some clever configuration, we can even upgrade half of
129+
a system that's highly available and compare the old and new
130+
implementations side by side to see if we like one over the other.
131+
This composability enables us to configure complex systems as
132+
individual, understandable components.
133+
134+
Its worth clarifying though that this is not necessarily a
135+
microservices architecture. We don't really have any services that
136+
could be defined as microservices in the conventional sense. Instead
137+
this architecture should be thought of as the Unix Philosophy as
138+
applied to infrastructure components. Each component has a single
139+
well understood goal and that's all it does. Other goals are
140+
accomplished by other services.
141+
142+
We assemble all our various composed services into the service suite
143+
that Void provides via our orchestration system (Nomad) and our load
144+
balancers (nginx) which allow us to present the various disparate
145+
systems as though they were one to the outside world, while still
146+
maintaining them as separate service "verticals" side by side each
147+
other internally.
148+
149+
## Everything in Git
150+
151+
Void's packages repo is a large git repo with hundreds of contributors
152+
and many maintainers. This package bazaar contains all manner of
153+
different software that is updated, verified, and accepted by a team
154+
that spans the globe. Our infrastructure is no different, but
155+
involves fewer people. We make use of two key systems to enable our
156+
Infrastructure as Code (IaC) approach.
157+
158+
The first of these tools is Ansible. Ansible is a configuration
159+
management utility written in python which can programatically SSH
160+
into machines, template files, install and remove packages and more.
161+
Ansible takes its instructions as collections of YAML files called
162+
roles that are assembled into playbooks (composeability!). These
163+
roles come from either the main void-infrastructure repo, or as
164+
individual modules from the void-ansible-roles organization on GitHub.
165+
Since this is code checked into Git, we can use ansible-lint to ensure
166+
that the code is consistent and lint-free. We can then review the
167+
changes as a diff, and work on various features on branches just like
168+
changes to void-packages. The ability to review what changed is also
169+
a powerful debugging tool to allow us to see if a configuration delta
170+
led to or resolved a problem, and if we've encountered any similar
171+
kind of change in the past.
172+
173+
The second tool we use regularly is Terraform. Whereas Ansible
174+
configures servers, Terraform configures services. We can apply
175+
Terraform to almost any service that has an API as most popular
176+
services that Void consumes have terraform providers. We use
177+
Terraform to manage our policy files that are loaded into Nomad,
178+
Consul and Vault, we use it to provision and deprovision machines on
179+
DigitalOcean, Google and AWS, and we use it to update our DNS records
180+
as services change. Just like Ansible, Terraform has a linter, a
181+
robust module system for code re-use, and a really convenient system
182+
for producing a diff between what the files say the service should be
183+
doing and what it actually is doing.
184+
185+
Perhaps the most important use of Terraform for us is the formalized
186+
onboarding and offboarding process for maintainers. When a new
187+
maintainer is proposed and has been accepted through discussion within
188+
the Void team, we'll privately reach out to them to ask if they want
189+
to join the project. Given that a candidate accepts the offer to join
190+
the group of pkg-committers, the action that formally brings them on
191+
to the team is a patch applied to the Terraform that manages our
192+
GitHub organization and its members. We can then log approvals,
193+
welcome the new contributor to our team with suitable emoji, and grant
194+
access all in one convenient place.
195+
196+
Infrastructure as Code allows our distributed team to easily maintain
197+
our complex systems with a written record that we can refer back to.
198+
The ability to defer changes to an asynchronous review is imperative
199+
to manage the workflows of a distributed team.
200+
201+
## Good Lines of Communication
202+
203+
Of course, all the infrastructure in the world doesn't help if the
204+
people using it can't effectively communicate. To make sure this
205+
issue doesn't occur for Void, we have multiple forms of communication
206+
with different features. For real-time discussions and even some
207+
slower ones, we make use of IRC on Libera.chat. Though many
208+
communities appear to be moving away from synchronous text, we find
209+
that it works well for us. IRC is a great protocol that allows each
210+
member of the team to connect using the interface that they believe is
211+
the best for them, as well as to allow our automated systems to
212+
connect in as well.
213+
214+
For conversations that need more time or are generally going to be
215+
longer we make use of email or a group-scoped discussion on GitHub.
216+
This allows for threaded messaging and a topic that can persist for
217+
days or weeks if needed. Maintaining a long running thread can help
218+
us tease apart complicated issues or ensure everyone's voice is heard.
219+
Long time users of Void may remember our forum, which has since been
220+
supplanted by a subreddit and most recently GitHub Discussions. These
221+
threaded message boards are also examples of places that we converse
222+
and exchange status information, but in a more social context.
223+
224+
For discussion that needs to pertain directly to our infrastructure,
225+
we open tickets against the infrastructure repo. This provides an
226+
extremely clear place to report issues, discuss fixes, and collate
227+
information relating to ongoing work. It also allows us to leverage
228+
GitHub's commit message parsing to automatically resolve a discussion
229+
thread once a fix has been applied by closing the issue. For really
230+
large changes, we can also use GitHub projects, though in recent years
231+
we have not made use of this particular organization system for
232+
issues (we use tags).
233+
234+
No matter where we converse though, its always important to make sure
235+
we converse clearly and concisely. Void's team speaks a variety of
236+
languages, though we mostly converse in English which is not known for
237+
its intuitive clarity. When making hazardous changes, we often push
238+
changes to a central location and ask for explicit review of dangerous
239+
parts, and call out clearly what the concerns are and what requires
240+
review. In this way we ensure that all of Void's various services
241+
stay up, and our team members stay informed.
242+
243+
---
244+
245+
This post was authored by `maldridge` who runs most of the day to day
246+
operations of the Void fleet. On behalf of the entire Void team, I
247+
hope you have enjoyed this week's dive into the infrastructure that
248+
makes Void happen, and have learned some new things. We're always
249+
working to improve systems and make them easier to maintain or provide
250+
more useful features, so if you want to contribute, join us in IRC.
251+
Feel free to ask questions about this post or any of our others this
252+
week on [GitHub
253+
Discussions](https://github.com/void-linux/void-packages/discussions/45165)
254+
or in [IRC](https://web.libera.chat/?nick=Guest?#voidlinux).

0 commit comments

Comments
 (0)