|
| 1 | +--- |
| 2 | +title: Infrastructure Week - Day 2: What Do We Do With Our Infrastructure? |
| 3 | +layout: post |
| 4 | +--- |
| 5 | + |
| 6 | +## What Does Void Do With the Infrastructure? |
| 7 | + |
| 8 | +Yesterday we looked at what kinds of infrastructure Void has, how its |
| 9 | +managed, and what makes each kind unique and differently suited. |
| 10 | +Today we'll look at what runs on the infrastructure, and what it does. |
| 11 | +We'll then finally look at how we make sure it keeps running in the |
| 12 | +event of an error or disruption. |
| 13 | + |
| 14 | +## Two Kinds of Services |
| 15 | + |
| 16 | +Void runs, broadly speaking, two different categories of services. In |
| 17 | +the first category, we have the tooling that supports maintainers and |
| 18 | +makes it easier or in some cases possible to work on Void. These are |
| 19 | +services that most users are unaware of, and in general don't interact |
| 20 | +with. In the second category of services are systems that general end |
| 21 | +users of Void interact with and are more likely to know about or |
| 22 | +recognize. |
| 23 | + |
| 24 | +## Public Services |
| 25 | + |
| 26 | +We'll first talk about public services that are broadly available to |
| 27 | +both maintainers and general consumers of Void Linux. These are |
| 28 | +almost, but not entirely, web based services that are accessed via a |
| 29 | +browser. See how many of these services you recognize. |
| 30 | + |
| 31 | +### Void's Website |
| 32 | + |
| 33 | +Void's website (the one you are reading right now) is a GitHub pages |
| 34 | +Jekyll site. This content is checked into `git` rendered by a worker |
| 35 | +process in the GitHub network, and then published to a CDN where you |
| 36 | +can read it. Additionally the Jekyll software produces feeds suitable |
| 37 | +for consumption in an RSS reader. The website is probably our |
| 38 | +simplest service and the easiest to copy on your own since it requires |
| 39 | +no special infrastructure, just a GitHub account to setup. |
| 40 | + |
| 41 | +### Void Mirrors |
| 42 | + |
| 43 | +Void's mirrors are simple nginx webservers that host static copies of |
| 44 | +all our software. This also includes some other sites that include |
| 45 | +content that we host ourselves, such as the [docs |
| 46 | +site](https://docs.voidlinux.org) and the dedicated [infrastructure |
| 47 | +docs site](https://infradocs.voidlinux.org). We host these from our |
| 48 | +own system since they both use mdbook, which is not as straightforward |
| 49 | +to use with a hosting service like GitHub Pages. We also run these |
| 50 | +sites this way so that they are broadly copied in the event of a |
| 51 | +failure in any of our systems. Did you know you can go to `/docs` on |
| 52 | +any mirror to read the Void handbook? |
| 53 | + |
| 54 | +### Popcorn |
| 55 | + |
| 56 | +Popcorn is a package statistics service that provides information |
| 57 | +about the popularity of packages as provided by systems that have |
| 58 | +opted in to have their package information reported. Though we are |
| 59 | +evaluating ways to replace the data provide by Popcorn, it still |
| 60 | +provides good real-world data on package installs. You can learn more |
| 61 | +about Popcorn [in the |
| 62 | +handbook](https://docs.voidlinux.org/contributing/index.html#usage-statistics). |
| 63 | + |
| 64 | +### Source Site |
| 65 | + |
| 66 | +The Sources Site (<https://sources.voidlinux.org>) provides a copy of |
| 67 | +all the sources as our build servers consumed them. This provides a |
| 68 | +way for us to quickly and easily make sure that we have the same |
| 69 | +source to troubleshoot a bad build with when finding the fault may |
| 70 | +require more than just the build error logs. |
| 71 | + |
| 72 | +### xq-api |
| 73 | + |
| 74 | +Some functionality on our website requires the ability to query the |
| 75 | +Void repository data. This is accomplished by fronting the repository |
| 76 | +data by a service called `xq-api` which provides query functionality |
| 77 | +on top of the repodata files. The data is refreshed frequently, so |
| 78 | +new packages quickly show up in the website search results as well as |
| 79 | +making sure that packages that are no longer available in our repos |
| 80 | +are removed promptly. |
| 81 | + |
| 82 | +### Old Wiki Snapshot |
| 83 | + |
| 84 | +At one time prior to the introduction of our docs site, Void |
| 85 | +maintained a MediaWiki instance. While MediaWiki is extremely |
| 86 | +powerful software and is a great choice for hosting a wiki, Void found |
| 87 | +that our wiki was being slowly filled with hyper-specific guides, lots |
| 88 | +of abandoned pages, and lower quality versions of pages that exist on |
| 89 | +the [Arch Linux Wiki](https://wiki.archlinux.org/). While we ported |
| 90 | +over a large number of pages to the docs that remained generally |
| 91 | +applicable, we also felt it was important to archive the entire wiki |
| 92 | +as it appeared before releasing the resources powering it. This was |
| 93 | +accomplished using a wiki crawler which could convert the wiki itself |
| 94 | +into an archive format that we now serve with kiwix server. You can |
| 95 | +find that old content at <https://wiki.voidlinux.org> should it |
| 96 | +interest you. |
| 97 | + |
| 98 | +### Online Man Pages |
| 99 | + |
| 100 | +Void makes available a copy of all the contents of our man page |
| 101 | +database online so users can easily search for commands even when not |
| 102 | +on a Void enabled system, such as during install time when internet |
| 103 | +access may not be available yet from a Void device. This service |
| 104 | +involves a task which routinely extracts the man pages from all |
| 105 | +packages using a program that is specific to XBPS, and then the files |
| 106 | +are arranged on disk to be served by the `mdocml` man page server, |
| 107 | +which is a program we obtain from OpenBSD. You can browse our online |
| 108 | +manuals at <https://man.voidlinux.org>. |
| 109 | + |
| 110 | +## Services that help Maintainers |
| 111 | + |
| 112 | +Not all services are meant for public consumption. A number of Void's |
| 113 | +services are meant to help maintainers be more productive, produce |
| 114 | +build artifacts, or generally make our workflows easier to accomplish. |
| 115 | + |
| 116 | +### Build Pipeline |
| 117 | + |
| 118 | +The build pipeline was discussed in detail in [another |
| 119 | +post](/news/2023/02/1-new-repo-fastly.html), but we'll recap that post |
| 120 | +here. In general there are a handful of powerful servers that we run |
| 121 | +automated compiler tasks on that run `xbps-src` whenever the contents |
| 122 | +of `void-packages` is updated. Once the packages are built, they are |
| 123 | +collected to a central point, signed cryptographically to attest that |
| 124 | +they are in fact packages produced by Void, and then they are copied |
| 125 | +out to mirrors around the world for users to download. |
| 126 | + |
| 127 | +The build pipeline is the single largest collection of moving parts |
| 128 | +within our infrastructure, and is usually the component that breaks |
| 129 | +the most often as it has many exciting failure modes. Some of the |
| 130 | +author's favorites include running out of disk, stuck connection poll |
| 131 | +loops, and rsync just wandering off instead of synchronizing packages. |
| 132 | + |
| 133 | +### Email |
| 134 | + |
| 135 | +Void maintainers have access to email on the voidlinux.org domain. To |
| 136 | +provide this service, Void runs an email server. We make use of |
| 137 | +[maddy](https://maddy.email) which provides a convenient all in one |
| 138 | +mail server. It works well at our scale, and doesn't require a |
| 139 | +significant amount of maintainer time to make work. Though most of us |
| 140 | +access the mail using a combination of desktop and CLI clients, we |
| 141 | +also run a copy of the [Alps](https://git.sr.ht/~migadu/alps) web |
| 142 | +frontend which allows quick and easy access to mail when away from |
| 143 | +normal console services. |
| 144 | + |
| 145 | +### DevSpace |
| 146 | + |
| 147 | +Sometimes when preparing a fix or updating a package, a maintainer |
| 148 | +will want to share this new built artifact with others to gather |
| 149 | +feedback or see if the fix works. To enable this quickly and easily, |
| 150 | +we have a dedicated webserver and SFTP share box for these files. You |
| 151 | +can see things we're currently working on or haven't yet cleaned up at |
| 152 | +<https://devspace.voidlinux.org/> where the files are organized by |
| 153 | +maintainer. |
| 154 | + |
| 155 | +Sometimes end users will be asked to fetch a build from devspace when |
| 156 | +filing an issue ticket to verify that a particular fix works, or that |
| 157 | +a given problem continues to exist when rebuilding a package or disk |
| 158 | +image from clean sources. |
| 159 | + |
| 160 | +### void-robot and void-fleet |
| 161 | + |
| 162 | +Void's team communicates primarily via IRC. In order to allow our |
| 163 | +infrastructure to communicate with us, we have a pair of IRC bots that |
| 164 | +inform us of status changes. The more chatty of the bots, |
| 165 | +`void-robot` tells us when PRs change status or when references change |
| 166 | +on Void's many git repos. This allows us to know when changes are |
| 167 | +going out, and its not uncommon for a maintainer to just ping someone |
| 168 | +else with a single `^` to gesture at a push or reference the bot has |
| 169 | +printed to the channel. |
| 170 | + |
| 171 | +The second bot speaks on behalf of our monitoring infrastructure and |
| 172 | +notifies us when things break or when they're resolved. We'll take a |
| 173 | +deeper look at monitoring in a future post and look more at what this |
| 174 | +bot does then. |
| 175 | + |
| 176 | +### Nomad, Consul & Vault |
| 177 | + |
| 178 | +Many of Void's more modern services run on top of containers managed |
| 179 | +by Hashicorp Nomad. These services retrieve secrets from Hashicorp |
| 180 | +Vault, and can locate each other using Hashicorp Consul. The use of |
| 181 | +these tools allows us to largely abstract out what provider any given |
| 182 | +software is running on and where it resides in the world. This also |
| 183 | +makes it much easier when we need to replace a host or take one down |
| 184 | +for maintenance without interrupting access to user facing services. |
| 185 | + |
| 186 | +The use of well understood tools like the Hashistack also makes it |
| 187 | +much easier for us to subdivide systems and check components locally. |
| 188 | + |
| 189 | +### NetAuth |
| 190 | + |
| 191 | +With all these services, it would be inconvenient for maintainers to |
| 192 | +need to maintain separate usernames and passwords for everything. In |
| 193 | +order to avoid this, we use Single Sign On concepts where all services |
| 194 | +that support it reach out to a centralized secure authentication |
| 195 | +service. You can read more about NetAuth at <https://netauth.org>. |
| 196 | + |
| 197 | + |
| 198 | +## How Does All This Get Run? |
| 199 | + |
| 200 | +For some of Void's older services, notably the build farm itself, our |
| 201 | +services are configured, provisioned, and maintained using Ansible |
| 202 | +just like the underlying OS configuration. This works well, but has |
| 203 | +some drawbacks in being difficult to test, difficult to change in an |
| 204 | +idempotent way, and difficult to explain to others since its firmly |
| 205 | +the realm of infrastructure engineering. Trying to explain to someone |
| 206 | +how a hundred lines of yaml gets converted into a working webserver |
| 207 | +requires detours through a number of other assorted technologies. |
| 208 | + |
| 209 | +Void's newer services run uniformly as containers and are managed by |
| 210 | +Nomad. This enables us to dynamically move workloads around, have |
| 211 | +machines self-heal and update in coordination with the fleet, and to |
| 212 | +provide a lens into our infrastructure for people to see. You can |
| 213 | +explore all our running containers in a limited read-only context by |
| 214 | +looking at the [nomad dashboard](https://nomad.voidlinux.org). Before |
| 215 | +you go trying to open a security notice though, we're aware that |
| 216 | +buttons that shouldn't be visible look like they're clickable. Rest |
| 217 | +assured that the anonymous policy that provides the view access can't |
| 218 | +actually stop jobs or drain nodes (we've reported this UI bug a few |
| 219 | +times already). |
| 220 | + |
| 221 | +What Nomad does under the hood is actually really clever. It assesses |
| 222 | +what we want to run, and what resources we have available to run it. |
| 223 | +It then applies any constraints we've set on the services themselves. |
| 224 | +These constraints encode information like requiring locality to a |
| 225 | +particular disk in the fleet, or requiring that two copies of a |
| 226 | +service reside on different hosts. This then gets converted into a |
| 227 | +plan of what services will run where, and the workload of applications |
| 228 | +is distributed to all machines in the fleet. If a server fails to |
| 229 | +check in periodically, the workload on it is considered "lost" and can |
| 230 | +be restarted elsewhere if allowed. When we need to move between |
| 231 | +providers or update hardware, Nomad provides a way for us to quickly |
| 232 | +and easily work out how much of a machine we're actually consuming as |
| 233 | +well as actually performing the movement of the services from one |
| 234 | +location to another. |
| 235 | + |
| 236 | +While Nomad is very clever and makes a lot of things much easier, we |
| 237 | +do still have a number of services that run directly on the Void |
| 238 | +system installed to the machines. For services that run on top of the |
| 239 | +metal directly we almost always use runit to supervise the tasks and |
| 240 | +restart them when they crash. This works well, but does tightly |
| 241 | +couple the service to the machine on which it is installed, and |
| 242 | +requires coordination with Ansible to make sure that restarts happen |
| 243 | +when they are supposed to during maintenance activities. For services |
| 244 | +that run in containers, we can simply set the restart policy on the |
| 245 | +container and allow the runtime to supervise the services as well as |
| 246 | +any cascading restarts that need to happen, such as when certificates |
| 247 | +are renewed or rotated. |
| 248 | + |
| 249 | +In general, all our services have at least one layer of service |
| 250 | +supervision in the form of Void's runit-based init system, and in many |
| 251 | +cases more application specific level supervision occurs, often with |
| 252 | +status checks to validate and check assumptions made about the |
| 253 | +readiness of a service. |
| 254 | + |
| 255 | +--- |
| 256 | + |
| 257 | +This has been day two of Void's infrastructure week. Check back |
| 258 | +tomorrow to learn about how we know that the services we run are up, |
| 259 | +and how we verify that once they're up, they're behaving as expected.. |
| 260 | +This post was authored by `maldridge` who runs most of the day to day |
| 261 | +operations of the Void fleet. Feel free to ask questions on [GitHub |
| 262 | +Discussions](https://github.com/void-linux/void-packages/discussions/45099) |
| 263 | +or in [IRC](https://web.libera.chat/?nick=Guest?#voidlinux). |
0 commit comments