Hey everybody,
I’m sure there are some ongoing questions about what happened Thursday night, and why stuff is still being kinda buggy. If you hung out in our Discord at all (or followed our status page), you probably saw that around 2021-04-09T06:00:00Z, we had a LOT of stuff break. Because I believe in transparency, I’m sharing a brief (yes, this is brief in comparison) explanation of what happened:
- I ran a recommended Discourse upgrade because our installation was out of date. This is pretty normal; most of the time when the forum goes down for maintenance, this is what I’m doing. If I’m already going to be taking the forum offline for a few minutes, I usually use this opportunity to add new features (like the reactions emojis or follow function). In this case, I wasn’t planning on doing anything like that, at least at first.
- In this update of Discourse was another update of something called Postgres, which is the kind of database that holds all our stuff (images, etc). I was unaware at the time that this update was included as I hadn’t looked through the Discourse update documentation.
- The Postgres update was pushed through, which failed and broke our entire database, thus taking down the whole site.
- There was no immediate documentation on this error, so I spent several hours just trying to figure out why the failure happened. We were getting permission denied errors (explain that one to me since I’m the owner of the server lol). I could not get Discourse back up and running with the database so broken.
- At this point, I assume something fried our Discourse installation. So I move our existing forum installation and messed up database off the main site into and reinstall Discourse, with the intention of just bringing our stuff back over once the new installation was up and working. (If you tried to log in yesterday and got a “User doesn’t exist” error, that would have been about the time I was working on this step.)
- The new installation failed. About 20 times. The site would load and hang indefinitely. I couldn’t bring anything over from the old forum. I run backups weekly, and we had one from April 6th. I knew I could restore from that without too much data loss since our original forum installation was broken and the new installation wouldn’t install.
- I go to grab this existing backup from the server, only to find out it’s not the kind of backup we need (it held our database contents, which were posts, user accounts, themes etc, but not images). I try to pull it off the server and restore the forum, only for it to fail repeatedly. Again. That’s when I reached out to Discourse.
- They told me it was possible to get the backup off the server and restore to it. I downloaded it, only to be hit with MORE permissions errors every time I try to restore it. Access denied repeatedly. So I had to figure out a way to circumvent that. (There was a way, I just didn’t get an answer right away. Discourse eventually helped with this.)
- Once I finally get the permissions switched to the appropriate settings and restore the backup, NOW the site won’t load. Won’t load. Won’t load. Won’t load. If you tried to login or access the site and got a “site won’t connect/won’t load” error, that was about this time.
- So now, I’m thinking the whole server where Discourse lives is a loss as a result of this update, so decide to spin up a new one. It’s not as fast or robust as our old server, but it’s easy to upgrade that once we get things up and running (to keep costs down while I’m trying to fix this). I install Discourse again, get the backup restored again, and restore it again, only to find that the site still won’t load, even on a completely new, fresh, spun up server that was less than an hour old.
- Now I’m reaching out to our server hosts to find an answer. Every test I’m running shows that the server is running, the Discourse instance is running, and finally, our restore is running too, but the site is still down.
- I finally finally learn that, during all of these new Discourse instances/new servers spun up/moving around Discourse instances, we’ve been requesting new SSL certificates (the thing that gives us the padlock in the browser) from our provider. Requesting certificates is intentional. Requesting them 4 dozen times for the same website is not. I install certificates across all our sites to make everything secure, but there’s a limit to how many you can get at one time. So unwittingly, I’m requesting brand new certificates every time I move something around or try to fix something. Once we hit that limit, the provider who issues us those certificates shut us down to prevent us from making more requests (this is rate limiting). I go back into the Discourse app and temporarily disable all of our SSL certificates to make the site accessible. We still have security with our service that A) caches our site to make it load faster for all users across the world, B) protects us from brute force attacks (amongst other things), and is the page that shows those 521 errors when the site is down. Finally, the site loads.
- Now, the Discourse site is loading FINALLY. But all the images are gone, which you may have noticed. I keep our uploads (images etc) saved on a different storage server to help alleviate costs. But because we’ve had to move the forum around, Discourse can’t find them. This means now I have to point Discourse to where those files live to display the images. But it also means basically “reloading” all our images, which requires refreshing nearly 500k posts to make everything work again (you may have seen that some of the images are slowly coming back but some aren’t–it’s because this isn’t done yet). This last step is still running and will be for some time.
- Once the forum is mostly functional again, I’ll be able to run another backup to get to the actual reason for my maintenance, which is to update our server to something faster and with better storage. You may have noticed things are laggy right now - it’s because we’re temporarily on a shared server, meaning we don’t have dedicated resources right now.
- When that is done, I’ll be moving our images again, which should be marginally less painful than this time around. This is because in about 6 months, we’ve already eaten through about 50GB of space just in uploads (not counting the other 20GB of resources), and we need to switch to a storage solution that will A) be cheaper and B) allow us to load images and other resources more quickly.
As a result of this, yes, we lost about 2-3 days’ worth of posts, threads, and user sign-ups. (if you had to sign up again, please message me and let me know. I have a badge for you.) The alternative was we would have needed to start from scratch, including everyone having to sign up again.
With this new storage solution I’m working toward (which was one of the main two reasons for all this in the first place), we’ll be able to maintain daily backups (rather than weekly which is what we have now) to ensure we don’t have this kind of data loss again. I’ll also be updating our server again so we get back to pre-outage speeds. That’s why we’re still so laggy, but I don’t want to move us again until I get our storage figured out so we have backups I can restore from in real-time.
Like I said, yes, this is a brief explanation of what’s happened the last couple of days. We will still be having partial outages over the next 24 hours or so while I get everything patched up, so you will see the forums go into read-only mode a few times while I’m making changes. There’s also a chance you’ll see images disappear again. Don’t worry, though, all those uploads are safe: we just have to tell Discourse where to find them.
I really appreciate everyone’s patience while I work to get everything back up and running. If you have additional questions, please ask below.
Love, CJ