CJ, why did everything break? [[Forum outage and issues April 8-11]]

CJtheSiteWizard · April 10, 2021, 3:35pm

Hey everybody,

I’m sure there are some ongoing questions about what happened Thursday night, and why stuff is still being kinda buggy. If you hung out in our Discord at all (or followed our status page), you probably saw that around 2021-04-09T06:00:00Z, we had a LOT of stuff break. Because I believe in transparency, I’m sharing a brief (yes, this is brief in comparison) explanation of what happened:

I ran a recommended Discourse upgrade because our installation was out of date. This is pretty normal; most of the time when the forum goes down for maintenance, this is what I’m doing. If I’m already going to be taking the forum offline for a few minutes, I usually use this opportunity to add new features (like the reactions emojis or follow function). In this case, I wasn’t planning on doing anything like that, at least at first.
In this update of Discourse was another update of something called Postgres, which is the kind of database that holds all our stuff (images, etc). I was unaware at the time that this update was included as I hadn’t looked through the Discourse update documentation.
The Postgres update was pushed through, which failed and broke our entire database, thus taking down the whole site.
There was no immediate documentation on this error, so I spent several hours just trying to figure out why the failure happened. We were getting permission denied errors (explain that one to me since I’m the owner of the server lol). I could not get Discourse back up and running with the database so broken.
At this point, I assume something fried our Discourse installation. So I move our existing forum installation and messed up database off the main site into and reinstall Discourse, with the intention of just bringing our stuff back over once the new installation was up and working. (If you tried to log in yesterday and got a “User doesn’t exist” error, that would have been about the time I was working on this step.)
The new installation failed. About 20 times. The site would load and hang indefinitely. I couldn’t bring anything over from the old forum. I run backups weekly, and we had one from April 6th. I knew I could restore from that without too much data loss since our original forum installation was broken and the new installation wouldn’t install.
I go to grab this existing backup from the server, only to find out it’s not the kind of backup we need (it held our database contents, which were posts, user accounts, themes etc, but not images). I try to pull it off the server and restore the forum, only for it to fail repeatedly. Again. That’s when I reached out to Discourse.
They told me it was possible to get the backup off the server and restore to it. I downloaded it, only to be hit with MORE permissions errors every time I try to restore it. Access denied repeatedly. So I had to figure out a way to circumvent that. (There was a way, I just didn’t get an answer right away. Discourse eventually helped with this.)
Once I finally get the permissions switched to the appropriate settings and restore the backup, NOW the site won’t load. Won’t load. Won’t load. Won’t load. If you tried to login or access the site and got a “site won’t connect/won’t load” error, that was about this time.
So now, I’m thinking the whole server where Discourse lives is a loss as a result of this update, so decide to spin up a new one. It’s not as fast or robust as our old server, but it’s easy to upgrade that once we get things up and running (to keep costs down while I’m trying to fix this). I install Discourse again, get the backup restored again, and restore it again, only to find that the site still won’t load, even on a completely new, fresh, spun up server that was less than an hour old.
Now I’m reaching out to our server hosts to find an answer. Every test I’m running shows that the server is running, the Discourse instance is running, and finally, our restore is running too, but the site is still down.
I finally finally learn that, during all of these new Discourse instances/new servers spun up/moving around Discourse instances, we’ve been requesting new SSL certificates (the thing that gives us the padlock in the browser) from our provider. Requesting certificates is intentional. Requesting them 4 dozen times for the same website is not. I install certificates across all our sites to make everything secure, but there’s a limit to how many you can get at one time. So unwittingly, I’m requesting brand new certificates every time I move something around or try to fix something. Once we hit that limit, the provider who issues us those certificates shut us down to prevent us from making more requests (this is rate limiting). I go back into the Discourse app and temporarily disable all of our SSL certificates to make the site accessible. We still have security with our service that A) caches our site to make it load faster for all users across the world, B) protects us from brute force attacks (amongst other things), and is the page that shows those 521 errors when the site is down. Finally, the site loads.
Now, the Discourse site is loading FINALLY. But all the images are gone, which you may have noticed. I keep our uploads (images etc) saved on a different storage server to help alleviate costs. But because we’ve had to move the forum around, Discourse can’t find them. This means now I have to point Discourse to where those files live to display the images. But it also means basically “reloading” all our images, which requires refreshing nearly 500k posts to make everything work again (you may have seen that some of the images are slowly coming back but some aren’t–it’s because this isn’t done yet). This last step is still running and will be for some time.
Once the forum is mostly functional again, I’ll be able to run another backup to get to the actual reason for my maintenance, which is to update our server to something faster and with better storage. You may have noticed things are laggy right now - it’s because we’re temporarily on a shared server, meaning we don’t have dedicated resources right now.
When that is done, I’ll be moving our images again, which should be marginally less painful than this time around. This is because in about 6 months, we’ve already eaten through about 50GB of space just in uploads (not counting the other 20GB of resources), and we need to switch to a storage solution that will A) be cheaper and B) allow us to load images and other resources more quickly.

As a result of this, yes, we lost about 2-3 days’ worth of posts, threads, and user sign-ups. (if you had to sign up again, please message me and let me know. I have a badge for you.) The alternative was we would have needed to start from scratch, including everyone having to sign up again.

With this new storage solution I’m working toward (which was one of the main two reasons for all this in the first place), we’ll be able to maintain daily backups (rather than weekly which is what we have now) to ensure we don’t have this kind of data loss again. I’ll also be updating our server again so we get back to pre-outage speeds. That’s why we’re still so laggy, but I don’t want to move us again until I get our storage figured out so we have backups I can restore from in real-time.

Like I said, yes, this is a brief explanation of what’s happened the last couple of days. We will still be having partial outages over the next 24 hours or so while I get everything patched up, so you will see the forums go into read-only mode a few times while I’m making changes. There’s also a chance you’ll see images disappear again. Don’t worry, though, all those uploads are safe: we just have to tell Discourse where to find them.

I really appreciate everyone’s patience while I work to get everything back up and running. If you have additional questions, please ask below.

Love, CJ

alexlight · April 10, 2021, 4:05pm

Hello CJ, I just wanted to ask if anyone else can’t see people’s profile pictures?

alcoholandcaffeine · April 10, 2021, 4:09pm

@alexlight

CJtheSiteWizard · April 10, 2021, 8:17pm

@eggs

Tagging everybody here so everyone knows: I’m about to put the forums in read-only again and the forums will probably be down for about an hour or so. Feel free to hang out in our Discord, if you want to keep chatting! I’ll make a couple extra channels today.

anon25068527 · April 11, 2021, 9:49am

@CJtheSiteWizard I just want to say thank you for all you do to try and keep the forums running no matter what.

Novel_Worm · April 11, 2021, 11:21am

@CJtheSiteWizard Everything seems to be working almost perfectly again! I’m seeing images and avatars load, but somethings are a little laggy–which I assume will get fixed soon as the forum comes back to normal.

Novel_Worm · April 11, 2021, 11:55am

I just can’t seem to see the Wacky logo when I enter a topic…

CJtheSiteWizard · April 11, 2021, 6:35pm

2021-04-11T18:32:00Z

Update:

The server is functioning at full capacity at this time. Discourse is up to date. We are running another couple of batches of image processing, so you may still some problems there. If you encounter a problem that is NOT related to lag or pictures, please post about it in Site Bugs.

Thank you for your patience while we continue to monitor progress.

CJtheSiteWizard · April 11, 2021, 10:32pm

DollyTH · April 11, 2021, 10:36pm

SHE’S SO CUTE!

ShinobiSakura · April 12, 2021, 6:55pm

So this is why I had to log back in… Well I’m glad you were able to get the site up and running again…does this mean I need to put my pfp reuploaded?

CJtheSiteWizard · April 12, 2021, 7:07pm

Yes at this point in time, you should. The avatars aren’t stored in the same places the rest of our uploads are and for some reason Discourse hasn’t been very forthcoming about where they’re kept by default

ShinobiSakura · April 12, 2021, 7:10pm

Alright… hopefully discourse will help out more. I’m just glad wackywriters is back

CJtheSiteWizard · April 12, 2021, 10:04pm

They’ve been pretty helpful but the only answer I’ve received so far is “they’re stored with the rest of your uploads.”

Ugh yes, you and me both.

FallenDreamm · April 12, 2021, 10:41pm

Yes! The forums are so important. I am so glad I found this site

ShinobiSakura · April 13, 2021, 1:01am

That reply from them is cryptic. I hate cryptic answers.

On a side note, love the new color scheme

ShinobiSakura · April 13, 2021, 1:02am

You and me both. Especially since I just told a friend about this sight too

CJtheSiteWizard · April 13, 2021, 2:16am

For pfp woes, I’m running a kind of “compare and update” process that’s different from the other image migration we did a few days ago to see if I can catch missing files (of which there are apparently many missing). It’s a slower process than a simple migration, but I’m hopeful we’ll start to see avatars coming back with this method because it compares the entire file directory and adds any files that are missing (or may have been overlooked or failed migrating with the other method).

CJtheSiteWizard · April 13, 2021, 10:43pm

Update 2021-04-13T22:42:00Z:

While looking through our uploads and trying to figure out why profile pictures haven’t been loading (among other things), I found a corrupted uploads database file which is preventing both read and write access (meaning we can’t save new files to it and the site can’t find the images). So I will be creating a different server to host those, and then copying over what is already functional and recopying the files that were corrupted the first time again to make sure they go cleanly.

The good news is I can do this behind the scenes without taking down the forum until we need to switch over to the new space. After that, it’ll just require another rebuild to update where the images are which should at that point changeover seamlessly. After that, we shouldn’t have anymore trouble with uploads.

CJtheSiteWizard · April 14, 2021, 6:49am

The forums will be going into read only mode for a few hours early evening CST on April 14th (2021-04-15T00:00:00Z -ish), following by a rebuild of the forums that will likely mean the forums are down for a few minutes while everything gets re-aligned. If all goes well, this should be the last read only and rebuild I’ll need to do for awhile. It should, in theory, restore missing avatars and, again, if all goes according to plan (which I’m specifying repeatedly because as we saw over the weekend, things “going to plan” didn’t work out for us very well lol), we should be up and running on this nice new server with all our pictures and such and won’t need to perform another move for a server or need to upgrade our storage for a long while.

Cross your fingers. (and yes, I have a backup just in case lol)