Just a quick post – I recently had a friend who was moving from a WP MultiUser setup, where each blog had its own subdomain, to one consolidated blog. He wanted to maintain links pointed to his old blogs, so he needed to 301 redirect all the pages on the old subdomains to the appropriate pages on the main domain using his .htaccess file. The code to put it together wasn’t too tricky:
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_HOST} !^yourdomain.com$ [NC]
RewriteRule ^(.*)$ http://yourdomain.com/$1 [L,R=301]
This will take any request your site receives, and make sure that the url starts (after the http://) with yourdomain.com. If it starts with anything else (say blog.yourdomain.com, or even www.yourdomain.com), it will be redirected to http://yourdomain.com. This code also allows for deep redirection – that is, if the user typed in
blog.yourdomain.com/my-favorite-page/
they’ll be redirected to
yourdomain.com/my-favorite-page/
which is definitely what you want.
As I’ve mentioned before, I’m in charge of all things technical over at The Keyword Academy. Last weekend, we had one massive site outage. For nearly 36 hours, the site was COMPLETELY inaccessible. Not responding to pings inaccessible. All of my knowledge about how to run a website comes from experience, so this was a great, and terrible learning experience. I figured it would bea good idea to write down what I learned from the mess – both to clarify it in my own head, and to give anyone else some insight into what they might need to watch out for.
The Keyword Academy is a respectable program – we have somewhere between 1000 and 1500 paying members, many of who are on the site every single day – checking the forum, using our tools, or reading up on content. Going down for any length of time means that we’re not only losing out on potential new members arriving on the site, but we’re also annoying our faithful members a lot, by throwing a wrench in their daily work (or procrastination, depending on what they’re planning on site) habits. Like any other site, we don’t want to be down. Ever.
Just after 1pm on Friday, 10/29, things started getting weird with the server (more on the actual server setup in a minute). Database connection started showing up in some places, but not others. Soon, no connection could be made to the database. After sshing into the server to see if I could figure out why the DB was down, things got weirder. The “top” command wasn’t recognized. Neither was “vi”. I couldn’t even list directory contents with “ls”. Things were going bad, in a hurry. A quick reboot of the server (I’m a terrible nerd – reboot is very high on my “potential solutions” list), the problem remained.
So, I contacted support. We’d been hosting the site (actually the collection of sites – we’ve got a couple of tools on different domains/servers) with MediaTemple, and we’ve been very pleased with them – the pricing is reasonable, the service is predictable, and, while I can count the number of legitimate support tickets I’ve ever opened with them on one hand, support has always been pretty good. Obviously, this was a big deal, so I pinged them on twitter immediately after opening the support request (insider’s tip: pinging @mediatemple after opening a ticket seems to get it looked at much more quickly), and soon after, I had a response: “Not our fault. Try reformatting your server” (more extra info: This site was a VPS, meaning we have full root access to it, but we don’t get the whole machine. MT has a convenient “revert” feature that puts everything back to it’s original state and lets you start over. Also, they used nicer words, but that was the gist of it). Not what I wanted to hear, but at least it meant I could resolve it on my own. A few minutes later, however, another email came in – something to the effect of “Actually, there might be something to this. We’ll look into it”.
Then came the waiting. And waiting. Even if it’s a hardware problem, how long could it take to fix, right? I’m not a sysadmin, and I’ve never worked at a hosting company, so I have no idea if that sentiment is valid or not, but – surely this can’t take more than a few hours? So we kept waiting.
Updates were scarce, and vague. While I was envisioning a tech jumping into action to replace the hardware, and get my service restored, the reality looks to have been a little different. 2+ hours later, I got my first status update (which I expected to be “All good, go nuts!”): “… emergency maintenance for vz439 has been scheduled …”.
Wait, what? 2 hours later, and you’ve managed to schedule maintenance? Surely that’s just some sort of automated message meaning “Our team of ninjas is replacing hard drives at breakneck speed”, or at least “We’re at best buy, reading the back of hard drive boxes. We’ll call you when we’re back”, right? So, I held on. Surely it would be resolved in a few more hours?
I’ll cut the narrative short at this point, because I could rant for 1000 more words about the rest of the ordeal. Suffice it to say, this update came about 1:00 Sunday morning, with 1 update in between. 1 update. In 35 hours of downtime.
To make matters worse, my server was actually still offline. I updated my support ticket to reflect that fact, and got no response. Finally, I reverted the server, and it started working again. At this point, it was midday sunday, a full 48 hours since the site had gone down. Oh – and when I did revert, I managed to get into the data they had worked so diligently to “restore”. It was from July. I can’t imagine any business where 4 month old data would be a good thing, but in ours, I have little doubt that it would have been the end of the business. We can’t lose 4 months of customer data and shrug it off, we would literally be out of business.
Fortunately, I had lost faith in MT’s ability to fix the problem sometime early saturday, and started the arduous process of moving to a new host. Because of DNS resolution, the site was down for anywhere between about 30 hours and 40 hours for our members, so we’ll call it 36. We’ve got the main site on a rackspace cloud server at the moment, while we figure out if that’s where we want to stay. We’re not sure what we’ll do, but we do know what we won’t do: stay with MediaTemple. While their service record has been great for us for over a year, they lost us far too much money (in missed sales, potential cancels, and my time, which is billed hourly to the owners of TKA) to stick with them. On top of that, service was atrocious. If I had been updated hourly, and given a reasonable ETA (even if it was 48 hours) during this ordeal, I might stick with them. As it was, they absolutely hung us out to dry. It’s inexcusable.
Fortunately, there was lots of learning to be done in this experience. Here are the main points:
We’d been with MT in our current setup for over a year, and I’ve had smaller sites hosted with them for much longer. They weren’t always great performance wise, but the control panel was nice, and it worked well enough. What could we have to worry about?
As I mentioned, this was a VPS server with MT, meaning I share it’s resources with at least a few other users. I don’t know how many people were on my particular box, but I know that there was only one other person demanding updates on twitter. If the outage had affected thousands of users, I’m guessing updates would have been flowing a little more reliably.
If we had been on a fully dedicated server (where we were actually the only user on the server), what would the result have been? The level of public outcry is important to getting your problem addressed. Next time, I’ll inform the general membership who to yell at to get things going.
July. MT gave us data from July. That is borderline useless to me. Aside from the months and months of user data that wasn’t there, we’ve also made countless code changes since July. Without a backup, all of that would have been lost. Again – this probably would have been a business ender.
There are a number of “Set and forget” backup solutions out there, even for fairly large servers. We use one of these, because I don’t want to be saddled with the additional worry of knowing whether or not my code is backing us up properly – I want someone else to worry about that. We use Jungle Disk. When it came time to get my backup restored onto the new server, I was terrified. I had messed around with restoring backups to the same server using their handy desktop client before, but moving the data to a new server was a different animal, and significantly harder to do. Documentation was scarce. It all worked out, and we’re staying with JungleDisk – but I’d have felt much more comfortable if I knew what I was doing BEFORE the disaster.
We were backing up the database every 6 hours. That saves you from a real catastrophe, but it doesnt exactly make for a seamless resolution when things go wrong. We now back up hourly, and we’re looking into what we can do to have an up-to-date database ready and waiting for a problem to happen, 24/7.
(sidenote: We use MySQL for everything) I’ll admit: We cheat on our database backups. We don’t run mysqldump, because it’s too slow for the amount of data we’re backing up – so I just backup the database files and hope for the best. This hasn’t bitten us yet, except in one spot: a couple of points in the database use innodb tables instead of MyIsam tables. MyIsam tables are easy to recover from files – you just plop them into your MySQL folder, and they’re good to go. InnoDB tables, however, are messier. Depending on your configuration, they’re probably stored in one big file, and that one big file probably will try to keep MySQL from starting if you don’t get it, and it’s required supporting files into your MySQL directory before trying to recover. Know how to handle this beforehand.
Since we came back online, we’ve been battling countless tiny bugs that pop up because of inconsistencies between the old server and the new server. PHP versions, MySQL versions, installed libraries – all of these things can cause problems. If you’re planning a server switch, you can take your time to iron these out. If you’re suddenly forced to make the switch unexpectedly, your users are going to see, and have to deal with all your dirty laundry as you fix it up.
Email:
The problem (ok, one of the problems) with your average VPS setup is that everything runs on that box – including your mail server. If your server is down for 36 hours, you can’t receive email at your normal @mydomain.com address. Worse – if all your user data is tied up in the server that went down, you don’t know who to email to give them updates, and you can’t send it from whatever mail system you ahve on the server. As it happens, just as the server went down, we were getting set up on a new account with MailChimp. We hadn’t imported our email list yet, but I was able to get that from a db backup, allowing Mark (one of the owners) to import to MailChimp, and send out an email giving everyone an update.
Twitter:
Twitter is a great way to keep up with users in a situation like this. Fortunately, we have a support twitter account. Unfortunately, we hadn’t done a huge amount of promotion to get all our users following it beforehand. Twitter is great for giving out updates. Get an account, and make sure your users know about it.
The real lesson is: stuff breaks. Failures happen. I place a lot of the blame for what happened to us on MediaTemple – but the fact is: I’m the one in charge of keeping our site up and running. I have far more to lose than they do if they lose all data since July. While it feels great to say it’s their fault, it would feel even better to say “I was prepared, so our users barely noticed”. Be ready with a backup plan. Figure out what you’re going to do when you run into a massive failure like this, and do it before you’re forced to, because it’s happening.
Recent Comments