MySQL and Offsite Failover: Part I

February 6th, 2010 By randy Categories: System Administration

In a previous post, “DNS and Offsite Failover“, I documented the implementation of an automated, offsite, fail-over using DNS. For static websites there is nothing left to do except perhaps use `rsync` to keep the files up to date.

Unfortunately there is a lot more to think about for dynamic websites and applications that make use of a database.

  1. The databases need to be sync’d realtime.
  2. Keeping track of state is important, each server needs to know when it and its counter-parts are up, down, active or standby.
  3. In the event of a failover, they need to coordinate cleanup, merge changes and start replicating again.

With the level of complexity involved small businesses typically accept defeat and embrace the potential downtime. Some even plan for a manual failover in the event of a disaster.

There are two automated solutions that can be weighed in the balances.

The Reduced Functionality Offsite Failover
The easiest solution by far is to implement what I call the “Read-Only” or reduced functionality offsite failover. In this configuration we’ll keep another database sync’d offsite. In the event of downtime the offsite backup takes over but in a reduced functionality mode. If the site supports user login or performs transactions, they can be disabled temporarily. Interactive sites effectively become “read-only” for the time being.

This poses a problem for commercial sites whose business is to perform transactions. Potential sales are lost, but at least new and old customers can still learn about products and get information. Nobody receives an ugly “Server not found” browser error; in-fact a custom message can be crafted explaining that full functionality will return shortly. This would be great for university sites, magazines, journals or even popular blogs where the primary purpose of the site is to get information.

This makes the life of the system administrator easy because the primary server doesn’t need to keep track of state, never needs to merge changes later on and can continue as it was when and if the network connection is restored.

In fact, if it’s acceptable for the offsite server to be a day behind the primary, implementation can be as simple as our prior DNS solution and a nightly cronjob that looks something like this:

ssh 'mysqldump --single-transaction -uUSERNAME -pPASSWORD -h DBSERVER DATABASE' | /usr/local/bin/mysql -uLOCALUSERNAME -pLOCALPASSWORD LOCALDATABASE
delete from shared_sessions;
delete from main_cache_page;
delete from groups_cache_page;
insert into main_access (mask, type, status) values ('%', 'user', 0);
insert into groups_access (mask, type, status) values ('%', 'user', 0);

The above is an example of a Drupal site being placed into “read-only” mode. The ENTIRE database is pulled from the primary location via SSH (shared key is used instead of a password). The session table and cache is cleared and Drupal’s access table is modified to block all but the admin user from logging in. If the database is large it’s best to avoid this method and stick to replication. Replication will also give you real-time updates. We’ll cover that in the “fully functional implementation” in Part II. You can borrow elements from both while developing your own custom solution.

You can even set server variables in Apache:

setenv readonly yes

And in the theme layer detect the server variable ‘readonly’ to know when to hide the login box or display your custom reduced functionality nofication so that two separate code bases do not need to be maintained. A Drupal example:

  if ($region == 'login_slide' && !$_SERVER['readonly']) {    // Don't display the login field if the server is in read-only mode.
    drupal_set_content('login_slide', pframe_login_slide());

Fully Functional Offsite Failover
There are few sites that can afford to deny customer transactions. Reduced functionality mode is the easiest solution to implement but isn’t very practical for most business models. It’s harder to implement a fully functional offsite failover. As stated prior this requires both machines to keep track of state and resolve data conflicts when the primary changes from down to up. The logic is increasingly more complicated when you take into account the transition back from secondary to primary. The delays introduced by DNS caching and ISPs not respecting TTLs can potentially lead visitors to both primary and secondary locations at the same time and throw our databases out of sync. This method requires a lot more thought.

We’ll address the implementation of the fully functional offsite failover in Part II.

No comments yet.

Leave a Comment