Website Disaster Recovery Playbook: When Everything Goes Wrong (Part 20 of 20)

Detailed view of a black data storage unit highlighting modern technology and data management.

Most website owners think about disaster recovery after their site is already down. By then, every minute of panic decision-making costs money, reputation, and customer trust. The sites that recover fastest are not the ones with the most technical expertise in the moment – they are the ones with a plan written down and tested before the crisis happened.

This guide is that plan. It covers every major disaster type, what to do in the first 15 minutes, how to build true resilience into your infrastructure, and how to communicate professionally through an incident. It also includes templates you can use directly – the disaster recovery plan, the incident response playbook, the communication template, and the post-mortem format. Read it before you need it.


Disaster Types and Their Impact

Not all disasters are equal. Some are fast and obvious; others are slow and silent. Understanding the type of disaster determines how you respond.

Disaster TypeDetection SpeedBusiness ImpactRecovery Time (Typical)Prevention Feasibility
Hosting outageFast (uptime monitors)High (site down)Hours (dependent on host)Low (use reliable host + CDN failover)
Hack / MalwareSlow (often days)Very High (SEO damage + data risk)Days to weeksMedium (security hardening)
Database corruptionFast (site breaks immediately)Very High (data loss)Hours if recent backup existsMedium (automated backups)
DNS hijackingMedium (customers notice first)Very High (traffic, email stolen)Hours to days (DNS propagation)Medium (registrar 2FA, registry lock)
DDoS attackFastMedium-HighHours (with CDN mitigation)High (Cloudflare)
Plugin conflictFast (immediate break)Medium (specific features broken)Minutes (rollback)High (staging + update protocols)
Traffic spikeFastMediumMinutes to hoursHigh (CDN, auto-scaling)
Human errorFast to mediumVariesMinutes if backup is recentHigh (staging, confirm before delete)

Hosting Outage

A hosting outage is the simplest disaster conceptually – your server is down and you cannot control it. Your options are to wait for the host to restore service, or to have a failover system that switches traffic to a secondary location automatically. Without failover, you are entirely dependent on your host’s incident response speed and communication quality. Check your host’s historical uptime and incident response records before a crisis happens.

Hack and Malware Infection

Malware infections are often silent for days or weeks. Your site appears to work normally while injected code harvests credit card data, redirects visitors to spam sites, or uses your server resources for cryptocurrency mining. You often discover it when Google Search Console sends a security warning, a visitor reports a redirect, or your host detects the malware and suspends the account.

The recovery from a malware infection is more complex than a hosting outage because the infection often has multiple footholds – backdoors in multiple files, compromised admin accounts, and sometimes injected database content. Simply restoring a backup may restore the infected code if the breach occurred before the backup.

Database Corruption

WordPress databases corrupt most commonly from an abrupt server shutdown during a write operation, running out of disk space during a WordPress update, or a MySQL crash. The symptoms are usually immediate and obvious – white screen of death, database connection errors, or partial content appearing. Recovery requires either a clean database backup or running MySQL repair operations.

DNS Hijacking

DNS hijacking – when an attacker takes control of your domain’s DNS records – is one of the most damaging disasters. Attackers can redirect your domain to their own servers (capturing traffic, login credentials, and email), issue fraudulent SSL certificates for your domain, or simply point your DNS to nothing. Recovery requires regaining control of your domain registrar account and restoring correct DNS records, which then must propagate globally over 24-48 hours.


RTO and RPO: Define Your Recovery Targets

Before you can build a recovery plan, you need to define what recovery means in terms of time and data. RTO and RPO are the two key metrics.

RTO: Recovery Time Objective

RTO is the maximum amount of time your site can be offline before the business impact becomes unacceptable. A brochure website might have an RTO of 24 hours – a day of downtime is bad but survivable. A WooCommerce store processing $10,000/day of orders might have an RTO of 1 hour, because every additional hour costs $416 in lost revenue and compounds customer trust damage.

RPO: Recovery Point Objective

RPO is the maximum amount of data loss that is acceptable. If your RPO is 1 hour, your backups must run at least every hour so that in the worst case you lose at most 1 hour of data. A content site publishing a few articles per day might accept a 24-hour RPO. A WooCommerce store processing orders continuously needs an RPO of minutes – losing orders is directly losing customer money and requires refunds or manual reconciliation.

Calculating RTO and RPO for Your Site

Ask yourself: if my site went down right now and came back 4 hours later, what is the business impact? What if I lost the last 2 hours of data (orders, form submissions, user registrations)? The answers tell you your acceptable RTO and RPO respectively. Document these numbers – they drive every infrastructure and budget decision in your recovery plan.


The Disaster Recovery Plan Template

A disaster recovery plan is a documented, tested set of procedures for restoring your website to normal operation after a disruptive event. Copy and adapt this template for your site.

Section 1: Inventory and Dependencies

Document everything your site depends on: primary hosting provider and account details, domain registrar and account details, DNS provider and nameservers, CDN provider, SSL certificate provider, database server location and access credentials, backup storage location (separate from primary host), email service provider, key third-party APIs (payment gateways, email, SMS). Store this in an encrypted password manager (1Password, Bitwarden) accessible to all team members who might need it during an incident.

Section 2: Backup Verification Schedule

A backup that has never been tested is not a backup – it is a hope. Document: what is backed up (files and database), backup frequency, backup retention period, backup storage location (must be separate from primary server), and the test restoration schedule. Restore from backup to a staging environment at minimum once per quarter.

Section 3: Escalation Contacts

Name and contact information for: primary technical contact (who gets the first call), backup technical contact (if primary is unavailable), hosting support contact number and support ticket URL, domain registrar support, and business owner or decision-maker who needs to be informed of any P1 incident.

Section 4: Recovery Procedures by Scenario

For each disaster type, document the exact recovery steps specific to your site and infrastructure. Generic advice is not useful at 3 AM when you are panicking – you need: the exact command to restore the database, the exact steps to connect to your server, the exact location of your backups. Pre-written procedures are the difference between a 30-minute recovery and a 6-hour one.


Immediate Response Checklist: The First 15 Minutes

The first 15 minutes of an incident set the tone for everything that follows. Panic decisions in this window – like pushing untested fixes to production, or restarting servers without understanding the cause – often make things worse. Follow this structured checklist instead.

  • 0-2 minutes – Verify: Confirm the site is actually down. Check from multiple devices and connections. Check your uptime monitor dashboard. Rule out that it is just your local internet or browser cache.
  • 2-4 minutes – Characterize: What exactly is broken? Whole site down (server error)? Specific page? Specific feature? Check your hosting control panel, server error logs, and WordPress error log immediately.
  • 4-6 minutes – Check recent changes: What changed in the last 24 hours? Plugin update? WordPress core update? Theme update? Server change? Code deployment? The most likely cause is the most recent change.
  • 6-8 minutes – Escalate appropriately: Is this within your ability to fix in the next 30 minutes? If not, contact your hosting support now – the earlier you open a ticket, the sooner they are working on it.
  • 8-10 minutes – Communicate: Post to your status page: “We are aware of an issue with [site/service] and are investigating. We will post an update within 30 minutes.” Do not wait until you have answers – communicate that you know and are working on it.
  • 10-15 minutes – Isolate and act: If the cause is a recent plugin update, deactivate it. If the database is down, check disk space and MySQL status. If it is a hosting issue, work with support while investigating client-side options like serving a maintenance page from Cloudflare.

Hosting-Level Failover

True hosting resilience requires redundancy – a second environment that can serve traffic if the primary fails. The options range from simple CDN-based failover to full multi-region active-active architectures.

Multi-Region on AWS

AWS allows you to run your WordPress site across multiple Availability Zones (AZs) within a region (protecting against data center failures) or across multiple regions (protecting against regional outages). Multi-AZ is achievable for most WordPress sites using Amazon Aurora Multi-AZ for the database and EC2 instances across multiple AZs behind an Application Load Balancer.

Multi-region is more complex and expensive. It requires database replication across regions (Aurora Global Database), CDN routing logic to detect region failures and redirect traffic, and session and file storage that works across regions (Amazon S3 for media, Amazon ElastiCache for sessions). This level of infrastructure is appropriate for enterprise sites but generally overkill for most WordPress operations.

Cloudflare Load Balancing

Cloudflare Load Balancing ($5/month) provides geographic load distribution and automatic failover between multiple origin servers. Configure your primary and backup servers as pool members. Cloudflare health checks each server regularly and routes traffic away from unhealthy origins automatically. This is the most accessible path to multi-origin failover for WordPress sites – it does not require the same origin server infrastructure as AWS multi-region, just two server instances that can both serve your site.


DNS Failover Setup

DNS failover automatically points your domain to a backup server if the primary server goes down. This works at the DNS resolution layer, so visitors never see the failover happen – their requests are transparently routed to the working server.

Cloudflare Health Checks

Cloudflare’s health checks (available on all plans, failover on paid plans) continuously probe your origin servers. When a health check fails, Cloudflare automatically switches DNS to point to the backup origin. For many sites, this provides the fastest and most accessible DNS failover without managing Route 53 or other DNS failover solutions.

Amazon Route 53

Route 53’s health-based routing evaluates the health of your endpoints and routes traffic to healthy ones. You can configure primary/secondary routing (Route 53 sends traffic to the primary unless it fails, then routes to secondary) or latency-based routing (always routes to the lowest-latency healthy endpoint). Route 53 health checks cost $0.50/check/month – a negligible cost for the protection they provide.


CDN as a Failover Layer

When your origin server is completely unreachable, Cloudflare’s “Always Online” feature serves a cached version of your site to visitors. This is not full functionality – visitors cannot log in, submit forms, or complete purchases – but it shows content rather than an error page, which maintains the user experience for most visitors during a brief outage.

Configure Cloudflare’s cache aggressively for static pages so Always Online has content to serve. For a content site, this means most pages are available even if your origin is down. For a WooCommerce site, the shop can appear functional while the actual purchasing flow will fail – combine this with a maintenance page notice that explains the situation and directs customers to an email address or phone number for urgent orders.


Incident Response Playbooks by Scenario

Pre-written playbooks for your specific infrastructure are far more useful than general advice during an actual incident. Here are templates for the most common WordPress disaster scenarios – customize them for your specific server setup, hosting provider, and access methods.

Playbook: Plugin Conflict Breaking Site

  • Enable WordPress maintenance mode via FTP if admin is inaccessible (create a .maintenance file in WordPress root)
  • Access the server via FTP or SSH
  • Navigate to /wp-content/plugins/
  • Rename the plugin directory for any recently updated plugin (e.g., rename /woocommerce/ to /woocommerce-disabled/)
  • Test if site loads – if yes, the renamed plugin was the cause
  • If site is still broken, rename the entire /plugins/ directory to /plugins-disabled/ to disable all plugins simultaneously
  • Re-enable plugins one at a time to identify the conflict

Playbook: Database Connection Error

  • Check disk space first – a full disk is the most common cause
  • Check MySQL/MariaDB service status via hosting control panel or SSH
  • Verify database credentials in wp-config.php match your database user credentials
  • Try repairing the database: add define(‘WP_ALLOW_REPAIR’, true) to wp-config.php, navigate to /wp-admin/maint/repair.php, run repair, then remove the line
  • If repair fails, restore from the most recent backup

Playbook: Malware Infection

  • Take an immediate backup (even of the infected site – for forensics)
  • Put site in maintenance mode to prevent further visitor exposure
  • Change all passwords immediately: WordPress admin accounts, database, FTP/SSH, hosting control panel
  • Run malware scan via your security plugin (Wordfence, Sucuri)
  • Remove identified malicious files – do not just quarantine, delete them
  • Check for unauthorized admin accounts in WordPress Users – delete any you do not recognize
  • Reinstall WordPress core files from a fresh download (do not trust existing core files)
  • Verify all plugins and themes against known-good versions
  • Restore database from a pre-infection backup if possible
  • Harden: update all credentials again, implement 2FA, install a WAF
  • Request Google Search Console security review removal after cleanup

Playbook: DDoS Attack

  • Enable Cloudflare Under Attack Mode immediately (Security > Settings > Security Level: Under Attack)
  • This adds a 5-second JavaScript challenge to all visitors, blocking most bot traffic
  • If not using Cloudflare: contact your host’s support immediately, they can implement upstream blocking
  • Enable rate limiting rules in Cloudflare for suspicious traffic patterns
  • If attack is targeted at specific URLs (common for login page attacks): block those URLs at Cloudflare level for non-logged-in visitors
  • Monitor traffic patterns to understand the attack vector – is it volumetric (bandwidth), layer 7 (HTTP floods), or application-layer?
  • Once attack subsides, analyze and implement permanent protections

Communication Plan Template

Professional incident communication reduces customer anxiety, decreases support ticket volume, and signals that your organization is competent and in control. Use these templates for different communication channels.

Status Page Updates (Every 30 Minutes During P1)

Initial (first 15 minutes): “We are investigating an issue affecting [service/site]. Our team is aware and actively working on a resolution. We will post an update within 30 minutes.”

Progress update: “We have identified the cause of the issue [brief description, no technical jargon]. We are working on a resolution and estimate service restoration within [timeframe]. Affected features: [list]. Unaffected features: [list].”

Resolution: “The issue has been resolved as of [time]. Normal service has been restored. We will publish a full incident report within 48 hours. We apologize for the disruption and any inconvenience caused.”

Customer Email Template

Send to affected customers if the outage lasted longer than your SLA commitments, involved data exposure, or impacted orders/purchases:

Subject: [Site name] – Service Update Regarding [Date] Incident

Body: On [date] from [time] to [time] [timezone], [site name] experienced a service disruption affecting [what was affected]. We understand this may have caused inconvenience and we are writing to keep you informed. [Brief, non-technical explanation of what happened.] All [orders/accounts/data] are intact. [Specific action you are taking for affected customers, e.g., extending subscriptions, refunding affected orders.] To prevent future incidents, we have [brief description of improvement]. Please contact us at [email] if you have any questions or if anything about your account does not look right.


Using Your Staging Site as Emergency Fallback

A staging site is primarily a testing environment, but in an emergency it can serve as a read-only fallback while you restore the production site. If your production site is down for more than 30 minutes, consider temporarily pointing your domain (or a subdomain) to your staging environment with a notice explaining that some features may be unavailable.

This only works if staging is kept reasonably current (synced from production regularly) and if your content is the primary value of the site. For WooCommerce sites where the site being available is about the ability to transact, a read-only staging fallback is of limited use – customers cannot actually purchase. For content sites and blogs, serving read-only content from staging while production is being restored is significantly better than an error page.


Tabletop Exercise Guide

A tabletop exercise is a structured discussion where your team walks through a hypothetical disaster scenario to test your response plan without actually breaking anything. Run one quarterly to keep your team prepared and your plan current.

How to Run a Tabletop Exercise

Block 90 minutes with your technical team and a business stakeholder. Choose a scenario: “It is 2 PM Friday. We receive an alert that the site is down. Uptime monitor shows 503 errors. A developer pushed a plugin update 20 minutes ago.” Work through the scenario step by step using your documented playbook. After the walkthrough, debrief: What was unclear? What information was missing? What would you do differently? Update your playbook based on the discussion.

Scenarios to rotate through quarterly:

  • Site down due to hosting outage (host is unresponsive)
  • Malware discovered by Google Search Console notification
  • Database corruption from plugin update conflict
  • DDoS attack during a sale event
  • Key team member unavailable during a P1 incident (test if others can execute the plan)

Post-Mortem Template and Process

Complete this within 48 hours of every P1 incident. Store the document in a shared location accessible to all team members. The goal is learning and improvement, never blame.

Post-Mortem Structure

Incident Summary
Date, duration, services affected, user impact (estimated users affected, orders lost/impacted, revenue impact).

Timeline
Timestamped sequence from first symptom to full resolution.

Root Cause
The technical root cause of the incident in plain language.

Contributing Factors
What conditions allowed the root cause to create an incident? Missing monitoring? No staging test? Inadequate backup frequency?

What Went Well
Acknowledge what your team did right. Fast detection? Good communication? Quick isolation of the cause?

What Could Be Improved
Specific, actionable observations without assigning individual blame.

Action Items
Each item must have: a specific owner, a due date, and a measurable definition of done. No vague commitments like “improve monitoring” – instead, “Set up Checkly transaction monitor for checkout flow by [date]. Owner: [name].”


Insurance and Managed DR Services

Business interruption insurance covers revenue losses from unexpected outages that are outside your control. Cyber liability insurance covers costs of a data breach including notification requirements, legal fees, and regulatory fines. If your site generates significant revenue, consult with a business insurance broker about appropriate coverage – the premium cost is typically a fraction of the cost of a single serious incident.

Managed disaster recovery services (offered by many managed WordPress hosts including WP Engine, Kinsta, and Cloudways) include daily backup management, one-click restores, and often SLA guarantees on restore times. If you do not have in-house technical capacity to execute recovery procedures, a managed host that includes DR as part of their service is worth the premium over cheap shared hosting.


Testing Your DR Plan: A Simulation Walkthrough

Testing your disaster recovery plan is not optional – it is the entire point. A plan that has never been tested has unknown failure modes. Here is a practical testing schedule:

Monthly Tests

  • Verify backup files exist and are dated correctly
  • Test one backup restoration to staging (complete files + database)
  • Verify all recovery-related credentials are current (server SSH keys, hosting panel passwords)

Quarterly Tests

  • Full tabletop exercise with team
  • Test DNS failover if you have a configured failover environment
  • Review and update all documentation (has anything changed in your infrastructure?)
  • Verify escalation contact list is current

Annual Tests

  • Full DR simulation: actually execute a recovery from backup in a test environment and time it
  • Verify your RTO target is achievable based on actual recovery time from the simulation
  • Review insurance coverage
  • Update business impact assessment (your revenue and cost structures may have changed)

Business Continuity Planning

Disaster recovery focuses on restoring your technical systems. Business continuity planning focuses on how the business keeps operating while systems are being restored. For most small to mid-size website operations, BCP is simple: have a way to take orders by phone or email, communicate to customers where they can reach you, and have a team member who can handle customer communication while the technical team works on restoration.

For sites where the website IS the product (SaaS, membership sites, e-commerce), the distinction between technical recovery and business continuity is blurry – restoring the site IS the business continuity plan. The key addition is customer communication: keeping affected customers informed, providing alternatives where possible (phone orders, manual subscription extensions), and compensating impacted customers appropriately once service is restored.


The Cost of Not Having a DR Plan

The average cost of downtime for small businesses is $427 per minute according to ITIC research. For a site doing $50,000/month in revenue, even a conservative estimate puts a 4-hour outage at $5,000-$15,000 in combined direct revenue loss, staff incident response cost, and SEO impact.

Beyond the direct financial cost: customer trust that takes months to rebuild, Google rankings that penalize sites with repeated crawl errors, and the reputational damage of an incident handled poorly versus one handled professionally with timely communication. The difference between a 2-hour outage with clear communication and a 2-hour outage with no communication is enormous in terms of customer retention.

Incidents are inevitable. How prepared you are determines whether they are minor disruptions or existential threats to your business.

Building a disaster recovery plan takes 4-8 hours of focused work. Running quarterly tabletop exercises takes 90 minutes each. Testing backup restorations takes 30 minutes per month. This is a small time investment relative to the protection it provides against events that WILL happen to your site eventually – not if, but when.


Next Steps

Start with the two highest-impact actions: verify your backups are working and document your escalation contacts. These two things alone significantly improve your starting position in any incident. Then build out the full plan over the next month – scenario playbooks, communication templates, and a scheduled quarterly tabletop exercise.

Pair your disaster recovery plan with a solid monitoring setup so you detect incidents as fast as possible, and with a backup strategy that meets your RPO requirements. Monitoring without DR planning means you know about problems quickly but are not prepared to fix them. DR planning without monitoring means you might not even know there is a problem until customers tell you.


Website Owner’s Toolkit – 21-Part Series

This post is part of the Website Owner’s Toolkit – a 21-part series covering everything you need to run a professional website. View the full series index.

Related posts in this series:

Part 16: Website Downtime Costs You Money – How to Monitor and Fix Issues Fast

Part 14: The 3-2-1 Backup Strategy Every Website Owner Needs

Part 2: Shared vs Managed vs Cloud Hosting – Which Is Right for Your Site?

Part 13: Website Security Checklist – Keep Hackers Out

Facebook
Twitter
LinkedIn
Pinterest