OT Backup, Recovery, and Business Continuity Planning

Introduction

Backup and recovery capability is the ultimate security backstop. Preventive controls can fail. Detection can be delayed. But an organization with tested, current backups of its OT systems and practiced recovery procedures can survive even a catastrophic cyber incident, ransomware, destructive malware, or targeted sabotage, and restore operations with defined, predictable timelines. Without that foundation, the same incident becomes an open-ended crisis.

OT backup and recovery is not the same as IT backup and recovery, and IT-focused backup programs do not adequately protect OT environments. An enterprise backup solution that correctly images every Windows server may completely ignore the PLC logic that actually runs the process, the HMI project files that define the operator interface, and the historian database that contains years of process history. When a ransomware attack encrypts the SCADA server and corrupts the PLC configurations, the IT backup program provides servers that boot correctly into an environment where the control systems no longer work.

We've seen this scenario play out in real incidents. In our incident response work at Beacon Security, one of the most consistent findings when organizations call us after a ransomware event is that their IT backup program was running fine, and their OT was effectively unrecoverable without weeks of re-engineering. The gap between IT backup coverage and OT backup coverage is one of the most dangerous blind spots in industrial cybersecurity.

This guide provides a complete framework for OT backup and recovery, from understanding what needs to be backed up and why, through backup methodology, offline storage requirements, recovery time objectives, procedure development, and testing.

What Needs to Be Backed Up in OT

The OT backup inventory goes far beyond the servers that enterprise backup tools typically manage.

Tier 1: Control System Logic and Configuration

These backups are irreplaceable if lost and are the foundation of any OT recovery.

PLC and RTU Logic Programs: The ladder logic, function block diagrams, and structured text programs running on every PLC, RTU, and DCS controller in the facility. These programs represent years of engineering effort, the accumulated logic of process optimization, safety interlock development, and operational refinement. Without current backups, recovery from a logic-modifying attack requires re-engineering the control logic from process documentation, which may take weeks and may not perfectly replicate the tuned parameters of the original.

DCS Controller Configurations: Distributed Control System controller databases, loop configurations, control module hierarchies, and faceplates. DCS vendors (Honeywell DeltaV, Yokogawa CENTUM, Emerson Ovation, ABB 800xA) each have vendor-specific backup procedures that export the controller configuration in a format that can be fully restored.

Safety Instrumented System (SIS) Logic and Configuration: SIS configurations are the most critical backup in the OT environment. A SIS that cannot be restored to its certified configuration following a cyber incident cannot be returned to service without extensive re-certification testing. Backup SIS configurations with version control and store them offline. There is no acceptable substitute.

Network Device Configurations: Managed switch configurations, industrial firewall rule sets, router configurations, and wireless access point configurations. Network device configuration backup is frequently overlooked in OT backup programs, but a facility where every PLC is restored but the network is misconfigured cannot resume operations.

Field Instrument and Device Configurations: Process transmitters, analyzers, intelligent motor controllers, and variable frequency drives that have network-configurable parameters. These are often not backed up because they are considered "field instruments" rather than "control systems." In practice, reconfiguring dozens of field instruments after a cyber incident can add days to the recovery timeline, days that nobody planned for because nobody thought to back them up.

Tier 2: Windows-Based OT Systems

SCADA Server Images: Full system images of SCADA servers, including operating system, SCADA application, tag database, alarm configuration, and historian connections. A SCADA server image backup allows restoration to a known-good state without re-installing and re-configuring from scratch.

HMI Workstation Images: Full images of HMI servers and operator workstations, including the HMI project files, the HMI software installation, and configurations. HMI images allow restoration of the operator interface to an operational state.

Engineering Workstation Images: Full images of engineering workstations, including the vendor engineering software (TIA Portal, Studio 5000, etc.), project files, and configurations. Engineering workstation images are critical for recovery, the workstation is the tool needed to restore PLC configurations, and if it is compromised, the recovery process requires rebuilding or using a clean spare before anything else can proceed.

Historian Databases: Process historian databases (OSIsoft PI, Canary, Ignition Historian) contain the operational data record of the facility. The database can be restored from backup; the ongoing data collection resumes after server recovery. Historian database backup frequency should match the value of the historical data, for most facilities, daily backups are appropriate.

Tier 3: Documentation and Configuration Records

As-Built Network Diagrams: Current network diagrams showing all device connections, IP addresses, VLAN assignments, and zone boundaries. Without current network documentation, rebuilding the network topology after a destructive incident is complex, slow, and error-prone.

Firewall Rule Sets and Access Control Lists: Exported rule set configurations for all industrial firewalls and network ACLs. In many cases, firewall management systems maintain version-controlled rule sets; confirm that the management system's configuration database is also backed up.

Vendor Licenses and Keys: Software license keys, hardware keys (dongles), and vendor-issued certificates for OT software. A SCADA server restored from image backup may fail to start if the license management server is not also restored or the license is not available. Document and store all license keys offline.

Vendor Contact Information and Escalation Contacts: During an incident, the vendor emergency contact list is critical. Store a current copy offline and ensure it includes direct technical support contacts, not just sales or service desk numbers. In the middle of a recovery at 2 AM, a number that goes to a sales inbox is not useful.

Backup Methodology by Component Type

PLC and Controller Logic Backup

Method: Use the vendor's engineering software to export PLC logic in a backup-compatible format:

Siemens S7: TIA Portal > Project > Archive project as .zap17 (TIA Portal V17) or equivalent. Full project archive includes all PLC programs, hardware configurations, and network configurations.
Rockwell Studio 5000: File > Save As creates .ACD or .L5K (text format) files. The .L5K format is human-readable and version-controllable in a code repository.
Schneider EcoStruxure Control Expert: File > Export creates .STA or .EXP project files.
GE Proficy Machine Edition: File > Export creates backup archives.

Frequency: After every authorized change. PLC logic should be backed up as part of the change management process, take the backup before making changes (pre-change baseline) and after changes are validated (post-change authorized baseline). Automated backup tools can also perform periodic scheduled backups to catch any changes that slip through.

Automation tools: Tools like Rockwell FactoryTalk AssetCentre and OT monitoring platforms with asset management capabilities (Dragos, Claroty, Tenable OT) can perform automated controller logic backups across multiple vendor platforms. For organizations with large PLC populations, manual backup processes do not scale.

Version control: Store PLC logic backups in a version-controlled system, Git or a document management system with version history. Version control allows comparison between current and previous logic versions to identify unauthorized changes, which is itself a detection capability for logic modification attacks.

Windows-Based System Backup

Image-based backup: Use image-based backup tools to capture full system images of SCADA servers, HMI servers, and engineering workstations:

Veeam Backup and Replication: Full image backup with bare-metal restore capability. Widely used for OT Windows systems.
Acronis Cyber Backup: Image-based backup with encrypted storage and offline backup capability.
Windows Server Backup (native): Built-in to Windows Server; acceptable for basic image backup where third-party tools are not available.

Frequency: Weekly full image backup for SCADA and HMI servers; after every significant HMI project change. Weekly full image backup for engineering workstations; after every authorized software or configuration change.

Retention: Retain at least 4 weeks of image history. This provides the ability to restore to a pre-compromise state when an incident is discovered after some delay, which is common. Ransomware operators often maintain access for weeks before detonating, and an organization that only keeps one week of backups may find that all of them are already compromised.

Isolation of backup storage: Backup storage must be isolated from both the OT network and the IT network. Ransomware that encrypts OT systems will also encrypt any backup storage it can reach. Options include offline storage (removable media or offline tape that is physically disconnected from all networks), write-once storage (WORM media or object storage with object lock enabled), and backup infrastructure on a dedicated isolated network segment with no access from OT or IT production networks.

Historian Database Backup

Most historian platforms include native backup capabilities. OSIsoft PI's backup utility creates full backups of the PI Data Archive, including all points and historical data. Canary Labs supports CSV export or native backup. Ignition provides project and data backups through its built-in backup/restore function.

Historian database backups should be stored on isolated backup infrastructure, not on OT network storage that is also a ransomware target. For historians with years of process history, verify that the backup infrastructure has sufficient capacity and that restore time from backup is within the recovery time objective, a backup that exists but takes 72 hours to restore is a very different thing from a backup you can act on.

Offline Storage Requirements

The most critical backup media must be isolated from all networks to ensure ransomware cannot encrypt or corrupt backups.

Storage Architecture

Tier 1, Offline / Air-Gapped Storage: PLC logic backups, SIS configurations, and "golden images" of critical OT servers should be stored on media that is physically disconnected from all networks, external hard drives in a fireproof safe, write-protected USB drives stored offline, or an offline NAS device that is powered off and physically disconnected between backup events.

Tier 2, Isolated Network Storage: Regular backup data (weekly images, daily historian backups) may be stored on backup infrastructure that is network-connected but isolated from both OT and IT networks. The backup network should have no routing to OT or IT; the only connections should be the scheduled backup agents that push data to the backup system.

Tier 3, Offsite Storage: A copy of the most critical backup data should be maintained offsite, at a secondary facility, a secure storage service, or an encrypted cloud backup. Offsite storage protects against physical events that could destroy all on-site backup media. More than one facility has lost backups and production systems simultaneously in a fire.

Backup Media Management

Label all backup media with the backup date, the system backed up, and the backup type. Maintain a backup media log. Rotate backup media, do not overwrite the most recent backup until the previous backup is verified complete and readable. Most importantly, test media periodically: confirm that backup media is readable and that the backup data is actually restorable before you need it in an emergency. A backup that has never been tested is a hope, not a capability.

Recovery Time Objectives

Define Recovery Time Objectives (RTO) for each OT system category:

System Category	Typical RTO	Recovery Method
Safety Instrumented System	4-24 hours	Logic restore from backup + SIS functional testing before restart
Primary DCS Controllers	8-24 hours	Controller restore from backup + process checkout
SCADA Server	2-8 hours	Bare-metal restore from image backup
HMI Server	1-4 hours	Bare-metal restore from image backup
Engineering Workstation	4-8 hours	Bare-metal restore or spare EWS deployment
Field Instruments	1-48 hours (per quantity)	Configuration restore using engineering workstation
Historian Server	4-24 hours (data recovery ongoing)	Server restore from image; historical data restore from DB backup
Network Infrastructure	2-8 hours	Configuration restore to spare hardware

RTOs are targets, not guarantees. Actual recovery times depend on whether backup media is current and accessible, whether recovery procedures are documented and practiced, whether spare hardware is available for failed equipment, and whether vendor support is available for complex restorations.

Test your RTOs annually against actual recovery exercises. Beacon Security consistently finds in ransomware readiness assessments that organizations' assumed RTOs are significantly more optimistic than what their actual backup posture and spare parts inventory would support. Untested RTOs are estimates, not commitments.

Recovery Procedure Development

Recovery procedures must be documented at sufficient detail that personnel who were not involved in the original system configuration can execute them correctly under pressure. That is the standard to write to, not "someone who knows the system" but someone who doesn't, at 3 AM, in the middle of an active incident.

Structure of an OT Recovery Procedure

Each recovery procedure should include:

1. Scope and Applicability: What system(s) does this procedure cover? Under what circumstances should it be invoked?

2. Prerequisites: What must be in place before starting? Backup media available, spare hardware in place, vendor contact list available, process in safe state, operations team notified.

3. Recovery Team: Who must be involved? OT security engineer, control system engineer, vendor contact, operations supervisor. What are their roles and contact numbers?

4. Step-by-Step Procedure: Each action described in sufficient detail for execution by a qualified engineer without additional reference material. For complex steps, include screenshots or configuration screenshots.

5. Verification Steps: After each major step, what checks confirm the step completed successfully before proceeding?

6. Process Restart Criteria: What conditions must be verified before the process can be restarted? This section is typically written jointly with the operations team and may include physical inspections, instrument calibration checks, and loop functional tests.

7. Rollback Options: If the recovery procedure fails at a specific step, what options exist?

8. Escalation Contacts: Who to call at each stage if problems arise that the team cannot resolve. Include vendor emergency contacts with direct phone numbers.

Procedure Testing

Recovery procedures must be tested before they are needed in an emergency.

Tabletop Exercise: Walk through the procedure with the full recovery team. Identify gaps, ambiguities, and missing prerequisites. A tabletop exercise reveals most procedure problems without requiring a production system to be taken offline, and it forces the recovery team to confront questions they hadn't considered.

Component Restoration Test: Restore a non-critical OT system component, a decommissioned HMI server, a development PLC, a spare engineering workstation, from backup in a test environment. Verify that the restored system comes up correctly and the backup data is complete and valid.

Full Recovery Exercise: At planned maintenance or shutdown, execute a complete recovery of a section of the OT environment from backup. This provides the highest confidence that the procedures work but requires coordination with operations and a production maintenance window.

Conduct tabletop exercises at minimum annually. Component restoration tests should be performed at least semi-annually. Full recovery exercises should occur at least once every 18-24 months for critical OT systems.

Ransomware-Specific Recovery Considerations

Ransomware attacks on OT environments present specific recovery challenges that go beyond the standard restore-from-backup playbook. Speed matters, but so does doing it right.

Assess before you restore: Before restoring from backup, assess whether the adversary still has access. Restoring encrypted systems to a network where the attacker maintains persistence allows re-encryption, sometimes within hours. The restoration sequence must include re-establishing the network architecture from a clean state, not simply restoring systems into the existing compromised network. At Beacon Security, we have guided organizations through ransomware recovery where this step was initially skipped, and the result was a second encryption event that set recovery back by days.

Clean restore environment: Restore OT systems to a clean, rebuilt network environment. If the IT network is compromised, keep the OT restoration network completely isolated from IT until the IT investigation is complete.

Verify the backup before trusting it: After restoring PLC logic from backup, verify the restored logic against the backup record using hash comparison or automated logic comparison tools. If the backup was made before the attacker modified the logic, restoring it restores the correct logic. If the backup was made after modification, it may contain the attacker's changes. Know when your last-known-good backup was taken.

Recovery sequencing matters: Restore and verify each layer in order:

Network infrastructure (switches, firewalls), the network must be correct before systems are connected
Engineering workstations, needed to restore and verify controller configurations
Safety systems, must be restored and verified before any process restart
DCS and SCADA systems, restore before returning operators to automated control
Historians and reporting systems, restore after the process is stable

Preserve evidence: Before restoring, capture forensic images of affected systems where possible. The encrypted systems and ransom notes contain forensic evidence relevant to the investigation, insurance claim, and regulatory reporting. Recovery and investigation are parallel workstreams, not sequential ones.

Business Continuity: Operating Through a Cyber Incident

For extended incidents where full OT restoration takes days or weeks, business continuity planning addresses how to maintain safe plant operation during the recovery period.

Manual operations procedures: Document manual operating procedures for each process area, how operators maintain safe control of the process if SCADA visibility is lost or HMIs are unavailable. Manual operations are cognitively demanding and error-prone; practiced procedures significantly reduce the risk of operator error during a crisis. These procedures should be printed and physically stored in the control room, not only on systems that might be encrypted.

Degraded mode operations: Define acceptable reduced-capacity operations that can be maintained with degraded automation. A process that normally runs at 100% capacity under automated control may be maintainable at 60% under manual control until the SCADA system is restored. Agreeing on this in advance, rather than debating it during an incident, saves time and reduces pressure on the recovery team.

Communication alternatives: If normal OT communications are disrupted, what alternatives exist? Panel-mounted instruments, local gauges, local indicators, that are independent of the SCADA network provide operator visibility into process parameters without relying on the compromised network.

Customer and stakeholder communication: Prepare communication templates for notifying customers of production delays, regulatory bodies of incidents as required, and internal leadership of status and recovery timelines. Having a draft ready before it is needed makes a difficult situation slightly more manageable.

Conclusion

OT backup and recovery is not an insurance policy that is set up and forgotten. It requires ongoing maintenance, keeping backups current, testing restoration procedures, updating RTOs as systems change, and practicing the coordination required for an effective recovery response.

The organizations that recover fastest from OT cyber incidents are those that made the investment before the incident: current, tested backups; practiced recovery procedures; and a recovery team that knows its roles without having to figure it out under pressure. That preparation is achievable, and its value is proven every time it converts a potential catastrophe into a manageable recovery event.

Beacon Security provides OT backup program design, recovery procedure development, restoration testing, and ransomware readiness assessments for industrial organizations. Contact us to assess the completeness and testedness of your OT backup and recovery capability.