IT disaster recovery

IT disaster recovery (also, simply disaster recovery (DR)) is the process of maintaining or reestablishing vital infrastructure and systems following a natural or human-induced disaster, such as a storm or battle. DR employs policies, tools, and procedures with a focus on IT systems supporting critical business functions.[1] This involves keeping all essential aspects of a business functioning despite significant disruptive events; it can therefore be considered a subset of business continuity (BC).[2][3] DR assumes that the primary site is not immediately recoverable and restores data and services to a secondary site.

IT service continuity

IT service continuity (ITSC) is a subset of BCP,[4] which relies on the metrics (frequently used as key risk indicators) of recovery point/time objectives. It encompasses IT disaster recovery planning and the wider IT resilience planning. It also incorporates IT infrastructure and services related to communications, such as telephony and data communications.[5][6]

Principles of backup sites

Planning includes arranging for backup sites, whether they are "hot" (operating prior to a disaster), "warm" (ready to begin operating), or "cold" (requires substantial work to begin operating), and standby sites with hardware as needed for continuity.

In 2008, the British Standards Institution launched a specific standard supporting Business Continuity Standard BS 25999, titled BS25777, specifically to align computer continuity with business continuity. This was withdrawn following the publication in March 2011 of ISO/IEC 27301, "Security techniques — Guidelines for information and communication technology readiness for business continuity."[7]

ITIL has defined some of these terms.[8]

Recovery Time Objective

The Recovery Time Objective (RTO)[9][10] is the targeted duration of time and a service level within which a business process must be restored after a disruption in order to avoid a break in business continuity.[11]

According to business continuity planning methodology, the RTO is established during the business impact analysis (BIA) by the owner(s) of the process, including identifying time frames for alternate or manual workarounds.

Example showing longer 'actual' times that do NOT meet either RPO or RTOs ('objectives'). Diagram provides schematic representation of the terms RPO and RTO.

RTO is a complement of RPO. The limits of acceptable or "tolerable" ITSC performance are measured by RTO and RPO in terms of time lost from normal business process functioning and data lost or not backed up during that period.[11][12]

Recovery Time Actual

Recovery Time Actual (RTA) is the critical metric for business continuity and disaster recovery.[9]

The business continuity group conducts timed rehearsals (or actuals), during which RTA gets determined and refined as needed.[9]

Recovery Point Objective

A Recovery Point Objective (RPO) is the maximum acceptable interval during which transactional data is lost from an IT service.[11]

For example, if RPO is measured in minutes, then in practice, off-site mirrored backups must be continuously maintained as a daily off-site backup will not suffice.[13]

Relationship to RTO

A recovery that is not instantaneous restores transactional data over some interval without incurring significant risks or losses.[11]

RPO measures the maximum time in which recent data might have been permanently lost and not a direct measure of loss quantity. For instance, if the BC plan is to restore up to the last available backup, then the RPO is the interval between such backups.

RPO is not determined by the existing backup regime. Instead BIA determines RPO for each service. When off-site data is required, the period during which data might be lost may start when backups are prepared, not when the backups are secured off-site.[12]

Mean times

The recovery metrics can be converted to/used alongside failure metrics. Common measurements include mean time between failures (MTBF), mean time to first failure (MTFF), mean time to repair (MTTR), and mean down time (MDT).

Data synchronization points

A data synchronization point[14] is a backup is completed. It halts update processing while a disk-to-disk copy is completed. The backup[15] copy reflects the earlier version of the copy operation; not when the data is copied to tape or transmitted elsewhere.

System design

RTO and the RPO must be balanced, taking business risk into account, along with other system design criteria.[16]

RPO is tied to the times backups are secured offsite. Sending synchronous copies to an offsite mirror allows for most unforeseen events. The use of physical transportation for tapes (or other transportable media) is common. Recovery can be activated at a predetermined site. Shared offsite space and hardware complete the package.[17]

For high volumes of high-value transaction data, hardware can be split across multiple sites.

History

Planning for disaster recovery and information technology (IT) developed in the mid to late 1970s as computer center managers began to recognize the dependence of their organizations on their computer systems.

At that time, most systems were batch-oriented mainframes. An offsite mainframe could be loaded from backup tapes pending recovery of the primary site; downtime was relatively less critical.

The disaster recovery industry[18][19] developed to provide backup computer centers. Sungard Availability Services was one of the earliest such centers, located in Sri Lanka (1978).[20][21]

During the 1980s and 90s, computing grew exponentially, including internal corporate timesharing, online data entry and real-time processing. Availability of IT systems became more important.

Regulatory agencies became involved; availability objectives of 2, 3, 4 or 5 nines (99.999%) were often mandated, and high-availability solutions for hot-site facilities were sought.[citation needed]

IT service continuity became essential as part of Business Continuity Management (BCM) and Information Security Management (ICM) as specified in ISO/IEC 27001 and ISO 22301 respectively.

The rise of cloud computing since 2010 created new opportunities for system resiliency. Service providers absorbed the responsibility for maintaining high service levels, including availability and reliability. They offered highly resilient network designs. Recovery as a Service (RaaS) is widely available and promoted by the Cloud Security Alliance.[22]

Classification

Disasters can be the result of three broad categories of threats and hazards.

  • Natural hazards include acts of nature such as floods, hurricanes, tornadoes, earthquakes, and epidemics.
  • Technological hazards include accidents or the failures of systems and structures such as pipeline explosions, transportation accidents, utility disruptions, dam failures, and accidental hazardous material releases.
  • Human-caused threats that include intentional acts such as active assailant attacks, chemical or biological attacks, cyber attacks against data or infrastructure, sabotage, and war.

Preparedness measures for all categories and types of disasters fall into the five mission areas of prevention, protection, mitigation, response, and recovery.[23]

Planning

Research supports the idea that implementing a more holistic pre-disaster planning approach is more cost-effective. Every $1 spent on hazard mitigation (such as a disaster recovery plan) saves society $4 in response and recovery costs.[24]

2015 disaster recovery statistics suggest that downtime lasting for one hour can cost[25]

  • small companies $8,000,
  • mid-size organizations $74,000, and
  • large enterprises $700,000 or more.

As IT systems have become increasingly critical to the smooth operation of a company, and arguably the economy as a whole, the importance of ensuring the continued operation of those systems, and their rapid recovery, has increased.[26]

Control measures

Control measures are steps or mechanisms that can reduce or eliminate threats. The choice of mechanisms is reflected in a disaster recovery plan (DRP).

Control measures can be classified as controls aimed at preventing an event from occurring, controls aimed at detecting or discovering unwanted events, and controls aimed at correcting or restoring the system after a disaster or an event.

These controls are documented and exercised regularly using so-called "DR tests".

Strategies

The disaster recovery strategy derives from the business continuity plan.[27] Metrics for business processes are then mapped to systems and infrastructure.[28] A cost-benefit analysis highlights which disaster recovery measures are appropriate. Different strategies make sense based on the cost of downtime compared to the cost of implementing a particular strategy.

Common strategies include:

  • backups to tape and sent off-site
  • backups to disk on-site (copied to off-site disk) or off-site
  • replication off-site, such that once the systems are restored or synchronized, possibly via storage area network technology
  • private cloud solutions that replicate metadata (VMs, templates and disks) into the private cloud. Metadata are configured as an XML representation called Open Virtualization Format, and can be easily restored
  • hybrid cloud solutions that replicate both on-site and to off-site data centers. This provides instant fail-over to on-site hardware or to cloud data centers.
  • high availability systems which keep both the data and system replicated off-site, enabling continuous access to systems and data, even after a disaster (often associated with cloud storage).[29]

Precautionary strategies may include:

  • local mirrors of systems and/or data and use of disk protection technology such as RAID
  • surge protectors — to minimize the effect of power surges on delicate electronic equipment
  • use of an uninterruptible power supply (UPS) and/or backup generator to keep systems going in the event of a power failure
  • fire prevention/mitigation systems such as alarms and fire extinguishers
  • anti-virus software and other security measures.

Disaster recovery as a service

A modular data center connected to the power grid at a utility substation

Disaster recovery as a service (DRaaS) is an arrangement with a third party vendor to perform some or all DR functions for scenarios such as power outages, equipment failures, cyber attacks, and natural disasters.[30]


Disaster recovery for cloud systems

Following best practices can enhance disaster recovery strategy for cloud-hosted systems: [31][32][33]

  1. Flexibility: The disaster recovery strategy should be adaptable to support both partial failures (such as recovering specific files) and full environment failures.
  2. Regular testing: Regular testing of the disaster recovery plan can verify its effectiveness and identify any weaknesses or gaps.
  3. Clear roles and permissions: It should be clearly defined who is authorized to execute the disaster recovery plan, with separate access and permissions for these individuals. Implementing a clear separation of permissions between those who can execute the recovery and those who have access to backup data helps minimize the risk of unauthorized actions.
  4. Documentation: The plan should be well-documented and easy-to-follow to ensure that operators can effectively follow it during stressful situations.

See also

References

  1. ^ "'Systems and Operations Continuity: Disaster Recovery". Georgetown University - University Information Services. Archived from the original on 26 Feb 2012. Retrieved 20 July 2024.
  2. ^ "Disaster Recovery and Business Continuity". IBM. Archived from the original on January 11, 2013. Retrieved 20 July 2024.
  3. ^ "What is Business Continuity Management?". Disaster Recovery Institute International. Retrieved 20 July 2024.
  4. ^ "Defending The Data Strata". ForbesMiddleEast.com. December 24, 2013.[permanent dead link]
  5. ^ M. Niemimaa; Steven Buchanan (March 2017). "Information systems continuity process". ACM.com (ACM Digital Library).
  6. ^ "2017 IT Service Continuity Directory" (PDF). Disaster Recovery Journal. Archived from the original (PDF) on 2018-11-30. Retrieved 2018-11-30.
  7. ^ "ISO 22301 to be published Mid May - BS 25999-2 to be withdrawn". Business Continuity Forum. 2012-05-03. Retrieved 2021-11-20.
  8. ^ "Browse the Resource Hub for all the latest content | Axelos". www.axelos.com.
  9. ^ a b c "Like The NFL Draft, Is The Clock The Enemy Of Your Recovery Time". Forbes. April 30, 2015.
  10. ^ "Three Reasons You Can't Meet Your Disaster Recovery Time". Forbes. October 10, 2013.
  11. ^ a b c d "Understanding RPO and RTO". DRUVA. 2008. Retrieved February 13, 2013.
  12. ^ a b "How to fit RPO and RTO into your backup and recovery plans". SearchStorage. Retrieved 2019-05-20.
  13. ^ Richard May. "Finding RPO and RTO". Archived from the original on 2016-03-03.
  14. ^ "Data transfer and synchronization between mobile systems". May 14, 2013.
  15. ^ "Amendment #5 to S-1". SEC.gov. real-time ... provide redundancy and back-up to ...
  16. ^ Peter H. Gregory (2011-03-03). "Setting the Maximum Tolerable Downtime -- setting recovery objectives". IT Disaster Recovery Planning For Dummies. Wiley. pp. 19–22. ISBN 978-1118050637.
  17. ^ William Caelli; Denis Longley (1989). Information Security for Managers. Springer. p. 177. ISBN 1349101370.
  18. ^ "Catastrophe? It Can't Possibly Happen Here". The New York Times. January 29, 1995. .. patient records
  19. ^ "Commercial Property/Disaster Recovery". The New York Times. October 9, 1994. ...the disaster-recovery industry has grown to
  20. ^ Charlie Taylor (June 30, 2015). "US tech firm Sungard announces 50 jobs for Dublin". The Irish Times. Sungard .. founded 1978
  21. ^ Cassandra Mascarenhas (November 12, 2010). "SunGard to be a vital presence in the banking industry". Wijeya Newspapers Ltd. SunGard ... Sri Lanka's future.
  22. ^ SecaaS Category 9 // BCDR Implementation Guidance CSA, retrieved 14 July 2014.
  23. ^ "Threat and Hazard Identification and Risk Assessment (THIRA) and Stakeholder Preparedness Review (SPR): Guide Comprehensive Preparedness Guide (CPG) 201, 3rd Edition" (PDF). US Department of Homeland Security. May 2018.
  24. ^ "Post-Disaster Recovery Planning Forum: How-To Guide, Prepared by Partnership for Disaster Resilience". University of Oregon's Community Service Center, (C) 2007, www.OregonShowcase.org. Retrieved October 29, 2018.[permanent dead link]
  25. ^ "The Importance of Disaster Recovery". Retrieved October 29, 2018.
  26. ^ "IT Disaster Recovery Plan". FEMA. 25 October 2012. Retrieved 11 May 2013.
  27. ^ "Use of the Professional Practices framework to develop, implement, maintain a business continuity program can reduce the likelihood of significant gaps". DRI International. 2021-08-16. Retrieved 2021-09-02.
  28. ^ Gregory, Peter. CISA Certified Information Systems Auditor All-in-One Exam Guide, 2009. ISBN 978-0-07-148755-9. Page 480.
  29. ^ Brandon, John (23 June 2011). "How to Use the Cloud as a Disaster Recovery Strategy". Inc. Retrieved 11 May 2013.
  30. ^ "What Is Disaster Recovery as a Service (DRaaS)? | Definition from TechTarget". Disaster Recovery.
  31. ^ Engineering Resilient Systems on AWS. O'Reilly Media. ISBN 9781098162399.
  32. ^ Cloud Application Architectures Building Applications and Infrastructure in the Cloud. O'Reilly Media. ISBN 9780596555481.
  33. ^ Site Reliability Engineering How Google Runs Production Systems. O'Reilly Media. ISBN 9781491951170.

Further reading

Read other articles:

Swedish songwriter, record producer and record executive (born 1972) This article is about the music producer. For other uses, see Red One. RedOneRedOne in 2017Background informationBirth nameNadir KhayatBorn (1972-04-09) 9 April 1972 (age 51)Tétouan, MoroccoGenresPopdancerockR&Bhousehip hoppop rockOccupation(s)FIFA's Creative Entertainment Executive [1]Singersongwriterrecord producerrecord executiveInstrument(s) Keyboards guitar drums vocals Years active1991–presentLabels...

 

Data yang menunjukkan aplikasi kaidah Bergmann dalam spesies rusa besar Swedia (Alces alces).[1] Kaidah Bergmann adalah kaidah ekogeografis yang menyatakan bahwa di dalam klad yang tersebar secara luas, populasi dan spesies yang lebih besar cenderung berada di lingkungan yang lebih dingin, sementara spesies yang lebih kecil berada di wilayah yang hangat. Walaupun awalnya dirumuskan dalam konteks spesies di dalam suatu genus, kaidah ini sering kali diubah konteksnya menjadi populasi di...

 

Cet article est une ébauche concernant la mer. Vous pouvez partager vos connaissances en l’améliorant (comment ?) selon les recommandations des projets correspondants. Consultez la liste des tâches à accomplir en page de discussion. Pour les articles homonymes, voir Planeur (homonymie). Un planeur sous-marin[1], ou glider sous-marin ou underwater glider ou encore glider AUV, est un appareil autonome (AUV), (ou semi-autonome), de mesure en milieu aquatique - le terme glider venant ...

Félix Houphouët-Boigny Presiden Pantai Gading PertamaMasa jabatan3 November 1960 – 7 Desember 1993PendahuluTidak ada (jabatan baru dibentuk)PenggantiHenri Konan BédiéPerdana Menteri Pantai GadingMasa jabatan7 Agustus 1960 – 27 November 1960PendahuluTidak ada (jabatan baru dibentuk)PenggantiTidak ada (jabatan ditiadakan) Informasi pribadiLahir(1905-10-18)18 Oktober 1905Yamoussoukro, Côte d'IvoireMeninggal7 Desember 1993(1993-12-07) (umur 88)Côte d'IvoireKebangs...

 

Punta Gorda adalah kota utama di Distrik Toledo. Distrik Toledo adalah distrik paling selatan di Belize, dan beribu kota di Punta Gorda. Distrik tersebut merupakan kawasan yang kurang berkembang di negara tersebut, dan memiliki beberapa hutan hujan, jaringan gua, dataran rendah pesisir, dan lepas pesisir. Pranala luar Official website - with maps and area attractions Toledo District at belize.fm Diarsipkan 2011-06-10 di Wayback Machine. The Toledo Howler Diarsipkan 2010-09-20 di Wayback Machi...

 

University in Qatar Carnegie Mellon University QatarFront entrance at sunsetOther nameCMU-QMottoMy heart is in the work (Andrew Carnegie)TypePrivate satellite campusEstablished2004; 20 years ago (2004)Parent institutionCarnegie Mellon UniversityPresidentFarnam JahanianProvostJames GarrettDeanMichael TrickAcademic staff64Administrative staff90Undergraduates467 (Fall 2022)[1]LocationDoha, Qatar25°18′59″N 51°26′20″E / 25.31639°N 51.43889°E&#x...

Ця стаття потребує додаткових посилань на джерела для поліпшення її перевірності. Будь ласка, допоможіть удосконалити цю статтю, додавши посилання на надійні (авторитетні) джерела. Зверніться на сторінку обговорення за поясненнями та допоможіть виправити недоліки. Мат...

 

土库曼斯坦总统土库曼斯坦国徽土库曼斯坦总统旗現任谢尔达尔·别尔德穆哈梅多夫自2022年3月19日官邸阿什哈巴德总统府(Oguzkhan Presidential Palace)機關所在地阿什哈巴德任命者直接选举任期7年,可连选连任首任萨帕尔穆拉特·尼亚佐夫设立1991年10月27日 土库曼斯坦土库曼斯坦政府与政治 国家政府 土库曼斯坦宪法 国旗 国徽 国歌 立法機關(英语:National Council of Turkmenistan) ...

 

Canadian curler This biography of a living person needs additional citations for verification. Please help by adding reliable sources. Contentious material about living persons that is unsourced or poorly sourced must be removed immediately from the article and its talk page, especially if potentially libelous.Find sources: Tammy Schneider – news · newspapers · books · scholar · JSTOR (October 2016) (Learn how and when to remove this message) Tammy Sch...

Race car class This article is about the third tier of single-seater racing. For the current international championship of the same name, see FIA Formula 3 Championship. This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Formula Three – news · newspapers · books · scholar · JSTOR (April 2022) (Learn how and whe...

 

Town in Washington, United StatesDarrington, WashingtonTownDistant view of Darrington from the northwestLocation of Darrington, WashingtonCoordinates: 48°15′8″N 121°36′14″W / 48.25222°N 121.60389°W / 48.25222; -121.60389CountryUnited StatesStateWashingtonCountySnohomishFounded1891IncorporatedOctober 15, 1945Government • TypeMayor–council • MayorDan RankinArea[1] • Total1.75 sq mi (4.54 km2) ...

 

American politician Joseph K. EdgertonMember of the U.S. House of Representativesfrom Indiana's 10th districtIn office1863–1865Preceded byWilliam MitchellSucceeded byJoseph H. Defrees Personal detailsBornJoseph Ketchum Edgerton(1818-02-16)February 16, 1818Vergennes, VermontDiedAugust 25, 1893(1893-08-25) (aged 75)Boston, MassachusettsResting placeFort Wayne's Lindenwood CemeteryPolitical partyDemocratRelativesAlfred Peck Edgerton (brother) Joseph Ketchum Edgerton (February ...

This article is about an Illinois newspaper. For the Indiana newspaper, see Journal & Courier. Jacksonville Journal-CourierTypeDaily newspaperFormatBroadsheetOwner(s)Hearst CommunicationsPublisherDavid C.L. BauerEditorDavid C.L. BauerFoundedApril 24, 1830; 194 years ago (1830-04-24)[1]Headquarters235 West State Street, Jacksonville, Illinois 62650, United StatesCirculation10,975 Daily11,274 Sunday (as of 2012)[2]OCLC number24396370 Websitemyjournalcou...

 

Australian architect (1867–1900) Herbert Nathaniel DavisBorn(1867-12-20)December 20, 1867DiedMarch 14, 1900(1900-03-14) (aged 32)Ord Street, Fremantle, Western AustraliaNationalityAustralianOccupationArchitectSpouseMiriam Louise LevineChildren1 Herbert Nathaniel Davis (20 December 1867 – 14 March 1900) was an Australian architect responsible for designing a number of the extant heritage buildings in Fremantle, Western Australia.[1][2] He died at the age of 32, and was...

 

Compagnia di venturaCavaliere medievale a cavallo con armatura metallica Descrizione generaleAttivodal XIII secolo al XVI secolo TipoMercenari RuoloEsercito Voci su unità militari presenti su Wikipedia Le compagnie di ventura erano truppe mercenarie utilizzate nel Medioevo, formate dai cosiddetti soldati di ventura, organizzate e guidate da un condottiero, generalmente detto capitano di ventura. Il principale scopo di tali compagnie era quello di arricchirsi il più possibile, e a tale final...

Borough in Pennsylvania, United StatesNorthumberland, PennsylvaniaBoroughView of NorthumberlandNickname: NorryLocation of Northumberland in Northumberland County, Pennsylvania.NorthumberlandLocation on Northumberland in PennsylvaniaShow map of PennsylvaniaNorthumberlandNorthumberland (the United States)Show map of the United StatesCoordinates: 40°53′38″N 76°47′46″W / 40.89389°N 76.79611°W / 40.89389; -76.79611CountryUnited StatesStatePennsylvaniaCount...

 

Battle of the American Civil War Battle of Manassas GapPart of the American Civil WarDateJuly 23, 1863 (1863-07-23)LocationWarren County, VirginiaResult InconclusiveBelligerents United States (Union) CSA (Confederacy)Commanders and leaders William H. French Francis B. Spinola Richard H. AndersonStrength Unknown amount of Corps Unknown amount of BrigadesCasualties and losses 440[1] vteGettysburg campaign Franklin's Crossing Brandy Station 2nd Winchester Aldie Middleburg ...

 

危機管理センターの入居する浦和消防署庁舎 さいたま市危機管理センター(さいたましききかんりセンター)は、埼玉県さいたま市浦和区にある市の危機管理拠点施設である。 概要 大規模自然災害などに対応するための拠点として、さいたま市消防局の本局が設置されている浦和消防署3階に設置された。災害対策室、オペレーションルーム、関係機関室、関係課会議�...

War of RadagaisusPart of the Roman–Germanic WarsBarbarian raidsDate405–406LocationPanonnia, ItaliaResult Roman victoryBelligerents Western Roman Empire GothsHuns GothsVandalsAlansCommanders and leaders StilichoSarusUldin Radagaisus Sarus The War of Radagaisus was a military conflict in northern Italy in the period 405–406. This conflict was caused by the invasion of Radagaisus in 405. He invaded the Western Roman Empire with a huge population shortly after the empire had ended a wa...

 

この項目では、姿勢制御システムについて説明しています。姿勢制御(Attitude control)については「姿勢制御」をご覧ください。 姿勢制御システム(しせいせいぎょシステム、英: Reaction Control System, RCS)は、宇宙船のサブシステムの一種である。その目的は姿勢制御と操縦である。RCSは、任意の方向に若干の推力を与えることができる。また、機体の回転を制御する�...