Last updated on Oct 4, 2024

Multiple critical systems crash simultaneously. How do you decide which system recovery tasks to prioritize?

When multiple systems fail, what drives your decision-making process? Share your approach to prioritizing recovery tasks.

Operating Systems

+ Follow

Last updated on Oct 4, 2024

Multiple critical systems crash simultaneously. How do you decide which system recovery tasks to prioritize?

When multiple systems fail, what drives your decision-making process? Share your approach to prioritizing recovery tasks.

Add your perspective

116 answers

Agha Muhsin

Technical Project Manager | Expert in Data & Information Security | Leading Teams to Deliver Secure, Compliant, and Innovative Solutions
Report contribution
When multiple critical systems crash at once, the priority is to restore those that impact essential functions and core operations, especially those affecting customers, safety, or security. I’d start by identifying which systems are foundational for others, so they’re restored first if other systems rely on them. Systems with the highest risk of data loss or corruption would also be prioritized. To speed things up, I'd try to have separate teams or automated processes handle lower-impact systems in parallel. Throughout the recovery, I’d keep stakeholders informed on progress and adjust priorities if anything changes. This way, we can quickly restore the most important services and minimize disruption.

Like
Sachin Trambadiya

Technical Project Manager @ focus | M.tech
Report contribution
Recovery Time Objective (RTO): Prioritize systems with the shortest RTOs to minimize downtime. Recovery Point Objective (RPO): Prioritize systems with lower RPOs to minimize data loss. Resource Availability: Consider the resources needed for each recovery task and allocate them efficiently. Dependency: Account for systems that rely on others for recovery.

Like
Huzefa Ali

Enterprise Architect ,Togaf Certified,ITIL,System Engineering,Service Assurance,IOT,Smart Home
Report contribution
When multiple system fails this is pure due to lack of governance, change management or no compliance to HA or DR or might due to SPOF system In this type of environment, with this scenario occurring during working hours there will be enormous escalation. First thing is to isolate your technical team from escalation and work on identifying the issues on core business application that is impacting customer In parallel is to move to business continuity plan to continue the operation this might be moving to paper based if required Note : priority will be change on type of industry and time of day this system failure occurs

Like
Leandro Stein

Coordenador de Saúde e Segurança do Trabalho | Consultor Eng. de Segurança do Trabalho | Eng. de Produção | Green Belt | Lean Six Sigma
Report contribution
Olhando pro contexto de minha área de atuação, segurança do trabalho, priorizo aquelas que trazem risco grave e iminente a qualquer integridade física de algum colaborador, garantida a segurança, todos os esforços se voltam a criar uma rápida matriz de criticidade dos demais problemas e recursos necessários para empenho em várias frentes iniciando pela mais crítica e se desenrolando para as menos críticas (preto)

Translated

Like
Phillip Kolale

Computer Technician at the University of Eldoret and Website and Graphics Design Consultant.
Report contribution
Identify which systems are essential to core business operations. Some systems may depend on others to function. Start with foundational systems to prevent rework and ensure other systems can be restored effectively. Consider systems with compliance obligations and prioritize those to minimize legal or financial risk. If SLAs exist, prioritize systems with stricter recovery time objectives (RTOs) and recovery point objectives (RPOs) to meet contractual obligations. If certain systems impact a larger number of users or high-priority users, prioritize their recovery to reduce operational disruption. Check for available resources, including personnel with specific expertise and tools.

Like
Givanildo Melo
Report contribution
1° Análise dos impactos dos negócios : Afim de identificar quais os sistemas são importantes para recuperação no menor tempo possivel. 2°Restauração do sistema : Manter comunicação clara e objetiva com a equipe de TI, para garantir alinhamento nas ações e agilidade na execução.

Translated

Like
Dawn De Maroussem

"Driven leader passionate about fostering growth and motivating teams in dynamic environments. Embracing fast-paced challenges to unlock potential and drive success together! #Leadership #Growth #Motivation"
Report contribution
I would consider which system would negatively impact the business and customers the most and ensure this is a priority to restore and recover. Then the quick fixes would be the next option. Another consideration would be “why” has this happened and are we fixing something which perhaps needs a full overhaul to prevent this from continually happening. Is this happening due to outdated equipment and finally the last consideration would be are the users of these critical systems trained properly. Is it a user issue or systems issue.

Like
Tamer Taani

Manager - Data Centers Europe at velia.net Internetdienste GmbH
Report contribution
1-Get the core devices up somehow (routers, switches, crucial internal systems..etc) , don't focus on how clean you do it, but on how fast you will get it back live. 2-premium customers systems, regardless of what systems they are. 3-high revenue devices regardless of the customers. 4-handle the rest. 5-clean up the mess caused. 6- turn the temporary solutions to permanent, focus on doing it neat.

Like
Edward Rebudan

Platform Operations (DevSecOps, AVD) - Infrastructure Reliability Engineer , Global Operations Command Center (Manulife) | Certified in Cybersecurity, ISC2 | ITILv4 Certified
Report contribution
If I find myself in multiple system crashes, I will request communication to ensure all stakeholders and customers are aware of and have transparency into the current issue. I will then identify the core systems critical to business operations and activate disaster recovery procedures. I will gather stakeholders and initiate a Major Incident in Incident Management to expedite the necessary stakeholder, support, and technical personnel. I will engage them in a technical bridge call. If I encounter critical data corruption, I will recover from our backup management system. And if changes are needed to recover from the crash, we will initiate an Emergency Change Request (ECR).

Like

View more answers

Multiple critical systems crash simultaneously. How do you decide which system recovery tasks to prioritize?

Operating Systems

Multiple critical systems crash simultaneously. How do you decide which system recovery tasks to prioritize?

Operating Systems

Rate this article

Thanks for your feedback

More articles on Operating Systems

More relevant reading

Multiple critical systems crash simultaneously. How do you decide which system recovery tasks to prioritize?

Operating Systems

Multiple critical systems crash simultaneously. How do you decide which system recovery tasks to prioritize?

Operating Systems

Rate this article

Thanks for your feedback

Explore Other Skills