Multiple critical systems crash simultaneously. How do you decide which system recovery tasks to prioritize?
When multiple systems fail, what drives your decision-making process? Share your approach to prioritizing recovery tasks.
Multiple critical systems crash simultaneously. How do you decide which system recovery tasks to prioritize?
When multiple systems fail, what drives your decision-making process? Share your approach to prioritizing recovery tasks.
-
When multiple critical systems crash at once, the priority is to restore those that impact essential functions and core operations, especially those affecting customers, safety, or security. I’d start by identifying which systems are foundational for others, so they’re restored first if other systems rely on them. Systems with the highest risk of data loss or corruption would also be prioritized. To speed things up, I'd try to have separate teams or automated processes handle lower-impact systems in parallel. Throughout the recovery, I’d keep stakeholders informed on progress and adjust priorities if anything changes. This way, we can quickly restore the most important services and minimize disruption.
-
Recovery Time Objective (RTO): Prioritize systems with the shortest RTOs to minimize downtime. Recovery Point Objective (RPO): Prioritize systems with lower RPOs to minimize data loss. Resource Availability: Consider the resources needed for each recovery task and allocate them efficiently. Dependency: Account for systems that rely on others for recovery.
-
When multiple system fails this is pure due to lack of governance, change management or no compliance to HA or DR or might due to SPOF system In this type of environment, with this scenario occurring during working hours there will be enormous escalation. First thing is to isolate your technical team from escalation and work on identifying the issues on core business application that is impacting customer In parallel is to move to business continuity plan to continue the operation this might be moving to paper based if required Note : priority will be change on type of industry and time of day this system failure occurs
-
Olhando pro contexto de minha área de atuação, segurança do trabalho, priorizo aquelas que trazem risco grave e iminente a qualquer integridade física de algum colaborador, garantida a segurança, todos os esforços se voltam a criar uma rápida matriz de criticidade dos demais problemas e recursos necessários para empenho em várias frentes iniciando pela mais crítica e se desenrolando para as menos críticas (preto)
-
Identify which systems are essential to core business operations. Some systems may depend on others to function. Start with foundational systems to prevent rework and ensure other systems can be restored effectively. Consider systems with compliance obligations and prioritize those to minimize legal or financial risk. If SLAs exist, prioritize systems with stricter recovery time objectives (RTOs) and recovery point objectives (RPOs) to meet contractual obligations. If certain systems impact a larger number of users or high-priority users, prioritize their recovery to reduce operational disruption. Check for available resources, including personnel with specific expertise and tools.
-
1° Análise dos impactos dos negócios : Afim de identificar quais os sistemas são importantes para recuperação no menor tempo possivel. 2°Restauração do sistema : Manter comunicação clara e objetiva com a equipe de TI, para garantir alinhamento nas ações e agilidade na execução.
-
I would consider which system would negatively impact the business and customers the most and ensure this is a priority to restore and recover. Then the quick fixes would be the next option. Another consideration would be “why” has this happened and are we fixing something which perhaps needs a full overhaul to prevent this from continually happening. Is this happening due to outdated equipment and finally the last consideration would be are the users of these critical systems trained properly. Is it a user issue or systems issue.
-
1-Get the core devices up somehow (routers, switches, crucial internal systems..etc) , don't focus on how clean you do it, but on how fast you will get it back live. 2-premium customers systems, regardless of what systems they are. 3-high revenue devices regardless of the customers. 4-handle the rest. 5-clean up the mess caused. 6- turn the temporary solutions to permanent, focus on doing it neat.
-
If I find myself in multiple system crashes, I will request communication to ensure all stakeholders and customers are aware of and have transparency into the current issue. I will then identify the core systems critical to business operations and activate disaster recovery procedures. I will gather stakeholders and initiate a Major Incident in Incident Management to expedite the necessary stakeholder, support, and technical personnel. I will engage them in a technical bridge call. If I encounter critical data corruption, I will recover from our backup management system. And if changes are needed to recover from the crash, we will initiate an Emergency Change Request (ECR).
Rate this article
More relevant reading
-
Business ReportingHow can you effectively address shareholder needs and expectations?
-
Conflict ManagementWhat are the most effective ways to use problem-solving skills in shareholder conflicts?
-
Financial ServicesWhat are the best practices for preparing for an investor conference call?
-
Financial ServicesHow can you develop a strong sense of vision in financial services?