You're facing system downtime. How can you prioritize tasks with conflicting priorities?
Curious about navigating system outages? Share your strategies for tackling tasks when everything's on the line.
You're facing system downtime. How can you prioritize tasks with conflicting priorities?
Curious about navigating system outages? Share your strategies for tackling tasks when everything's on the line.
-
System outages are a nightmare. Most engineers probably won’t directly deal with this their entire career. But if you did, you surely are working on something impactful. When an outage occurs immediately accumulate all dashboards and monitors and see the impact and see dependencies impact, keep the relevant team engineers informed and start to focus on what can be immediately done to restore the system. The dashboards will hint you to where the root cause radius might be to start looking for possible mitigations. While in the process it’s important to be aware the goal is to get the system up and running, don’t dwelve into the “whys” or “retrospect” at this time! Your business is losing money, customers are losing money and trust!
-
When facing system downtime and conflicting priorities there are many things that we have to consider before taking action. I personally would follow these key steps. 1) Assess the impact: I would focus on areas where downtime affects critical services and prioritize tasks that minimize the largests risks first. 2) Identify task dependencies: Prioritize tasks that unblock others or restore key systems as they may accelerate overall recovery. 3) Implement Temporary Solutions: In some cases, a quick fix or workaround can restore partial functionality while you focus on long-term solutions. 4) Communicate: Keep all teams and stakeholders informed about what’s being prioritized and why.
-
- I would ask myself if the issue is related to hardware or software or if it's a mixed situation where hardware and software are disrupted. This will help us narrow down the root cause of the system outage, hence making us better decision-makers in prioritizing the exact task which may be required to restore the system back to its operational state. - Using tools to monitor and register logs will help us understand if the system performance is degrading with time; hence, we could make necessary arrangments by tweaking the configurations or enhancing hardware capabilities that may prevent the system from failing with time, which may put the business to stand still with system outage/downtime.
-
Identify the Issue: Determine if the downtime is due to a specific application, network problem, or server issue. Restart Devices: Sometimes a simple restart of the device or application can resolve temporary issues. Check Connections: Ensure that all cables and connections are secure. For network issues, check the Wi-Fi or ethernet connections. Consult Status Pages: Check the service provider's status page or social media for updates on outages. Use Alternative Tools: If possible, switch to backup systems or tools while the main system is down.
-
A practical approach is to assess the impact and urgency of each task. Focus on critical issues affecting the most users or business functions first. Then, communicate clearly with stakeholders to align priorities and ensure the most significant problems are resolved quickly.
-
If I am facing system downtime, I will do: 1. I will set an alarm on the system's database so that it will send me notification like on AWS we can setup custom alarm on downtime. 2. I will break down my tasks into categories ( like which is urgent or which is important etc.) 3. I will inform my team and other stakeholders regarding downtime and tasks prioritization. It builds trust. 4. If possible I will delegate less critical tasks to other team members so that I can concentrate on high critical tasks: 5. Will set up a timeframe for every single task. (To stay from big chaos) 6. Will regularly review and revisit the priorities. Hope this helps.
-
When facing system downtime, I prioritize tasks by assessing the impact on critical operations. I identify the affected systems and gather information from stakeholders to understand urgency. Using a priority matrix, I categorize tasks into urgent and important and less critical issues. For example, when the Patient Management System at Al Rahma Hospital went down, I prioritized fixing the database connection, as it was critical for patient care. I communicated updates to staff while focusing on immediate fixes. After resolving the issue, I reviewed the incident to identify lessons learned for future improvements.
-
- The first step is to stay calm and active. A system can be down for many reason. Just making a quick restart without investigating can be proven wrong in future - Inform the teams, clients and if needed the users - Navigating through the logs, alerts and monitoring to determine why the system is down. Trying to find out the cause and most importantly effect if restart or reboot is initiated - Take a decision with the team which server will you restart, where you first find and solve the problem then will up the server, where you need to rollback, what servers need to be deployed quickly, how much time it will be needed etc. Just bear in mind that this is not so straightforward think before you take the steps
-
To clarify, interrupts are signals that a device, such as a mouse or hard drive, sends to the CPU to tell it to immediately stop what it is doing and do something else. When a specific interrupt arises, the CPU looks for an entry for that specific interrupt in a table provided by the OS. Once the CPU finds the entry for the interrupt, it jumps to the code that the entry points to. This code that runs in response to the interrupt is known as an interrupt service routine (ISR) or interrupt handler. In this case, I recommend applying IRQ (“interrupt request”) by performing the following steps:
-
1. We are going to generate an IRQ in a process that responds to the corresponding event of the problem (e.g. key press). 2. When an interrupt is executed, a mode change is performed (from user mode to kernel mode). 3. The IRQ is then processed by the OS. The function that handles the key press is analyzed and identified in an interrupt routine table (IDT). IMPORTANT: In the IDT there are not only routines to handle Hardware Interrupts (IRQ) but also Software Interrupts. 4. Then we execute the routine in Kernel mode to process the IRQ. 5. Finally, control is returned to the context that generated the IRQ. Keep in mind that there are possibilities that the code that resides in those routines that we execute, can be altered, intercepted, etc.
Rate this article
More relevant reading
-
IT ServicesHow do you calculate the mean time between failures (MTBF) in incident response?
-
Operating SystemsHow do you resolve an operating system deadlock?
-
Operating SystemsHere's how you can stay professional and composed when facing a system failure in operating systems.
-
Production SupportHow do you align your communication strategy with your SLA and escalation policies during an outage?