Written by Maciej Gajdzica (Senior Software Developer)
While designing safety-critical systems we need to make sure that the devices will perform their assigned tasks properly but also keep the risk of any dangerous occurrence at reasonably low levels.
It does not mean that such projects are exempt from limited budgets and fixed schedules. On the contrary, it means that developing such systems is a difficult compromise – strong focus on safety means higher complexity and more functionalities, but limited time and resources increase the risk of potential mistakes. The way to identify possible problems and choose the right countermeasures for the assumed security level, is a proper risk analysis.
The first and the most important rule for analysing risk is doing it right from the beginning of the project. We simply cannot guarantee sufficient security if we implement it in the final stages. Neither can we succeed if we prepare individual components separately and try to integrate them at the end trying to retain the overall safety level – ultimately, integration mistakes happen to be the biggest source of software vulnerabilities. That is why we have no choice in the matter, we need to begin analysing risk right from the start. But how?
Risk analysis includes two stages:
The list including all the events which might impact the system should be compiled right at the beginning, that is at the stage of forming the general concept of the product and laying out the requirements.
In the course of the project it should be expanded and filled with more detail in accordance with the stages of designing, implementing, testing, and even documenting. The analysis of each and every entry in the list is the crux of the process.
We need to take every single occurrence from the list and think about its impact on the system’s operation, its consequences, the probability of it actually happening, and the ways to prevent it or counteract. This step should be performed repeatedly, as we may not have the full view of the problem in the earlier stages, thus assessing the impact wrongly. What is more, the scrupulous analysis may indicate the need for some changes in the approach to the project at its early stages, in order to avoid potential problems it finds.
During the analysis of particular risks from the list the risk matrix is prepared – a table juxtaposing how severe the consequences of an event would be and how probable the event is to happen. Every element from the list is assigned these two values – from catastrophic, to negligible and from frequent, to improbable – and placed within the said grid.
The matrix’s coloured fields indicate how dangerous the risk is and depending on the SIL security level some fields are not allowed to be occupied by any records. If any risk falls under those fields, we need to make the effort to move it to an acceptable area – to the right by reducing its negative consequences, or down by reducing or eliminating the possibility of it happening.
The actions we can take involve three areas: hardware, software, and people or procedures. The hardware solutions are the preferred ones, as machines are quick, consistent, predictable in their failures, and easily tested for proper operation. The second choice would be the software solutions. Software is equally fast, can be fully automated, and it can perform much more sophisticated operations than machines. Yet it is also much more complex, making it more difficult to prove the system works properly and increasing the risk of introducing new errors with code changes. The least valued solutions involve people, as people often work less efficiently and less consistently than machines, they suffer from fatigue, stress, and boredom.
Importantly, their mistakes may lead to legal actions against them – as Nancy Leveson, a recognised expert in system safety states: “insisting that operators always follow procedures does not guarantee safety although it does usually guarantee that there is someone to blame – either for following the procedures or for not following them – when things go wrong.”
Regardless of the human unreliability, the operator is indispensable and their actions are necessary when the safety-critical system enters the safe state and refuses further operation due to finding a potential risk. In such a case it will not attempt to continue its work. Instead, it will wait for the operator’s decision. It might happen to be a false alarm, but for the system it is better to announce the false alarm than to miss a real threat. In these cases the operator will assess the situation and commence the system’s operation. Let’s just remember not to overuse this mechanism – if the false positives happen too often, people will stop paying much attention to them and leave the door open to a real problem.