Development process for safety-critical systems

Written by Maciej Gajdzica (Senior Software Developer)

Therac-25 case – the lesson learned.

Therac-25 is a radiotherapy device from the 80’s used in twelve hospitals across the USA and Canada. It has earned its reputation as the most infamous case for software errors resulting in multiple human deaths. Between 1985 and 1987 it was involved in six such fatal events, each time causing excessive patient radiation.

oprogramowanie-krytyczne

The experts designated by the court found the exact lines of code responsible for the accidents, yet they did not stop at that – they were sure the flaws in the code stemmed directly from flaws in the software development process. Further research revealed that the whole code was written by one person, without any validation, verification, or consultation by other coders. The process also lacked basic documentation, such as architecture specification and requirements.

 

These were not the only problems. The faulty risk assessment had lead the producer to resign from implementing mechanical safety measures preventing excessive radiation. No tests were performed prior to introducing the device to the work environment. Any reports of problematic functioning were frowned upon by the producer and the hospital staff were blamed. At the end, the producer was forced to pay huge fines, lost reputation, and had to withdraw from the medical field.

 

The case of Therac-25 was an important lesson for all developers working on life-and-death systems. We have all understood that relying on individual team skill and experience is not enough to provide a sufficient level of safety. Everyone makes mistakes.

 

A safety net against programming mistakes

Therefore, the process of software development is required to constitute a safety net against potential programming mistakes. Such mistakes are bound to occur, yet they will not severely impact the whole project – the right team will be able to intercept and remove them before any faulty code is introduced to the final product.

 

Therefore, the process of software development is required to constitute a safety net which catches all the mistakes and errors at different stages and does not let them slip through to the final product. This process is called the V-model and has been specified in norms such as IEC62304 (medical), DO-178C (aeronautics), and ISO26262 (automotive).

 

The V-model consists of three parts – Design, Implementation, and Verification. During the design process a general concept of the system is developed, requirements are listed, the system architecture is built, and details of its modules functioning are arranged. The implementation stage means transforming the concept into the source code. Finally, verification involves various test of different levels: unit, integration, and system testing, followed by the certification process performed by an independent institution reviewing the project’s compliance with the norms.

The model got the name due to its characteristic graphic representation – the shape of the letter „V,” where each element on the right verifies the final product’s agreement with the respective part of the design stage on the left side of the graph. The scheme means the implementation details are checked by unit tests, module cooperation is checked by integration tests, while requirements are checked by user acceptance tests.

 

Safety critical software documentation

It seems obvious that such solutions should be complemented with an extensive documentation. On each stage of design and verification several documents are created – plans, specifications, reports, risk assessments, etc. Preparing these documents is a crucial step necessary for certifying the product and finalizing the project.

The V-model is easily recognized as similar to the Waterfall system, which has been considered inefficient for the last twenty years. The Big Design Up Front approach leads to never-ending prolonging of the project, inability to predict many problems, and a tedious process of introducing changes. As a result, the quality of a system produced in such a fashion is hardly satisfactory. Why, therefore, should you develop the most sensitive tools this way?

In reality it looks a bit differently. As it has been accurately summarized in the medical norm IEC62304:

„It does not require that any particular life-cycle model is used, but it does require that the plan include certain ACTIVITIES and have certain ATTRIBUTES.”

In practice, the norm does not impose on us any particular approach towards the product life-cycle. Instead, it grants us considerable freedom in the matter, provided that we do execute certain actions, thus creating the required documents. Therefore, it has been much more common to perform the development process in iterations, while monitoring its compliance with the norms at the certification stage.

 

Find out more about safety critical systems -> more

 

(Originally published at LinkedIn)

Lates blog posts

Multi-processor solutions in safety-critical systems

Written by Maciej Gajdzica (Senior Software Developer)

Multi-processor solutions are overwhelmingly popular in safety-critical systems. Contrary to popular beliefs, increased performance is usually not the reason for this fact.

safety-critical systems

The two prevalent reasons are increased security by means of redundancy and simplified development by means of moving complicated, yet less critical elements to separate processors.

Let’s start with security -the system needs to maintain its security at all times, even when a malfunction or an error of a processor occurs. In case of a severe failure the processor might not be able to switch the system to the safe state on its own. Therefore, processor redundancy is indispensable as it provides the second processing unit to detect the problem and counteract.

The simplest redundancy system is one fitted with a supervising processor. In this configuration the main processor implements all the major functions, while the additional processor simply monitors operations and intervenes when significant anomalies are detected.

Using two independent processing channels is a more advanced solution. The channels have separate input and output but if the need arises they are both capable of triggering the safe state of the device. The channels exchange information, which is why they can easily spot discrepancies between the values they process. A single channel is usually not able to easily determine if the difference stems from its error or its counterpart’s one. It does not matter, though, as it knows that something is wrong and it is necessary to enter the safety mode. Such a system can detect much more complex problems than the supervising processor solution.

More advanced still is the voting system. In this case more independent channels are present, three or five, as long as the number is odd. The voting system collects information from all channels and based on a chosen strategy it decides on the output. Of course, designing such a system we need to remember about providing redundancy to the voting system, so that it does not become the weakest link, that is a “single point of failure.”

Voting systems and double-channel control systems can be easily combined into hybrid solutions. For example, each channel of the voting system can be made of two separate channels, or the other way round, two separate channels can be realized as two voting systems.

 

Talking about multiple channels performing the same task, a topic worth mentioning is diverse programming. This approach assumes that each of the independent channels is developed by a different programming team, therefore, the probability of the same software mistake occurring in all channels is much lower. The teams work based on the same documentation but they do not share their code, nor even ideas on how to implement the solution. This diversification can be even higher if different types of processors, different programming languages, techniques, or methodologies are used.

 

The second reason for utilizing multi-processor solutions is the aim of separating the less critical part of the system and develop it under a lower security level, with less stringent standards, and in much shorter time. What is more, it allows to use some ready-made solutions which do not meet the requirements of the higher levels of security.

 

This approach is especially tempting if we want to fit the system with modules such as TCP/IP stack or  display support. These modules usually require much memory, often with dynamic allocation, and can occupy much of the processor’s resources. Moreover, there is a higher chance of errors regarding memory leaks, stack overflows, or deadlocks. Separating the elements in question solves a lot problems.

 

Fort the processor performing actions crucial to security all additional processors can be transparent. The processor in a network lane can be treated as an element of the network infrastructure. It is an element of the so called “black channel.”  Similarly, the processor responsible for displaying data reacts to certain commands, so it can be treated as part of the HMI interface.

 

Separating tasks and assigning them to different processors it is also possible to avoid complicated integration, wherever multiple programmers work on the same code. Integrating code from separate processors operating in accordance with a strictly defined interface is much easier.

 

Learn more about safety-critical systems!

Latest blog posts