Testing fundamentals: reliability

Published in

MarathonLabs

5 min readApr 20, 2023

Working with any codebase requires verifying whether the result of a change or even the current state is valid. To do that, we interact with our software via direct software interface such as calling methods or interacting with the customer surface via calling CLI or interacting with GUI. Sometimes we even automate such actions via other code that we call automated tests. Unfortunately, when executing such code, we often face unexpected behaviour. A perfect system doesn’t have any unexpected behaviour, however this is almost impossible to achieve at scale. If the amount of unexpected behaviour is under a certain threshold then the system is considered reliable. Let’s understand why unexpected behaviour exists and what we can do about it to make our system more reliable.

Determinism

Implementing any code can be done in several ways. Still, one common requirement that is usually desirable is determinism, meaning executing an action with the same inputs should produce the same outcome. Determinism is easy to achieve when dealing with abstract code based on some mathematical computation. On the other hand, working on practical application code deterministic is either hard to accomplish or undesirable. A simple example of undesirable determinism is using a dark theme for the UI at night and a light theme during the day because there is no real control over the time of day.

In abstract terms, any code executed has a set of inputs and outputs. Apart from the stable inputs you provide for verification, two main groups of input essential in analysing determinism are system components and time.

Flakiness source: System components

The larger your solution — the more non-deterministic behaviour you will experience. A software solution is not just your code. It includes all the code you depend on, such as API client implementation, emulator software, or even the OpenGL implementation used for rendering the graphics. Hardware is also not removed from the equation.

Some problems in the components can be solved by your team, your department or even your business. Solving the determinism of remote components accessed by the network is usually solved via setting up a local instance of the API server. Mocking and stubbing is also a widespread solution to isolating from an external source of non-determinism.

Unfortunately, not everything can/should be mocked. Fixing external components is a people problem; the owner of a particular component needs to spend time. If this interaction is inside the same business, then it’s much easier to deliver a solution. But what if the source of non-deterministic behaviour is some other company entirely, and you don’t even know the owner?

An example of something that you can’t really solve properly is the n-bit data corruption happening in the RAM. It is usually addressed using a form of error correction codes by using ECC memory modules.

Flakiness source: Time

A critical input into the execution of any code is time. No, I don’t mean a system timer where you get the number of milliseconds passed since the epoch.

One typical pattern where it’s difficult to predict the behaviour of a system is asynchronous behaviour: dealing with many threads/coroutines simultaneously leads to complicated synchronisation problems. Trying to make even a single touch input into the UI can be a challenging task: before performing a touch input, what is the right time to touch a UI element when lazy loading a list of items?

Fixing such issues requires a good grasp of the states your system transitions through as a Finite State Machine. Unfortunately, since most of our code is written in a different way, most testing frameworks resort to for loops with a thread sleep and an if condition with timeout exception when a certain requirement has not been met. What this code really means is framework doesn’t have any idea on the determinism of what’s being checked: when and even if a certain condition will be met.

Solutions

An engineer with an idealistic mindset will surely try to make all of the code deterministic. Still, while the technical problems owned by the team can be solved, the further the problem area is away from the engineer in terms of ownership — the harder it will be to fix it. There is no expectation, for example, that a frontend engineer can patch container runtime implementation or even OS kernel.

If the ideal solution is impossible, there are pragmatic solutions such as retries. The idea here is to sacrifice the cost of running the code with the success criteria of `if one out of N actions passes, then the action is passed`. But, of course, this is nothing more than throwing money at the problem and is an industry-accepted solution. It just requires cost management and a bit of monitoring. What you get in return is a buffer for the engineer to fix the real issue or even work on the current, more urgent tasks and delay the actual fix.

If we expect a particular code to be unreliable, we can do three types of retries:

Post-factum
Preventive
Uncompleted

When talking about retries in testing, most retries you see are post-factum, as in the testing code failed execution, and the execution flow decides to add more retries.

If we accept that testing code is inherently non-deterministic, we can start planning for it. For example, if we know that a particular test fails sometimes, why don’t we add retries from the start and run them in parallel? These are the preventive retries in practice.

Running non-deterministic code can also have no result. These might indicate a substantial underlying problem but, at the same time, such uncompleted testing code could be retried as well.

Conclusion

Reliability of testing code depends fully on the determinism. Code that is not deterministic is called flaky code. Causes of flakiness might be both under your control and outside of your control. To solve the flakiness is to have control over the behaviour and logic of your code including all inputs and the implementation. If such fix is not an option then a pragmatic approach with retries helps mitigate flakiness by sacrificing the cost of execution.

Implementing such logic can be done in the testing framework, but it’s much better to implement it once for every technical stack and reuse it via a general-purpose test runner. Marathon implements all of the above retries and their cost centers via configuring flakiness strategy, retry strategy and the uncompleted tests quota. Most of the solutions you can find currently only implement the post-factum retries which result in sub-optimal test execution time and higher cost.

You can find more information on how to make your test runs more reliable and performant in our documentation https://docs.marathonlabs.io/