This is a postmortem of the incident impacting mCare services on the 13th of November 2024.
04:50 – Testing complete, due to time of day and belief of impact to infections only release is made ready and held for next morning
19:30 - Release rolled out to test customers
20:08 - Expedited release process starts rolling out to shards
20:32 - Release to all shards completed, service returned to normal
The root cause of this incident was that a new field was being added to the wound care / infection service to record the reason for a change of status. This field should have been behind a release control. However, it did not function correctly which caused wounds/infections to fail to save as no reason had been supplied in this new field.
Once we had identified the release control was not correctly working, we were able to correct the issue and prove through testing that saving wounds / infections would work.
Due to the time of the day we did not want to introduce any other potential issues, that combined with having access to only a limited scope of the impact, the release was held until the next morning when all staff would be available to support.
On the next day we released the fix following an expedited rollout process to reach customers as soon as possible, during this, the full scope of the impact was also realised due to additional support tickets that were raised throughout the night.
The issue presented as a failure in both infection and wound care saving with staff unable to make changes.
Once the issue was identified and the fix fully tested it was too late in the day to release safely, this decision was partially based on the believed scope of impact being limited to just infections.
The next day we released as soon as safely possible to restore functionality.
It was only on this 2nd day that the impact to wounds was also fully understood, if this had been known at the time, an emergency release could have been considered.
We are replacing the release control functionality within mCare to allow for more granular control and checking, this will ensure that such mistakes of functionality not being correctly guarded, will not happen again.
A replacement solution has been chosen and will be implemented over the next few weeks. The implementation of which will not impact customers using our solutions directly.
We are also changing our release process. The purpose of this is to ensure more detailed testing of such situations takes place, where functionality is locked down, so that even with release control the risk of issues arising is reduced. Furthermore, to support this, an additional test environment is being created to allow for testing in an environment with everything purposefully turned off.
We will also be looking at how we can better identify the scope of impact if such an issue occurs again to better inform the priority of releasing and to allow us to make better informed decisions, we will review all of our build and release processes to identify safer routes for releasing in such cases and we will be looking at improving our policies around such emergency releases.