Database Outage
Incident Report for UpGuard CyberRisk
Postmortem

PIR Date: 17th June, 2022
Incident Date: June 6th, 2022
Incident Time: 3:33 UTC
Incident Number: INCI-159
Severity Level: 1 - Blocker
Affected Services: UpGuard CyberRisk, Web App, External API, Authentication services
Outage Duration: 30 Minutes

Incident Summary

On Monday, June 6th at 3:33 UTC, the UpGuard CyberRisk, Web App, External API & Authentication services experienced an outage of 30 minutes, and recovery from this outage led to the loss of 18 hours of data affecting <1% of our customers.

Fault

A database maintenance task commenced on UpGuard CyberRisk. The production database was incorrectly overwritten, halting access to UpGuard CyberRisk, Web App, External API & Authentication services.

Detection

Internal alerting systems notified internal channels immediately of the service disruption across UpGuard CyberRisk, Web App, External API & Authentication services.

Impact

  1. Outage: UpGuard CyberRisk, Web App, External API & Authentication services were unavailable for 30 minutes. All performance transactions within the product were halted as a result.
  2. Data loss: The database backup restored was from the previous day Sunday, June 5th at 7:00 UTC.  Data entered into UpGuard CyberRisk during the previous 18 hours was lost affecting <1% of our customers.

Recovery

  1. Due to the low number of transactions, UpGuard CyberRisk, Web App, External API & Authentication services were restored and brought back online with the last available backup from Sunday, June 5th at 7:00 UTC.
  2. Analysis was conducted to review changes that occurred between Sunday, June 5th at 7:00 UTC and Monday, June 6th at 3:33 UTC on UpGuard CyberRisk.

Timeline

3:30 UTC: Database maintenance commenced.

3:33 UTC: It was identified that the database maintenance was incorrectly carried out on the production database instead of the test database due to human error which halted access to UpGuard CyberRisk, Web App, External API & Authentication services.

3:45 UTC: An incident response group was formed.

3:55 UTC: A decision was made to restore from the last available full backup provided the low impact of transactions that were executed.

4:03 UTC: UpGuard CyberRisk, Web App, External API & Authentication services were restored from backup data as of Sunday, June 5th at 7:00 UTC was successful and within our Hosted Services Agreement. Data entered into UpGuard CyberRisk during the previous 18 hours was lost affecting <1% of our customers.

Root Cause

It was concluded that the root cause was human error, along with insufficient testing and verification of the maintenance work.  In addition, the change type (restoring a database image into a non-production copy) represents a unique case for our change control procedures and was not classified as a production change.  Although performed from the production environment, the change classification as non-production was due to the destination being a non-production copy of the database rather than the source target.  

Corrective Actions

As a result of this incident, we have analyzed all of the transactions within UpGuard CyberRisk between Sunday, June 5th at 7:00 UTC and Monday, June 6th at 3:33 UTC to notify the customers affected with a description of the data loss.

Effective Immediately: For this category of change, we will ensure it is aligned with all other types of change that potentially impact our customer and production data.  This means that it will follow the formal change control process that requires review, testing, and approval.

We will increase the frequency of our backups. We will require a backup before any major change to the production environment.

Targeting completion within 1 month:

We are reviewing all other types of changes that fall outside our regular change control process to verify coverage at the appropriate level of control.

For any change that requires any manual process, we will ensure that:

  1. A scripted solution is present that allows for review and testing. This will include a scripted and tested database restore function.
  2. For changes that cannot be scripted a documented playbook is available that allows for peer-review and testing

We will be reviewing our external communications plan for our customers to ensure that the relevant and active users are communicated with and ensuring there is an opt out function.

Posted Jun 15, 2022 - 05:51 UTC

Resolved
Systems are stable, and affected users will be contacted soon.
Posted Jun 06, 2022 - 10:23 UTC
Monitoring
Systems are back online and we are currently investigating logs as to which users have been affected.
Posted Jun 06, 2022 - 04:07 UTC
Identified
A database backup has been restored and we will begin bringing systems back online.
Posted Jun 06, 2022 - 03:59 UTC
Update
We are continuing to investigate this issue.
Posted Jun 06, 2022 - 03:49 UTC
Investigating
We are currently investigating this issue.
Posted Jun 06, 2022 - 03:35 UTC
This incident affected: UpGuard CyberRisk (Web App, External API, Authentication).