End users experiencing issues when interacting with courses

Incident Report for Northpass

Postmortem

On September 11, 2025, our platform experienced intermittent disruptions impacting our AWS-hosted customers. These disruptions affected our end-users' ability to interact with courses. The issue was traced to an automated deployment process that unintentionally updated certain backend services to incompatible versions. This created temporary mismatches between services and led to periodic failures. Azure-hosted customers were not affected.

Impact

  • Duration:  2 hours (16:24 - 18:23 EDT).
  • Pattern: Cyclic issues every few minutes with brief automatic recovery periods.
  • Symptoms: Users experienced difficulties accessing courses, completing activities, and general course interactions.
  • Affected users: All AWS-hosted customers during problematic service version combinations.

Root Cause

Our deployment automation tool was querying the docker registry for the latest images, but received inconsistent results due to the large number of images in our registry (suspected registry API limitation). This caused the automation to cycle through different microservice image combinations approximately every 10 minutes, creating incompatible service versions that disrupted the interdependent functionality required for course interactions.

Resolution

Our team cleaned up the docker registry by removing old images, significantly reducing the total number of images. This stabilized our deployment automation process and eliminated the cycling behavior.

Next steps to preventing recurrence

Implement automated docker image cleanup policy to maintain registry hygiene.

Posted Sep 12, 2025 - 10:38 EDT

Resolved

This incident have now been resolved, we apologize for the service interruption.
Our team is hard at work on delivering on our promised transparency, we will provide the postmortem of this issue within 48 hours.
Posted Sep 11, 2025 - 18:23 EDT

Monitoring

A Fix has been implemented and we are monitoring the results.
Posted Sep 11, 2025 - 18:14 EDT

Update

We are continuing to work on resolving this issue.
Posted Sep 11, 2025 - 17:43 EDT

Identified

We've identified the issue and are working towards a resolution.
Posted Sep 11, 2025 - 16:42 EDT

Investigating

End users are experiencing issues when interacting with courses
Posted Sep 11, 2025 - 16:24 EDT
This incident affected: Northpass App - AWS.