On September 11, 2025, our platform experienced intermittent disruptions impacting our AWS-hosted customers. These disruptions affected our end-users' ability to interact with courses. The issue was traced to an automated deployment process that unintentionally updated certain backend services to incompatible versions. This created temporary mismatches between services and led to periodic failures. Azure-hosted customers were not affected.
Our deployment automation tool was querying the docker registry for the latest images, but received inconsistent results due to the large number of images in our registry (suspected registry API limitation). This caused the automation to cycle through different microservice image combinations approximately every 10 minutes, creating incompatible service versions that disrupted the interdependent functionality required for course interactions.
Our team cleaned up the docker registry by removing old images, significantly reducing the total number of images. This stabilized our deployment automation process and eliminated the cycling behavior.
Implement automated docker image cleanup policy to maintain registry hygiene.