jsDelivr May outage postmortem
During the night, on May 2, 2024, the jsDelivr CDN domain cdn.jsdelivr.net started serving an expired SSL certificate to clients connecting from certain regions.
The outage lasted for more than 5 hours and affected users mostly in Africa, Asia, and certain countries in Europe and Latin America.
Users from the USA, Canada, western Europe, Brazil, and many other countries were not affected.
This disparity was due to our routing between our main CDN providers. Users that were hitting our Fastly CDN endpoint were unaffected.
The root cause of the outage was Cloudflare’s switch from DigiCert’s certificate authority to Google Trust Services. While the switch itself was benign, it also changed the domain validation method.
Since we’re a multi-CDN, with the traffic being routed between providers based on our own internal rules, we can’t use Cloudflare DNS hosting and have a special setup where they only act as the CDN, with DNS being hosted elsewhere.
To allow Cloudflare to automatically issue and manage our certificates, we added the proper DNS records to our third-party DNS providers. This system worked great for almost 10 years now.
Unfortunately, this migration of certificate authorities also made those validation records obsolete, and switched to HTTP validation instead. This would never work in our case, as depending on where the verification test came from, it could hit a different CDN provider and fail.
We were not aware of this change, so what ended up happening was the following:
- The previous working DigiCert was set to expire, as it has happened many times over the years
- Some time shortly before the expiration, Cloudflare tried to issue the new certificate using HTTP validation and failed
- After it failed, it reverted to an old certificate expired in 2020
- None of us or our systems ever expected the managed SSL system to fail
- This resulted in an error message to all of our users hitting Cloudflare’s CDN based on our routing
I want to note that Cloudflare is at no fault here, and they have no obligation towards anyone to validate rare client configs, especially free sponsored ones.
I, @jimaek, take full responsibility for this outage. I should’ve not made any assumptions about the migration and should’ve taken extra precautions.
This outage is embarrassing, and I apologize to everyone who was affected!
The goal of jsDelivr was always to be a production-ready, most reliable, and fastest open-source CDN for anyone to use. This never changed and never will.
The next steps are:
- Short-term:
- Review the full system to ensure that similar issues are automatically handled and re-routed
- From now on, any critical changes by CDN providers will immediately result in their deactivation from jsDelivr and manual verification after the fact to ensure the CDN’s stability with our specific setup
- Long-term:
- Optimize, automate, and simplify our DNS, load-balancing, and failover systems and components to prevent other theoretically possible but unknown edge cases
- We will integrate our own Globalping service for even better and more reliable monitoring and failover. Consider joining the project by hosting a Docker probe on your network!
Thank you to everyone for being with us for so many years, and hope to see you all continue this journey forward.