Upgrades are a critical part of any cloud-based Identity and Access Management (IAM) system. In Cloud-IAM, we strive to make upgrades both fast and safe, with minimal (or zero) downtime. Our customers—who may be using different types of IAM (CIAM, WIAM, etc.)—expect uninterrupted services, even when new versions roll out.
In this article, we’ll walk through how we design and execute Keycloak upgrades in Cloud-IAM. While the focus is on Keycloak, the principles can be adapted to almost any service architecture at scale.
Our design principle prioritizes no service interruptions, automatic updates, seamless rollback capabilities, and overall simplicity. This principle is driven by the following constraints:
1. We Are a Global SaaS (IAMaaS) Provider
2. We Manage Diverse IAM Types
3. We Offer Co-Managed Clusters with Clients
At a high level, our architecture splits each Keycloak cluster into two distinct parts:
By separating these concerns, we can manage each category of component differently. The next sections walk through how this design translates into a practical upgrade workflow.
Originally known as the "creation job," our automated procedure has grown beyond its initial scope. Its capabilities now include:
Consequently, we have adopted the term "set cluster state" job to better represent its function. It acts as a central orchestrator, ensuring a Keycloak cluster reaches a specified version and condition, regardless of whether it involves a new deployment, upgrade, or restoration.
While the "creation job" moniker persists due to past usage, it is crucial to understand that its primary role is to guarantee the cluster achieves the intended final state, irrespective of the method employed.
The "set cluster state" job leverages Terraform/Ansible scripts to execute the initial "Creation" phase, which involves provisioning a new cluster. This process entails:
This workflow is similar to conventional automated deployment. However, critically, this same procedure is applied not only for new deployments but also for upgrades and rollbacks.
Keycloak uses Liquibase for database migrations, which means the application itself handles schema changes. A powerful realization is that a Keycloak cluster created with the same backup and the same version will yield the same state.
When a client creates a cluster in Cloud-IAM (what we call a “deployment”), it’s associated with a major version of Keycloak. During an upgrade, we’re effectively just changing the version and letting Terraform + Ansible apply those changes to the existing state.
When a Keycloak cluster update necessitates a complete shutdown, implying downtime, a Disaster Recovery (DR) approach ensures zero interruption. This method involves:
By treating Keycloak as stateless and relying on backups for stateful data, a parallel cluster can be established. This enables a rapid DNS switch, achieving seamless, zero-downtime updates.
Because we treat Keycloak as stateless and rely on backups for the stateful part, we can stand up a second cluster in parallel. Then, switching DNS is effectively instant—yielding zero downtime.
Rollbacks are essentially the reverse of the upgrade process.
1. Tear Down Problematic Stateful Components
2. Restore from Backup
For zero-downtime rollbacks, an alternative Disaster Recovery (DR) approach can be used. This involves deploying the old version alongside the new, switching the DNS back to the old version, and then decommissioning the new, problematic version.
By following these principles, we ensure our IAM platform remains up-to-date, consistent, and resilient—no matter how large or varied our clients’ needs become.