Zero-Downtime Upgrades in Cloud-IAM: How We Keep Keycloak Clusters Always On

July 1, 2025

Upgrades are a critical part of any cloud-based Identity and Access Management (IAM) system. In Cloud-IAM, we strive to make upgrades both fast and safe, with minimal (or zero) downtime. Our customers—who may be using different types of IAM (CIAM, WIAM, etc.)—expect uninterrupted services, even when new versions roll out.

In this article, we’ll walk through how we design and execute Keycloak upgrades in Cloud-IAM. While the focus is on Keycloak, the principles can be adapted to almost any service architecture at scale.

‍

Context and Constraints

Our design principle prioritizes no service interruptions, automatic updates, seamless rollback capabilities, and overall simplicity. This principle is driven by the following constraints:

1. We Are a Global SaaS (IAMaaS) Provider

Operating worldwide necessitates the automation of processes to minimize manual intervention.
Supporting customer deployment across various cloud environments requires a flexible and universally applicable solution.

2. We Manage Diverse IAM Types

Consumer-facing IAM (CIAM) has different requirements compared to internal workforce IAM (WIAM).
The diverse usage scenarios demand minimal downtime solutions. For example,VOD platforms are always up while Workforce IAM companies might accommodate downtime outside of working hours.

3. We Offer Co-Managed Clusters with Clients

A key feature of Cloud-IAM is the support for custom Keycloak extensions (CE).
We do not manage or have visibility into these CEs, allowing clients to develop and deploy them independently.
A quick and smooth rollback mechanism is crucial for clients if an upgrade disrupts a custom extension.

The Design Overview

At a high level, our architecture splits each Keycloak cluster into two distinct parts:

Stateless Components
- Components that produce the same results regardless of how many times they run (e.g., the Keycloak application binaries, or other horizontally scalable services, the webserver by itself).
Stateful Components
- Components where running them multiple times yields different results (e.g., the database, custom configuration, or elements of the WAF and configuration of the load balancers).

By separating these concerns, we can manage each category of component differently. The next sections walk through how this design translates into a practical upgrade workflow.

‍

Rethinking the Automation Process: From "Creation Job" to "Set Cluster State"

Originally known as the "creation job," our automated procedure has grown beyond its initial scope. Its capabilities now include:

Deploying a completely new Keycloak cluster.
Performing upgrades to existing clusters.
Executing restoration and rollback operations using backups.
Duplicating clusters for disaster recovery purposes.

Consequently, we have adopted the term "set cluster state" job to better represent its function. It acts as a central orchestrator, ensuring a Keycloak cluster reaches a specified version and condition, regardless of whether it involves a new deployment, upgrade, or restoration.

While the "creation job" moniker persists due to past usage, it is crucial to understand that its primary role is to guarantee the cluster achieves the intended final state, irrespective of the method employed.

Step 1: Cluster Provisioning

The "set cluster state" job leverages Terraform/Ansible scripts to execute the initial "Creation" phase, which involves provisioning a new cluster. This process entails:

Infrastructure Provisioning: Activating the required cloud resources, including servers, networking, and load balancers.
Keycloak Deployment: Installing the "stateless" application, Keycloak.
DNS Configuration: Establishing a new DNS route directed at the freshly provisioned cluster.
Cluster Registration: Informing the backend or orchestrator to manage the cluster's lifecycle.

This workflow is similar to conventional automated deployment. However, critically, this same procedure is applied not only for new deployments but also for upgrades and rollbacks.

Step 2: Adding Backups to the Creation Process

Keycloak uses Liquibase for database migrations, which means the application itself handles schema changes. A powerful realization is that a Keycloak cluster created with the same backup and the same version will yield the same state.

What We Added
We enhanced our creation script to optionally restore from a backup. This means we can spin up a clone of a production Keycloak cluster—complete with its database contents—just by pointing the creation job to an existing backup.

Step 3: Versioning the Deployment

When a client creates a cluster in Cloud-IAM (what we call a “deployment”), it’s associated with a major version of Keycloak. During an upgrade, we’re effectively just changing the version and letting Terraform + Ansible apply those changes to the existing state.

Progressive Rollouts
Often, we do a rolling upgrade: spin up a new node with the new version, then remove an old node, and so on. However, some Keycloak updates don’t support rolling or progressive upgrades. In these cases, we might have to stop the entire cluster, upgrade it, and then start it up again.

Step 4: Zero-Downtime Upgrades - Disaster Recovery Approach

When a Keycloak cluster update necessitates a complete shutdown, implying downtime, a Disaster Recovery (DR) approach ensures zero interruption. This method involves:

Cluster Replication: Create an identical copy of the production cluster using the existing backup and creation job.
Copy Upgrade: Apply the new version to the copied cluster, execute necessary schema migrations, and verify system stability.
DNS Switch: Redirect traffic to the newly upgraded cluster. This action is effectively instantaneous, providing zero downtime.
Old Cluster Management: Decommission the original cluster, or retain it as a readily available hot standby for immediate rollback if required.

By treating Keycloak as stateless and relying on backups for stateful data, a parallel cluster can be established. This enables a rapid DNS switch, achieving seamless, zero-downtime updates.

Because we treat Keycloak as stateless and rely on backups for the stateful part, we can stand up a second cluster in parallel. Then, switching DNS is effectively instant—yielding zero downtime.

Step 5: Handling Rollbacks

Rollbacks are essentially the reverse of the upgrade process.

1. Tear Down Problematic Stateful Components

This is crucial to avoid partial upgrades or data corruption. Completely remove the current (faulty) stateful components.

2. Restore from Backup

Utilize the creation job with the previous backup to rebuild a functioning cluster.

For zero-downtime rollbacks, an alternative Disaster Recovery (DR) approach can be used. This involves deploying the old version alongside the new, switching the DNS back to the old version, and then decommissioning the new, problematic version.

‍

Conclusion & Key Takeaways

Stateless vs. Stateful: Separating the application from its data and stateful elements simplifies upgrades and enables flexible rollback strategies.
Versioned Deployments: Tying Keycloak deployments to a version number centralizes upgrade logic and maintains consistency.
Backups Are Everything: Treat backups (and the ability to restore from them) as a first-class feature. It’s the foundation for reliable rollbacks and zero-downtime upgrades.
Flexible Approaches: Depending on the nature of the release, you can:
- Progressively roll out updates, or
- Perform a full swap using DR copy and DNS switching.
Client Collaboration: Allowing custom extensions means customers control their code, but you must also provide a robust rollback mechanism to protect them (and you) from version incompatibilities.

By following these principles, we ensure our IAM platform remains up-to-date, consistent, and resilient—no matter how large or varied our clients’ needs become.