Awesome
Azure Kubernetes Service (AKS) Baseline Cluster for Regulated Workloads
This reference implementation demonstrates the recommended starting (baseline) infrastructure architecture for an AKS cluster that is under regulatory compliance requirements (such as PCI). This implementation builds directly upon the AKS Baseline Cluster reference implementation and adds to it additional implementation points that are more commonly seen in regulated environments vs typical "public cloud" consumption patterns.
🎓 Foundational Understanding |
---|
If you haven't familiarized yourself with the general-purpose AKS baseline cluster architecture, you should start there before continuing here. This architecture is constructed from the AKS baseline, which is the foundation for this body of work. This reference implementation avoids rearticulating points that are already addressed in the AKS baseline cluster. |
Compliance
:warning: | These artifacts have not been certified in any official capacity; regulatory compliance is a shared responsibility between you and your hosting provider. This implementation is designed to aide you on your journey to achieving your compliance, but by itself does not ensure any level of compliance. To understand Azure compliance and shared responsibility models, visit the Microsoft Trust Center. |
---|
Azure and AKS are well positioned to give you the tools and allow you to build processes necessary to help you achieve a compliant hosting infrastructure. The implementation details can be complex, as is the overall process of compliance. We walk through the deployment here in a rather verbose method to help you understand each component of this architecture, teaching you about each layer and providing you with the knowledge necessary to apply it to your unique compliance scoped workload.
Even if you are not in a regulated environment, this infrastructure demonstrates an AKS cluster with a more heightened security posture over the general-purpose cluster presented in the AKS baseline. You might find it useful to take select concepts from here and apply it to your non-regulated workloads (at the tradeoff of added complexity and hosting costs).
Azure Architecture Center guidance
This project has a companion set of articles that describe challenges, design patterns, and best practices for a AKS cluster designed to host workloads that fall in PCI-DSS 3.2.1 scope. You can find this article on the Azure Architecture Center at Azure Kubernetes Service (AKS) regulated cluster for PCI-DSS 3.2.1. If you haven't reviewed it, we suggest you read it; as it will give added context to the considerations applied in this implementation. This repo primarly focuses on deployment concerns, while compliance concerns are mostly addressed in the linked article series above.
Architecture
This reference implementation is infrastructure focused, more so than workload. It concentrates on dealing with the AKS cluster itself. This implementation will touch on workload concerns, but does not contain end-to-end guidance on in-scope workload architecture, container security, or isolation. There are some good practices demonstrated and others talked about, but it is not exhaustive.
The implementation presented here is the minimum starting point for most AKS clusters falling into a compliance scope. This implementation integrates with Azure services that will deliver observability, provide a network topology that will support public traffic isolation, and keep the in-cluster traffic secure as well. This architecture should be considered your architectural starting point for preproduction and production stages of clusters hosting regulated workloads.
The material here is relatively dense. We strongly encourage you to start by reading the Azure Architecture Center guidance linked above and then dedicate at least four hours to walk through these instructions, with a mind to learning. You will not find any "one click" deployment here. However, once you've understood the components involved and identified the shared responsibilities between your team and your greater IT organization, it is encouraged that you build auditable deployment processes around your final infrastructure.
Finally, this implementation uses a small, custom application as an example workload. This workload is minimally interesting, as it is here exclusively to help you experience the infrastructure and illustrate network and security controls in place. The workload, and its deployment, does not represent any sort of "best practices" for regulated workloads.
Core architecture components
Azure platform
- AKS v1.30
- System and User node pool separation
- AKS-managed Microsoft Entra ID
- Managed Identities for kubelet and control plane
- Azure CNI
- Azure Monitor for containers
- Private Cluster (Kubernetes API Server)
- Azure Workload Identity
- Azure Virtual Networks (hub-spoke)
- Azure Firewall managed egress
- Hub-proxied DNS
- BYO Private DNS Zone for AKS with no public DNS representation
- Azure Application Gateway (WAF - OWASP 3.2)
- AKS-managed Internal Load Balancers
- Azure Bastion for maintenance access
- Private Link enabled Key Vault and Azure Container Registry
- Private Azure Container Registry Task Runners
In-cluster open-source software components
- Azure Workload Identity [AKS-managed add-on]
- Flux GitOps Operator [AKS-managed extension]
- Falco
- Kubernetes Reboot Daemon
- Secrets Store CSI Driver for Kubernetes [AKS-managed add-on]
- NGINX Ingress Controller
- Open Service Mesh
Network topology
Workload HTTPS request flow
Deploy the reference implementation
A deployment of AKS-hosted workloads typically experiences a separation of duties and lifecycle management in the area of identity & security group management, the host network, the cluster infrastructure, and finally the workload itself. This reference implementation will have you be working across these various roles. Regulated environments require strong, documented separation of concerns; but ultimately you'll decide where each boundary should be.
Also, remember the primary purpose of this body of work is to illustrate the topology and decisions made in this cluster. A guided, "step-by-step" flow will help you learn the pieces of the solution and give you insight into the relationship between them. A bedrock understanding of your infrastructure, its supply chain, and its "Day-2" workflows are critical for compliance concerns. If you cannot explain each decision point and rationalization, audit conversations can quickly turn uncomfortable.
Ultimately, lifecycle/SDLC management of your cluster, its dependencies, and your workloads will depend on your specific situation. You'll need to account for team roles, centralized and decentralized IT roles, organizational standards, industry expectations, and specific mandates by your compliance auditor.
Start this learning journey in the Prepare the subscription section. If you follow this through the end, you'll have our recommended baseline cluster for regulated industries installed, with a sample workload running for you to reference in your own Azure subscription.
1. :rocket: Prepare the subscription
There are considerations that must be addressed before you start deploying your cluster. Do I have enough permissions in my subscription and Microsoft Entra tenant to do a deployment of this size? How much of this will be handled by my team directly vs having another team be responsible?
- Begin by ensuring you install and meet the prerequisites.
- Procure required TLS certificates.
- Plan your Microsoft Entra integration.
- Apply Azure Policy and Microsoft Defender for Cloud configuration to your target subscription.
2. Build regional networking hub
This reference implementation is built on a traditional hub-spoke model, typical found in your organization's Connectivity subscription. The hub will contain Azure Firewall, DNS forwarder, and Azure Bastion services.
- Build the regional hub to control and monitor spoke traffic.
3. Plan Kubernetes API server access
Because the AKS server is a "private cluster" the control plane is not exposed to the internet. Management now can only be performed with network line of sight to the private endpoint exposed by AKS. In this case, you'll build an Azure Bastion-fronted jump box.
- Build cluster operations VM image in an isolated network spoke.
- Build cloud-init configuration for the operations VM image.
4. Deploy the cluster
This is the heart of the guidance in this reference implementation; paired with prior network topology guidance. Here you will deploy the Azure resources for your cluster and the adjacent services such as Azure Application Gateway WAF, Azure Monitor, Azure Container Registry, and Azure Key Vault. This is also where you will validate the cluster is bootstrapped.
- Deploy the target network spoke that the cluster will be homed to.
- Prep for cluster bootstrapping by deploy and populating Azure Container Registry and Azure Key Vault. This includes passing cluster images through quarantine.
- Deploy the AKS cluster and supporting services.
- Validate cluster access and bootstrapping.
5. Deploy your workload
A simple workload made up of four interconnected services is manually deployed across two namespaces to illustrate concepts such as nodepool placement, zero-trust network policies, and external infrastructure protections offered by the applied NSGs and Azure Firewall rules.
6. :checkered_flag: Validation
Now that the cluster and the sample workload is deployed; now it's time to look at how the cluster is functioning.
7. :broom: Clean up resources
Most of the Azure resources deployed in the prior steps will have ongoing billing impact unless removed.
Separation of duties
All workloads that find themselves in compliance scope usually require a documented separation of duties/concern implementation plan. Kubernetes poses an interesting challenge in that it involves a significant number of roles typically found across an IT organization. Networking, identity, SecOps, governance, workload teams, cluster operations, deployment pipelines, any many more. If you're looking for a starting point on how you might consider breaking up the roles that are adjacent to the AKS cluster, consider reviewing our Microsoft Entra role guide shipped as part of this reference implementation.
:notebook: For more information, see Azure Architecture Center guidance for PCI-DSS 3.2.1 Requirement 7, 8, and 9 in AKS.
Is that all, what about ... !?
Yes, there are concerns that do extend beyond what this implementation could reasonably demonstrate for a general audience. This reference implementation strived to be accessible for most people without putting undo burdens on the subscription brought to this walkthrough. This means SKU choices with relatively large default quotas, not using features that have very limited regional availability, not asking for learners to be overwhelmed with "Bring your own encryption key" options for services, and similar. All in hopes that more people can complete this walkthrough without disruption or excessive coordination with subscription or management group owners.
For your implementation, take this starting point and add on additional security measures talked about throughout the walkthrough and the Azure Architecture Center guidance that were not directly implemented. For example, enable JIT and Conditional Access Policies, use Encryption-at-Host features if applicable to your workload, and so on.
For a list of additional considerations for your architecture, please see our Additional Considerations document.
Cost
This reference implementation runs idle around $95 (US Dollars) per day within the first 30 days; and you can expect it to increase over time as some Microsoft Defender for Cloud tooling has free-trial period and logs will continue to accrue. The largest contributors to the starting cost are Azure Firewall, the AKS nodepools (virtual machine scale sets), and Log Analytics. While some costs are usually cluster operator costs, such as nodepool VMSS, Log Analytics, incremental Microsoft Defender for Cloud costs; others will likely be amortized across multiple business units or applications, such as Azure Firewall.
Some customers will opt to amortize cluster costs across workloads by hosting a multitenant cluster within their organization, maximizing density with workload diversity. Doing so with regulated workloads is not advised. Regulated environments will generally prioritize compliance and security (isolation) over cost (diverse density).
Final thoughts
Kubernetes is a very flexible platform, giving infrastructure and application operators many choices to achieve their business and technology objectives. At points along your journey, you will need to consider when to take dependencies on Azure platform features, CNCF OSS solutions, ISV solutions, support channels, and what operational processes need to be in place. We encourage this reference implementation to be the place you start architectural conversations within your own team; adapting to your specific requirements, and ultimately delivering a solution that delights your customers and your auditors.
Related documentation
- Azure Kubernetes Service Baseline Architecture
- Azure Kubernetes Service Documentation
- Microsoft Azure Well-Architected Framework
- Microservices architecture on AKS
Contributions
Please see our contributor guide.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
With :heart: from Microsoft Patterns & Practices, Azure Architecture Center.