Kubernetes state management with Pulumi and Python

I like Kubernetes way of declarative workload configuration, but handling cluster state using dozens or hundreds of YAML files is impractical.

Of course, one can just combine them all into a single uber-YAML . But the harsh reality is, despite the fact that Kubernetes by design can and will apply this configuration asynchronously, and eventually cluster state will achieve the desired state, this “eventually” might be equal to infinity.

There are certain cases when order matters, for instance when new CRD definitions are added, and then new objects with that kind are declared.

Another aspect is complexity, which can be encapsulated by tools such as Helm. While Helm is a good solution for the problem of installing third-party apps, it’s not necessary a right choice for your own services, or for lightweight overall cluster configuration.

And one more thing. I enjoy the kubernetes architecture, even (and especially!) the fact that numerous abstractions are needed to “canonically” expose a single container to the rest of the world. But it doesn’t mean that I enjoy to break a DRY principle, and copy-paste-modify same YAMLs over and over.

So… Pulumi to the rescue!

Disclaimer: I like declarative configuration. I really do. There’s no need to think what can go wrong, or what needs to be done to the infrastructure to change it from the current to the desired state. You just declare the target state, and the system decides on it’s own the effective diff between the two. It’s not the case with Ansible, for instance. You have to explicitly describe all the changes. But the Ansible is so simple, so good and so predictive, that I keep using it for simple infrastructure configuration.

Bearing that in mind, I was looking for a tool that could be used to automate cluster configuration after provisioning:

Apply numerous YAMLs or pseudo-YAMLs (good old kubernetes configs), and ideally refactor them by extracting common parts.
Apply Helm charts.
Extract configuration parameters (DNS names, IP addresses, secrets) and store them separately.

Alternatives

I’ve looked at different tools, and here’s what I disliked about them:

Plain YAMLs. These are just unmanageable. And not DRY enough for me (Plus all the issues mentioned above). Even with kustomize. Even with kpt.
Ansible. Uh… Thanks, but no, thanks. Not declarative enough for me :)
Terraform. Now, here’s the thing about Terraform. I tried to like it, but I simply don’t. At this point in my life I don’t understand why yet-another-DSL should be used in a situation, where general-purpose programming language can be used. Especially taking into consideration, that some of these languages already have features that enable programmers to build DSL on top of them (think of Groovy or Kotlin). Real languages, with loops and conditions. With debuggers, code styles, test frameworks and all other features that mature development technologies provide. HCL is what I don’t like in Terraform. Sorry, it’s just my opinion. And yes, I super-enjoyed when Hashicorp folks made breaking changes to the syntax between 0.11 and 0.12 . It’s possible to migrate Kubernetes descriptors from YAML to HCL. I don’t think it’ll make them more DRY though. Probably more vendor-locked.
Pulumi. I’ve heard about it on an episode of Kubernetes Podcast, and was happy to find same-minded developers . Pulumi supports Python, Javascript, Typescript, and as of version 2 (released just days ago) also Go and .NET. Pulumi supports numerous cloud providers, and even have a Terraform bridge. In regards to Kubernetes, Pulumi claims to support declarative configuration via these programming languages, but also existing YAML descriptors and Helm charts.

Pulumi 101

Detailed explanation of how Pulumi works can be found on their website, but in a nutshell it’s simple:

Run a program written in general-purpose language using their SDK that will declare a desired state.
Compare it to the actual state stored by Pulumi (that state might actually be different from the actual state of the cluster, in which case you might need to pulumi refresh it).
Perform necessary operations (using K8s API Server) to ensure that actual state matches the desired state.
Store the new actual state.

Unfortunately, I am not coding day-to-day anymore, so I don’t have hands-on experience with either of languages supported by Pulumi, but Python and Go are on my todo list. When I’ve started looking at Pulumi, Go support was not in GA, hence choice was obvious: Python.

I was able to cover majority of my use cases at the end (all except Helm), and now I have 95% confidence (and already verified it) that I can configure a new cluster after spinning it up with a simple plan:

Run Pulumi (configuration: skip CRDs)
Deploy Helm charts (containing CRD definitions) with Helmsman
Run Pulumi (configuration: enable CRDs)
Do one small kubectl create / kubectl replace - more on that below

Pros & cons

Pulumi has the following pros to me (in regards to K8s handling):

Real programming language support. Possibility to extract common parts to functions, and then reuse them.
Open source. Free. Not vendor-locked to their backend. It is possible to use their SaaS backend, but it’s also possible to store the state locally or on self-managed backends (AWS S3, GCP GCS, Azure Blob). Same applies to secrets management: Pulumi SaaS, encrypted locally with AES-256-GCM, using cloud KMS (AWS, GCP, Azure) or Hashicorp Vault.
YAML generation instead of applying changes to server. That’s a continuation of their awesomeness in fight with vendor lock-in! Not happy with Pulumi? Generate YAMLs and move on. Kudos to Pulumi team for the wise decision !
YAML provisioning support (and even customization via transformations) - super-useful to provision existing external apps that don’t have Helm charts, or where Helm charts add unnecessary complexity (Fluent Bit, for instance).
Automatic SDK generation from k8s OpenAPI. Very wise decision, allowing Pulumi to quickly become up-to-date with latest changes in spec.
They provide a toolkit for automated infrastructure testing, but I haven’t tried it.
Detailed documentation. Lots of examples for all supported languages, as well as API docs. Amazing.

And in my opinion these are the drawbacks (again, for K8s only):

Auto-naming approach. That’s a design decision. I don’t like it, IMHO it’s a leaky abstraction (Resources from existing YAMLs or Helm charts are not auto-named), but it’s an easy fix: explicitly provide a name in metadata.name. And take responsiblity for potential clashes. But you should be doing that already
Helm support via local template rendering. Helm v2 was a painful experience (tiller and friends), but v3 made it a lot easier. And while kubernetes itself is moving apply to the server, helm moved it to the client. Wise decision for Helm IMHO, and now they can leverage native server-side k8s apply. In my opinion, Pulumi should’ve delegated underlying resource management to helm, but they made a decision to manage granular resources within Pulumi. In fact, Helm v3 support is identical to v2 (at least in Python), implemented with a single line:
```
from ..v2.helm import *  # v3 support is identical to v2
```
I don’t like the decision that Pulumi made, it’s not suitable to me:
- There are problems with charts rendered via helm template, for instance lifecycle hooks are not supported (#555, #666), CRDs are not installed (#1023) etc.
- It’s impossible to leverage the existing helm tooling for the installed releases, as helm won’t see them. What’s crucial to me is finding out updates to the installed charts, as helm whatup won’t work.
Lack of first-class Namespace support. That one really makes me sad, but it’s also an easy fix: specify namespace name explicitly in metadata.namespace. However, in my opinion that should be handled in a different way by Pulumi, providing better abstraction, as that’s a crucial object within Kubernetes.
No type hinting for Kubernetes objects in Python. It looks like they’ve implemented it for Go, but in Python it’s all lists and dicts in metadata, spec and everywhere else. Pulumi converts canonical Python snake_case to Kubernetes camelCase and vice versa, but that’s about it. Other than that - you’re on your own. And as far as I can tell, Pulumi does not (or at least did not, as of v1) perform validation against an OpenAPI spec. I remember seeing errors while objects with misspelled properties were applied.

My Pulumi Infrastructure as Code

Separate stacks for k8s clusters
Python
GCP GCS for state storage (recently replaced with local due to #4258)
Locally encrypted secrets with passphrase provided via PULUMI_CONFIG_PASSPHRASE
Helm managed outside of Pulumi. Flag in Pulumi Config that enables provisioning of CRDs that depends on definitions from Helm charts.
YAML files for third-party services (Fluent Bit, K8s dashboard) stored as resources and applied via pulumi_kubernetes.ConfigFile with transformations
Grafana dashboards ConfigMaps generated as YAML files by a separate provider and applied manually (workaround for #1048)

Issues

Pulumi is a relatively new tool, and probably not as popular as Terraform. It has certain issues, but none are critical for me. Trying to be a good citizen, I’ve at least documented them on Github.

Lack of native Helm 3 support. As I’ve mentioned before, that’s a design decision. As a result I am just using another tool for helm charts management.
Resources with long manifests cannot be created due to too long last-applied-configuration annotation. That’s not an issue of Pulumi per se, rather the one of kubectl apply. However, due to that issue I am currently unable to provision ConfigMaps with Grafana Dashboards.

Workaround (also described in the Github issue):
- separate provider generating YAML
- Python script removing kubectl.kubernetes.io/last-applied-configuration annotation
- kubectl create or kubectl replace
“Released” PersistentVolumes with Reclaim Policy “Retain” are not recreated. They have to be manually removed and re-created by pulumi (after pulumi refresh)
Resource getter function does not work for CustomResourceDefinition. Again, workaround in the issue.
Allow conditional resource creation depending on existence of external resources. Not Kubernetes-specific, hence marked as a duplicate to #3364

GCP GCS state storage issue

One more issue I’ve suffered from recently - rate limit errors for GCS bucket (#4258).

Symptoms in logs:

kubernetes:core:Secret (kubernetes-dashboard/kubernetes-dashboard-csrf):
error: pre-step event returned an error: failed to save snapshot: An IO error occurred during the current operation: blob (key ".pulumi/stacks/dev.json") (code=Unknown): googleapi: Error 429: The rate of change requests to the object /path/to/code/.pulumi/stacks/dev.json exceeds the rate limit. Please reduce the rate of create, update, and delete requests., rateLimitExceeded

kubernetes:core:Endpoints (XXX):
error: pre-step event returned an error: failed to save snapshot: An IO error occurred during the current operation: blob (key ".pulumi/stacks/dev.json") (code=Unknown): googleapi: Error 429: The rate of change requests to the object /path/to/code/.pulumi/stacks/dev.json exceeds the rate limit. Please reduce the rate of create, update, and delete requests., rateLimitExceeded

pulumi:pulumi:Stack (YYY-dev):
error: update failed

Current workaround: migrated to local storage.

Conclusion

All in all, I am extremely happy with Pulumi. I use it for all provisioning except Helm charts.

And I know for sure, if for some reason I’d want to move off from it, it would be as simple as configuring K8s provider to generate YAML descriptors. And that’s one of the best design decisions for such tool.