Leaving Olark after nine years

In June of 2016 I thought I was interviewing for a python gig at a small chat company. Instead Olark’s engineering director noticed “kubernetes” on my resume and I ended up on a zoom with two engineers from the company’s Engineering Operations team. They were building an SRE culture to drive improvements in reliability via a migration to kubernetes for container orchestration. Was I interested in a role with them instead? I was. A couple of weeks later during onboarding in Ann Arbor we started hacking out the first version of our tooling for deploying containers to a k8s cluster from a CI/CD pipeline. We would build out at least two more over the next nine years, a period which has been one of the most rewarding and productive of my career. I was privileged to work with a group of very smart engineers, at a company small-enough to be friendly and comfortable, and yet large enough to have systems you could get your teeth into.


Alas, nothing lasts forever and August 29th will be my final day on the job with Olark. I’m not sure yet what my next thing is, but I am sure I will miss the friends I made there and the work we did together. I wish them all much success and endlessly incrementing karma. To help me put a sort of mental wrap on it all I am going to give a quick list of what I feel are the highlights of my time with a company whose name was never, as best I can recall, fully explained to me. I did get enough to know it’s not about the bird.

Migrating to Google Cloud

When I first joined Olark we ran on a bunch of VMs on Rackspace public cloud. Olark was a Y-Combinator alum, and one of their graduation gifts I think was credits or something to entice them in that direction. That was my first experience with pager rotation, and in my first week I was awakened at least once because one of those VMs had inexplicably lost its network interface or boot disk and sent nagios screaming into the night. Incidents were depressingly common. Our initial plan had been to move services directly into GKE and leave all that VM + puppet + jenkins stuff behind but the situation got so bad that we decided to pick it up lock, stock and barrel and move to Google. We all worked on everything but my daily focus was building out the terraform scaffolding to create service instances and networking and creating the base images that allowed the systems to join our puppet show. I cannot fail to mention the time I ran a terraform command locally in the wrong folder and deleted 180 VMs. Anyway, the Saturday in March of 2017 when we flipped the DNS switch and lit it all up remains one of the high points of my career.

Kubernetes: it’s Greek for something

Kubernetes on GKE was always where we were heading. Some of our services were already containerized but most were python processes deployed by jenkins and configured by shell scripts dumped onto the instances by a custom daemon. When we updated config we had to ssh to all the boxes and restart everything. Moving it all to containers and into kubernetes took probably two years, and we had a lot to learn along the way from cluster sizing to resource management, networking and autoscaling. As this migration evolved we replaced our initial custom CI/CD tooling with helm charts and later mostly moved from those to a toolchain using kustomize to patch yaml. Before joining Olark I had played with running http servers and whatnot in early beta clusters. Here at the end of my time with the company we run five clusters, with nearly 100 nodes. The curve of our adoption of k8s closely paralleled the curve of Google’s rollout and evolution of GKE. In those years from 2017 through 2022 we worked through a constant stream of changes: everything from statefulsets to ephemeral storage and container-native load balancing. Throughout it all GKE has been an extremely performant and reliable platform for us, and I think it’s fair to say we met all of our initial goals for the migration and then some.

All you need is logs (and then metrics)

We weren’t on GCP for long when our monthly bills convinced us that we were going to need our own aggregation, storage and visualization pipeline for logs and metrics. Google had just bought Stackdriver at the time, and while I don’t remember the pricing model or numbers I do remember it wasn’t working for us. I think the value prop has since flipped, but at the time building our own was the only option. We ended up with fluent-bit on our remaining VMs, fluentd in the cluster (a daemonset tailing container logs on every node and a dedicated indexing deployment), elasticsearch for storage and kibana for querying. At peak utilization we were handling 15 – 20k log lines per second. Figuring out the sizing of the elasticsearch cluster was a bit of black magic and we ended up replacing the whole thing a year in, but it’s been doing its job now for seven years. Later on I got to implement prometheus and grafana for monitoring, and probably the best part of that was working with all the other engineers to instrument the services they were responsible for.

Out of state

State is always the hard bit. We migrated into GCP with a bunch of databases, elasticsearch and redis clusters, etcd clusters and memcached instances. We messed around with moving some of it into k8s but it was early days for statefulsets and we weren’t happy with the storage performance at the time. It quickly became a policy and medium term goal to get as much of that stuff onto managed services as we could. We moved the dbs to Cloud SQL first, then moved all the redis instances to MemoryStore. We retained the elasticsearch clusters for lack of a fully-managed GCP-hosted alternative, but if one had been available we would certainly have been tempted. In all of these cases the hosted alternative has been more performant and cost effective, and the migrations took a big chunk of responsibility off the backs of our small engineering team.

Migrating our DNS

After the migration to GCP our public DNS zones remained hosted on Rackspace, while our private zones were implemented on AWS Route 53. In mid-2023 we decided to move it all to Google Cloud DNS. The motivation was to finally close out our Rackspace account, clean up a lot of ancient cruft in the zones, and bring it all under terraform management so that it could be more accessible to the engineering team. We extracted all of our existing records as bind formatted zone files, a process that was simple at AWS but required a support request at Rackspace. We then cleaned out all the obsolete records, wrote code to generate terraform markup from the zone files, created the new zones, wrote some more code to compare the records in the new and old zones and validate that they were identical, and finally updated the name server records at our registrar. The whole thing went very smoothly and without any disruption for our customers. We’ve since been quite happy with the performance of Cloud DNS and the process of managing our zones with terraform, and while the terraform markup for DNS records is pretty cumbersome (since each record is a named terraform resource) it is overall an improvement over the previous imperative approach to managing our DNS.

Thanks a bunch Edge.io

The looming holiday season of 2024 brought a fun email from Edge.io, the CDN platform we had been caching our content on since way back when they were Verizon Digital Media Services. They were in the process of shutting down the network in mid-January of 2025 and we would have to find a new place to cache our stuff. A notification that you have to switch to a new CDN network and have only a couple of weeks to do it in is not the kind of holiday card I like to find in the mailbox. After a crash program to investigate capabilities and pricing of a number of leading CDN providers we settled on Cloudflare and were able to migrate our objects to it and switch over seamlessly, while at the same time saving a substantial amount of money. Cloudflare, of course, is much more than a CDN network and once we had the relationship in place we began finding uses for some of their other tools, like edge workers, and maybe that was the point. Well-played Cloudflare.

Shout out to Gitlab

It’s not easy being a small engineering team running a complex distributed system built from dozens of critical open source components. One of the ways in which it is not easy is keeping up with the pace of change while continuing to ensure that new versions of things play nice with older versions of other things and that nothing breaks horribly because you had the temerity to move from happyfreestuff 2.02 to 2.03. The perfect storm is when something is critical, complex, and scary to upgrade. Such things tend to languish. Since not long after our migration to GCP we had been running our own Gitlab server. I’m not proud of it, but by mid-2024 the current version was 17.4 and we were still chugging along on 13.whatever and had a problem: Gitlab’s license file format had changed and we needed to upgrade in order to be able to renew ours. Fortunately Gitlab has an awesome upgrade path planning tool: you give it the current running and target versions and it tells you which versions you need to install to get from A to B. We cloned the server, ran through the upgrade on the clone successfully, then did it again for reals. Building some new runners was easier than upgrading the ones we had and we took the opportunity to give them a little more muscle too.

What it was that made Olark fun

I’ve been with Olark longer than any other company I’ve worked for, including the one I co-founded back in the late 90’s. When I think about what it was that kept me there I end up with a short list of things that made it a great place to work. First, a sense of mission. The company had serious platform problems when I arrived. A lot of things had to be fixed. Big changes had to be made and there was support for making them. That’s always exciting. Second a really sharp team of people I identified with (though I am older than most of them). I remember at one point in my work-along day when one of the other guys had broken my demo and we were laughing at a joke I cracked about it. I mentioned something about my computer and one of them asked if I built it. Yeah of course I built it. Asus motherboard, Fractal Design case, EGA 1070Ti. We’re all geeks on this call. We all enjoyed building stuff and the virtual high fives when you got a thing running were as real as virtual high fives get. That is the thing that still gets me up and to work every day, and the reason why I am still doing engineering after 30 years. So thanks for that Olark. I wish you all luck… and I’ll just leave this here…

Exhausting conntrack table space crippled our k8s cluster

It’s been a couple of years since I’ve written on software or systems topics. No specific reason for that other than that I wrote a bunch back when kubernetes adoption was ramping up and I just got tired of the topic (even though there have been plenty of new things worth writing about!). Also pandemic, family responsibilities, work stuff, etc., etc. Before that I wrote mostly for publication on Medium. Over time I became less and less thrilled with Medium, and so I’ve decided that for any future work I’m going to publish here and syndicate there, and we’ll see how that goes. For this post I want to talk about something that snuck up and hobbled our RPC services in our production cluster a couple of weeks ago: conntrack table exhaustion.

It started, as many such episodes do, with a ping in our internal support channel. Some things were slow, and people were seeing errors. On our response video chat we combed through logs until someone noticed a bunch of “temporary failure in name resolution” messages, an error that cluster nodes and other VMs will log when they can’t, you know, resolve a name. We run on Google Cloud + GKE, and all of our DNS zones are on Google Cloud DNS. If DNS lookups were failing it wasn’t our doing. We’ve been on Google for almost seven years and the platform has been amazingly durable and performant… but the odd, brief episode of network shenanigans had not been, up to this point, an unknown thing.

Continue reading

So you want Windows to show 24-hour time?

Originally published at https://medium.com/@betz.mark/so-you-want-windows-to-show-24-hour-time-eeac41062b73

I spend a large part of every day shelled into cloud servers, viewing logs, checking alerts in slack channels, looking at pages on my phone, glancing at the kitchen clock as I walk by to get coffee, and otherwise behaving like a typical engineer. These activities have something in common: they all involve timestamps of one form or another and most of them are different.

Yeah, I hate time zones, and you probably do too. Our servers are on UTC military time. Our slack channel shows 12-hour local time, as does the kitchen clock and my phone. My colleagues are often reporting timestamps in their own local time, which given that we’ve been a remote team for something like forever means those might be EST, EDT, CDT, CST, PDT, PST… you get the point… moreover you’ve probably lived it just like the rest of us. I’ve considered just changing everything in my life to UTC military time but I would irritate my wife and you can’t avoid hitting a disconnect somewhere. Still, I do want to make all the on-the-fly converting I have to do as easy as possible.

Continue reading

The day a chat message blew up prod

Originally published at https://medium.com/@betz.mark/the-day-a-chat-message-blew-up-prod-2c30941db07a

I don’t often write “in the trenches” stories about production issues, although I enjoy reading them. One of my favorite sources for that kind of tale is rachelbythebay. I’ve been inspired by her writing in the past, and I’m always on the lookout for opportunities to write more posts like hers. As it happens we experienced an unusual incident of our own recently. The symptoms and sequence of events involved make an interesting story and, as is often the case, contain a couple of valuable lessons. Not fun to work through at the time, but perhaps fun to replay here.

Some background: where I work we run a real-time chat service that provides an important communications tool to tens of thousands of businesses worldwide. Reliability is critical, and to ensure reliability we have invested a lot of engineering time into monitoring and logging. For alerting our general philosophy is to notify on root causes and to page the on-call engineer for customer-facing symptoms. Often the notifications we see in email or slack channels allow us to get ahead of a developing problem before it escalates to a page, and since pages mean our users are affected this is a good thing.

Continue reading

The cost of tailing logs in kubernetes

Originally published at https://medium.com/@betz.mark/the-cost-of-tailing-logs-in-kubernetes-aca2bfc6fe43

Logging is one of those plumbing things that often gets attention only when it’s broken. That’s not necessarily a criticism. Nobody makes money off their own logs. Rather we use logs to gain insight into what our programs are doing… or have done, so we can keep the things we do make money from running. At small scale, or in development, you can get the necessary insights from printing messages to stdout. Scale up to a distributed system and you quickly develop a need to aggregate those messages to some central place where they can be useful. This need is even more urgent if you’re running containers on an orchestration platform like kubernetes, where processes and local storage are ephemeral.

Since the early days of containers and the publication of the Twelve-Factor manifesto a common pattern has emerged for handling logs generated by container fleets: processes write messages to stdout or stderr, containerd (docker) redirects the standard streams to disk files outside the containers, and a log forwarder tails the files and forwards them to a database. The log forwarder fluentd is a CNCF project, like containerd itself, and has become more or less a de facto standard tool for reading, transforming, transporting and indexing log lines. If you create a GKE kubernetes cluster with cloud logging enabled (formerly Stackdriver) this is pretty much the exact pattern that you get, albeit using Google’s own flavor of fluentd.

Continue reading

Behind the front lines of the pandemic

Originally published at https://medium.com/@betz.mark/behind-the-front-lines-of-the-pandemic-dedaef9fcffd

I’m a software engineer and so I usually fill this space with software and systems engineering topics. It’s what I do and love, and I enjoy writing about it, but not today. Instead I’m going to talk about what my wife does, and loves doing, and how the times we are living through have affected her job and our lives together. In many ways we’re among the lucky ones: we both have incomes and health insurance, and I already worked from home. In other ways we’re not so fortunate. The current crisis facing the world is like nothing any of us have seen in a generation or more. It’s impacting every single segment of our population and economy, and everyone has a story. This is what ours looks like, almost four weeks into lock-down.

My wife is a registered nurse. She works at a regional hospital in northern New Jersey, about 30 miles from our home. She has been there more than a decade. Her current role is as clinical coordinator on a cardiac critical care unit. You can think of it as sort of the captain of the care team. Some weeks ago, in preparation for what was obviously coming, her unit was converted into a negative pressure floor for the care of Covid-19 cases. This means that a lot of work was done to seal the floor off and provide ventilation to lower the air pressure within to prevent the escape of infectious material. The same was done to one other unit in the hospital, and a lot of work was also done to prepare to provide intensive respiratory care for patients in those units.

Continue reading

Pulling shared docker tags is bad

Originally published at https://medium.com/@betz.mark/pulling-shared-docker-tags-is-bad-5aea48e079c6

Last night we migrated a key service to a new environment. Everything went smoothly and we concluded the maintenance window early, exchanged a round of congratulations and killed the zoom call. This morning I settled in at my desk and realized that this key service’s builds were breaking on master. My initial, and I think understandable impulse was that somehow I had broken the build when I merged my work branch for the migration into master the night before. Nothing pours sand on your pancakes like waking up to find out the thing you thought went so well last evening is now a smoking pile of ruin.

Except that wasn’t the problem. There was no difference between the commit that triggered the last good build and the merge commit to master that was now failing. I’m fine with magic when it fixes problems. We even have an emoji for it. “Hey, that thingamajig is working now!” Magic. I do not like it when it breaks things, although it is possible to use the same emoji for those cases as well. The first clue as to what was really happening was that the broken thing was a strict requirements check we run on a newly built image before unit tests. It has a list of packages it expects to find, and fails if it finds any discrepancy between that and the image contents.

Continue reading

After my disk crashed

Originally published at https://medium.com/@betz.mark/after-my-disk-crashed-e4f6d1c29d93. Reposted here with minor edits.

A couple of years ago I lost all of what I would have considered, up to that point, my intellectual life, not to mention a lot of irreplaceable photos, in a hard drive failure. And while this post is not about the technical and behavioral missteps that allowed the loss to occur those things nonetheless make up a part of the story. How does it happen that an experienced software engineer, someone who is often responsible for corporate data and has managed to not get fired for losing any of it, suffers a hard drive failure and finds himself in possession of zero backups? Almost effortlessly, as it turned out.

Since the early 1980’s I’ve kept all my digital self in a single directory tree off the root of my system’s boot disk. Over the years this directory structure was faithfully copied every time I upgraded, travelling on floppies, zip drives, CD-Rs, DVD-Rs, USB thumb drives, flash drives, from my first 8088 to my second and ridiculously expensive 80286 and so on through all of the machines I’ve bought or built in three decades. Along the way it grew, becoming the repository for all my software and writing work. The first VGA code I wrote was in there. The complete source code for my shareware backgammon game was in there. All the articles I wrote for Dr. Dobbs, Software Development and other journals were in there.

Continue reading

Upgrading a large cluster on GKE

Originally published on the Google Cloud Community blog at https://medium.com/google-cloud/upgrading-a-large-cluster-on-gke-499a7256e7e1

At Olark we’ve been running production workloads on kubernetes in GKE since early 2017. In the beginning our clusters were small and easily managed. When we upgraded kubernetes on the nodes, our most common cluster-wide management task, we could just run the process in the GKE console and keep an eye on things for awhile. Upgrading involves tearing down and replacing nodes one at a time, and consumes about 4–5 minutes per node in the best case. When we were at 20 nodes it might take 90–120 minutes, which is in a tolerable range. It was disruptive, but all our k8s services at the time could deal with that. It was irreversible too, but we mitigated that risk by testing in staging, and by staying current enough that the previous version was still available for a replacement nodepool if needed. This approach seemed to work fine for over a year.

As our clusters grew and we created additional nodepools for specific purposes a funny thing began to happen: upgrading started to become a hassle. Specifically it began to take a long time. Not only did we have more nodes, but we also had a greater diversity of services running on them. Some of those implemented things like pod disruption budgets and termination grace periods, that slow an upgrade down. Others could not be restarted without a downtime due to legacy connection management issues. As the upgrade times got longer the duration of these scheduled downtimes also grew, impacting our customers and our team. Not surprisingly we began to fall behind the current GKE release version. Recently we received an email from Google Support letting us know that an upcoming required master update would be incompatible with our node version. We had to upgrade them, or they would.

Continue reading

Ingress load balancing issues on Google’s GKE

Originally published on the Google Cloud Community blog at https://medium.com/google-cloud/ingress-load-balancing-issues-on-googles-gke-f54c7e194dd5

Usually my posts here are about some thing I think I might have figured out and want to share. Today’s post is about a thing I’m pretty sure I haven’t figured out and want to share. I want to talk about a problem we’ve been wrestling with over the last couple of weeks; one which we can suggest a potential fix for but do not yet know the root cause of. In short, if you are running certain types of services behind a GCE class ingress on GKE you might be getting traffic even when your pods are unready, as during a deployment for example. Before I get into the details here is the discovery story. If you just want the tl;dr and recommendations jump to the end.

[Update 4/17/2019 — Google’s internal testing confirmed that this is a problem with the front end holding open connections to kube-proxy and reusing them. Because the netfilter NAT table rules only apply to new connections this effectively short-circuited kubernetes’ internal service load balancing and directed all traffic to a given node/nodeport to the same pod. Google also confirmed that removing the keep-alive header from the server response is a work-around, and we’ve confirmed this internally. If you need the keep-alive header then the next best choice is to move to container native load balancing with a VPC-native cluster, since this takes the nodeport hop right out of the equation. Unfortunately that means building a new cluster if yours is not already VPC-native. So that is the solution… if you’re still interested in the story read on!]

Over the last couple of months we’ve been prepping one of our most critical services for migration to GKE. This service consists partly of an http daemon that handles long poll requests from our javascript client, and runs on 90 GCE instances. These instances handle approximately 15k requests per second at peak load. Because many of these requests are long polls with a timeout of 30 seconds we need the ability to gracefully shut down instances of this service. To accomplish this we have a command we can send the service that causes it to take itself out of rotation, wait 60 seconds for all existing long polls to complete, and then exit.

Continue reading