 
    
    Deploying our Kubernetes-based solution in a new environment uncovered a tricky difference in underlying software configurations. Be ready to dive into solving an exciting SRE mystery involving Rook, Ceph, containerd, and Linux systemd!
 
    
    Deploying our Kubernetes-based solution in a new environment uncovered a tricky difference in underlying software configurations. Be ready to dive into solving an exciting SRE mystery involving Rook, Ceph, containerd, and Linux systemd!
 
    
    How API Priority and Fairness can help your Kubernetes workloads? Here's a real-life case where its flow control features helped us bring a production application back to life.
 
    
    A fascinating story of our recent incident resulted in a couple of pull requests to Kubernetes-related projects. Be ready to dive into some intricacies of Kubernetes API as well as etcd interaction.
 
    
    Our recent experience with Chaos Mesh as a way to test an application run in Kubernetes for various disruptive scenarios.
 
    
    The disastrous fire OVHcloud data centers experienced this March affected our monitoring system badly. Here is how it challenged us and what we did to keep everything working smoothly.
 
    
    Here is another failure experience from our SREs that is worth sharing. It involves the migration of an Elasticsearch cluster from one storage to another inside a Kubernetes cluster.
 
    
    We're starting a special series of articles dedicated to our… failures and lessons we've learned from them. This story has happened with a ClickHouse + ZooKeeper setup due to miscommunication.
 
    
    This article reviews existing tools for implementing chaos engineering in K8s including kube-monkey, chaoskube, Chaos Mesh, Litmus Chaos, Chaos Toolkit, some games, and even more.
 
    
    Why you should be careful using Kubernetes operators for critical infrastructure and which tools might be useful for analyzing your Redis databases.
 
    
    Troubleshooting & recovering the failed Rook cluster manually. Tips for preventing the disaster and restoring from backups.
 
    
    When something prevents web applications from proper functioning, you have to investigate into all levels: in your infrastructure (K8s based in our case), third-party services or in the code itself.
Get our new tech articles in a good old fashion!
 We promise not to send anything besides
                        them.