Vol. 2026 Issue 15 Updated 11 Apr 2026 Entries 759
Filtered #SRE × clear filter

This category collects essential tools and resources for Site Reliability Engineering, focusing on building, deploying, and maintaining resilient distributed systems. It covers critical areas like cloud infrastructure management, AI/ML inference, observability, and security, with a strong emphasis on practical, often open-source, solutions. I find it invaluable for those looking to optimize their operational workflows and ensure system stability.

SRE entries

Questions & Answers

What kind of links live here?
This SRE category features a curated list of tools and resources for building and maintaining robust distributed systems, covering infrastructure, deployment, observability, and security. It includes open-source solutions for cloud management, AI/ML infrastructure, database optimization, and performance testing.
Who is this category for?
This category is primarily for Site Reliability Engineers, DevOps practitioners, platform engineers, and machine learning engineers. It caters to anyone responsible for designing, deploying, monitoring, and scaling critical infrastructure and applications, particularly those leveraging cloud and AI technologies.
What are the recurring themes or types of tools found here?
Recurring themes in this category include cloud infrastructure and deployment automation with tools like Kamal, comprehensive monitoring and observability solutions such as OpenInference for AI, and specialized infrastructure for AI/ML workloads with vLLM and llama.cpp. It also covers security, performance testing, and database optimization.
Can you name one or two standout entries that represent this category well?
Standout entries include Kamal, a Docker-based deployment automation tool from 37signals, which simplifies production, and LiteLLM, a Python SDK and proxy server that unifies over 100 LLM APIs for consistent AI application development. OpenInference for AI observability is another key resource.
When should I browse this SRE category versus others, for example, "Cloud" or "AI/ML"?
Browse this SRE category when your focus is on the operational aspects of systems, including reliability, performance, observability, and secure deployment across various environments. While "Cloud" might focus on providers and services, and "AI/ML" on models and frameworks, this category specifically addresses the engineering practices and tools to run them reliably in production.