What kind of links live here?

This SRE category features a curated list of tools and resources for building and maintaining robust distributed systems, covering infrastructure, deployment, observability, and security. It includes open-source solutions for cloud management, AI/ML infrastructure, database optimization, and performance testing.

Who is this category for?

This category is primarily for Site Reliability Engineers, DevOps practitioners, platform engineers, and machine learning engineers. It caters to anyone responsible for designing, deploying, monitoring, and scaling critical infrastructure and applications, particularly those leveraging cloud and AI technologies.

What are the recurring themes or types of tools found here?

Recurring themes in this category include cloud infrastructure and deployment automation with tools like Kamal, comprehensive monitoring and observability solutions such as OpenInference for AI, and specialized infrastructure for AI/ML workloads with vLLM and llama.cpp. It also covers security, performance testing, and database optimization.

Can you name one or two standout entries that represent this category well?

Standout entries include Kamal, a Docker-based deployment automation tool from 37signals, which simplifies production, and LiteLLM, a Python SDK and proxy server that unifies over 100 LLM APIs for consistent AI application development. OpenInference for AI observability is another key resource.

When should I browse this SRE category versus others, for example, "Cloud" or "AI/ML"?

Browse this SRE category when your focus is on the operational aspects of systems, including reliability, performance, observability, and secure deployment across various environments. While "Cloud" might focus on providers and services, and "AI/ML" on models and frameworks, this category specifically addresses the engineering practices and tools to run them reliably in production.

SRE links | Simon Frey Open Link List

bunny.net · 01 JAN '26

bunny.net

bunny.net positions itself as a robust European alternative to Cloudflare, offering a global edge platform that integrates CDN, security, and compute services. I find their focus on performance and comprehensive edge solutions appealing.

anubis.techaro.lol · 31 DEC '25

Anubis

Anubis is a self-hostable proof-of-work software designed to defend websites against scraper bots and automated traffic.

github.com · 30 DEC '25

3FS

DeepSeek's 3FS (Fire-Flyer File System) is a high-performance distributed file system specifically designed for AI training and inference workloads, leveraging SSDs and RDMA.

github.com · 30 DEC '25

litellm

LiteLLM is a Python SDK and proxy server to unify 100+ LLM APIs, enabling calls in OpenAI format. This is a critical tool for abstracting LLM providers, offering features like cost tracking, load balancing, and guardrails.

github.com · 30 DEC '25

openinference

OpenInference delivers OpenTelemetry instrumentation for AI observability. This allows tracing AI applications, including LLMs and their ecosystem components, providing critical insights into their runtime behavior across any OpenTelemetry-compatible backend.

github.com · 30 DEC '25

llama.cpp

This is my go-to C library for efficiently running LLaMA and other LLMs locally on consumer hardware, especially useful for Mac users where it shines on Apple Silicon. It really democratizes local AI inference.

docs.vllm.ai · 30 DEC '25

vLLM

vLLM is an LLM hosting framework designed for fast and efficient LLM inference and serving. Its features like continuous batching and automatic prefix caching significantly improve throughput for online and offline workloads.

kamal-deploy.org · 14 MAY '25

Kamal — Deploy web apps anywhere

Kamal is a Docker-based deploy automation tool from 37signals. I use it to deploy web apps anywhere, from bare metal to cloud VMs, to avoid commercial platform lock-in and simplify production with open-source tooling.

github.com · 19 JUL '24

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM

This is NVIDIA's DCGM-based Prometheus exporter for GPU metrics, which the Kubernetes metrics exporter leverages. It's a foundational tool for monitoring NVIDIA GPUs, often deployed as part of the NVIDIA GPU Operator.

gpudeploy.com · 01 JUL '24

GPU Deploy

GPU Deploy was an Airbnb-like marketplace enabling users to rent idle GPU capacity from others. It focused on providing low-cost, on-demand GPU instances, including fractional rentals, specifically for machine learning and AI tasks.

oauth2-proxy.github.io · 21 MAY '24

OAuth2 Proxy

I use OAuth2 Proxy as a reverse proxy to secure internal services, authenticating users via common providers like Google or GitHub to restrict access by email, domain, or group.

github.com · 19 MAY '24

Anakin

Anakin is a Linux tool that ensures child processes are properly reaped, preventing lingering orphans after a parent terminates, which is essential for robust process management. I appreciate the Star Wars pun for a tool that 'kills' orphans.

github.com · 19 MAY '24

disposable-email-domains

This is a useful list of disposable and temporary email domains. I use it to block registrations on my email forms and logins, preventing spam and abuse.

domains-monitor.com · 19 MAY '24

Domain Monitor

Domain Monitor provides a comprehensive, daily-updated list of all globally registered domains, including historical data, zone information, and email contacts. I see this as a critical resource for anyone needing to analyze domain trends or perform large-scale data operations.

clickhouse.com · 19 MAY '24

ClickHouse

ClickHouse is an open-source, column-oriented data warehouse designed for real-time analytical processing. It excels at delivering millisecond queries on large datasets, making it highly efficient for OLAP workloads.

gcloud-compute.com · 19 MAY '24

Google Cloud Compute Engine Machine Type Comparison

This site offers a clear overview table of Google Cloud Compute Engine machine type pricing across regions. I find it useful for quickly comparing instance costs and specifications without navigating multiple Google docs.

raft.github.io · 27 NOV '23

The Raft Consensus Algorithm

This is a great visual explainer for the Raft consensus algorithm, making it easier to grasp for distributed systems developers. It's designed to be understandable while matching Paxos in fault-tolerance and performance.

$InfraCost — screenshot of infracost.io$

infracost.io · 08 AUG '23

InfraCost

Infracost is a great tool for showing infrastructure cost changes directly in GitHub PRs/MRs. I find it invaluable for detecting and preventing potential cloud cost increases before deployment, shifting FinOps left effectively.

k6.io · 08 AUG '23

K6

k6 is an open-source load testing tool from Grafana. It enables engineering teams to script performance tests in JavaScript, supporting various test types for continuous application reliability.

tech.deliveryhero.com · 08 AUG '23

Delivery Hero Reliability Manifesto

This is the SRE team's reliability manifesto from Delivery Hero, outlining their operational principles. It's a solid blueprint that can be adapted to build your own team's reliability framework.

github.com · 22 JAN '23

Innernet

Innernet is a FOSS, WireGuard-based private network system. It's similar to Tailscale but leverages CIDRs for powerful ACL primitives rather than a custom approach.

aquasecurity.github.io · 22 JAN '23

tfsec

tfsec is a static security linter for Terraform. I find it useful for early detection of security misconfigurations in my infrastructure-as-code.

github.com · 11 JAN '23

Google Container Tools

GoogleContainerTools is a GitHub organization offering a significant collection of tools for container development and management. I find it an essential resource for anyone working with Docker and Kubernetes.

pganalyze.com · 10 JAN '23

Index Advisor for Postgres

This Postgres Index Advisor helps you identify and configure optimal indexes for more efficient query execution. It's a valuable tool for boosting database performance.

github.com · 16 JAN '22

Bomb squad

Bomb Squad is a Kubernetes sidecar for Prometheus, automatically detecting and silencing high cardinality series to maintain operational stability. I find it a crucial tool for preventing cardinality explosions.

victoriametrics.com · 04 JAN '22

Victoria Metrics

Victoria Metrics is a high-performance, open-source time series database and monitoring solution, designed as a Prometheus-compatible stack. It's a solid choice for those needing scalable observability across metrics, logs, and traces.

github.com · 31 DEC '21

Grafonet

This is `grafonnet-lib`, a Jsonnet library for defining Grafana dashboards as code. While powerful for programmatic dashboard management, this specific repository is deprecated; a new, generated version exists at `grafana/grafonnet`.

til.codes · 18 JUL '21

Find long running queries in postgres

This article details a straightforward SQL query to identify active, long-running queries in PostgreSQL by inspecting `pg_stat_activity`. It's a quick way to pinpoint performance bottlenecks and offers basic steps to further diagnose and terminate rogue processes.

github.com · 11 JUL '21

Sloth

Sloth simplifies creating SLOs for Prometheus, generating consistent metrics and multi-window multi-burn alerts from a simple spec. It's a pragmatic approach to standardize service-level objectives.

boringtechnology.club · 21 APR '21

Boring Technology Club

This is Dan McKinley's essay on why prioritizing stable, proven technologies is crucial. I find it a pragmatic take on conserving engineering focus for core business innovation rather than experimenting with unproven tech stacks.

litestream.io · 21 APR '21

Litestream

Litestream provides continuous streaming backup for SQLite databases, enabling recovery to the point of failure with minimal cost. It's an essential tool for robust single-server applications.

github.com · 02 APR '21

Awesome SRE

This GitHub repo provides a curated list of excellent resources for Site Reliability and Production Engineering. I use it as a solid starting point when I need to dig into SRE topics.

github.com · 16 FEB '21

How they SRE

This is a curated collection of publicly available resources detailing how various tech organizations implement Site Reliability Engineering. I find it valuable for understanding diverse SRE practices, tools, and culture.

linkedin.github.io · 16 FEB '21

School of SRE

This provides a solid introduction to foundational SRE knowledge, covering core systems engineering and software concepts. It's a useful resource for anyone looking to build a career in site reliability engineering.

principlesofchaos.org · 11 JAN '21

Principles of chaos engineering

This site defines Chaos Engineering, an empirical approach to building confidence in distributed systems by actively experimenting on them in production. It outlines core principles for proactively identifying systemic weaknesses.

Questions & Answers

What kind of links live here?: This SRE category features a curated list of tools and resources for building and maintaining robust distributed systems, covering infrastructure, deployment, observability, and security. It includes open-source solutions for cloud management, AI/ML infrastructure, database optimization, and performance testing.
Who is this category for?: This category is primarily for Site Reliability Engineers, DevOps practitioners, platform engineers, and machine learning engineers. It caters to anyone responsible for designing, deploying, monitoring, and scaling critical infrastructure and applications, particularly those leveraging cloud and AI technologies.
What are the recurring themes or types of tools found here?: Recurring themes in this category include cloud infrastructure and deployment automation with tools like Kamal, comprehensive monitoring and observability solutions such as OpenInference for AI, and specialized infrastructure for AI/ML workloads with vLLM and llama.cpp. It also covers security, performance testing, and database optimization.
Can you name one or two standout entries that represent this category well?: Standout entries include Kamal, a Docker-based deployment automation tool from 37signals, which simplifies production, and LiteLLM, a Python SDK and proxy server that unifies over 100 LLM APIs for consistent AI application development. OpenInference for AI observability is another key resource.
When should I browse this SRE category versus others, for example, "Cloud" or "AI/ML"?: Browse this SRE category when your focus is on the operational aspects of systems, including reliability, performance, observability, and secure deployment across various environments. While "Cloud" might focus on providers and services, and "AI/ML" on models and frameworks, this category specifically addresses the engineering practices and tools to run them reliably in production.

SRE entries

bunny.net

Anubis

3FS

litellm

openinference

llama.cpp

vLLM

Kamal — Deploy web apps anywhere

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM

GPU Deploy

OAuth2 Proxy

Anakin

disposable-email-domains

Domain Monitor

ClickHouse

Google Cloud Compute Engine Machine Type Comparison

The Raft Consensus Algorithm

InfraCost

K6

Delivery Hero Reliability Manifesto

Innernet

tfsec

Google Container Tools

Index Advisor for Postgres

Bomb squad

Victoria Metrics

Grafonet

Find long running queries in postgres

Sloth

Boring Technology Club

Litestream

Awesome SRE

How they SRE

School of SRE

Principles of chaos engineering

Questions & Answers