bunny.net
bunny.net positions itself as a robust European alternative to Cloudflare, offering a global edge platform that integrates CDN, security, and compute services. I find their focus on performance and comprehensive edge solutions appealing.
This category collects essential tools and resources for Site Reliability Engineering, focusing on building, deploying, and maintaining resilient distributed systems. It covers critical areas like cloud infrastructure management, AI/ML inference, observability, and security, with a strong emphasis on practical, often open-source, solutions. I find it invaluable for those looking to optimize their operational workflows and ensure system stability.
bunny.net positions itself as a robust European alternative to Cloudflare, offering a global edge platform that integrates CDN, security, and compute services. I find their focus on performance and comprehensive edge solutions appealing.
Anubis is a self-hostable proof-of-work software designed to defend websites against scraper bots and automated traffic.
DeepSeek's 3FS (Fire-Flyer File System) is a high-performance distributed file system specifically designed for AI training and inference workloads, leveraging SSDs and RDMA.
LiteLLM is a Python SDK and proxy server to unify 100+ LLM APIs, enabling calls in OpenAI format. This is a critical tool for abstracting LLM providers, offering features like cost tracking, load balancing, and guardrails.
OpenInference delivers OpenTelemetry instrumentation for AI observability. This allows tracing AI applications, including LLMs and their ecosystem components, providing critical insights into their runtime behavior across any OpenTelemetry-compatible backend.
This is my go-to C library for efficiently running LLaMA and other LLMs locally on consumer hardware, especially useful for Mac users where it shines on Apple Silicon. It really democratizes local AI inference.
vLLM is an LLM hosting framework designed for fast and efficient LLM inference and serving. Its features like continuous batching and automatic prefix caching significantly improve throughput for online and offline workloads.
Kamal is a Docker-based deploy automation tool from 37signals. I use it to deploy web apps anywhere, from bare metal to cloud VMs, to avoid commercial platform lock-in and simplify production with open-source tooling.
This is NVIDIA's DCGM-based Prometheus exporter for GPU metrics, which the Kubernetes metrics exporter leverages. It's a foundational tool for monitoring NVIDIA GPUs, often deployed as part of the NVIDIA GPU Operator.
GPU Deploy was an Airbnb-like marketplace enabling users to rent idle GPU capacity from others. It focused on providing low-cost, on-demand GPU instances, including fractional rentals, specifically for machine learning and AI tasks.
I use OAuth2 Proxy as a reverse proxy to secure internal services, authenticating users via common providers like Google or GitHub to restrict access by email, domain, or group.
Anakin is a Linux tool that ensures child processes are properly reaped, preventing lingering orphans after a parent terminates, which is essential for robust process management. I appreciate the Star Wars pun for a tool that 'kills' orphans.
This is a useful list of disposable and temporary email domains. I use it to block registrations on my email forms and logins, preventing spam and abuse.
Domain Monitor provides a comprehensive, daily-updated list of all globally registered domains, including historical data, zone information, and email contacts. I see this as a critical resource for anyone needing to analyze domain trends or perform large-scale data operations.
ClickHouse is an open-source, column-oriented data warehouse designed for real-time analytical processing. It excels at delivering millisecond queries on large datasets, making it highly efficient for OLAP workloads.
This site offers a clear overview table of Google Cloud Compute Engine machine type pricing across regions. I find it useful for quickly comparing instance costs and specifications without navigating multiple Google docs.
This is a great visual explainer for the Raft consensus algorithm, making it easier to grasp for distributed systems developers. It's designed to be understandable while matching Paxos in fault-tolerance and performance.
Infracost is a great tool for showing infrastructure cost changes directly in GitHub PRs/MRs. I find it invaluable for detecting and preventing potential cloud cost increases before deployment, shifting FinOps left effectively.
k6 is an open-source load testing tool from Grafana. It enables engineering teams to script performance tests in JavaScript, supporting various test types for continuous application reliability.
This is the SRE team's reliability manifesto from Delivery Hero, outlining their operational principles. It's a solid blueprint that can be adapted to build your own team's reliability framework.
Innernet is a FOSS, WireGuard-based private network system. It's similar to Tailscale but leverages CIDRs for powerful ACL primitives rather than a custom approach.
tfsec is a static security linter for Terraform. I find it useful for early detection of security misconfigurations in my infrastructure-as-code.
GoogleContainerTools is a GitHub organization offering a significant collection of tools for container development and management. I find it an essential resource for anyone working with Docker and Kubernetes.
This Postgres Index Advisor helps you identify and configure optimal indexes for more efficient query execution. It's a valuable tool for boosting database performance.
Bomb Squad is a Kubernetes sidecar for Prometheus, automatically detecting and silencing high cardinality series to maintain operational stability. I find it a crucial tool for preventing cardinality explosions.
Victoria Metrics is a high-performance, open-source time series database and monitoring solution, designed as a Prometheus-compatible stack. It's a solid choice for those needing scalable observability across metrics, logs, and traces.
This is `grafonnet-lib`, a Jsonnet library for defining Grafana dashboards as code. While powerful for programmatic dashboard management, this specific repository is deprecated; a new, generated version exists at `grafana/grafonnet`.
This article details a straightforward SQL query to identify active, long-running queries in PostgreSQL by inspecting `pg_stat_activity`. It's a quick way to pinpoint performance bottlenecks and offers basic steps to further diagnose and terminate rogue processes.
Sloth simplifies creating SLOs for Prometheus, generating consistent metrics and multi-window multi-burn alerts from a simple spec. It's a pragmatic approach to standardize service-level objectives.
This is Dan McKinley's essay on why prioritizing stable, proven technologies is crucial. I find it a pragmatic take on conserving engineering focus for core business innovation rather than experimenting with unproven tech stacks.
Litestream provides continuous streaming backup for SQLite databases, enabling recovery to the point of failure with minimal cost. It's an essential tool for robust single-server applications.
This GitHub repo provides a curated list of excellent resources for Site Reliability and Production Engineering. I use it as a solid starting point when I need to dig into SRE topics.
This is a curated collection of publicly available resources detailing how various tech organizations implement Site Reliability Engineering. I find it valuable for understanding diverse SRE practices, tools, and culture.
This provides a solid introduction to foundational SRE knowledge, covering core systems engineering and software concepts. It's a useful resource for anyone looking to build a career in site reliability engineering.
This site defines Chaos Engineering, an empirical approach to building confidence in distributed systems by actively experimenting on them in production. It outlines core principles for proactively identifying systemic weaknesses.