Monitor
All things monitoring related.
Cloud Native
My preferred stack: Prometheus, Grafana, Loki
- Node Exporter: Prometheus exporter for server/OS statistics
- Elk Stack for Log Monitoring: ELK tends to be a bit heavy, but keeping this around just in case
- Changd: Notify if WebUI changes.
- Performance related articles at https://www.brendangregg.com
- Internet Monitoring (globally)
- AWS CloudWatch Internet Weather Map
- Contrack talkes - one thousand and one flows - Interesting article on monitoring the maximum number of entries in the Linux Contrack table, used for statefile firewall setup
- Pingdom’s State of the Internet
- Down Detector
- Oracle Internet Intelligence
- The Outage Mailing List - Network admins chatting about global issues
- Internet Monitoring (locally)
- Open Speed Test: Browser based, no client login required.
- Trippy: More advance Traceroute
Alternative Software I want to Look At Someday
- One Uptime: Open source observability platform - uptime monitoring, incident maganement, oncall alerts, logs, traces, etc (and maybe metrics, but not widely advertised)
- OpenObserve: Open source, lightweight, single binary, drop in replacement for Elisticsearch, support OpenTelementry/OTEL
- Signoz: Open source, lightweight, log, metrics, traces, all working with OpenTelementry
- BindplaneOP: Manage sources that are OpenTelementry Specific
Advance Debugging
- strace - Almost always available. Potentially A LOT of performance impact
- Sysdig: Combo of strace and tcpdump - and with less performance impact
- Sysdig Inspect: Potential GUI for sysdig output
- eBPF.io: Resources for eBPF
- KubeCTL Trace: Easily run eBFP from kubectl
- Pixie Labs: Troubleshoot K8S apps relatively easily, leveraging eBFP
Data Visualization
- Observable Framework: Code based data visualization framekwork.
Performance/Tuning
Purposely in the “monitor” phase of the DevOps cycle, as you do not want to prematurely optimize an architecture.
- How Cloudflare Was Able to Support 55 Million Requests per Second With Only 15 Postgres Clusters
- Scaling Mastodon - Also some great general tips for Rails, Sidekiq, and Redis.
- Scaling Mastodon to 128K Users
Post-Mortum
Also see plan for actual retrospective stores - as those are the basis for planning improvements
Articles
- Everyone Should Be On-Call - with appropriate life balance and compensation
Sites
- End Of Life: Quick End of Life Reference