Plan
Planning for a DevOps Cycle
Architecture
- Platform Engineering: Home for Platform Engineers. Includes a comprehensive tech library of stacks/solutions.
- Keeping Code Simple
- Questions for a new technology
- Redhat Demo Central - Architectures for a wide range of cloud infrastructures and problems.
- Who Cares If It Scales - Avoiding pre-mature optimization.
Statistics
- Github Release Download Stats: For public projects, perhaps useful to gauge how popular certain packages are - especially if one is NOT collecting telemetry data directly from users.
Questions For A New Technology
- What problem are we trying to solve? (Tech should never be introduced as an end to itself)
- How could we solve the problem with our current tech stack? (If the answer is we can’t, then we probably haven’t thought about the problem deeply enough)
- Are we clear on what new costs we are taking on with the new technology? (monitoring, training, cognitive load, etc)
- What about our current stack makes solving this problem in a cost-effective manner (in terms of money, people or time) difficult?
- If this new tech is a replacement for something we currently do, are we committed to moving everything to this new technology in the future? Or are we proliferating multiple solutions to the same problem? (aka “Will this solution kill and eat the solution that it replaces?”)
- Who do we know and trust who uses this tech? Have we talked to them about it? What did they say about it? What don’t they like about it? (if they don’t hate it, they haven’t used it in depth yet)
- What’s a low risk way to get started?
- Have you gotten a mixed discipline group of senior folks together and thrashed out each of the above points? Where is that documented?
Tech Debt
Post-Mortem
Post mortums are the basis of planning for future improvements.
General
- Kubernetes Fail Stories and Source
- Danluu’s Post Mortem Repo
- Scaling up Prive Video By Migration to a Monolith: A good case study that some workloads are better served by a Monolithic app. Of course intersting that “monolith” in this case is an ECS container that processes an entire stream - perhaps describe better as a case-study in the downsides of extereme Microservice engineering for certain cases (large batch video processing for instance)
2024
- Resend Outage: Developer performed local migraiton while pointed to production database, taking site down for about 12 hours.
- Crowdstrike Falcon Content Update - The outage of the year (so far). Untested content pack update overflowed the number of parameters expected in an IPC check. Caused Windows systems to blue screen.
- Stoli Ransomware Attach And Bankruptcy and raw bankruptcy filing. Took out ERP system until Q1 of 2025. A huge case of proactive security measures and DR practices.
2023
- CircleCI Security Breach
- NameCheap Spam Email
- Details based on spam I received - third party provider most likely Sendgrid
- Okta Security Compromise: Beleved to be a combination of session compromise via saved HAR files, and an employee saving their work credentials to personal Google account that was compromised.
- Reddit Pi Day Outage - Great overview of a Kubernetes outage centered around an upgrade from 1.23 to 1.24
- Reddit Thread: Wiedest Outages
- Square Incident Summary: 2023-09-07
2022
- Incident.IO - Intermittent Downtime: Discussion of analysis and resolution in a Heroku app
- Gitpod - Sustained Workspace Performance Degradation
- How A Single Developer Dropped AWS Costs by 90% Then Disappeared - Good story/retrospective about why you whould always have software peer reviewed.
- Lastpass Security Breach
2020
2019
2017
Small Environment/Home Lab
Small Environment/Home Lab Planning resources
- k8s-at-home
- k8s-at-home Search - Search FluxCD based HelmReleases of software