Almost Every infrastructure decision I endorse or regret after 4 years
Summary
Startup infrastructure lead shares 4 years of scaling lessons. Recommends AWS over GCP for support, managed services (EKS, RDS) over self-hosting, and tools like Terraform, GitHub Actions, and PagerDuty. Advises prioritizing automation, GitOps, and cost reviews. Warns against shared databases and over-engineering.
AWS over Google Cloud for support and stability
The author's startup used both Google Cloud Platform (GCP) and AWS early on. They ultimately favored AWS due to superior, proactive account management and a perception of greater stability with fewer backward-incompatible API changes.
While GCP was once the clear choice for Kubernetes, the author notes AWS's EKS and its ecosystem of integrations have largely closed that gap. For startups, the advice is to use managed services like EKS rather than running your own control plane.
Managed databases are non-negotiable
The author considers data loss a company-ending event, making the markup for managed services like Amazon RDS an essential cost. They endorse Amazon ElastiCache for Redis as a versatile, battle-tested tool for caching and other fast data operations.
For container registries, they moved from the unstable quay.io to Amazon ECR, citing better stability and deeper permission integrations with AWS services.
Account management and cost control
The startup uses AWS Control Tower with the AWS Account Factory for Terraform (AFT) to automate account creation. This solved previous automation pains and made it easier to standardize account tags for governance.
To control spending, they instituted a mandatory monthly review of all SaaS costs, including AWS, attended by both finance and engineering. The process involves:
- Grouping AWS costs by tag and account.
- Analyzing major cost drivers like EC2, RDS, and networking.
- Extending the review to all major software bills, not just AWS.
Incident and alert management
The company uses bots to automate the incident response process, nudging teams to provide updates and schedule post-mortems. They adopted and customized PagerDuty's incident response template in Notion.
They operate a two-tiered alerting system in PagerDuty: critical (wakes people up) and non-critical (async). To prevent alert fatigue and ignored non-critical alerts, they hold bi-weekly reviews to:
- Re-evaluate if critical alerts should remain critical.
- Iterate through non-critical alerts to adjust thresholds or add automation.
For post-mortems, they found dedicated tools from PagerDuty and Datadog too inflexible and instead use Notion for greater customization.
The case for Lambda and GitOps trade-offs
The author argues that cost comparisons between AWS Lambda and EC2 are often flawed, as EC2 instances rarely run at 100% utilization and scaling logic is imperfect. A hidden benefit of Lambda is precise, easily attributable cost tracking.
They use GitOps (with Flux) to manage infrastructure and say it has scaled well. The main downside is the loss of a clear, linear pipeline visualization, requiring investment in custom tooling to answer "why isn't my commit deployed yet?"
Avoiding infrastructure pitfalls
The author strongly advises against a monolithic, shared database for multiple services. It becomes "cared for by no one," creates painful, coordinated migrations, and turns the database into a single point of failure for deployments.
They regret sticking with Google Workspace for identity and permissions, wishing they had adopted Okta sooner for its flexibility and broad SaaS integrations.
For internal tools, they self-host AppSmith to provide UIs for internal scripts, finding it sufficient versus paid alternatives like Retool.
Tooling and platform choices
The startup's core toolchain has evolved to a specific set of platforms:
- Communication & Docs: Slack (with strict channel hygiene) and Notion.
- Project Management: Linear over Jira ("so bloated I'm worried... it would just turn fully sentient").
- CI/CD & Code: GitHub with GitHub Actions, though self-hosted runners on EKS are buggy.
- Infrastructure as Code: Terraform over CloudFormation or Pulumi, run via Atlantis.
- Observability: Datadog, but its per-instance pricing is criticized as punitive for dynamic Kubernetes and GPU-heavy AI workloads.
- Schema Management: Schema-as-code checked into Git, with a tool to generate sync SQL.
Kubernetes ecosystem recommendations
For their EKS clusters, the author provides several specific endorsements:
- Autoscaling: Use Karpenter over the cluster autoscaler or SpotInst.
- Secrets: Use ExternalSecrets to sync from AWS Secrets Manager, not Sealed Secrets.
- DNS & Certificates: Use ExternalDNS and cert-manager.
- Helm & Packaging: Helm v3 works well; they store charts as OCI artifacts.
- Node OS: Standard EKS Optimized AMIs over Bottlerocket for easier debugging.
They advise against over-engineering, skipping service meshes in favor of stable, simple tools like Nginx. For distributing internal tools, they use Homebrew taps.
Final regrets and strong endorsements
The author's biggest technical regrets are not using OpenTelemetry from the start and not implementing automated dependency updates with RenovateBot sooner. They strongly endorse prioritizing internal automation and documentation, even under feature pressure, and choosing a primary programming language like Go that is easy for new engineers to learn.
Related Articles
Internal memo details cosmetic changes and facility repairs to Kennedy Center
Trump plans to close the Kennedy Center for two years for rebuilding, following artist cancellations.
The “funhouse mirror”: How AI reflects the hidden truths of your software pipeline
AI accelerates software development but can amplify existing problems, turning speed into technical debt without proper guardrails and fundamentals.
Stay in the loop
Get the best AI-curated news delivered to your inbox. No spam, unsubscribe anytime.
