Almost Every infrastructure decision I endorse or regret after 4 years

AWS over Google Cloud for support and stability

The author's startup used both Google Cloud Platform (GCP) and AWS early on. They ultimately favored AWS due to superior, proactive account management and a perception of greater stability with fewer backward-incompatible API changes.

While GCP was once the clear choice for Kubernetes, the author notes AWS's EKS and its ecosystem of integrations have largely closed that gap. For startups, the advice is to use managed services like EKS rather than running your own control plane.

Managed databases are non-negotiable

The author considers data loss a company-ending event, making the markup for managed services like Amazon RDS an essential cost. They endorse Amazon ElastiCache for Redis as a versatile, battle-tested tool for caching and other fast data operations.

For container registries, they moved from the unstable quay.io to Amazon ECR, citing better stability and deeper permission integrations with AWS services.

Account management and cost control

The startup uses AWS Control Tower with the AWS Account Factory for Terraform (AFT) to automate account creation. This solved previous automation pains and made it easier to standardize account tags for governance.

To control spending, they instituted a mandatory monthly review of all SaaS costs, including AWS, attended by both finance and engineering. The process involves:

Grouping AWS costs by tag and account.
Analyzing major cost drivers like EC2, RDS, and networking.
Extending the review to all major software bills, not just AWS.

Incident and alert management

The company uses bots to automate the incident response process, nudging teams to provide updates and schedule post-mortems. They adopted and customized PagerDuty's incident response template in Notion.

They operate a two-tiered alerting system in PagerDuty: critical (wakes people up) and non-critical (async). To prevent alert fatigue and ignored non-critical alerts, they hold bi-weekly reviews to:

Re-evaluate if critical alerts should remain critical.
Iterate through non-critical alerts to adjust thresholds or add automation.

For post-mortems, they found dedicated tools from PagerDuty and Datadog too inflexible and instead use Notion for greater customization.

The case for Lambda and GitOps trade-offs

The author argues that cost comparisons between AWS Lambda and EC2 are often flawed, as EC2 instances rarely run at 100% utilization and scaling logic is imperfect. A hidden benefit of Lambda is precise, easily attributable cost tracking.

They use GitOps (with Flux) to manage infrastructure and say it has scaled well. The main downside is the loss of a clear, linear pipeline visualization, requiring investment in custom tooling to answer "why isn't my commit deployed yet?"

Avoiding infrastructure pitfalls

The author strongly advises against a monolithic, shared database for multiple services. It becomes "cared for by no one," creates painful, coordinated migrations, and turns the database into a single point of failure for deployments.

They regret sticking with Google Workspace for identity and permissions, wishing they had adopted Okta sooner for its flexibility and broad SaaS integrations.

For internal tools, they self-host AppSmith to provide UIs for internal scripts, finding it sufficient versus paid alternatives like Retool.

Tooling and platform choices

The startup's core toolchain has evolved to a specific set of platforms:

Communication & Docs: Slack (with strict channel hygiene) and Notion.
Project Management: Linear over Jira ("so bloated I'm worried... it would just turn fully sentient").
CI/CD & Code: GitHub with GitHub Actions, though self-hosted runners on EKS are buggy.
Infrastructure as Code: Terraform over CloudFormation or Pulumi, run via Atlantis.
Observability: Datadog, but its per-instance pricing is criticized as punitive for dynamic Kubernetes and GPU-heavy AI workloads.
Schema Management: Schema-as-code checked into Git, with a tool to generate sync SQL.

Kubernetes ecosystem recommendations

For their EKS clusters, the author provides several specific endorsements:

Autoscaling: Use Karpenter over the cluster autoscaler or SpotInst.
Secrets: Use ExternalSecrets to sync from AWS Secrets Manager, not Sealed Secrets.
DNS & Certificates: Use ExternalDNS and cert-manager.
Helm & Packaging: Helm v3 works well; they store charts as OCI artifacts.
Node OS: Standard EKS Optimized AMIs over Bottlerocket for easier debugging.

They advise against over-engineering, skipping service meshes in favor of stable, simple tools like Nginx. For distributing internal tools, they use Homebrew taps.

Final regrets and strong endorsements

The author's biggest technical regrets are not using OpenTelemetry from the start and not implementing automated dependency updates with RenovateBot sooner. They strongly endorse prioritizing internal automation and documentation, even under feature pressure, and choosing a primary programming language like Go that is easy for new engineers to learn.

Almost Every infrastructure decision I endorse or regret after 4 years

AWS over Google Cloud for support and stability

Managed databases are non-negotiable

Account management and cost control

Incident and alert management

The case for Lambda and GitOps trade-offs

Avoiding infrastructure pitfalls

Tooling and platform choices

Kubernetes ecosystem recommendations

Final regrets and strong endorsements

Related Articles

Internal memo details cosmetic changes and facility repairs to Kennedy Center

The “funhouse mirror”: How AI reflects the hidden truths of your software pipeline

Stay in the loop

From notebooks to nodes: Architecting production-ready AI infrastructure

Related Articles

Internal memo details cosmetic changes and facility repairs to Kennedy Center
Feb 20, 20263 min read

The “funhouse mirror”: How AI reflects the hidden truths of your software pipeline
Feb 20, 20263 min read

From notebooks to nodes: Architecting production-ready AI infrastructure

AWS over Google Cloud for support and stability

Managed databases are non-negotiable

Account management and cost control

Incident and alert management

The case for Lambda and GitOps trade-offs

Avoiding infrastructure pitfalls

Tooling and platform choices

Kubernetes ecosystem recommendations

Final regrets and strong endorsements

Related Articles

Internal memo details cosmetic changes and facility repairs to Kennedy Center

The &#8220;funhouse mirror&#8221;: How AI reflects the hidden truths of your software pipeline

Stay in the loop

From notebooks to nodes: Architecting production-ready AI infrastructure

The “funhouse mirror”: How AI reflects the hidden truths of your software pipeline