About the Role

We are looking for a Site Reliability Engineer (SRE) to join our platform team and ensure the reliability, scalability, and performance of our production infrastructure.

What You Will Do

Design and implement monitoring, alerting, and incident response systems
Build and maintain CI/CD pipelines and deployment automation
Optimize system performance and reduce latency across services
Conduct capacity planning and manage cloud infrastructure (AWS/GCP)
Drive post-incident reviews and implement reliability improvements
Develop internal tooling to improve developer productivity

What We Are Looking For

3-5 years of experience in SRE, DevOps, or Infrastructure Engineering
Strong experience with Kubernetes, Docker, and container orchestration
Proficiency in at least one programming language (Python, Go, or Bash)
Experience with monitoring tools (Prometheus, Grafana, Datadog)
Deep understanding of Linux systems, networking, and distributed systems
Experience with IaC tools (Terraform, Pulumi, or CloudFormation)

Nice to Have

Experience with service mesh (Istio, Linkerd)
Familiarity with chaos engineering practices
Contributions to open-source infrastructure tooling