About the Role
We are looking for a Site Reliability Engineer (SRE) to join our platform team and ensure the reliability, scalability, and performance of our production infrastructure.
What You Will Do
- Design and implement monitoring, alerting, and incident response systems
- Build and maintain CI/CD pipelines and deployment automation
- Optimize system performance and reduce latency across services
- Conduct capacity planning and manage cloud infrastructure (AWS/GCP)
- Drive post-incident reviews and implement reliability improvements
- Develop internal tooling to improve developer productivity
What We Are Looking For
- 3-5 years of experience in SRE, DevOps, or Infrastructure Engineering
- Strong experience with Kubernetes, Docker, and container orchestration
- Proficiency in at least one programming language (Python, Go, or Bash)
- Experience with monitoring tools (Prometheus, Grafana, Datadog)
- Deep understanding of Linux systems, networking, and distributed systems
- Experience with IaC tools (Terraform, Pulumi, or CloudFormation)
Nice to Have
- Experience with service mesh (Istio, Linkerd)
- Familiarity with chaos engineering practices
- Contributions to open-source infrastructure tooling