Ensure world-class reliability and performance as a Senior SRE at Pankh.AI.
What you'll do
- Define and maintain SLOs/SLIs for all critical services
- Lead incident response, post-mortems, and reliability improvements
- Design and implement auto-scaling, self-healing infrastructure
- Build internal tools for deployment automation and observability
- Drive chaos engineering practices to proactively identify weaknesses
What we're looking for
- 5-8 years of SRE/DevOps experience at scale
- Deep expertise in Kubernetes, service mesh, and cloud-native architecture
- Strong programming skills in Python or Go for tooling and automation
- Experience managing systems with 99.9%+ uptime requirements
Nice to have
- Google SRE certification or equivalent
- Experience with chaos engineering tools (Gremlin, Litmus)