San Francisco, CA · noida · full time · senior level
Join our SRE team to ensure the reliability and performance of systems serving millions of users across the globe. Key Responsibilities: - Design and implement SLO/SLI frameworks for critical services - Build automated incident response and recovery systems - Optimize system performance and reduce operational toil - Design disaster recovery and business continuity solutions - Lead capacity planning and infrastructure scaling initiatives Requirements: - 5+ years of SRE or DevOps experience - Strong background in Linux systems and networking - Experience with Infrastructure as Code (Terraform, Pulumi) - Proficiency in Python, Go, or Bash scripting - Experience with monitoring systems (Prometheus, Grafana, Datadog)
Skills: Kubernetes, Terraform, Prometheus, Python, Linux, AWS, Incident Management