Netflix is the world's leading streaming video service, and our growth is accelerating. At Netflix, we are building new cloud management tools, pushing the limits of cloud-based technologies, and powering our explosive growth while at the same time improving the availability and reliability of our services.
In this role, your mission is to improve the availability of our distributed and cloud-based service. You can accomplish this by:
• Building automated alerting and visibility tools for Netflix Engineering teams
• Being the call leader for a service with millions of customers
• Working with individual service teams to adopt best practices for improving availability
• Extend the Simian Army (http://techblog.netflix.com/2011/07/netflix-simian-army.html)
• Inventing new best practices within our environment
You have been part of an operations or software engineering group that cared about getting that extra 9 of availability. You are able to jump on top of an outage, see it through to resolution, then ask the right questions to prevent the problem going forward. You believe that automation is the only way to scale out a service and that any manual effort needs to be scrutinized, even if it is a 'one-off'.
While we proactively seek out candidates that are familiar with our current stack, we care more about hiring people that can learn new technologies and adapt quickly.
Technologies we use:
• Linux on EC2
• Python for tool building
• Cassandra for persistence
• Many other AWS services (ELB, S3, SDB, SNS, SQS, etc) for infrastructure