
Website The Walt Disney Company
Job Description:
As a Senior engineer, you are looked at by your fellow team members as a ‘go to’ individual; you are someone who has a clear understanding of, and can thoroughly elaborate on SRE principles and best practices to a given audience.To be successful in this role you will continuously uphold and improve all the relevant reliability aspects for our services, with an increased focus on SLIs and SLOs, while raising the reliability of a variety of large scale user facing and internal services.
Job Responsibilities:
- Continuously refine monitoring processes, configurations, and thresholds;
- Practice and promote sustainable incident response and blameless postmortems
- Develop runbooks and tools to streamline processes and shorten problem resolution time;
- Add, tune and maintain alert configurations and documentation as needed;
- Identify areas of improvement in reliability, efficiency, and operations;
- Consult on best practices and develop tools to enable smooth adoptions of good service reliability practices and methods;
- Collaborate and provide technical excellence within and across teams;
- Build tools to help your SRE team quickly pinpoint, isolate and resolve issues related to infrastructure, platform services and applications;
- Write code that improves scalability, performance, maintainability, and security;
- Develop useful telemetry, alerts, and response to reduce Mean Time To Repair (MTTR);
- Deploy and manage innovative modern cloud technologies using infrastructure-as-code, self-healing, and security automation patterns;
Job Requirements:
- Creative and innovative outside the box thinking
- Experience in designing, building, and operating large-scale production systems
- Excellent communication skills, both verbal and written
- 5-7 years of experience in SRE, devops, technical operations, systems engineering, software engineering or related discipline
- Proficient, collaborative, & experienced in building reliable, scalable, enterprise systems
- Passionate and curious about ways to leverage technology while continually learning
- Ability to identify root-cause sources of instability in a high-traffic, large-scale distributed systems
Qualification & Experience:
- Comfortable in one or more of the following languages (Python, Java, Scala, Go, Rust, Ruby, or similar)
- Scripting languages like Ruby, Bash, PowerShell or Python;
- Possess expertise in scalable testing, automation, continuous integration frameworks and best practices;
- Experience in designing, building, and operating large-scale production systems
- Knowledge of best practices and IT operations in an always-up, always-available service;
- Efficiently skilled with the use of containers in enterprise production environments (e.g. Docker, Kubernetes, LXC, AWS ECS and EKS)
- Experience with continuous integration tools (e.g. Jenkins, Gitlab CI/CD, AWS CodeBuild, CodeDeploy, CodePipeline, Azure DevOps, Spinnaker)
- Skilled in Cloud/PaaS/SaaS Environments (e.g. AWS, Azure, Google Cloud Compute)
- Configuration management and orchestration (e.g. Terraform, Cloud Formation, Ansible)
- Hands-on experience using source control (Git, GitHub) and feature branching strategies
Job Details:
Company: The Walt Disney Company
Vacancy Type: Full Time
Job Location: Manchester, England, United Kingdom
Application Deadline: N/A
careersvite.com