This position is focused on providing strategic direction on and execution of infrastructure, security, continuous integration, deployment, IT operations practices, scaling and metrics, and running day-to-day production and development infrastructure operations for cloud-based hosted platforms.
The Site Reliability Engineer (SRE) will work with other Software Engineers, Database Engineers, and Product Managers to analyze system and network loads to address stability and performance challenges and collaborate with others to operate various systems. The SRE performs ongoing application and platform support by diagnosing and resolving issues, maintaining applications, and evaluating and recommending options for improving performance, maintainability, and operability. This also includes streamlining processes to increase system scalability and reliability, improve efficiency, and minimize errors. The SRE will also contribute to core cloud technologies that span the enterprise to ensure cost and security objectives by implementing standards and policies via code, enhancing reporting, and developing automated remediations.
Essential Duties
include but are not limited to the following:
- Maintain and manage the SRE backlog of technical debt to deliver automation workflows for deploying infrastructure and applications.
- Engage with cross-functional projects and program management teams on architectural design and review sessions. Implement effective, value-driven processes to deliver innovative solutions that meet business requirements. Develop, collect, and dashboard key performance indicators (KPI) to measure and report on service level objects and availability
- Mentor engineering team members by providing regular, in-depth feedback during the peer review process. Ensure the SRE peer review process is followed to deliver high-quality code.
- Ability to work with and use Amazon Web Services (AWS) and/or other cloud technology platforms, Ansible, Python, Bash, Java, JavaScript, and Linux.
- Understanding of security and encryption best practices.
- Responsible for designing, building, maintaining, and scaling production services and server farms across multiple data centers for complex and data-intensive cloud services.
- Design and enhance software architecture to improve scalability, service reliability, capacity, and performance.
- Write automation code for provisioning and operating infrastructure at a massive scale. You are not an operator, you’re an experienced software engineer focused on operations.
- Work with development teams to ensure scalability and reliability are designed and implemented from the ground up.
- Roll up your sleeves to troubleshoot incidents, formulate theories, test your hypothesis, and narrow down possibilities to find the root cause.
- Write postmortem reviews and remediation recommendations.
- Identify bad trends before they become problems; respond to automated system alerts, effectively troubleshoot system errors and work incidents to return systems to normal operating conditions.
- Author and update high-quality documentation of all relevant specifications, systems and procedures.
- Other duties as assigned.
- Uphold company mission and values through accountability, innovation, integrity, quality, and teamwork.
- Support and comply with the company’s Quality Management System policies and procedures.
- Regular and reliable attendance.
- Ability to work a designated schedule.
- Ability to work nights and/or weekends.
- On-call rotation supporting the infrastructure.
- Ability to work on a mobile device, tablet, or in front of a computer screen and/or perform typing for approximately 90% of a typical working day.
Minimum Qualifications
- Bachelor’s degree in computer sciences or related field; or high school degree/general education diploma and 5 years of relevant experience instead of a degree.
- 5+ years of experience with modern application design, cloud environments or platform engineering:
- CI/CD pipeline design (Ex. YAML scripting in Gitlab Pipelines)
- Infrastructure as Code (Terraform, AWS CDK, CloudFormation)
- Demonstrated proficiency in writing platform services using Python or Golang.
- Demonstrated experience with container orchestration (Very familiar with Docker, experience with AWS ECS, or AWS EKS, Kubernetes).
- Security configuration
- Maintain a library
- More complicated things: Several stacks of CloudFormation
- Leverage AWS Cloud Development Kit
- Resources management
- Looking for more: Potentially has a specialization in security
- Ability to understand complex automation mechanisms for CI/CD working with a Cloud Identity Provider:
- AWS Config, AWS Guard Manager,
- Cloud Identity Provider: Okta moving to Microsoft Entra ID, Hashicorp Vault