SRE (Site Reliability Engineer) - Jobs

SRE (Site Reliability Engineer)Senior ● RemoteFull-time2 years ago

Do You Want to Discover Your Ideal Career Path?Craft Your Own Career Path for Free!

About the job

You will be responsible for designing, implementing, maintaining and defining practice and process that keeps our products and services running smoothly. As SRE you will work closely with development, Cloud Engineering and Security teams to identify and address performance issues, optimize system performance, and ensure high availability and reliability of our systems.

What will you do?

Exercise best practices to ensure and improve high availability, reliability, and recoverability of our platforms.
Work with proprietary tools that mitigate weakness in incident management or software delivery.
Maintain disaster recovery and business continuity automation and perform routine DR trials.
Participate in platform management and capacity management practices.
Maintain SLI’s adjusting as necessary to properly represent service reliability as service evolve and grow.
Develop, maintain and configure cloud observability systems (e.g., DataDog, Splunk, OpenTelemetry, APM, etc.).
Build flexible monitoring and alerting to proactively address issues before they become incidents.
Identify and address performance issues and optimize system performance.
Partner with development teams to establish application production readiness through rigorous testing and release procedures.
Participate in on-call rotations for incident response and postmortem investigation.
Participates in rigorous training both within and across engineering teams.

Qualifications

Bachelor's degree in Computer Science or related field
3-5 years of experience in Site Reliability Engineering or related field
Proficiency in at one or more programming language (Go, Java, Python, TypeScript, or shell scripting)
Experience with monitoring and logging tools such as Datadog, Splunk, ELK
Experience with cloud computing platforms such as AWS, Azure, with preference for GCP.
Understanding of Linux and Windows operating systems and networking fundamentals
Understanding of distributed data streaming technologies such as Kafka.
Experience with containerization and orchestration systems such as Docker Swarm, Kubernetes, or Helm
Strong troubleshooting and problem-solving skills
Strong understanding of Networking in cloud.
Strong understanding of DevOps principles and practices
Strong with distributed systems and microservices architecture
Strong with working with git and version control UIs (GitHub and Gitlab).
Works with issue tracking tools, such as Jira.

Skills

Hard Skills

AgileAbility to CloseCloud Infrastructure

Soft Skills

Analysis and Problem SolvingDrive High PerformanceProblem SolvingResults OrientedDetail Oriented / Quality of Work

Technical Expertise

GCPJavaPython