Cloud Infrastructure Engineer
Sector: IT / Computers / Software
Posted: Friday, 8 April 2022
A small to medium sized company, based just out of the CBD in Cape Town, who works with a US clientele is looking for a Cloud Infrastructure / Site Reliability Engineer who will be part of a team who will be responsible for monitoring the production systems 24/7. We are looking to hire for our SA day shifts with weekend flexibility. Your primary responsibility will be to drive very high levels of uptime and reliability of our service and supporting cloud infrastructure. You will achieve this by having a very good understanding of the supporting cloud infrastructure services, how to monitor the health of these services as well as a good understanding of how the system uses these services.
Cloud Infrastructure / Site Reliability Engineering is an engineering discipline devoted to helping an organization sustainably achieve the appropriate level of reliability in their infrastructure, services, and products. The Cloud Infrastructure / Site Reliability team plays a crucial role in our mission to reduce emergency response times and improve public safety.
You are responsible for providing support when there is an incident and managing communications and escalations around the incidents.
You will be managing the monitoring our entire platform and feel comfortable to continuously add to and adjust to improve coverage and accuracy of the relevant monitoring components. You welcome and see the benefit in automation. You are driven and determined to identify the root cause of problems and can accurately capture your findings to communicate with other teams. You must be comfortable performing well under pressure with tight deadlines and communicate to larger audiences.
DUTIES WILL INCLUDE, BUT ARE NOT LIMITED TO:
- Work with DevOps and DBA teams to support Cloud infrastructure.
- Work with Analytics team to support Eclipse Analytics.
- Work with Platform and other Development teams to support Nimbus/Radius front end applications and back-end services.
- Work with IoT Team to support IoT Devices.
- Work with Customer Support team to provide technical support for customer reported issues.
- Work with QA and Implementation teams to provide insight on application and infrastructure performance with future releases.
- Be in a scheduled rotation for On Call duties which include receiving alerts from monitoring systems as well as internal escalations.
- Build and improve monitors and alerts to increase visibility of system health.
- Build tools or automation that can improve SRE role efficiencies or increase monitoring capabilities.
- Troubleshoot technical issues with infrastructure and applications.
- Operate as an Incident Commander role when Incidents are created. Escalate to other teams, be a central communication channel across teams, and make detailed timeline entries of actions taken during Incident.
- Produce Root Cause Analysis reports for customers.
- Write post-mortems for Incidents and review with internal teams.
REQUIREMENTS & SKILLS:
- Bachelor's degree in Computer Science, Management Information Systems, or equivalent field with 1-2 years’ experience as a Cloud Infrastructure or Site Reliability Engineer
- Experience with Cloud infrastructure and services (AWS, GCP or Azure), with preference to Azure
- Infrastructure Engineer, Reliability engineer, DevOps engineer, or Software engineer background will be beneficial
- Familiarity of distributed systems and microservices
- Understanding of front end and back-end architecture
- Experience with SQL databases
- Experience with Datadog or other monitoring and logging tools
- Programming/Scripting skills in a major language such as .NET, PowerShell, Bash
- Experience with deployment tools such as Terraform, Ansible, Puppet
- Experience in Kubernetes
- Strong communication skills
- Work well under pressure
- Good communication skills (Written and verbal)
- A good problem solver
- Have an inquisitive nature
- Like to keep this simple
- Can organise and plan well