Site Reliability Engineering Manager

Engineering/Technical
Cape Town – Western Cape

ENVIRONMENT:
A radio astronomy company is seeking a Site Reliability Engineering (SRE) Manager to build and lead the SRE team for a major telescope project in South Africa. This role involves using Site Reliability Engineering principles to support the planning, monitoring, and controlling of the day-to-day operations and delivery aspects of the global IT and Networks of the Observatory, with a focus on the systems in South Africa. The software and computing systems adhere to large-scale agile principles, using a tailored version of the Scaled Agile Framework (SAFe). This role will be a key stakeholder within this framework as it evolves from construction to operations. Additionally, the SRE Manager will actively participate in implementing all aspects of Site Reliability Engineering across the Global Observatory, including technical vision, observability, automation strategy, solution delivery, and platform incident and problem management. This leadership position involves both technical and people management responsibilities and requires participation in short and long-term system and capability planning, as well as team and organizational planning. The position reports directly to the Head of Computing and Software. A BTech, Degree, Masters, or PhD in Computer Science, Information Technology, Information Systems, Computer Engineering, or related fields is required.
 
Key Requirements:
Qualification:
  • BTech, Degree, Masters, or PhD in Computer Science, Information Technology, Information Systems, Computer Engineering, or related fields.
Experience:
  • BTech with 13 years of relevant experience; or
  • Degree with 9 years of relevant experience; or
  • Masters with 7 years of relevant experience; or
  • PhD with 5 years of relevant experience in fields such as Digital Signal Processing, FPGA design, development and verification, combined with software engineering, preferably in an engineering development project environment.
  • Experience in computer and network infrastructure implementation.
  • Significant experience in IT service, operations, and management, including responsibility over Service Level Agreements.
  • Leadership experience in IT Infrastructure or software teams.
  • Project management expertise.
  • Experience in IT systems engineering, application support, and user management.
  • Knowledge of IT governance and security, data governance and security, IT availability, resilience, and redundancy.
  • Experience supporting distributed software systems in production environments such as Cloud and/or Data Centres.
  • Procurement and IT asset management experience.
Knowledge:
  • Proven track record of building and managing high-performance teams in a software, IT, or technology-related industry.
  • Experience in asset lifecycle management and software asset management.
  • Resource management and prioritization skills.
  • Knowledge of IT Service Management disciplines and frameworks such as ITIL and Change Management.
  • Experience with Lean Agile project management.
  • Experience working in globally diverse teams.
  • Programming/scripting experience and capability across multiple platforms.
Additional Notes:
Skills/Abilities/Competencies:
Essential:
  • Experience working with Linux and within the Open-Source Software ecosystem.
  • Experience with DevOps tools, processes, and culture.
  • Knowledge of and/or certification in SRE, ITIL, or related IT Management processes.
  • Experience supporting and maintaining large-scale High-Performance Computing (HPC) and storage systems.
  • Advanced programming and/or scripting experience with languages such as Python.
  
ATTRIBUTES:
  • Passion for Excellence
  • World-class service
  • People-centered
  • Respect
  • Integrity and Ethics 
  • Accountability