Senior Site Reliability Engineer (SRE) with deep expertise in Red Hat OpenShift and infrastructure automation for our banking client
S.i. Systèmes
Toronto, ON-
Nombre de poste(s) à combler : 1
- Salaire À discuter
-
Emploi Contrat
- Publié le 23 octobre 2025
-
Date d'entrée en fonction : 1 poste à combler dès que possible
Description
We are seeking an experienced Site Reliability Engineer (SRE) with deep expertise in Red Hat OpenShift and infrastructure automation. The ideal candidate will have hands-on experience deploying, maintaining, and optimizing OpenShift clusters in both on-premise and cloud environments.
This role requires a strong understanding of platform reliability, networking, GitOps practices, and enterprise security standards. The successful candidate will work closely with development and infrastructure teams to ensure seamless CI/CD processes, high availability, and efficient incident response.
Location - Downtown Toronto
Work Mode - Mostly remote, some onsite work
Duration - ASAP to Feb 27, 2026 with possibility of extension
Must-Have
- 8+ years of experience in infrastructure, DevOps, or SRE roles, including 3+ years focused on OpenShift administration.
- Proven experience with OpenShift installation, configuration, and lifecycle management in both on-prem and cloud environments.
- Expertise with Terraform and Ansible for automation and configuration management.
- Strong hands-on experience with ArgoCD and GitOps workflows.
- Working knowledge of Red Hat ACM for managing multiple OCP clusters.
- Proficiency in F5 load balancer configuration and networking fundamentals (DNS, routing, firewalls, subnets).
- Experience building observability stacks (Prometheus, Grafana, ELK, Alertmanager).
- Solid understanding of TLS/mTLS, certificate management, and security hardening.
- Proven track record in incident response, RCA, and postmortem analysis.
- Experience defining and managing SLIs/SLOs for production services.
- Familiarity with CI/CD pipelines, Kubernetes-native tools, and container orchestration principles.
- Strong scripting skills (e.g., Bash, Python, or Go).
Responsibilities
- Install, configure, upgrade, and administer OpenShift clusters (OCP) in on-premise and cloud environments.
- Manage OCP internal networking, ingress, egress, and cluster services.
- Configure and integrate LDAP authentication and access management.
- Implement TLS and MTLS encryption, and manage certificate lifecycle for secure communications.
- Implement GitOps workflows using ArgoCD for continuous delivery and environment consistency.
- Manage multi-cluster orchestration using RedHat Advanced Cluster Management (ACM).
- Automate platform and application provisioning using Terraform and Ansible.
- Configure and maintain F5 LTM load balancers.
- Configure and manage DNS, networking, and subnets.
- Build and manage monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, ELK).
- Define and enforce SLIs/SLOs and error budgets for services running on OCP.
- Lead incident response, root cause analysis (RCA), and postmortems.
- Build automation for self-healing, scaling, and zero-touch operations.
- Ensure high availability, disaster recovery, and failover strategies are implemented.
- Secure platform and workloads following enterprise security standards.
- Support application deployments and CI/CD pipelines on OpenShift.
- Troubleshoot networking, cluster, and deployment issues end-to-end.
- Apply SRE best practices to improve reliability, scalability, and performance.
- Collaborate with development and platform teams to optimize system operations.
Exigences
non déterminé
non déterminé
non déterminé
non déterminé
D'autres offres de S.i. Systèmes qui pourraient t'intéresser