This recruiter is online.

This is your chance to shine!

Apply Now

Senior Site Reliability Engineer (SRE) with deep expertise in Red Hat OpenShift and infrastructure automation for our banking client

Toronto, ON
  • Number of positions available : 1

  • To be discussed
  • Contract job

  • Starting date : 1 position to fill as soon as possible

We are seeking an experienced Site Reliability Engineer (SRE) with deep expertise in Red Hat OpenShift and infrastructure automation. The ideal candidate will have hands-on experience deploying, maintaining, and optimizing OpenShift clusters in both on-premise and cloud environments.

This role requires a strong understanding of platform reliability, networking, GitOps practices, and enterprise security standards. The successful candidate will work closely with development and infrastructure teams to ensure seamless CI/CD processes, high availability, and efficient incident response.


Location - Downtown Toronto

Work Mode - Mostly remote, some onsite work

Duration - ASAP to Feb 27, 2026 with possibility of extension


Must-Have

  • 8+ years of experience in infrastructure, DevOps, or SRE roles, including 3+ years focused on OpenShift administration.
  • Proven experience with OpenShift installation, configuration, and lifecycle management in both on-prem and cloud environments.
  • Expertise with Terraform and Ansible for automation and configuration management.
  • Strong hands-on experience with ArgoCD and GitOps workflows.
  • Working knowledge of Red Hat ACM for managing multiple OCP clusters.
  • Proficiency in F5 load balancer configuration and networking fundamentals (DNS, routing, firewalls, subnets).
  • Experience building observability stacks (Prometheus, Grafana, ELK, Alertmanager).
  • Solid understanding of TLS/mTLS, certificate management, and security hardening.
  • Proven track record in incident response, RCA, and postmortem analysis.
  • Experience defining and managing SLIs/SLOs for production services.
  • Familiarity with CI/CD pipelines, Kubernetes-native tools, and container orchestration principles.
  • Strong scripting skills (e.g., Bash, Python, or Go).


Responsibilities

  • Install, configure, upgrade, and administer OpenShift clusters (OCP) in on-premise and cloud environments.
  • Manage OCP internal networking, ingress, egress, and cluster services.
  • Configure and integrate LDAP authentication and access management.
  • Implement TLS and MTLS encryption, and manage certificate lifecycle for secure communications.
  • Implement GitOps workflows using ArgoCD for continuous delivery and environment consistency.
  • Manage multi-cluster orchestration using RedHat Advanced Cluster Management (ACM).
  • Automate platform and application provisioning using Terraform and Ansible.
  • Configure and maintain F5 LTM load balancers.
  • Configure and manage DNS, networking, and subnets.
  • Build and manage monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, ELK).
  • Define and enforce SLIs/SLOs and error budgets for services running on OCP.
  • Lead incident response, root cause analysis (RCA), and postmortems.
  • Build automation for self-healing, scaling, and zero-touch operations.
  • Ensure high availability, disaster recovery, and failover strategies are implemented.
  • Secure platform and workloads following enterprise security standards.
  • Support application deployments and CI/CD pipelines on OpenShift.
  • Troubleshoot networking, cluster, and deployment issues end-to-end.
  • Apply SRE best practices to improve reliability, scalability, and performance.
  • Collaborate with development and platform teams to optimize system operations.
Apply

Requirements

Level of education

undetermined

Work experience (years)

undetermined

Written languages

undetermined

Spoken languages

undetermined