Ce recruteur est en ligne!

Voilà ta chance d'être vu en premier!

Postuler maintenant

*CONTRACT TO PERM*- Senior Site Reliability Engineer with Linux and Python experience to improve and optimize the batch jobs of applications- 39027

Toronto, ON
  • Nombre de poste(s) à combler : 1

  • À discuter
  • Emploi Contrat

  • Date d'entrée en fonction : 1 poste à combler dès que possible

*CONTRACT TO PERM*- Senior Site Reliability Engineer with Linux and Python experience to improve and optimize the batch jobs of applications- 39027


Location Address: Hybrid - 44 King - 3 days/week onsite (days will vary depending on team)

Subject to change: 3-4 days onsite may be required based on business needs

Contract Duration: 6 months (Must convert to perm after 6 months)

Schedule Hours: 9am-5pm Monday-Friday; standard 37.5 hrs/week


Story Behind the Need

  • Business group: Global Banking and Markets Engineering (GBME) is the fast-moving, award-winning technology engine that powers Scotiabank’s Corporate, Investment Banking and Capital Markets businesses. Team works with all GBME applications to ensure they are reliable
  • Project: GBME is searching for SRE’s who are continuous learners are and are eager to boost capabilities of capital markets products and analytics platforms. Improvement and optimization of batch jobs of applications
  • Resource will be aligned to application portfolio in GBME and ensure their batches are optimized and running in a resilient way; measured by SLA adherence for batch jobs


Typical Day in Role:

  • Reliability & Performance: Ensure stability and optimize batch processing pipelines; reduce runtime and failure rates, engineering for resiliency.
  • Observability: Implement and maintain monitoring with Dynatrace; create dashboards, alerts, and runbooks.
  • Systems Engineering: Manage and tune Linux and Windows systems for performance and resilience.
  • Automation & Orchestration: Create/Modify and optimize Airflow DAGs; build CI/CD pipelines for automation.
  • Incident Management: Lead incident response, root cause analysis, and postmortems; enforce SLOs and reliability practices.
  • Security & Compliance: Apply security best practices and ensure regulatory compliance in systems and automation.


Must Have Skills:

1) 10+ years of relevant working experience

2) 7+ years’ Linux Systems Expertise: Kernel/OS tuning, networking, filesystem optimization, process management, and troubleshooting.

3)5+ years’ experience with application performance monitoring

4) 7+ years’ experience with a more modern development languages (Python required, Java and others an asset,

5) 3+ years’ Airflow Expertise: DAG design best practices, SLA management, scheduler/executor tuning, and scaling strategies.

6) Proven experience optimizing batch workloads for performance, reliability, and cost. Strong understanding of distributed systems concepts retries, idempotency, backpressure, and data integrity. Strong understanding of backend systems and batch optimization.

7) Proven experience with containers and orchestration (Docker, Kubernetes).

8) Excellent incident management and root cause analysis skills.


Nice-To-Have Skills:

1) Dynatrace Mastery: Custom dashboards, KPIs, anomaly detection, tagging strategy, and alerting configuration.

2) Proficiency with CI/CD pipelines (GitHub Actions, Azure DevOps, Jenkins) and Infrastructure as Code (Terraform, Ansible).

3) Experience with some automated deployment.

4) Understanding of networking protocols and security principles

5) Capital Markets product knowledge

6) GCP Cloud experience

7) Experience working with real-time, high availability and low latency systems


Education:

Bachelor’s degree in computer science, Engineering, or related field.

Cloud certifications an asset

IaC automation certifications an asset


Best VS. Average Candidate:

The ideal candidate is passionate about Site Reliability Engineering (SRE), with a strong focus on building reusable, efficient, and scalable environments. They thrive in an innovative, cross-functional team setting and bring a strong technical and engineering mindset to the role.

Key attributes of the successful candidate include:


Extensive batch processing experience and a hands-on approach to problem-solving.

Proficiency in programming, deep Linux system expertise, and solid application monitoring experience.

Ideally, a developer who has transitioned into an SRE role, combining development skills with reliability engineering practices.

Familiarity with typical SRE/DevOps tools is helpful but less critical for this position.


Candidate Review & Selection - Interview Process

2 rounds - 1 hour - in person at 44 King

1st with HM

2nd with GBME

Apply

Exigences

Niveau d'études

non déterminé

Années d'expérience

non déterminé

Langues écrites

non déterminé

Langues parlées

non déterminé