Ce recruteur est en ligne!

Voilà ta chance d'être vu en premier!

Postuler maintenant

Site Reliability Engineer (2+ years) to enhance production observability and incident response using Dynatrace, Splunk, Power BI, and Google Cloud (GCP) Mo

S.i. Systèmes

Toronto, ON

Salaire À discuter
Emploi Contrat
Publié il y a 4 jour(s)
Ajouter aux favoris
1 poste à combler dès que possible

Voir le profil complet Postuler maintenant

Description

Our financial services client is seeking a Site Reliability Engineer (2+ years) to enhance production observability and incident response using Dynatrace, Splunk, Power BI, and Google Cloud (GCP) Monitoring- 40822

Location:Toronto/Hybrid- 3 days in-office

Anchor days: Flexible

Contract Duration: 07/01/2026 to 06/30/2027 (Possibility of extension)

Schedule Hours: 9am-5pm Monday-Friday; standard 37.5 hrs/week

Story Behind the Need

• Business group: Client Engineering - Mobile and Web

• Project: Onboarding and Mobile Applications

We’re looking for an SRE with deep experience in production observability and incident response to raise the reliability and transparency of our customer-facing services. You will own the end-to-end observability stack across Dynatrace, Splunk, Power BI, and Google Cloud (GCP) Monitoring, drive proactive detection and reduction of toil, and lead major incident response. This role focuses on operational excellence and service health and NOT platform engineering or DevOps provisioning.

Reason: This is a 1-year temporary contract position for FTE maternity leave

Candidate Value Proposition

The successful candidate will contribute to a high availability and application stability project.

RESPONSIBILITIES:

• Design and maintain end-to-end monitoring for critical services using Dynatrace (APM, Real User Monitoring, Synthetic, Davis AI, Smartscape) and GCP Cloud Monitoring (metrics, alerting policies, SLOs/SLIs, uptime checks, dashboards).

• Build service maps, dependency models, and problem detection in Dynatrace; tune Davis AI problem rules and reduce alert noise through thresholds, baselining, and tagging.

• Implement SLOs/SLIs with error budgets; continuously review burn rates and align alerting to customer impact.

• Partner with application teams to instrument code paths (e.g., Dynatrace OneAgent), trace distributed transactions, and validate golden signals (latency, traffic, errors, saturation).

Logging, Analytics & Insights (Splunk, Power BI)

• Create and optimize Splunk data models, indexes, sourcetypes, ingestion pipelines, and SPL searches; build actionable dashboards for NOC/SRE/Engineering.

• Develop operational analytics and executive reporting in Power BI (data modeling, DAX/Measures, scheduled refresh) to track reliability KPIs, incident trends, MTTR/MTTD, SLO compliance, and capacity signals.

• Establish governance for data quality, field extractions, and retention to ensure fast, accurate investigations.

Incident Management & Problem Management

• Lead incident response (Sev1/Sev2): run bridges, coordinate SMEs, communicate status/timelines, drive mitigation and customer updates.

• Maintain runbooks, decision trees, and standard operating procedures; ensure blameless post-incident reviews (PIRs) with clear RCA, corrective actions, and preventative measures.

• Track and close problem tickets tied to recurring failure modes; verify effectiveness of fixes via metrics and error budgets.

Reliability Engineering & Automation (Light Coding)

• Use light coding/scripting to automate recurring tasks: alert tuning, data enrichment, log parsing, playbook triggers, service health checks.

• Build small utilities or bots for on-call workflows (e.g., auto-triage, context gathering, incident timelines).

• Contribute to observability standards and best practices (naming, tags, SLIs, alert policies), and mentor teams on instrumenting for reliability.

Candidate Requirements/Must Have Skills:

• 2+ years of experience as Site Reliability Engineer (SRE)

• 2+ years of experience with Production Operations/Observability with Dynatrace and Splunk in high-availability environments.

• Hands-on recent experience with GCP operations: Cloud Monitoring, Cloud Logging, Alerting Policies, Uptime Checks, SLOs/SLIs; familiarity with Error Reporting/Trace is a plus.

• Strong SPL (Splunk) and Dynatrace (APM/RUM/Synthetic) expertise-including alert design, dashboards, and noise reduction.

• Proven incident commander experience for Sev1/Sev2 with clear comms, stakeholder management, and PIR facilitation.

• Solid understanding of service reliability concepts: golden signals, SLOs/error budgets, capacity and saturation, graceful degradation.

• Strong analytical mindset with a bias to measurable outcomes (MTTD/MTTR, alert volume, SLO compliance).

Nice-To-Have Skills:

• Coding/scripting for automation and data manipulation (e.g., Python or PowerShell; Go/Bash a plus).

• Power BI (reports, DAX, dataflows)

• Previous experience with Top Banks of Financial institutes is nice to have.

Best vs. Average Candidate:

The ideal candidate would be an experienced SRE with hands-on experience with Production Operations/Observability with Dynatrace and Splunk in high-availability environments. Hands-on recent experience with GCP operations and Splunk, Dynatrace expertise. Previous experience with Banks is nice to have.

Note: This role does NOT manage CI/CD, infrastructure provisioning, or platform build (Terraform/Kubernetes cluster ops). Collaboration with those teams is expected, but ownership remains on monitoring, analytics, incident response, and reliability outcomes.

Degrees or certifications:

• Bachelor's degree in a related field required

Candidate Review & Selection

1 Rounds of interviews

Structure and Format: MS Teams Interview

SRE and Production support related questions. Behavioral questions based on previous experience.

Disclaimer:
AI may be used in evaluating candidates.
This posting is for an existing vacancy.

Apply

Exigences

Niveau d'études

non déterminé

Diplôme

non déterminé

Années d'expérience

non déterminé

Langues écrites

non déterminé

Langues parlées

non déterminé

No. référence interne

153146

Postuler maintenant

D'autres offres de S.i. Systèmes qui pourraient t'intéresser

chef de mêlée
Toronto,ON

Publié il y a 16 jour(s)
Développeur principal Murex
Toronto,ON

Publié il y a 22 jour(s)
CONSEILLER, SUCCÈS DES FOURNISSEURS, ASKUITY, CANADA
Toronto,ON

Publié il y a 23 jour(s)

Voir plus d'offres similaires

Chercher d'autres emplois

On crée le match parfait

Jobillico te propose instantanément les offres d’emploi qui te correspondent.

Importe ton CV

Plus d'offres

SCIENTIFIQUE DES DONNÉES, ASKUITY, CANADA
Toronto,

Tech Lead en hyperautomatisation – Power Platform
Toronto,

INGÉNIEUR PRINCIPAL, LOGICIELS, ASKUITY
Toronto,

Ce recruteur est en ligne!

Site Reliability Engineer (2+ years) to enhance production observability and incident response using Dynatrace, Splunk, Power BI, and Google Cloud (GCP) Mo

S.i. Systèmes

Description

Exigences

Plus d'offres similaires à "Site Reliability Engineer (2+ years) to enhance production observability and incident response using Dynatrace, Splunk, Power BI, and Google Cloud (GCP) Mo"

Chercher d'autres emplois

On crée le match parfait

Plus d'offres

Envoyer par courriel