# Site Reliability Engineer

**Company:** [Tecsys Inc.](http://jobs.workable.com/companies/evwr4TLTPYjVHNokPuh6yy.md)
**Location:** Montreal, Canada
**Workplace:** hybrid
**Employment type:** Full-time
**Department:** Product & Technology

[Apply for this job](http://jobs.workable.com/view/4a29f362-4b02-493a-8031-e59b549aa2f4)

## Description

Having recognized the advantages of remote work, including employee morale, productivity, reduced commuting on employee wellbeing and the environment, we are proud to be a digital-first company. The technologies and programs in which we invested have provided a fantastic foundation to this end. Our digital-first work environment, together with our conveniently located offices and collaborative workspaces, provide our team with the freedom and flexibility to work in the way that makes our employees most productive.

### About us

Tecsys is a fast-growing innovator offering supply chain solutions to industry leading healthcare systems, hospitals, and pharmacy businesses to distributors, retailers, and 3PLs. We work with industry leaders to transform their supply chains through technology. If you thrive on tackling interesting challenges with continuous learning opportunities, then Tescys could be a good fit for you!

### About the Role

We are looking for a Site Reliability Engineer to join our Network and Security Operations Center (NOC), a team at the heart of platform reliability for mission-critical SaaS environments. You will help **maintain, optimize, and ensure the reliability and performance** of the systems that power our cloud infrastructure across **AWS** and **Kubernetes**, with a strong focus on **automation**, **observability**, and **continuous improvement**. This role blends reliability engineering with incident command, giving you real ownership over uptime, performance, and innovation. You will be part of a highly skilled team that values creative problem-solving, operational excellence, and continuous improvement through automation and resilience engineering.

### **Responsibilities**

-   **Collaborate** with engineering teams to support services from design through launch, including system design consulting, capacity planning, and launch reviews
-   **Maintain service reliability** post-deployment by monitoring availability, latency, and overall system health
-   **Identify pain points** and **drive continuous improvements** to enhance scalability, simplicity, and platform resilience
-   **Own observability** by developing and improving monitoring, alerting, dashboards, and defining SLOs/SLIs (Datadog)
-   **Build and enhance** automation, internal tooling, and IaC frameworks (Terraform, CI/CD) to enable scalable and self-healing systems
-   **Scale systems sustainably** through automation and reliability-focused improvements
-   **Leverage AI tools** (e.g., Amazon Kiro) to accelerate execution while validating outputs
-   **Lead incident response**, act as Incident Commander when required, and drive blameless postmortems with long-term fixes
-   **Implement** and **maintain logging**, **monitoring**, **alerting**, and SLA reporting practices
-   **Create** and **maintain** technical documentation and contribute to SRE best practices
-   **Partner with** platform engineering, deployment teams, and cross-functional stakeholders to support growth and system stability
-   Collaborate with internal teams and vendors globally to ensure high performance, availability, and reliability across environments

## Requirements

### **Qualifications** 

-   Strong relevant experience in Site Reliability, Cloud, or DevOps Engineering in SaaS or large-scale production environments
-   Strong experience with **AWS** (multi-account, VPC, EC2, EKS) and **Kubernetes** at scale
-   Hands-on expertise with **Infrastructure as Code** and **automation tools** (Terraform, Ansible, or similar)
-   Experience with **CI/CD pipelines** and **release automation** (GitLab preferred, Jenkins acceptable)
-   Proficiency in **monitoring and observability tools** (Datadog or equivalent), including metrics, logging, alerting, and dashboards
-   Experience designing, deploying, and operating large-scale, distributed systems and multi-vendor platforms
-   Solid **incident management** experience, including **on-call rotations**, **escalations**, and **postmortems**
-   Strong **scripting skills** in Python, Bash, Java, or similar for automation and diagnostics
-   Familiarity with **AI-assisted engineering tools** (e.g., Amazon Kiro) and ability to validate outputs effectively
-   Basic knowledge of Java or .NET-based development environments
-   Proactive mindset with strong ownership, problem-solving, and knowledge-sharing habits
-   Willingness to participate in **on-call rotations** and **occasional travel** to the office(less than 10%)

We understand that experience comes in many forms and that careers are not always linear. If you don't meet every requirement in this posting, we still encourage you to apply.

At Tecsys, we are committed to fostering a diverse and inclusive workplace where all employees feel valued, respected, and empowered. We believe that diversity drives innovation and strengthens our ability to deliver exceptional solutions. We welcome and encourage applicants from all backgrounds, experiences, and perspectives to join our team.

_Tecsys is an equal opportunity employer. Accommodation is available for applicants selected for an interview._

NB: if you are applying to this position, you must be a Canadian Citizen or a Permanent Resident of Canada, **OR**, have a valid Canadian work permit.

\*\*\*

**A Note on Our Hiring Process:** We do not use AI to automatically screen or reject candidates. However, we do use specific screening questions to prioritize the most relevant applications for human review.  
  
At Tecsys, we welcome the thoughtful use of AI tools to help you prepare your application, for example, to improve clarity, organize your resume, or practice interview responses. However, we ask that all information you provide reflects your real experience, and that any assessments or written submissions represent your own work and thinking.

During interviews, we expect candidates to engage without the use of AI tools, scripts, or real-time assistance. Authentic, direct conversation helps us get to know how you think, collaborate, and communicate. AI can support your preparation, but it shouldn’t speak or act on your behalf. We genuinely want to meet _you_.
