# Senior Site Reliability Engineer- Palo Alto, the US

**Company:** [Kody](http://jobs.workable.com/companies/aDWSRg9smKoHu3GYFnXU9H.md)
**Location:** Palo Alto, United States
**Workplace:** hybrid
**Employment type:** Full-time
**Department:** Technology

[Apply for this job](http://jobs.workable.com/view/16e1bab4-0f5b-45e0-907e-e857078da6ec)

## Description

**Senior Site Reliability Engineer (Payments Infrastructure)**  
Kody is seeking a Senior Site Reliability Engineer to ensure the reliability, availability, scalability, and operational excellence of our global payment platform. You will own production observability, incident response, service-level management, and cloud infrastructure reliability across mission-critical payment processing systems operating in Europe, Asia, and North America.  
  
**Responsibilities**  

-   Participate in a follow-the-sun production on-call rotation as a primary incident responder.
-   Diagnose, triage, mitigate, and coordinate resolution of production incidents across payment services, Kubernetes platforms, databases, messaging systems, and cloud infrastructure.
-   Define and maintain SLOs, SLIs, error budgets, alerting standards, and operational readiness processes.
-   Drive reliability improvements through automation, observability, capacity planning, performance optimization, and post-incident reviews.
-   Partner with engineering teams to improve resilience, security, and operational maturity in PCI-DSS-regulated environments.
-   Lead incident management during SEV1/SEV2 events and improve response effectiveness and MTTR.

-   **Cross-Border Collaboration:** Act as a key technical bridge between our US operations and international engineering hubs, leveraging bilingual communication to streamline complex technical alignment.

## Requirements

-   5+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or Cloud Infrastructure roles supporting mission-critical production systems.
-   Strong hands-on experience with AWS, Kubernetes (EKS), Terraform, PostgreSQL, Redis, Kafka, Linux, networking, and modern observability platforms.
-   Deep understanding of distributed systems, cloud-native architectures, high availability, disaster recovery, capacity planning, and performance optimization.
-   Proven experience operating payment, banking, fintech, or other highly regulated systems with stringent security, compliance, and uptime requirements.
-   Strong knowledge of SRE principles, including SLOs, SLIs, error budgets, incident management, alert governance, and operational excellence.

  
  
**Leadership & Operational Excellence**  

-   Demonstrates strong ownership and accountability, taking end-to-end responsibility for service reliability and customer impact.
-   Possesses a strong sense of urgency during production incidents while maintaining sound judgment and structured decision-making under pressure.
-   Applies a systematic and methodical approach to troubleshooting, root-cause analysis, and incident resolution in complex distributed environments.
-   Data-driven mindset with the ability to leverage metrics, telemetry, trends, and service-level indicators to prioritize reliability investments and operational improvements.
-   Continuously drives engineering excellence through iterative improvement, automation, standardization, and elimination of operational toil.
-   Proven ability to lead cross-functional incident response efforts, coordinate stakeholders, and communicate effectively during high-severity production events.
-   Champions a culture of operational readiness, continuous learning, post-incident improvement, and blameless accountability.
-   Demonstrates strong mentoring and technical leadership skills, influencing engineering teams to build reliable, scalable, and resilient systems by design.

## Benefits

-   Competitive packages aligned with California market standards
-   Lead a dynamic and innovative team in a very rapidly growing company
-   Collaborative, inclusive environment where your contributions are recognized and valued
