Site Reliability Engineer (SRE)

Category: coding

Compensation: $90 – $140/hr

Employment Type: Remote

Locations: USA, UK, Canada, Germany, Australia

Skills: SRE, Observability, Reliability

Job Description

About this role

SRE is software engineering applied to operations — and most AI assistants treat ops as an afterthought, not a discipline. As a Site Reliability Engineer for AI training, you will help AI generate code and documentation that takes SLOs, error budgets, and incident response seriously, the way real on-call engineers do.

Key Responsibilities

• Generate and evaluate instruction-response pairs covering SLOs, error budgets, and reliability math.

• Review AI-generated code for observability (Prometheus, OpenTelemetry, Grafana, Datadog).

• Provide feedback on alerting design, runbooks, and post-incident reviews.

• Validate AI handling of chaos engineering, load testing, and capacity planning.

• Evaluate AI-generated incident response procedures and on-call handoffs.

• Identify subtle issues in alert fatigue, cardinality explosions, and silent failures.

Ideal Qualifications

• 5• years in SRE, production engineering, or platform reliability.

• Deep familiarity with Prometheus/OpenTelemetry-based observability stacks.

• Strong grasp of distributed-systems failure modes and incident management.

• Experience writing runbooks and leading post-incident reviews.

• Comfort with Go, Python, or another systems language.

• Familiarity with major cloud providers (AWS, GCP, Azure) at scale is required.

Project Timeline

• Start Date: Immediate

• Duration: Ongoing

• Commitment: Flexible, 10-25 hours/week

Contract & Payment Terms

• Independent contractor agreement

• Remote work — anywhere in eligible locations

• Weekly payment via Stripe or bank transfer

• Flexible hours

Bring real on-call discipline to AI's view of reliability — apply now!

Apply for Site Reliability Engineer (SRE) at IXO