Site Reliability Engineer (SRE)
Job Description
About this role
SRE is software engineering applied to operations — and most AI assistants treat ops as an afterthought, not a discipline. As a Site Reliability Engineer for AI training, you will help AI generate code and documentation that takes SLOs, error budgets, and incident response seriously, the way real on-call engineers do.
Key Responsibilities
• Generate and evaluate instruction-response pairs covering SLOs, error budgets, and reliability math.
• Review AI-generated code for observability (Prometheus, OpenTelemetry, Grafana, Datadog).
• Provide feedback on alerting design, runbooks, and post-incident reviews.
• Validate AI handling of chaos engineering, load testing, and capacity planning.
• Evaluate AI-generated incident response procedures and on-call handoffs.
• Identify subtle issues in alert fatigue, cardinality explosions, and silent failures.
Ideal Qualifications
• 5• years in SRE, production engineering, or platform reliability.
• Deep familiarity with Prometheus/OpenTelemetry-based observability stacks.
• Strong grasp of distributed-systems failure modes and incident management.
• Experience writing runbooks and leading post-incident reviews.
• Comfort with Go, Python, or another systems language.
• Familiarity with major cloud providers (AWS, GCP, Azure) at scale is required.
Project Timeline
• Start Date: Immediate
• Duration: Ongoing
• Commitment: Flexible, 10-25 hours/week
Contract & Payment Terms
• Independent contractor agreement
• Remote work — anywhere in eligible locations
• Weekly payment via Stripe or bank transfer
• Flexible hours
Bring real on-call discipline to AI's view of reliability — apply now!