Data Reliability Engineer
Job Description
This Data Reliability Engineer role is pivotal in maintaining the reliability and stability of production data pipelines and data platform services. Key responsibilities include: 1. Diagnosing and resolving data pipeline failures, delays, and data quality issues. 2. Investigating issues across distributed data systems such as Spark/EMR workloads and data warehouse performance. 3. Leading or supporting incident response, including triage, mitigation, and long-term resolution. 4. Performing root cause analysis (RCA) and implementing durable fixes to prevent recurrence. 5. Defining and improving data SLAs (freshness, latency, completeness) and ensuring adherence. 6. Designing and enhancing monitoring, alerting, and observability for data systems. 7. Developing automation and tooling to reduce operational toil and improve system resilience. 8. Contributing to disaster recovery (DR) and resiliency planning, including backup validation and recovery workflows. 9. Partnering with engineering teams to improve pipeline design, reliability, and operational readiness. 10. Creating and maintaining runbooks, SOPs, and operational documentation. 11. Participating in occasional off-hours support for production data systems when required.
Qualifications
To excel in this Data Reliability Engineer position, the following qualifications are essential: 1. A minimum of 5 years of experience working with production data platforms in AWS environments. 2. Prior experience building data pipelines and seeing them through production, including exposure to real-world failures and operational challenges. 3. Strong experience with Python and SQL in real data systems. 4. Hands-on experience troubleshooting distributed data processing systems (e.g., Spark/EMR, Redshift, streaming systems). 5. Proven ability to debug and resolve production issues in data pipelines and data platforms. 6. Experience with AWS data services (such as EMR, Redshift, DynamoDB, S3, or similar). 7. Experience handling production incidents and performing root cause analysis. 8. A strong problem-solving mindset and ability to work through ambiguous production issues.
Benefits
Employees in this role enjoy a comprehensive benefits package: - Medical, dental, vision, and life insurance - Retirement savings – 401(k) plan with generous company matching contributions (up to 6%) - Tuition reimbursement up to $5,250/year - Business-casual environment that includes the option to wear jeans - Generous paid time off upon hire – including a paid time off program plus ten paid company holidays and three floating holidays each calendar year - Paid volunteer time — 16 hours per calendar year - Leave of absence programs – including paid parental leave, paid short- and long-term disability, and Family and Medical Leave (FMLA) - Business Resource Groups (BRGs) – facilitating inclusion and collaboration across our business internally and throughout the communities where we live, work and play.
Apply Now
