Job Type
Full-Time Regular


On the SRE team you will have the opportunity to apply your experience against systems at scale – where a single week can involve shifting terabits of traffic between sites, deploying configuration changes to shave milliseconds off billions of requests, or enabling a new software feature on thousands of systems using automated tooling you designed and built.

Essential Duties and Responsibilities

  • Respond to incidents during on-call duty
  • Respond to complex customer escalations, which often cross system, network and software boundaries
  • Design, develop and maintain internal service metrics (SLA, SLO, SLI) in cross-team collaborations
  • Design, develop and maintain dashboards, tooling, alarms and playbooks in collaboration with operations teams to support service-level objectives
  • Design, develop and maintain reusable monitoring and canary infrastructure
  • Design, execute and evaluate performance experiments
  • Collaborate with development teams to complete production readiness checklists prior to major feature launches
  • Collaborate with operations and engineering teams in determining root cause of major incidents, performance anomalies, or other customer-impacting issues

Desired Skills and Experience

  • Experience with monitoring and alerting platforms (Prometheus and Alertmanager, Grafana, Zabbix, Nagios)
  • Experience with a Linux server environment
  • Experience with scripting languages (Python, Ruby, Perl)
  • Experience with systems programming languages (Go, C)
  • Experience with configuration management systems (Puppet, Ansible, Chef)
  • Expert-level proficiency in systems, network or software engineering
  • Excited about working on a remote-first engineering team
  • Proficient at troubleshooting complex systems
  • Production experience in a service provider environment
  • Comfortable with a software engineering workflow for collaboration and configuration management — branches, pull requests, merges, conflicts

Projects you might work on

  • Product launches
  • Software and platform feature releases
  • Live streaming event planning and execution
  • Network reach and capability expansion
  • Network and system automation tooling development
  • Telemetry and monitoring system development
  • Defining service metrics (SLA, SLO, SLI) during new product development

Job ID EB-4646527289 / Posted Posted 3 Weeks ago
Apply With