Senior Site Reliability Engineer

J7OU3GZG3
Experience: 5-8 YearsLocation: Mumbai, Pune and Bangalore Department: Global Delivery

Zycus is a global leader in Source-to-Pay (S2P) procurement software, helping large enterprises drive efficiency, compliance, and measurable value across their procurement and finance operations. Trusted by leading Fortune 1000 organizations worldwide, Zycus enables procurement teams to move from cost control to strategic value creation.

At the core of Zycus’ platform is Merlin AI, an advanced AI-powered engine that brings intelligence, automation, and predictive insights across the entire procurement lifecycle—from sourcing and contract management to procurement, invoicing, and supplier management. Merlin AI empowers Chief Procurement Officers and finance leaders to make faster, smarter decisions with real-time visibility and actionable insights.

Zycus is consistently recognized by top industry analysts such as Gartner, Forrester, and IDC for its innovation, depth of functionality, and strong customer outcomes. Known for its enterprise-grade solutions, global delivery model, and customer-first mindset, Zycus partners closely with organizations to modernize procurement and unlock long-term business value.

With a strong global presence across North America, EMEA, and APAC, Zycus continues to invest aggressively in product innovation, AI-led capabilities, and brand leadership—shaping the future of intelligent procurement.

We Are An Equal Opportunity Employer:
Zycus is committed to providing equal opportunities in employment and creating an inclusive work environment. We do not discriminate against applicants on the basis of race, color, religion, gender, sexual orientation, national origin, age, disability, or any other legally protected characteristic. All hiring decisions will be based solely on qualifications, skills, and experience relevant to the job requirements.

Job Description

Zycus is looking for a Site Reliability Engineer (SRE) with deep expertise in Kubernetesautomation, and Linux systems. The ideal candidate will have hands on experience in deploying, administrating, and optimizing large-scale production systems, with a strong focus on microservices architecture, ensuring automation, performance, and reliability across our SaaS platform.

Roles and Responsibilities:

  • System Reliability & Uptime: Ensure high availability, performance, and reliability of applications and infrastructure.
  • Kubernetes & Cluster Management: Deploy, administer, and maintain Kubernetes clusters, managing scaling, upgrades, and troubleshooting.
  • Microservices Management: Handle the deployment, monitoring, and scaling of microservices in distributed environments.
  • Incident Management: Respond to production incidents, perform root cause analysis, and implement long-term fixes to prevent recurrence.
  • Automation & Infrastructure as Code (IaC): Automate repetitive tasks, infrastructure provisioning, and deployment workflows using tools like Ansible and Terraform.
  • Monitoring & Observability: Implement and maintain monitoring tools (e.g., PrometheusGrafanaDatadog) to track system health and application performance.
  • Performance Optimization: Analyze system performance, identify bottlenecks, and optimize resources for better efficiency.
  • Disaster Recovery & Backup: Design and implement backup and disaster recovery (DR) strategies for business continuity.
  • Capacity Planning: Forecast infrastructure needs based on performance trends and business growth to ensure scalability.
  • Security & Compliance: Ensure infrastructure and applications meet security standards and compliance requirements.
  • Collaboration with Dev & Ops Teams: Work closely with development and operations teams to improve deployment pipelines, release processes, and system reliability.
  • Documentation: Maintain clear and detailed documentation of systems, processes, and incident reports for knowledge sharing and compliance.
  • Continuous Improvement: Identify opportunities for improving system architecture, deployment strategies, and automation workflows.
  • Cloud Infrastructure Management: Manage cloud services (AWS, GCP, Azure) for resource optimization, cost management, and automation.
  • On-Call Support: Participate in on-call rotations to handle urgent production issues and ensure rapid recovery. 

Required Skills

Experience : 3 to 8 years 

Technical skills as mentioned below :

Must Have :

1. Kubernetes Expertise:

    Hands-on experience with installing and provisioning Kubernetes clusters.

    Deep understanding of core Kubernetes components such as CRI, CNS, ETCD, CoreDNS, KubeProxy. Strong knowledge of Kubernetes internal networking, service discovery, and ingress management.

2. Kubernetes Distributions:

    Hands-on experience with different Kubernetes provisioners and distributions.

3. Kubernetes Cluster Administration:

    Experience in administering production Kubernetes clusters, including backup and disaster   recovery (DR) strategies. Familiarity with cluster health monitoring and troubleshooting issues.

4. Monitoring tools : Exposure to monitoring tools such as PrometheusGrafanaDatadog or AppDynamics 

5. Automation & Scripting:

    Strong programming skills in Python or Shell, or similar languages.

    Hands-on experience with Infrastructure-as-Code (IaC) tools such as Terraform or Ansible.

    Cloud automation experience, ideally with AWS or other major cloud platforms.

6. Operating Systems: Hands-on experience with Linux system administration.

7. Microservices : Experience with microservices architecture and managing more than 50 microservices simultaneously.

Good to Have Skills:

-Experience with OpenShift virtualization in production environments.

-Knowledge of AWS EKSRancher, or other Kubernetes distributions.

-CKA (Certified Kubernetes Administrator) certification or equivalent.

-Experience in fine-tuning RHELCentOS, and Ubuntu.

-Familiarity with DevSecOps practices, container security, and compliance frameworks.

Apply for this Job

Personal InformationPersonal Information
Pre-screening QuestionsPre-screening Questions

Upload Your Resume

Acceptable formats are .docx or .pdf with a maximum file size of 5 MB.

Upload
Drag and drop your resume here
or click to browse files