وظائف الأماراتوظائف بيت الامارات
Devops, Distributed GPU Cloud Engineer
Devops, Distributed GPU Cloud Engineer
الوصف الوظيفي
What You’ll Do
- Design, implement, and manage deployment of highly available services
- Lead the implementation of a CI/CD pipeline for existing repositories
- Contribute to design, implementation, testing, and documentation of orchestration and other services supporting GPU Cloud
- Contribute to major cloud features that enable our customers to easily run large-scale AI research projects
- Develop and maintain tools for automation, deployment, monitoring, and operations of our cloud infrastructure
- Develop and maintain the infrastructure as code, including the design and implementation of automated infrastructure deployment and configuration management
- Provide guidance and support to development teams in the implementation of DevOps best practices
- Participate in building orchestration software within our data centers for launching/terminating VMs, networking, and attaching storage
- Ensure correctness of software through high quality automated testing and setting standards for other developers
- Stay current with emerging trends and technologies related to DevOps, cloud computing, and infrastructure automation
- Linux distributions and package managers
- Linux user management
- VM management
- GitLab administration
- Network building, configuration and maintenance
- Monitoring tool setup and maintenance (i.e.:Grafana)
- Automatic backup setup and maintenance
المهارات
- Strong engineering background – EECS preferred, Mathematics, Software Engineering, Physics
- 5+ years of experience in DevOps or related field
- Experience with the implementation of CI/CD pipelines for existing repositories using tools like Github Actions or Buildkite
- Strong experience with infrastructure as code (IaC) and related tools such as Ansible, CloudFormation or Terraform
- Proficient in one or more programming languages such as Python, Node.JS or Go
- Familiarity with cloud computing platforms such as AWS, GCP, or experience deploying a web application on a a private cloud platform
- Experience with containerization technologies such as Docker or Kubernetes
Nice to Have
- Experience in a leadership role, with the ability to lead and mentor junior team members
- Have experience developing major features from conception to production using a Python web application
- Experience integrating with storage systems, data center networking, virtualization
- Experience in the machine learning or computer hardware industry
- Cloud platform management
- Shell scripting
- Container orchestration tool familiarity
- Knowledge of cluster schedulers
- Designing cloud deployment strategy for Riva and NeMo LLM and any other LLM model services: helm charts, k8s operators, etc
- slurm, zabbix, rockclusters, proxmox for GPU orchestration and distributed full power in one computing solutions
- Experience with GPU computing systems
- Experience with Nvidia (Cuda) drivers.
- Tensor flow, PyTorch experience is advantageous
تفاصيل الوظيفة
- منطقة الوظيفة
- دبي, الإمارات العربية المتحدة
- قطاع الشركة
- الاتصالات والشبكات; تخزين البيانات; ألعاب الفيديو
- طبيعة عمل الشركة
- صاحب عمل (القطاع الخاص)
- الدور الوظيفي
- تكنولوجيا المعلومات
- نوع التوظيف
- دوام كامل
- الراتب الشهري
- غير محدد
- عدد الوظائف الشاغرة
- 1