Deep Learning Infrastructure Engineer at beebom

Palo Alto, CA · Noida · full time · mid level

Compensation: USD 160000–220000 per year

About the role

Optimize deep learning training at scale. Work on distributed training, GPU cluster management, and performance optimization for massive neural networks.

Responsibilities

Optimize distributed training across GPU clusters
Implement mixed-precision training and gradient compression
Manage and optimize GPU/TPU infrastructure
Debug performance bottlenecks in training pipelines

Requirements

BS/MS in Computer Science
3+ years experience in distributed systems
Expert in CUDA, cuDNN, NCCL
Experience with distributed training frameworks (DeepSpeed, Horovod)
Strong proficiency in C++ and Python

Benefits

Top-tier compensation
Latest hardware access
Stock options
Unlimited PTO

Skills: CUDA, C++, Python, Distributed Systems, DeepSpeed, GPU Optimization

Browse more jobs on 100Networks