Deep Learning Infrastructure Engineer at beebom
Palo Alto, CA · Noida · full time · mid level
Compensation: USD 160000–220000 per year
About the role
Optimize deep learning training at scale. Work on distributed training, GPU cluster management, and performance optimization for massive neural networks.
Responsibilities
- Optimize distributed training across GPU clusters
- Implement mixed-precision training and gradient compression
- Manage and optimize GPU/TPU infrastructure
- Debug performance bottlenecks in training pipelines
Requirements
- BS/MS in Computer Science
- 3+ years experience in distributed systems
- Expert in CUDA, cuDNN, NCCL
- Experience with distributed training frameworks (DeepSpeed, Horovod)
- Strong proficiency in C++ and Python
Benefits
- Top-tier compensation
- Latest hardware access
- Stock options
- Unlimited PTO
Skills: CUDA, C++, Python, Distributed Systems, DeepSpeed, GPU Optimization
Browse more jobs on 100Networks