Optiver
ML Platform Engineer — Shanghai
- Building the compute platform and machine learning libraries for large scale machine learning and simulation workloads.
- Focus on compute platform stability and efficiency on both CPU and GPU clusters, making the platform observable and scalable.
- Utilize cluster monitoring and profiling tools to identify bottlenecks and optimize both infrastructure and software system.
- Troubleshoot and resolve issues related to OS, storage, network, and GPUs.
Challenges You Will Tackle: design, build and improve our compute platform for PB scale data model training and simulations with a wide range of machine learning models by leveraging our existing research infrastructure.