Optiver

ML Platform Engineer — Shanghai

Apply for this role

  • Building the compute platform and machine learning libraries for large scale machine learning and simulation workloads.
  • Focus on compute platform stability and efficiency on both CPU and GPU clusters, making the platform observable and scalable.
  • Utilize cluster monitoring and profiling tools to identify bottlenecks and optimize both infrastructure and software system.
  • Troubleshoot and resolve issues related to OS, storage, network, and GPUs.

Challenges You Will Tackle: design, build and improve our compute platform for PB scale data model training and simulations with a wide range of machine learning models by leveraging our existing research infrastructure.

← Back to all jobs