Education

University of Chinese Academy of Sciences

Beijing, BJ

Doctor of Philosophy(high performance computing)

Sep 2013 - Jun 2018

    Ocean University of China

    Qingdao, SD

    Bachelor of Engineering

    Sep 2009 - Jun 2013

      Work Experience

      NIO Autonomous Driving Technology Co., Ltd.

      Beijing, BJ

      AI Compiler and System Software(Head of Heterogeneous Computing Team & Expert)

      Aug 2021 - Present
      • Establish CUDA operator optimization methodology for Computer Vision inference tasks(eg. Transformer, Self Attention, Cross Attention, Grid Sample, RoiAlign, BEV2BEV, NMS, CropAndResize)
      • Establish CUDA operator optimization methodology for Lidar inference tasks(eg. 3D Sparse Convolution, Voxel Generator, Mapping, Freespace, RangeView)
      • Optimize Deep Learning Operator with CUDA to accelerate the training performance
      • Optimize CPU Algorithm with SIMD technology to accelerate its performance
      • Implement VLIW optimization to accelerate Image Processing tasks
      • Design and implement profiling tools for PyTorch system with cupti and nvtx to guide the layer by layer optimization
      • Design and implement high performance communication middleware
      • Design decentrialized discovery mechanism for communication middleware

      TuSimple Inc.

      Beijing, BJ / San Diego, CA

      High Performance Computing Engineer

      Aug 2017 - Aug 2021
      • Optimize Deep Learning operator to accelerate the Deep Learning Inference system with GPU equiped device
      • Design and implement Publisher/Subscriber communication semantic on GPU equipped system to improve the transination performance for large memory object
      • Introduce MPS(Multi-Process Service) to autonomous driving environment to improve GPU utilization with multi-processes loosely couppled architecture
      • Hook CUDA runtime API to fulfill software GPU virtualization and Deep Learning performance improvement
      • Accelerate Binary Neutral Network with SIMD(SSE/AVX) instructions on CPU
      • Design and implement high performance, high robustic, high reliability communication middleware
      • Design and implement transport layer for communication middleware with shared pointer, shared memory, socket mechanism
      • Design a publisher-centric lifecycle mangment method for shared memory and notification mechanism
      • Design and implement weak-centrialized discovery layer for communication middleware with ETCD as the discovery mechanism
      • Optimize and re-implement serialization method for message type system in communication middleware
      • Integrate performance metrics and monitor tools into one docker container, to make its' usage more easy

      Tech Blogs

      Cute Series(in Chinese)

      • Layout for CUTE
      • Algebra and Geometry Explantion for CUTE
      • Tensor for CUTE
      • MMA Abstraction for CUTE
      • COPY Abstraction for CUTE
      • Simple GEMM Implementation for CUTE
      • GEMM Pipeline for CUTE
      • Swizzle for CUTE
      • High Performance GEMM Implementation for CUTE

      CUDA Related Topics(in Chinese)

      • Registers are shared between CUDA Core and Tensor Core
      • The Advantage of ldmatrix Instruction in CUDA
      • The Difference between CUDA Core and Tensor Core in GPU
      • How to optimize Transformer Attention

      Open-Source Contributions

      Publications