Work Experience
|
NIO Autonomous Driving Technology Co., Ltd.
Beijing, BJ
AI Compiler and System Software(Head of Heterogeneous Computing Team & Expert)
Aug 2021 - Present
- Establish CUDA operator optimization methodology for Computer Vision
inference tasks(eg. Transformer, Self Attention, Cross Attention, Grid Sample, RoiAlign,
BEV2BEV, NMS, CropAndResize)
- Establish CUDA operator optimization methodology for Lidar inference
tasks(eg. 3D Sparse Convolution, Voxel Generator, Mapping, Freespace, RangeView)
- Optimize Deep Learning Operator with CUDA to accelerate the training
performance
- Optimize CPU Algorithm with SIMD technology to accelerate its performance
- Implement VLIW optimization to accelerate Image Processing tasks
- Design and implement profiling tools for PyTorch system with cupti and nvtx
to guide the layer by layer optimization
- Design and implement high performance communication middleware
- Design decentrialized discovery mechanism for communication middleware
TuSimple Inc.
Beijing, BJ / San Diego, CA
High Performance Computing Engineer
Aug 2017 - Aug 2021
- Optimize Deep Learning operator to accelerate the Deep Learning Inference
system with GPU equiped device
- Design and implement Publisher/Subscriber communication semantic on GPU
equipped system to improve the transination performance for large memory object
- Introduce MPS(Multi-Process Service) to autonomous driving environment to
improve GPU utilization with multi-processes loosely couppled architecture
- Hook CUDA runtime API to fulfill software GPU virtualization and Deep
Learning performance improvement
- Accelerate Binary Neutral Network with SIMD(SSE/AVX) instructions on CPU
- Design and implement high performance, high robustic, high reliability
communication middleware
- Design and implement transport layer for communication middleware with
shared pointer, shared memory, socket mechanism
- Design a publisher-centric lifecycle mangment method for shared memory and
notification mechanism
- Design and implement weak-centrialized discovery layer for communication
middleware with ETCD as the discovery mechanism
- Optimize and re-implement serialization method for message type system in
communication middleware
- Integrate performance metrics and monitor tools into one docker container,
to make its' usage more easy
|