Summary
|
Ph.D. in High-Performance Computing with 8+ years in AI Infrastructure and
systems software. Currently leading LLM inference optimization at Tencent — improving
end-to-end efficiency through kernel, communication, quantization and scheduling optimization
— and creator of the open-source hpc-ops operator library that powers
large-scale production inference. Deep expertise in CUDA / Tensor Core kernel engineering
(CuTe / CUTLASS / Triton), distributed GPU communication (NVLS / RDMA), LLM serving (vLLM /
SGLang / TensorRT-LLM) and quantization across Hopper and more advanced GPUs. 10 peer-reviewed papers
(1 best-paper award, 2 nominations) and 4 international patents (US & AU).
|
Skills
|
LLM Inference & Serving:
vLLM, SGLang, TensorRT-LLM, FlashAttention / FlashInfer-class kernels, Prefill/Decode
(PD) disaggregation, speculative decoding, prefix caching, KVCache (DDR/SSD), sparse
Attention, Mamba
GPU Kernel Engineering:
CUDA, CuTe / CUTLASS, Triton, Tensor Core / SIMT, PTX / SASS, cp.async / TMA, Hopper
(Cluster), more advanced GPUs
Distributed Communication:
NVLS, NCCL-class collectives (AllReduce / AllGather / ReduceScatter), one-shot /
multimem, RDMA / GPU-NIC aware, MPI
Quantization:
SmoothQuant, MixQ, Rotation / Hadamard, INT8 / FP8, per-token / per-head schemes,
bf16×fp32 Tensor-Core GEMM
Parallelism & Scheduling:
TP / EP / PD / TPEP hybrid, MoE expert parallelism, data-parallel sampling, MoE weight
offloading
Languages & Frameworks:
C++, Python, CUDA C++, MPI, PyTorch, TVM
Domains:
AI Infra, LLM inference, AD, High-Performance Computing
|
Work Experience
|
Tencent
(as Reed)
Beijing, BJ
Tech Lead, LLM Inference Optimization (Expert)
Jun 2025 - Present
Lead a team driving end-to-end LLM inference optimization
— kernel, communication, scheduling and algorithm — that powers large-scale
production serving; creator of the open-source hpc-ops operator library (SOTA on
NVIDIA H20).
Kernel Optimization
- H20: CUDA-Core ops (RMSNorm, RoPE, fused
quantization) run >40% faster than TensorRT-LLM / FlashInfer; Attention and Group
GEMM at industry-leading level — Attention Prefill surpasses FlashAttention /
FlashInfer, Group GEMM beats vLLM by >30%, and Attention Decode beats the best-known
implementation by 30%.
- More advanced GPUs: GroupGEMM / MoE +10–20%
over FlashInfer; a bf16×fp32 Tensor-Core GEMM for high-precision MoE routing, and
a "schedule-then-compute" strategy for mixed long/short Decode requests (+30% in
unbalanced scenarios).
- Techniques: cp.async-based gather+GEMM fusion for
Group GEMM (reduced reorder / TMA overhead); long-to-short Attention scheduling to
balance SM load; Hadamard rotation to suppress QK outliers (fast transform fused into
the RoPE kernel) plus qk-per-token-dynamic + v-per-head-static quantization for
long-sequence precision; custom Mamba kernels >60% faster than the best open
source.
Distributed Communication
- NVLS-based intra-node multi-GPU AllReduce + Norm fusion,
>4× over Ring / Tree (one-shot for small batch, multimem for large batch);
GEMM + ReduceScatter fusion; GPU-NIC aware optimization leveraging full RDMA bandwidth
to cut cross-node latency.
- PD (Prefill/Decode) disaggregation on TensorRT-LLM with Mamba-State
PD transfer and HND-format Attention; MoE weight offloading to enable single-GPU
deployment and reduce cross-card traffic.
Scheduling & Algorithms
- Data-parallel Sampling using the Hopper Cluster feature for
softmax / topk locality; overlapped MoE expert and shared-expert parallelism;
speculative decoding; TPEP hybrid parallelism to balance EP and raise Tensor Core
utilization; process-independent KVCache prototype using DDR / SSD for prefix-cache
hits.
- Communication quantization + overlap: decompose
AllReduce into ReduceScatter + AllGather and fuse ReduceScatter + Norm + quantization
to emit FP8 (halving the AllGather payload), then overlap the AllGather with the
following QKV projection (qkv_proj) GEMM.
NIO
(as Xiaoze Lu)
Beijing, BJ
Head of High-Performance Computing Team (Expert)
Aug 2021 - Jun 2025
Led the high-performance computing team across on-device LLM
inference, model quantization and CV/Lidar perception kernels, with PTX/SASS-level NVIDIA
GPU tuning for extreme performance.
On-Device LLM Inference & Quantization
- Runtime: built a C++ runtime for Hugging Face
models (SafeTensor conversion, device memory pool, KVCache, tokenization, RoPE,
Prefill+Decode) running Llama2 / Llama3 / Qwen2 on automotive chips.
- Quantization: SmoothQuant / MixQ / Rotation
framework for ViT / LLM — 2× storage saving, <1pt accuracy loss vs FP16,
+25% Prefill and 2× Decode throughput.
- CuTe kernels: int8 Attention / FeedForward blocks
(RMSNorm, UpProj, GateProj, Self-Attention) — Attention +10–40% over
FlashAttention2; full network 1.6× over TVM-FlashInfer and 3× over TRT-LLM on
Orin.
Perception Kernels (Camera & Lidar)
- Camera: plugin kernels shared across TensorFlow /
PyTorch / TVM; LayerNorm 2× (TVM-tuned) / 20× (PyTorch); Cross-Attention
16× over MMCV; Camera ops (ROIAlign / NMS / CropAndResize / BEV2BEV / Lane /
Marker) up to tens× over TensorRT / TVM.
- Lidar: 3D Sparse Convolution 3× over
Spconv2.2 / MIT; specialized GEMM +40% over cuBLAS; Voxel Generator, Range View, BEV
and World-Map projection kernels for efficient Lidar networks.
Platform & Tooling
- Custom silicon: co-designed ISA / runtime /
programming model with CUDA-compatible abstractions for smooth migration of existing
code.
- Tooling: training-system profiling (cuPTI / NVTX +
PyTorch hooks); high-performance CV library (ARM-SIMD + CUDA-SIMT, 40+ APIs, ~2×
over OpenCV); VPU VLIW offloading; decentralized distributed communication system.
TuSimple
(as Wei Liu)
Beijing, BJ / San Diego, CA
High-Performance Computing Engineer
Jul 2018 - Aug 2021
GPU and deep-learning inference performance, GPU virtualization,
and high-performance robotic communication middleware.
GPU Compute & Virtualization
- Built the first GPU-native Publisher/Subscriber
for zero-copy inter-process messaging — a pub-centric GPU memory pool with
on-the-fly offset conversion that removes redundant data movement, extended from
single- to multi-GPU IPC — cutting message latency 53.7% (PointCloud / Image) vs
the prior SOTA, and end-to-end latency 29.2% with up to 58.9% less resource usage.
- Introduced MPS (Multi-Process Service) to the autonomous-driving
environment to raise GPU utilization and reduce GPUs per vehicle, while solving
MPS-Server centralization issues.
- Built GPU virtualization by hooking the CUDA runtime API to
partition and manage device memory.
- Accelerated Binary Neural Networks with SIMD (SSE / AVX) on
CPU.
Communication Middleware
- Robust-Z: a robotic communication middleware
combining high performance and high reliability — shared-memory transport plus a
novel socket-based control algorithm for crash-safe, lossless delivery; up to 41%
faster than ROS2 and 5% over Apollo CyberRT, with a 5.2% lower data-miss rate than
CyberRT.
- Production system: ETCD-based decentralized
service discovery (removing the ROS master), protobuf ROS2-compatible serialization,
KubeEdge containerization, and async networking + memory pool for cross-machine
transport.
|
Education
|
University of Chinese Academy of Sciences
Beijing, BJ
Doctor of Philosophy (High-Performance Computing)
Sep 2013 - Jun 2018
- Deep learning for high-precision numerical computing:
trained models to map low-precision to high-precision solutions from few samples and
generalize across datasets, reaching ~100× the efficiency of traditional methods.
- GPU-accelerated numerical analysis: staggered-grid
finite-difference solvers for 1st/2nd-order PDE systems modeling 3D seismic-wave
propagation, with Taylor-expansion high-order difference coefficients; interpolation /
curve fitting and Fourier spectral-envelope analysis for low-SNR weak-signal
extraction.
- GPU-accelerated integral equations: register /
occupancy tuning, double-buffered async IO, multi-stream overlap, shared memory and SFU
intrinsics — ~20× over a baseline CUDA implementation; MPI TB-scale parallel IO
with async send/recv hiding transfer latency behind computation.
Ocean University of China
Qingdao, SD
Bachelor of Engineering
Sep 2009 - Jun 2013
|
Publications & Patents
|
- Accelerating GPU Message Communication for Autonomous Navigation
Systems. IEEE CLUSTER 2021 (best paper award).
- A Robotic Communication Middleware Combining High Performance and
High Reliability. SBAC-PAD 2020 (best paper nomination) / JPDC 2022.
- Memory-Centric Communication Mechanism for Real-Time Autonomous
Navigation Applications. ICPP 2020 (best paper nomination).
- Safe Process Quitting for GPU Multi-Process Service (MPS).
ICDCS 2020.
- BitFlow: Exploiting Vector Parallelism for Binary Neural Networks on
CPU. IPDPS 2018.
- Data Optimization Method and Integral Prestack Depth Migration
Method. US Patent US11209563B2 (granted 2021-12-28).
- Method, Apparatus and System for Multi-Module Scheduling. US
Patents US10942771B2 and US11055144B2 (granted 2021).
- Data Communication Method, Communication System and
Computer-Readable Storage Medium. AU Patent AU2022204335A1 (2023-01-19).
- Several domestic (CN) patents.
|
Open-Source Contributions
|
- Creator and lead of Tencent's open-source LLM inference operator
library (Tencent Hunyuan AI Infra). Built from scratch on CUDA / CuTe and tuned for
NVIDIA H20 (sustaining ≥80% of peak memory bandwidth), it ships SOTA kernels —
Attention, quantized Grouped GEMM and Fused MoE in BF16 / FP8 — behind a clean API
for vLLM and SGLang integration, while doubling as an industrial-grade CuTe / CUTLASS
reference.
- Published benchmarks (H20): Attention up to 2.22× over
FlashInfer / FlashAttention, Grouped GEMM up to 1.88× over DeepGEMM, and Fused MoE
up to 1.49× over TensorRT-LLM; end-to-end +30% QPM on Hunyuan and +17% on DeepSeek.
Validated in Tencent large-scale production; 800+ GitHub stars.
- Presented at NVIDIA CUDA Meetup (Beijing):
“HPC-Ops: CuTe-based LLM Inference Operator Optimization and Production
Deployment.”
|
Tech Blogs
|
A widely-read tutorial series on NVIDIA CuTe / CUTLASS, referenced
by the CUTLASS team.
- Layout; Algebra and Geometry; Tensor
- MMA Abstraction; COPY Abstraction; Swizzle
- Simple GEMM; GEMM Pipeline; High-Performance GEMM
CuTe on Hopper (in Chinese)
- Hopper: Introduction
- Hopper MBarrier
- Hopper TMA
- TMA Descriptor Encoding and the Hidden 21st Bit
LLM Kernels & Communication (in Chinese)
- FP8 Attention Precision Optimization: Reverse-order Compute and
Scaling-Factor Selection
- How to Optimize Transformer Attention
- High-Performance Communication (series)
GPU Architecture & Internals (in Chinese)
- NVIDIA GPU ISA: Bit and Logic Operations; Warp-level and Uniform
Operations; Program Control and Atomic Operations
- Registers Shared between CUDA Core and Tensor Core
- The ldmatrix Instruction; CUDA Core vs Tensor Core
|