Summary

Ph.D. in High-Performance Computing with 8+ years in AI Infrastructure and systems software. Currently leading LLM inference optimization at Tencent — improving end-to-end efficiency through kernel, communication, quantization and scheduling optimization — and creator of the open-source hpc-ops operator library that powers large-scale production inference. Deep expertise in CUDA / Tensor Core kernel engineering (CuTe / CUTLASS / Triton), distributed GPU communication (NVLS / RDMA), LLM serving (vLLM / SGLang / TensorRT-LLM) and quantization across Hopper and more advanced GPUs. 10 peer-reviewed papers (1 best-paper award, 2 nominations) and 4 international patents (US & AU).

Skills

LLM Inference & Serving: vLLM, SGLang, TensorRT-LLM, FlashAttention / FlashInfer-class kernels, Prefill/Decode (PD) disaggregation, speculative decoding, prefix caching, KVCache (DDR/SSD), sparse Attention, Mamba
GPU Kernel Engineering: CUDA, CuTe / CUTLASS, Triton, Tensor Core / SIMT, PTX / SASS, cp.async / TMA, Hopper (Cluster), more advanced GPUs
Distributed Communication: NVLS, NCCL-class collectives (AllReduce / AllGather / ReduceScatter), one-shot / multimem, RDMA / GPU-NIC aware, MPI
Quantization: SmoothQuant, MixQ, Rotation / Hadamard, INT8 / FP8, per-token / per-head schemes, bf16×fp32 Tensor-Core GEMM
Parallelism & Scheduling: TP / EP / PD / TPEP hybrid, MoE expert parallelism, data-parallel sampling, MoE weight offloading
Languages & Frameworks: C++, Python, CUDA C++, MPI, PyTorch, TVM
Domains: AI Infra, LLM inference, AD, High-Performance Computing

Work Experience

Tencent

(as Reed)
Beijing, BJ

Tech Lead, LLM Inference Optimization (Expert)

Jun 2025 - Present
Lead a team driving end-to-end LLM inference optimization — kernel, communication, scheduling and algorithm — that powers large-scale production serving; creator of the open-source hpc-ops operator library (SOTA on NVIDIA H20).
Kernel Optimization
  • H20: CUDA-Core ops (RMSNorm, RoPE, fused quantization) run >40% faster than TensorRT-LLM / FlashInfer; Attention and Group GEMM at industry-leading level — Attention Prefill surpasses FlashAttention / FlashInfer, Group GEMM beats vLLM by >30%, and Attention Decode beats the best-known implementation by 30%.
  • More advanced GPUs: GroupGEMM / MoE +10–20% over FlashInfer; a bf16×fp32 Tensor-Core GEMM for high-precision MoE routing, and a "schedule-then-compute" strategy for mixed long/short Decode requests (+30% in unbalanced scenarios).
  • Techniques: cp.async-based gather+GEMM fusion for Group GEMM (reduced reorder / TMA overhead); long-to-short Attention scheduling to balance SM load; Hadamard rotation to suppress QK outliers (fast transform fused into the RoPE kernel) plus qk-per-token-dynamic + v-per-head-static quantization for long-sequence precision; custom Mamba kernels >60% faster than the best open source.
Distributed Communication
  • NVLS-based intra-node multi-GPU AllReduce + Norm fusion, >4× over Ring / Tree (one-shot for small batch, multimem for large batch); GEMM + ReduceScatter fusion; GPU-NIC aware optimization leveraging full RDMA bandwidth to cut cross-node latency.
  • PD (Prefill/Decode) disaggregation on TensorRT-LLM with Mamba-State PD transfer and HND-format Attention; MoE weight offloading to enable single-GPU deployment and reduce cross-card traffic.
Scheduling & Algorithms
  • Data-parallel Sampling using the Hopper Cluster feature for softmax / topk locality; overlapped MoE expert and shared-expert parallelism; speculative decoding; TPEP hybrid parallelism to balance EP and raise Tensor Core utilization; process-independent KVCache prototype using DDR / SSD for prefix-cache hits.
  • Communication quantization + overlap: decompose AllReduce into ReduceScatter + AllGather and fuse ReduceScatter + Norm + quantization to emit FP8 (halving the AllGather payload), then overlap the AllGather with the following QKV projection (qkv_proj) GEMM.

NIO

(as Xiaoze Lu)
Beijing, BJ

Head of High-Performance Computing Team (Expert)

Aug 2021 - Jun 2025
Led the high-performance computing team across on-device LLM inference, model quantization and CV/Lidar perception kernels, with PTX/SASS-level NVIDIA GPU tuning for extreme performance.
On-Device LLM Inference & Quantization
  • Runtime: built a C++ runtime for Hugging Face models (SafeTensor conversion, device memory pool, KVCache, tokenization, RoPE, Prefill+Decode) running Llama2 / Llama3 / Qwen2 on automotive chips.
  • Quantization: SmoothQuant / MixQ / Rotation framework for ViT / LLM — 2× storage saving, <1pt accuracy loss vs FP16, +25% Prefill and 2× Decode throughput.
  • CuTe kernels: int8 Attention / FeedForward blocks (RMSNorm, UpProj, GateProj, Self-Attention) — Attention +10–40% over FlashAttention2; full network 1.6× over TVM-FlashInfer and 3× over TRT-LLM on Orin.
Perception Kernels (Camera & Lidar)
  • Camera: plugin kernels shared across TensorFlow / PyTorch / TVM; LayerNorm 2× (TVM-tuned) / 20× (PyTorch); Cross-Attention 16× over MMCV; Camera ops (ROIAlign / NMS / CropAndResize / BEV2BEV / Lane / Marker) up to tens× over TensorRT / TVM.
  • Lidar: 3D Sparse Convolution 3× over Spconv2.2 / MIT; specialized GEMM +40% over cuBLAS; Voxel Generator, Range View, BEV and World-Map projection kernels for efficient Lidar networks.
Platform & Tooling
  • Custom silicon: co-designed ISA / runtime / programming model with CUDA-compatible abstractions for smooth migration of existing code.
  • Tooling: training-system profiling (cuPTI / NVTX + PyTorch hooks); high-performance CV library (ARM-SIMD + CUDA-SIMT, 40+ APIs, ~2× over OpenCV); VPU VLIW offloading; decentralized distributed communication system.

TuSimple

(as Wei Liu)
Beijing, BJ / San Diego, CA

High-Performance Computing Engineer

Jul 2018 - Aug 2021
GPU and deep-learning inference performance, GPU virtualization, and high-performance robotic communication middleware.
GPU Compute & Virtualization
  • Built the first GPU-native Publisher/Subscriber for zero-copy inter-process messaging — a pub-centric GPU memory pool with on-the-fly offset conversion that removes redundant data movement, extended from single- to multi-GPU IPC — cutting message latency 53.7% (PointCloud / Image) vs the prior SOTA, and end-to-end latency 29.2% with up to 58.9% less resource usage.
  • Introduced MPS (Multi-Process Service) to the autonomous-driving environment to raise GPU utilization and reduce GPUs per vehicle, while solving MPS-Server centralization issues.
  • Built GPU virtualization by hooking the CUDA runtime API to partition and manage device memory.
  • Accelerated Binary Neural Networks with SIMD (SSE / AVX) on CPU.
Communication Middleware
  • Robust-Z: a robotic communication middleware combining high performance and high reliability — shared-memory transport plus a novel socket-based control algorithm for crash-safe, lossless delivery; up to 41% faster than ROS2 and 5% over Apollo CyberRT, with a 5.2% lower data-miss rate than CyberRT.
  • Production system: ETCD-based decentralized service discovery (removing the ROS master), protobuf ROS2-compatible serialization, KubeEdge containerization, and async networking + memory pool for cross-machine transport.

Education

University of Chinese Academy of Sciences

UCAS logo
Beijing, BJ

Doctor of Philosophy (High-Performance Computing)

Sep 2013 - Jun 2018
  • Deep learning for high-precision numerical computing: trained models to map low-precision to high-precision solutions from few samples and generalize across datasets, reaching ~100× the efficiency of traditional methods.
  • GPU-accelerated numerical analysis: staggered-grid finite-difference solvers for 1st/2nd-order PDE systems modeling 3D seismic-wave propagation, with Taylor-expansion high-order difference coefficients; interpolation / curve fitting and Fourier spectral-envelope analysis for low-SNR weak-signal extraction.
  • GPU-accelerated integral equations: register / occupancy tuning, double-buffered async IO, multi-stream overlap, shared memory and SFU intrinsics — ~20× over a baseline CUDA implementation; MPI TB-scale parallel IO with async send/recv hiding transfer latency behind computation.

Ocean University of China

OUC logo
Qingdao, SD

Bachelor of Engineering

Sep 2009 - Jun 2013

    Publications & Patents

    Selected Publications

    • Accelerating GPU Message Communication for Autonomous Navigation Systems. IEEE CLUSTER 2021 (best paper award).
    • A Robotic Communication Middleware Combining High Performance and High Reliability. SBAC-PAD 2020 (best paper nomination) / JPDC 2022.
    • Memory-Centric Communication Mechanism for Real-Time Autonomous Navigation Applications. ICPP 2020 (best paper nomination).
    • Safe Process Quitting for GPU Multi-Process Service (MPS). ICDCS 2020.
    • BitFlow: Exploiting Vector Parallelism for Binary Neural Networks on CPU. IPDPS 2018.

    Patents

    • Data Optimization Method and Integral Prestack Depth Migration Method. US Patent US11209563B2 (granted 2021-12-28).
    • Method, Apparatus and System for Multi-Module Scheduling. US Patents US10942771B2 and US11055144B2 (granted 2021).
    • Data Communication Method, Communication System and Computer-Readable Storage Medium. AU Patent AU2022204335A1 (2023-01-19).
    • Several domestic (CN) patents.

    Open-Source Contributions

    • Creator and lead of Tencent's open-source LLM inference operator library (Tencent Hunyuan AI Infra). Built from scratch on CUDA / CuTe and tuned for NVIDIA H20 (sustaining ≥80% of peak memory bandwidth), it ships SOTA kernels — Attention, quantized Grouped GEMM and Fused MoE in BF16 / FP8 — behind a clean API for vLLM and SGLang integration, while doubling as an industrial-grade CuTe / CUTLASS reference.
    • Published benchmarks (H20): Attention up to 2.22× over FlashInfer / FlashAttention, Grouped GEMM up to 1.88× over DeepGEMM, and Fused MoE up to 1.49× over TensorRT-LLM; end-to-end +30% QPM on Hunyuan and +17% on DeepSeek. Validated in Tencent large-scale production; 800+ GitHub stars.
    • Presented at NVIDIA CUDA Meetup (Beijing): “HPC-Ops: CuTe-based LLM Inference Operator Optimization and Production Deployment.”

    Upstream Contributions

    Tech Blogs

    CuTe Series (in Chinese)

    A widely-read tutorial series on NVIDIA CuTe / CUTLASS, referenced by the CUTLASS team.
    • Layout; Algebra and Geometry; Tensor
    • MMA Abstraction; COPY Abstraction; Swizzle
    • Simple GEMM; GEMM Pipeline; High-Performance GEMM

    CuTe on Hopper (in Chinese)

    • Hopper: Introduction
    • Hopper MBarrier
    • Hopper TMA
    • TMA Descriptor Encoding and the Hidden 21st Bit

    LLM Kernels & Communication (in Chinese)

    • FP8 Attention Precision Optimization: Reverse-order Compute and Scaling-Factor Selection
    • How to Optimize Transformer Attention
    • High-Performance Communication (series)

    GPU Architecture & Internals (in Chinese)

    • NVIDIA GPU ISA: Bit and Logic Operations; Warp-level and Uniform Operations; Program Control and Atomic Operations
    • Registers Shared between CUDA Core and Tensor Core
    • The ldmatrix Instruction; CUDA Core vs Tensor Core