Reed Lau (Wei LIU · Xiaoze LU)

AI Infrastructure & High-Performance Computing

reed.lau@foxmail.com

github.com/reed-lau

www.zhihu.com/people/reed-84-49/

Summary	Ph.D. in High-Performance Computing with 8+ years in AI Infrastructure and systems software. Currently leading LLM inference optimization at Tencent — improving end-to-end efficiency through kernel, communication, quantization and scheduling optimization — and creator of the open-source hpc-ops operator library that powers large-scale production inference. Deep expertise in CUDA / Tensor Core kernel engineering (CuTe / CUTLASS / Triton), distributed GPU communication (NVLS / RDMA), LLM serving (vLLM / SGLang / TensorRT-LLM) and quantization across Hopper and more advanced GPUs. 10 peer-reviewed papers (1 best-paper award, 2 nominations) and 4 international patents (US & AU).
Skills	LLM Inference & Serving: vLLM, SGLang, TensorRT-LLM, FlashAttention / FlashInfer-class kernels, Prefill/Decode (PD) disaggregation, speculative decoding, prefix caching, KVCache (DDR/SSD), sparse Attention, Mamba GPU Kernel Engineering: CUDA, CuTe / CUTLASS, Triton, Tensor Core / SIMT, PTX / SASS, cp.async / TMA, Hopper (Cluster), more advanced GPUs Distributed Communication: NVLS, NCCL-class collectives (AllReduce / AllGather / ReduceScatter), one-shot / multimem, RDMA / GPU-NIC aware, MPI Quantization: SmoothQuant, MixQ, Rotation / Hadamard, INT8 / FP8, per-token / per-head schemes, bf16×fp32 Tensor-Core GEMM Parallelism & Scheduling: TP / EP / PD / TPEP hybrid, MoE expert parallelism, data-parallel sampling, MoE weight offloading Languages & Frameworks: C++, Python, CUDA C++, MPI, PyTorch, TVM Domains: AI Infra, LLM inference, AD, High-Performance Computing
Work Experience	Tencent (as Reed) Beijing, BJ Tech Lead, LLM Inference Optimization (Expert) Jun 2025 - Present Lead a team driving end-to-end LLM inference optimization — kernel, communication, scheduling and algorithm — that powers large-scale production serving; creator of the open-source hpc-ops operator library (SOTA on NVIDIA H20). Kernel Optimization H20: CUDA-Core ops (RMSNorm, RoPE, fused quantization) run >40% faster than TensorRT-LLM / FlashInfer; Attention and Group GEMM at industry-leading level — Attention Prefill surpasses FlashAttention / FlashInfer, Group GEMM beats vLLM by >30%, and Attention Decode beats the best-known implementation by 30%. More advanced GPUs: GroupGEMM / MoE +10–20% over FlashInfer; a bf16×fp32 Tensor-Core GEMM for high-precision MoE routing, and a "schedule-then-compute" strategy for mixed long/short Decode requests (+30% in unbalanced scenarios). Techniques: cp.async-based gather+GEMM fusion for Group GEMM (reduced reorder / TMA overhead); long-to-short Attention scheduling to balance SM load; Hadamard rotation to suppress QK outliers (fast transform fused into the RoPE kernel) plus qk-per-token-dynamic + v-per-head-static quantization for long-sequence precision; custom Mamba kernels >60% faster than the best open source. Distributed Communication NVLS-based intra-node multi-GPU AllReduce + Norm fusion, >4× over Ring / Tree (one-shot for small batch, multimem for large batch); GEMM + ReduceScatter fusion; GPU-NIC aware optimization leveraging full RDMA bandwidth to cut cross-node latency. PD (Prefill/Decode) disaggregation on TensorRT-LLM with Mamba-State PD transfer and HND-format Attention; MoE weight offloading to enable single-GPU deployment and reduce cross-card traffic. Scheduling & Algorithms Data-parallel Sampling using the Hopper Cluster feature for softmax / topk locality; overlapped MoE expert and shared-expert parallelism; speculative decoding; TPEP hybrid parallelism to balance EP and raise Tensor Core utilization; process-independent KVCache prototype using DDR / SSD for prefix-cache hits. Communication quantization + overlap: decompose AllReduce into ReduceScatter + AllGather and fuse ReduceScatter + Norm + quantization to emit FP8 (halving the AllGather payload), then overlap the AllGather with the following QKV projection (qkv_proj) GEMM. NIO (as Xiaoze Lu) Beijing, BJ Head of High-Performance Computing Team (Expert) Aug 2021 - Jun 2025 Led the high-performance computing team across on-device LLM inference, model quantization and CV/Lidar perception kernels, with PTX/SASS-level NVIDIA GPU tuning for extreme performance. On-Device LLM Inference & Quantization Runtime: built a C++ runtime for Hugging Face models (SafeTensor conversion, device memory pool, KVCache, tokenization, RoPE, Prefill+Decode) running Llama2 / Llama3 / Qwen2 on automotive chips. Quantization: SmoothQuant / MixQ / Rotation framework for ViT / LLM — 2× storage saving, <1pt accuracy loss vs FP16, +25% Prefill and 2× Decode throughput. CuTe kernels: int8 Attention / FeedForward blocks (RMSNorm, UpProj, GateProj, Self-Attention) — Attention +10–40% over FlashAttention2; full network 1.6× over TVM-FlashInfer and 3× over TRT-LLM on Orin. Perception Kernels (Camera & Lidar) Camera: plugin kernels shared across TensorFlow / PyTorch / TVM; LayerNorm 2× (TVM-tuned) / 20× (PyTorch); Cross-Attention 16× over MMCV; Camera ops (ROIAlign / NMS / CropAndResize / BEV2BEV / Lane / Marker) up to tens× over TensorRT / TVM. Lidar: 3D Sparse Convolution 3× over Spconv2.2 / MIT; specialized GEMM +40% over cuBLAS; Voxel Generator, Range View, BEV and World-Map projection kernels for efficient Lidar networks. Platform & Tooling Custom silicon: co-designed ISA / runtime / programming model with CUDA-compatible abstractions for smooth migration of existing code. Tooling: training-system profiling (cuPTI / NVTX + PyTorch hooks); high-performance CV library (ARM-SIMD + CUDA-SIMT, 40+ APIs, ~2× over OpenCV); VPU VLIW offloading; decentralized distributed communication system. TuSimple (as Wei Liu) Beijing, BJ / San Diego, CA High-Performance Computing Engineer Jul 2018 - Aug 2021 GPU and deep-learning inference performance, GPU virtualization, and high-performance robotic communication middleware. GPU Compute & Virtualization Built the first GPU-native Publisher/Subscriber for zero-copy inter-process messaging — a pub-centric GPU memory pool with on-the-fly offset conversion that removes redundant data movement, extended from single- to multi-GPU IPC — cutting message latency 53.7% (PointCloud / Image) vs the prior SOTA, and end-to-end latency 29.2% with up to 58.9% less resource usage. Introduced MPS (Multi-Process Service) to the autonomous-driving environment to raise GPU utilization and reduce GPUs per vehicle, while solving MPS-Server centralization issues. Built GPU virtualization by hooking the CUDA runtime API to partition and manage device memory. Accelerated Binary Neural Networks with SIMD (SSE / AVX) on CPU. Communication Middleware Robust-Z: a robotic communication middleware combining high performance and high reliability — shared-memory transport plus a novel socket-based control algorithm for crash-safe, lossless delivery; up to 41% faster than ROS2 and 5% over Apollo CyberRT, with a 5.2% lower data-miss rate than CyberRT. Production system: ETCD-based decentralized service discovery (removing the ROS master), protobuf ROS2-compatible serialization, KubeEdge containerization, and async networking + memory pool for cross-machine transport.
Education	University of Chinese Academy of Sciences Beijing, BJ Doctor of Philosophy (High-Performance Computing) Sep 2013 - Jun 2018 Deep learning for high-precision numerical computing: trained models to map low-precision to high-precision solutions from few samples and generalize across datasets, reaching ~100× the efficiency of traditional methods. GPU-accelerated numerical analysis: staggered-grid finite-difference solvers for 1st/2nd-order PDE systems modeling 3D seismic-wave propagation, with Taylor-expansion high-order difference coefficients; interpolation / curve fitting and Fourier spectral-envelope analysis for low-SNR weak-signal extraction. GPU-accelerated integral equations: register / occupancy tuning, double-buffered async IO, multi-stream overlap, shared memory and SFU intrinsics — ~20× over a baseline CUDA implementation; MPI TB-scale parallel IO with async send/recv hiding transfer latency behind computation. Ocean University of China Qingdao, SD Bachelor of Engineering Sep 2009 - Jun 2013
Publications & Patents	Selected Publications Google Scholar Accelerating GPU Message Communication for Autonomous Navigation Systems. IEEE CLUSTER 2021 (best paper award). A Robotic Communication Middleware Combining High Performance and High Reliability. SBAC-PAD 2020 (best paper nomination) / JPDC 2022. Memory-Centric Communication Mechanism for Real-Time Autonomous Navigation Applications. ICPP 2020 (best paper nomination). Safe Process Quitting for GPU Multi-Process Service (MPS). ICDCS 2020. BitFlow: Exploiting Vector Parallelism for Binary Neural Networks on CPU. IPDPS 2018. Patents Data Optimization Method and Integral Prestack Depth Migration Method. US Patent US11209563B2 (granted 2021-12-28). Method, Apparatus and System for Multi-Module Scheduling. US Patents US10942771B2 and US11055144B2 (granted 2021). Data Communication Method, Communication System and Computer-Readable Storage Medium. AU Patent AU2022204335A1 (2023-01-19). Several domestic (CN) patents.
Open-Source Contributions	hpc-ops github.com/Tencent/hpc-ops Creator and lead of Tencent's open-source LLM inference operator library (Tencent Hunyuan AI Infra). Built from scratch on CUDA / CuTe and tuned for NVIDIA H20 (sustaining ≥80% of peak memory bandwidth), it ships SOTA kernels — Attention, quantized Grouped GEMM and Fused MoE in BF16 / FP8 — behind a clean API for vLLM and SGLang integration, while doubling as an industrial-grade CuTe / CUTLASS reference. Published benchmarks (H20): Attention up to 2.22× over FlashInfer / FlashAttention, Grouped GEMM up to 1.88× over DeepGEMM, and Fused MoE up to 1.49× over TensorRT-LLM; end-to-end +30% QPM on Hunyuan and +17% on DeepSeek. Validated in Tencent large-scale production; 800+ GitHub stars. Presented at NVIDIA CUDA Meetup (Beijing): “HPC-Ops: CuTe-based LLM Inference Operator Optimization and Production Deployment.” Upstream Contributions CUTLASS — improved instruction hints. github.com/NVIDIA/cutlass Protocol Buffers — optimized inefficient memory usage. github.com/protocolbuffers/protobuf ROS2 — multiple contributions across ROS2 projects. github.com/ros2
Tech Blogs	CuTe Series (in Chinese) Zhihu A widely-read tutorial series on NVIDIA CuTe / CUTLASS, referenced by the CUTLASS team. Layout; Algebra and Geometry; Tensor MMA Abstraction; COPY Abstraction; Swizzle Simple GEMM; GEMM Pipeline; High-Performance GEMM CuTe on Hopper (in Chinese) Zhihu Column Hopper: Introduction Hopper MBarrier Hopper TMA TMA Descriptor Encoding and the Hidden 21st Bit LLM Kernels & Communication (in Chinese) Zhihu Column FP8 Attention Precision Optimization: Reverse-order Compute and Scaling-Factor Selection How to Optimize Transformer Attention High-Performance Communication (series) GPU Architecture & Internals (in Chinese) Zhihu Column NVIDIA GPU ISA: Bit and Logic Operations; Warp-level and Uniform Operations; Program Control and Atomic Operations Registers Shared between CUDA Core and Tensor Core The ldmatrix Instruction; CUDA Core vs Tensor Core