I am a fourth-year Ph.D. student in the Department of Automation at Tsinghua University, advised by
Prof. Jianjiang Feng and
Prof. Jiwen Lu.
In 2022, I obtained my B.Eng. in the XUTELI School, Beijing Institute of Technology.
My current research focuses on 3D vision and embodied intelligence.
We propose GesVLA, a gesture-aware VLA that embeds hand cues to disambiguate targets
in cluttered manipulation scenes. It improves human-robot interaction efficiency and
real-robot performance on multiple practical tasks.
We propose AwareVLN, a framework with sparse self-aware reasoning at key navigation nodes
via a unified reason-act VLM. An automatic data engine scales
structured supervision without manual annotation. It outperforms prior methods on R2R-CE/RxR-CE and deploys on a real quadruped.
We propose IGL-Nav, an incremental 3D Gaussian localization framework for image-goal navigation.
It supports challenging scenarios where the camera for goal capturing and the agent's
camera have very different intrinsics and poses, e.g., a cellphone and a RGB-D camera.
We propose TSP3D, an efficient multi-level convolution architecture for
3D visual grounding. TSP3D achieves superior performance compared
to previous approaches in both accuracy and inference speed.
CL-Gait: Camera-LiDAR Cross-modality Gait Recognition Wenxuan Guo*, Yingping Liang*, Zhiyu Pan, Ziheng Xi,
Jianjiang Feng,
Jie Zhou European Conference on Computer Vision (ECCV), 2024
[arXiv][Code]
We propose CL-Gait, the first cross-modality gait recognition framework between camera
and LiDAR. We propose a contrastive pre-training method to align the feature spaces
of the two modalities, along with a large-scale data generation strategy.
ReID3D: LiDAR-based person re-identification Wenxuan Guo, Zhiyu Pan, Yingping Liang, Ziheng Xi, Zhicheng Zhong,
Jianjiang Feng,
Jie Zhou Computer Vision and Pattern Recognition (CVPR), 2024
[arXiv][Code][中文解读]
We propose ReID3D, a LiDAR-based ReID framework that utilizes a pre-training strategy to
retrieve features of 3D body shape. Additionally, we build LReID — the first LiDAR-based
person ReID dataset, which is collected in several outdoor scenes with natural variations.
Selected Projects
GesVLA
Gesture-Aware Robot Manipulation
GesVLA introduces gesture as a parallel instruction modality
to disambiguate targets in cluttered scenes and improve human-robot interaction efficiency. Hand cues are encoded
in a dual-VLM for joint intent reasoning and action generation, with semi-synthetic data and two-stage training
for strong real-robot performance on block, jelly, and produce selection.
AwareVLN
Self-aware Vision-Language Navigation
In AwareVLN, we equip VLN with sparse self-aware reasoning
at key nodes via a unified reason–act VLM, together with an automatic data engine for scalable structured
supervision. AwareVLN achieves strong results on R2R-CE and RxR-CE with monocular RGB and deploys on a real quadruped.
IGL-Nav
3D Representation for Visual Navigation
We propose incremental 3D Gaussian localization for free-view image-goal navigation in
IGL-Nav. We support a challenging application scenario
where the camera for goal capturing and the agent's camera have very different intrinsics and poses,
e.g., a cellphone and a RGB-D camera.