VitaTouch: Property-Aware Vision–Tactile–Language Model for Robotic Quality Inspection in Manufacturing

Junyi Zonga,e, Qingxuan Jiaa, Meixian Shia, Tong Lia*, Jiayuan Lib,e, Zihang Lva, Gang Chena, Fang Dengc,d**

aSchool of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing 100876, China bSchool of Automation, Beijing Institute of Technology, Beijing 100081, China cSchool of AI, Beijing Institute of Technology, Beijing 100081, China dState Key Lab of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology, Beijing 100081, China eZhongguancun Academy, Beijing 100094, China

Abstract

Quality inspection in smart manufacturing requires recognizing material and surface properties beyond visible geometry, while vision-only methods often struggle under imaging nuisances. We propose VitaTouch, a property-aware vision–tactile–language (VTL) model that integrates visual and tactile sensing with language prompts in a unified semantic space. With modality-specific encoders and dual Q-Formers, VitaTouch distills vision and touch into compact prefix tokens for a frozen large language model, enabling property reasoning and natural-language attribute description. We also build VitaSet, a VTL dataset with 186 objects, 52k multimodal images, and 5.1k instruction–answer pairs. VitaTouch achieves state-of-the-art performance on the public TVL benchmark, reaches 88.89% hardness accuracy, 75.13% roughness accuracy, and 54.81% descriptor recall on VitaSet, and maintains strong few-shot defect recognition with LoRA adaptation.

Overview of the dual-branch VitaTouch architecture and the progressive three-stage training pipeline.

VitaTouch overview

VitaSet

VitaSet is a vision–tactile–language dataset for industrial quality inspection, integrating a self-collected GelSight robotic manipulation set and the GelSight-only subset of AnyTouch. It contains 186 objects, 30,553 RGB images, 21,510 GelSight tactile images (52,063 total), and 5,145 instruction–answer QA pairs under a unified annotation schema.

VitaSet overview

Method

Three-stage training

  • Stage 1: Cross-modal alignment to establish a shared semantic interface across vision, touch, and language.
  • Stage 2: Property-reasoning foundation learning with dual Q-Formers that distill vision–tactile prefix tokens and prepend them to a frozen Vicuna-7B decoder.
  • Stage 3: Parameter-efficient few-shot defect recognition using LoRA while freezing the backbone.
Model architecture

Results

TVL Benchmark

Table 1: Comparison on the TVL benchmark. VitaTouch achieves the best performance on HCT and TVL, and remains competitive on SSVTP.

Table 1: Comparison on the TVL benchmark

VitaSet Validation Performance

Figure 4: VitaSet validation performance of VitaTouch across training epochs. (a) Multi-task validation trends for hardness accuracy, roughness accuracy, descriptor recall, and mean task score. (b) Comparison of strict exact-match descriptor recall and semantic similarity for the material descriptor task. Stars mark the best epoch for each metric.

Figure 4: VitaSet validation performance vs epoch

Ablation on VitaSet

Table 2: Comparison of the full VitaTouch model, a no-Stage-1 variant, and unimodal settings on VitaSet across tasks.

Table 2: Ablation results on VitaSet

Figure 5: Ablation results on the VitaSet dataset across tasks. Each variant removes one key stage from the full model, demonstrating the necessity of explicit alignment and multimodal fusion for robust multi-task property learning.

Figure 5: Ablation results across tasks on VitaSet

Few-shot Defect Adaptation

Table 3: LoRA-based defect adaptation results under different numbers of defect categories and labeled training samples per category. Three settings are compared: vision-only, tactile-only, and the full VitaTouch model with fused vision–tactile inputs.

Table 3: Few-shot defect adaptation with LoRA

Qualitative Outputs

Figure 6: Qualitative inspection-style outputs of VitaTouch. Representative examples are shown for material descriptor prediction, hardness classification, roughness classification, and defect decision with brief descriptions.

Figure 6: Qualitative inspection-style outputs

Robotic Sorting Demonstration

Figure 7: Proof-of-concept robotic sorting demonstration. The robot performs pre-grasp, acquires tactile RGB observations during grasp, predicts defect status using the Stage 3 model, and sorts the object into the defect (left) or normal (right) bin accordingly.

Figure 7: Robotic sorting demonstration

Robotic Sorting Demo

We provide videos of the proof-of-concept closed-loop sorting system. The robot acquires aligned vision and tactile observations during grasp, predicts defect status using the Stage-3 model, and places objects into the left (defect) or right (normal) bin.

Defect → Left Bin

Defect sample 1 — predicted as defect and placed into the left bin.
Defect sample 2 — predicted as defect and placed into the left bin.

Normal → Right Bin

Normal sample 1 — predicted as normal and placed into the right bin.
Normal sample 2 — predicted as normal and placed into the right bin.

Citation

If you find our work useful, please consider citing:

@article{zong_vitatouch_2025,
  title   = {VitaTouch: Property-Aware Vision–Tactile–Language Model for Robotic Quality Inspection in Manufacturing},
  author  = {Zong, Junyi and Jia, Qingxuan and Shi, Meixian and Li, Tong and Li, Jiayuan and Lv, Zihang and Chen, Gang and Deng, Fang},
  year    = {2025}
}