VitaTouch: Property-Aware Vision–Tactile–Language Model for Robotic Quality Inspection in Manufacturing

Abstract

Quality inspection in smart manufacturing requires recognizing material and surface properties beyond visible geometry, while vision-only methods often struggle under imaging nuisances. We propose VitaTouch, a property-aware vision–tactile–language (VTL) model that integrates visual and tactile sensing with language prompts in a unified semantic space. With modality-specific encoders and dual Q-Formers, VitaTouch distills vision and touch into compact prefix tokens for a frozen large language model, enabling property reasoning and natural-language attribute description. We also build VitaSet, a VTL dataset with 186 objects, 52k multimodal images, and 5.1k instruction–answer pairs. VitaTouch achieves state-of-the-art performance on the public TVL benchmark, reaches 88.89% hardness accuracy, 75.13% roughness accuracy, and 54.81% descriptor recall on VitaSet, and maintains strong few-shot defect recognition with LoRA adaptation.

Overview of the dual-branch VitaTouch architecture and the progressive three-stage training pipeline.

VitaSet

VitaSet is a vision–tactile–language dataset for industrial quality inspection, integrating a self-collected GelSight robotic manipulation set and the GelSight-only subset of AnyTouch. It contains 186 objects, 30,553 RGB images, 21,510 GelSight tactile images (52,063 total), and 5,145 instruction–answer QA pairs under a unified annotation schema.

Method

Three-stage training

Stage 1: Cross-modal alignment to establish a shared semantic interface across vision, touch, and language.
Stage 2: Property-reasoning foundation learning with dual Q-Formers that distill vision–tactile prefix tokens and prepend them to a frozen Vicuna-7B decoder.
Stage 3: Parameter-efficient few-shot defect recognition using LoRA while freezing the backbone.

Results

TVL Benchmark

Table 1: Comparison on the TVL benchmark. VitaTouch achieves the best performance on HCT and TVL, and remains competitive on SSVTP.

VitaSet Validation Performance

Figure 4: VitaSet validation performance of VitaTouch across training epochs. (a) Multi-task validation trends for hardness accuracy, roughness accuracy, descriptor recall, and mean task score. (b) Comparison of strict exact-match descriptor recall and semantic similarity for the material descriptor task. Stars mark the best epoch for each metric.

Ablation on VitaSet

Table 2: Comparison of the full VitaTouch model, a no-Stage-1 variant, and unimodal settings on VitaSet across tasks.

Figure 5: Ablation results on the VitaSet dataset across tasks. Each variant removes one key stage from the full model, demonstrating the necessity of explicit alignment and multimodal fusion for robust multi-task property learning.

Figure 5: Ablation results across tasks on VitaSet

Few-shot Defect Adaptation

Table 3: LoRA-based defect adaptation results under different numbers of defect categories and labeled training samples per category. Three settings are compared: vision-only, tactile-only, and the full VitaTouch model with fused vision–tactile inputs.

Qualitative Outputs

Figure 6: Qualitative inspection-style outputs of VitaTouch. Representative examples are shown for material descriptor prediction, hardness classification, roughness classification, and defect decision with brief descriptions.

Robotic Sorting Demonstration

Figure 7: Proof-of-concept robotic sorting demonstration. The robot performs pre-grasp, acquires tactile RGB observations during grasp, predicts defect status using the Stage 3 model, and sorts the object into the defect (left) or normal (right) bin accordingly.

Robotic Sorting Demo

We provide videos of the proof-of-concept closed-loop sorting system. The robot acquires aligned vision and tactile observations during grasp, predicts defect status using the Stage-3 model, and places objects into the left (defect) or right (normal) bin.

Defect → Left Bin

Defect sample 1 — predicted as defect and placed into the left bin.

Defect sample 2 — predicted as defect and placed into the left bin.

Normal → Right Bin

Normal sample 1 — predicted as normal and placed into the right bin.

Normal sample 2 — predicted as normal and placed into the right bin.

Citation

If you find our work useful, please consider citing:

@article{zong_vitatouch_2025,
  title   = {VitaTouch: Property-Aware Vision–Tactile–Language Model for Robotic Quality Inspection in Manufacturing},
  author  = {Zong, Junyi and Jia, Qingxuan and Shi, Meixian and Li, Tong and Li, Jiayuan and Lv, Zihang and Chen, Gang and Deng, Fang},
  year    = {2025}
}