Underground video semantic extraction and description generation technology based on multimodal large model

FU Xiang; WANG Zhufeng; QIN Yifan; YAN Ming; ZHANG Zhixing; WANG Ranfeng; JIA Yifan

doi:10.12438/cst.2025-0940

FU Xiang，WANG Zhufeng，QIN Yifan，et al. Underground video semantic extraction and description generation technology based on multimodal large modelJ. Coal Science and Technology，2025，53（11）：216−228. DOI: 10.12438/cst.2025-0940

Citation:

Underground video semantic extraction and description generation technology based on multimodal large model

Abstract

Abstract

With the rapid advancement of intelligent coal mine construction, the volume of underground operational video data has surged dramatically. Current video processing and storage methods predominantly rely on single-scene video analysis and raw-format storage techniques, which face critical limitations: monolithic scene models lead to incomplete information descriptions, and constrained storage capacity results in short data retention periods. To address the practical need for comprehensive yet low-cost semantic analysis of underground videos, this paper proposes a novel coal mine video captioning method integrating working-condition complexity metric-based adaptive keyframe extraction and multimodal semantic modeling, achieving optimal computational parsing and natural language description of underground video content. First, a complexity metric assignment method is designed based on the distinctive features of underground working conditions. Building on this, a dynamic frame-sampling frequency algorithm is proposed to minimize computational overhead while ensuring robust key information capture. Subsequently, a Multimodal Large Language Model (MLLM)-based technical framework is developed, incorporating four core modules: adaptive keyframe extraction, large-model-driven visual-semantic feature extraction, prompt engineering and text encoding, and multimodal fusion and text decoding. This framework enables efficient, low-cost generation of natural language descriptions for full-scene underground video information. Comparative experiments demonstrate that the proposed method achieves a key information capture rate of 95.4% while reducing computational resource consumption to 1.5% of traditional dense-sampling approaches. These results validate its viability as a technical solution for high-fidelity, cost-effective semantic analysis of underground videos.

FullText(HTML)

References (33)

Cited By

Underground video semantic extraction and description generation technology based on multimodal large model

Abstract

Catalog

Export File

Citation

Format

Content