Advance Search
FU Xiang,WANG Zhufeng,QIN Yifan,et al. Underground video semantic extraction and description generation technology based on multimodal large model[J]. Coal Science and Technology,2025,53(11):216−228. DOI: 10.12438/cst.2025-0940
Citation: FU Xiang,WANG Zhufeng,QIN Yifan,et al. Underground video semantic extraction and description generation technology based on multimodal large model[J]. Coal Science and Technology,2025,53(11):216−228. DOI: 10.12438/cst.2025-0940

Underground video semantic extraction and description generation technology based on multimodal large model

  • With the rapid advancement of intelligent coal mine construction, the volume of underground operational video data has surged dramatically. Current video processing and storage methods predominantly rely on single-scene video analysis and raw-format storage techniques, which face critical limitations: monolithic scene models lead to incomplete information descriptions, and constrained storage capacity results in short data retention periods. To address the practical need for comprehensive yet low-cost semantic analysis of underground videos, this paper proposes a novel coal mine video captioning method integrating working-condition complexity metric-based adaptive keyframe extraction and multimodal semantic modeling, achieving optimal computational parsing and natural language description of underground video content. First, a complexity metric assignment method is designed based on the distinctive features of underground working conditions. Building on this, a dynamic frame-sampling frequency algorithm is proposed to minimize computational overhead while ensuring robust key information capture. Subsequently, a Multimodal Large Language Model (MLLM)-based technical framework is developed, incorporating four core modules: adaptive keyframe extraction, large-model-driven visual-semantic feature extraction, prompt engineering and text encoding, and multimodal fusion and text decoding. This framework enables efficient, low-cost generation of natural language descriptions for full-scene underground video information. Comparative experiments demonstrate that the proposed method achieves a key information capture rate of 95.4% while reducing computational resource consumption to 1.5% of traditional dense-sampling approaches. These results validate its viability as a technical solution for high-fidelity, cost-effective semantic analysis of underground videos.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return