高级检索

基于多模态大模型的井下视频语义提取与描述生成技术

Underground video semantic extraction and description generation technology based on multimodal large model

  • 摘要: 随着煤矿智能化建设的推进,井下作业视频数据量突增,目前视频信息处理与保存方法大多采用单场景视频分析和视频原格式存储技术,存在视频场景模型单一导致信息描述不全面、存储空间受限导致信息保存时间短等应用难题。针对井下视频全信息、低成本语义分析的实际需求,提出一种基于工况复杂度指标赋值的关键帧自适应提取方法与多模态语义建模的煤矿井下视频描述生成方法,实现对井下视频的最优计算解析与自然语言描述。首先根据井下工况特点设计复杂度指标赋值方法,提出基于工况复杂度的视频动态抽帧频度计算方法,实现最低计算成本的视频关键信息捕获;然后设计了基于MLLMs的井下视频描述生成技术框架,开发了关键帧自适应提取、大模型视觉语义特征提取、Prompt设计与文本编码、多模态融合与文本解码等关键技术模块,实现高效低成本化的井下视频全场景信息自然语言描述生成;最后将本文视频描述方法以及抽帧策略与传统方法进行了对比试验,试验结果表明:本文方法在确保高达95.4%的关键信息捕获率的同时,将计算资源消耗降低至传统密集抽帧方法的1.5%,为井下视频全信息、低成本语义分析提供了可行的技术路径。

     

    Abstract: With the rapid advancement of intelligent coal mine construction, the volume of underground operational video data has surged dramatically. Current video processing and storage methods predominantly rely on single-scene video analysis and raw-format storage techniques, which face critical limitations: monolithic scene models lead to incomplete information descriptions, and constrained storage capacity results in short data retention periods. To address the practical need for comprehensive yet low-cost semantic analysis of underground videos, this paper proposes a novel coal mine video captioning method integrating working-condition complexity metric-based adaptive keyframe extraction and multimodal semantic modeling, achieving optimal computational parsing and natural language description of underground video content. First, a complexity metric assignment method is designed based on the distinctive features of underground working conditions. Building on this, a dynamic frame-sampling frequency algorithm is proposed to minimize computational overhead while ensuring robust key information capture. Subsequently, a Multimodal Large Language Model (MLLM)-based technical framework is developed, incorporating four core modules: adaptive keyframe extraction, large-model-driven visual-semantic feature extraction, prompt engineering and text encoding, and multimodal fusion and text decoding. This framework enables efficient, low-cost generation of natural language descriptions for full-scene underground video information. Comparative experiments demonstrate that the proposed method achieves a key information capture rate of 95.4% while reducing computational resource consumption to 1.5% of traditional dense-sampling approaches. These results validate its viability as a technical solution for high-fidelity, cost-effective semantic analysis of underground videos.

     

/

返回文章
返回