Recognition method of transient falling images of coal and gangue separation in underground solid backfill mining with multimodal large language models
-
Abstract
Solid backfill coal mining, as a green mining method that balances resource recovery and ecological protection, relies on coal and gangue separation as a core process for the efficient operation of integrated underground mining, selection, and backfill technology. However, coal and gangue identification, as a key technology for precise coal and gangue separation, faces challenges such as difficulties in feature extraction and vague boundary positioning in the complex underground working conditions. To address this, a method for recognizing the transient falling images of coal and gangue separation in underground solid backfill coal mining using Multimodal Large Language Models (MLLM) was proposed, with the underground coal and gangue separation in solid backfill coal mining as the research background. First, an experimental platform for capturing transient falling images of coal and gangue separation in underground solid backfill coal mining was independently designed and built to simulate the complex underground conditions of low illumination and high dust. High-speed cameras were used to capture transient falling images of coal and gangue under different conditions. The collected images were preprocessed using optimized algorithms to enhance the brightness of low-illumination images and improve the quality of images in dusty environments. The images were then annotated and augmented to construct a dataset for training and testing coal and gangue identification models. Subsequently, to address the shortcomings of the traditional SegFormer model in boundary recognition of coal and gangue images, an ECA was introduced and the loss function was optimized to construct the ECSegFormer model. Furthermore, MLLM was integrated into the ECSegFormer model to form the MLLM-ECSegFormer architecture. The MLLM Qwen-VL(7B) was used to extract the center coordinates of coal and gangue targets, and a spatial attention mask was generated through a Gaussian heatmap, which was then incorporated into the ECSegFormer encoder in stages to achieve dynamic interaction between multimodal prior knowledge and image features. The experimental results showed that after the integration of the multimodal large language model, the performance of all classical image recognition models was significantly improved. Specifically, the MLLM-ECSegFormer achieved an MIoU of 95.50%, an MPA of 98.92%, and an accuracy rate of 98.87%, significantly outperforming classical image recognition models in terms of recognition accuracy, model complexity, and recognition efficiency. Compared with classical image recognition models, the MLLM-ECSegFormer demonstrated stronger edge recognition continuity under complex conditions. Particularly in scenarios with dust interference and irregular shapes of coal and gangue, the segmentation accuracy of the target area was significantly better than that of traditional models. The research findings provided a new method for precise identification of coal and gangue, enhance the intelligence level of solid backfill coal mining technology, and are of great significance for the green and intelligent mining of coal resources.
-
-