Abstract:
To address the limitations of existing methods, such as poor feature stability, inadequate appearance discrimination capabilities, and suboptimal adaptability of matching correlation mechanisms, arising from the complex underground coal mine environment characterised by lighting variations, dust, and frequent occlusions among visually similar personnel, a personnel tracking method for underground coal mines based on self-attention and matching optimisation is proposed. Based on the Transformer’s encoder-decoder structure, the technique enables end-to-end online target tracking bounding boxes, categories, and identity IDs in four steps: frame-level feature extraction, self-attention encoding, query entity decoding, and prediction mapping. First, an adaptive dual-domain synergy module is designed during feature extraction, dynamically adjusting the weight distribution of different channels and spatial locations through a channel-adaptive weighting mechanism and a spatial-aware weighting mechanism. This enhances the feature map’s ability to discriminate subtle target differences. Then, a progressive downsampling attention fusion module is designed between feature extraction and the encoder. This employs a multi-level feature fusion strategy. Through element-wise summation of multi-level features, high-level feature expression capability is enhanced while low-level detail information is retained. This enables accurate capture of the edge details and positional changes of the target in dim coal mine scenes. Finally, in the matching process between the tracked target and the prediction frame, the traditional Hungarian matching algorithm is improved and a locally optimal matching strategy is adopted. This strategy is optimized by dynamically adjusting the cost matrix, which efficiently reduces the mismatching rate and alleviates the tracking drift phenomenon. This improves the accuracy of tracking coal miners. The experimental results demonstrate that the proposed method outperforms the Trackformer network on both the self-constructed coal mine dataset and the public dataset. The metrics multiple object tracking accuracy (MOTA) and identification f1-score (IDF1) achieve values of 86.4% and 77%, respectively. Compared to the Trackformer network, there was an improvement of 4.2 and 5.4 percentage points in MOTA and IDF1, respectively.