Fan L,Chen W,Jiang X Cross-Correlation Fusion Graph Convolution-Based Object Tracking,*Symmetry* 2023

团队2019级研究生范柳伊同学的论文“Cross-Correlation Fusion Graph Convolution-Based Object Tracking”被期刊“Multidisciplinary Digital Publishing Institute Symmetry”录用,祝贺!

Abstract:
Most popular graph attention networks treat pixels of a feature map as individual nodes, which makes the feature embedding extracted by the graph convolution lack the integrity of the object. Moreover, matching between a template graph and a search graph using only part-level information usually causes tracking errors, especially in occlusion and similarity situations. To address these problems, we propose a novel end-to-end graph attention tracking framework that has high symmetry, combining traditional cross-correlation operations directly. By utilizing cross-correlation operations, we effectively compensate for the dispersion of graph nodes and enhance the representation of features. Additionally, our graph attention fusion model performs both part-to-part matching and global matching, allowing for more accurate information embedding in the template and search regions. Furthermore, we optimize the information embedding between the template and search branches to achieve better single-object tracking results, particularly in occlusion and similarity scenarios. The flexibility of graph nodes and the comprehensiveness of information embedding have brought significant performance improvements in our framework. Extensive experiments on three challenging public datasets (LaSOT, GOT-10k, and VOT2016) show that our tracker outperforms other state-of-the-art trackers.

Download: [官方链接]

Keywords: symmetry; single-object tracking; graph attention network; Siamese networks; cross-correlation; feature fusion

Photos:

Xiaoyan Jiang, J N Hwang and Z Fang, "A Multiscale Coarse-to-Fine Human Pose Estimation Network With Hard Keypoint Mining" in IEEE Transactions on Systems, Man, and Cybernetics:Systems, March 2024

团队负责人姜晓燕老师的论文“A Multiscale Coarse-to-Fine Human Pose Estimation Network With Hard Keypoint Mining” 被SCI期刊IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS接收,祝贺!

Abstract:
Current convolution neural network (CNN)-based multiperson pose estimators have achieved great progress, however, they pay no or less attention to “hard” samples, such as occluded keypoints, small and nearly invisible keypoints, and ambiguous keypoints. In this article, we explicitly deal with these “hard” samples by proposing a novel multiscale coarse-to-fine human pose estimation network (HM2PN), which includes two sequential subnetworks: CoarseNet and FineNet. CoarseNet conducts a coarse prediction to locate “simple” keypoints like hands and ankles with a multiscale fusion module, which is integrated with bottleneck, resulting in a novel module called multiscale bottleneck. The new module improves the multiscale representation ability of the network in a fine-grained level, while marginally reducing the computation cost because of group convolution. FineNet further infers “hard” keypoints and refines “simple” keypoints simultaneously with a hard keypoint mining loss. Distinct from the previous works, the proposed loss deals with “hard” keypoints differentially and prevents “simple” keypoints from dominating the computed gradients during training. Experiments on the COCO keypoint benchmark show that our approach achieves superior pose estimation performance compared with other state-of-the-art methods.

Download: [preprint版本]

Keywords: Hard sample mining, human pose estimation,multiscale

Photos:

Kunlun Xue, Xiaoyan Jiang, Zhichao Chen“A SLAM Method Based on ORB-SLAM3 Which Mixed GNSS Data” International Conference on Information Technologies and Electrical Engineering

团队2021级研究生薛昆仑同学的论文“A SLAM Method Based on ORB-SLAM3 Which Mixed GNSS Data”被“In 6th International Conference on Information Technologies and Electrical Engineering”录用,祝贺!

Abstract:
Traditional single-sensor SLAM methods suffer from cumulative drift errors in large-scale outdoor environments, which makes it difficult to have good localization accuracy in practical application scenarios. In this paper, to solve the above problems, we propose a visual inertial system fusion method with global navigation satellite system (GNSS), which transforms GNSS measurements into values in Cartesian coordinate system, and then uses odometry pose information and GNSS information to do nonlinear optimization to eliminate the cumulative drift error within the system, and experiments are carried out on the KITTI raw data, which show that the method proposed in this paper effectively improves the localization accuracy in large-scale outdoor environments. The results show that the method proposed in this paper effectively improves the localization accuracy in outdoor large-scale scenarios, and the localization accuracy on the KITTI dataset is 54% higher than that of ORB-SLAM3 on average.

Download: [官方链接]

Keywords: Simultaneous localization and mapping, Multi-source mixed, Automatic driving, Nonlinear optimization
Photos:

Wenwen Zheng,Xiaoyan Jiang, Zhijun Fang etc, "TV-Net:A Structure-Level Feature Fusion Network Based on Tensor Voting for Road Crack Segmentation" in IEEE Transactions on Intelligent Transportation Systems, June 2024

团队2021级研究生郑雯雯同学的论文“TV-Net: A Structure-Level Feature Fusion NetworkBased on Tensor Voting for RoadCrack Segmentation”被SCI顶刊《IEEE Transactions on Intelligent Transportation Systems》 录用,祝贺!

Abstract:
Pavement cracks are a common and significant problem for intelligent pavement maintainment. However, the features extracted in pavement images are often texture-less, and noise interference can be high. Segmentation using traditional convolutional neural network training can lose feature information when the network depth goes larger, which makes accurate prediction a challenging topic. To address these issues, we propose a new approach that features an enhanced tensor voting module and a customized pixel-level pavement crack segmentation network structure, called TV-Net. We optimize the tensor voting framework and find the relationship between tensor scale factors and crack distributions. A tensor voting fusion module is introduced to enhance feature maps by incorporating significant domain maps generated by tensor voting. Additionally, we propose a structural consistency loss function to improve segmentation accuracy and ensure consistency with the structural characteristics of the cracks obtained through tensor voting. The sufficient experimental analysis demonstrates that our method outperforms existing mainstream pixel-level segmentation networks on the same road crack dataset. Our proposed TV-Net has an excellent performance in avoiding noise interference and strengthening the structure of the fracture site of pavement cracks.

Download: [官方链接] [preprint版本]

Keywords: Crack detection convolutional neural network tensor voting U-Net

Photos:

Baihong Han, Xiaoyan Jiang, Zhijun Fang, Hamido Fujita, Yongbin Gao,F-SCP:An automatic prompt generation method for specific classes based on visual language pre-training models,*Pattern Recognition*,2024

团队2021级研究生韩柏宏同学的论文“F-SCP: An automatic prompt generation method for specific classes based onvisual language pre-training models”被SCI顶刊《Parttern Recognition》 录用,祝贺!

Abstract:
The zero-shot classification performance of large-scale vision-language pre-training models (e.g., CLIP, BLIP and ALIGN) can be enhanced by incorporating a prompt (e.g., “a photo of a [CLASS]”) before the class words. Modifying the prompt slightly can have significant effect on the classification outcomes of these models. Thus, it is crucial to include an appropriate prompt tailored to the classes. However, manual prompt design is labor-intensive and necessitates domain-specific expertise. The CoOp (Context Optimization) converts hand-crafted prompt templates into learnable word vectors to automatically generate prompts, resulting in substantial improvements for CLIP. However, CoOp exhibited significant variation in classification performance across different classes. Although CoOp-CSC (Class-Specific Context) has a separate prompt for each class, only shows some advantages on fine-grained datasets. In this paper, we propose a novel automatic prompt generation method called F-SCP (Filter-based Specific Class Prompt), which distinguishes itself from the CoOp-UC (Unified Context) model and the CoOp-CSC model. Our approach focuses on prompt generation for low-accuracy classes and similar classes. We add the Filter and SCP modules to the prompt generation architecture. The Filter module selects the poorly classified classes, and then reproduce the prompts through the SCP (Specific Class Prompt) module to replace the prompts of specific classes. Experimental results on six multi-domain datasets shows the superiority of our approach over the state-of-the-art methods. Particularly, the improvement in accuracy for the specific classes mentioned above is significant. For instance, compared with CoOp-UC on the OxfordPets dataset, the low-accuracy classes, such as, Class21 and Class26, are improved by 18% and 12%, respectively.

Download: [官方链接] [preprint版本]

Keywords: Multi-modalVision language modelPrompt tuningLarge-scale pre-training model
Photos: