Baihong Han, Xiaoyan Jiang, Zhijun Fang, Hamido Fujita, Yongbin Gao,F-SCP:An automatic prompt generation method for specific classes based on visual language pre-training models,*Pattern Recognition*,2024

团队2021级研究生韩柏宏同学的论文“F-SCP: An automatic prompt generation method for specific classes based onvisual language pre-training models”被SCI顶刊《Parttern Recognition》 录用,祝贺!

Abstract:
The zero-shot classification performance of large-scale vision-language pre-training models (e.g., CLIP, BLIP and ALIGN) can be enhanced by incorporating a prompt (e.g., “a photo of a [CLASS]”) before the class words. Modifying the prompt slightly can have significant effect on the classification outcomes of these models. Thus, it is crucial to include an appropriate prompt tailored to the classes. However, manual prompt design is labor-intensive and necessitates domain-specific expertise. The CoOp (Context Optimization) converts hand-crafted prompt templates into learnable word vectors to automatically generate prompts, resulting in substantial improvements for CLIP. However, CoOp exhibited significant variation in classification performance across different classes. Although CoOp-CSC (Class-Specific Context) has a separate prompt for each class, only shows some advantages on fine-grained datasets. In this paper, we propose a novel automatic prompt generation method called F-SCP (Filter-based Specific Class Prompt), which distinguishes itself from the CoOp-UC (Unified Context) model and the CoOp-CSC model. Our approach focuses on prompt generation for low-accuracy classes and similar classes. We add the Filter and SCP modules to the prompt generation architecture. The Filter module selects the poorly classified classes, and then reproduce the prompts through the SCP (Specific Class Prompt) module to replace the prompts of specific classes. Experimental results on six multi-domain datasets shows the superiority of our approach over the state-of-the-art methods. Particularly, the improvement in accuracy for the specific classes mentioned above is significant. For instance, compared with CoOp-UC on the OxfordPets dataset, the low-accuracy classes, such as, Class21 and Class26, are improved by 18% and 12%, respectively.

Download: [官方链接] [preprint版本]

Keywords: Multi-modalVision language modelPrompt tuningLarge-scale pre-training model
Photos:

Yi Wu, Xiaoyan Jiang, Zhijun Fang, Yongbin Gao, Hamido Fujita,Multi-modal 3D object detection by 2D-guided precision anchor proposal and multi-layer fusion,Applied Soft Computing,2021

团队2018级研究生吴益同学的论文“Multi-modal 3D object detection by 2D-guided precision anchor proposal and multi-layer fusion”被SCI期刊《Applied Soft Computing》 录用,祝贺!

Abstract:
3D object detection, of which the goal is to obtain the 3D spatial structure information of the object, is a challenging topic in many visual perception systems, e.g., autonomous driving, augmented reality, and robot navigation. Most existing region proposal network (RPN) based 3D object detection methods generate anchors in the whole 3D searching space without using semantic information, which leads to the problem of inappropriate anchor size generation. To tackle the issue, we propose a 2D-guided precision anchor generation network (PAG-Net). Specifically speaking, we utilize a mature 2D detector to get 2D bounding boxes and category labels of objects as prior information. Then the 2D bounding boxes are projected into 3D frustum space for more precise and category-adaptive 3D anchors. Furthermore, current feature combination methods are early fusion, late fusion, and deep fusion, which only fuse features from high convolutional layers and ignore the data missing problem of point clouds. To obtain more efficient fusion of RGB images and point clouds features, we propose a multi-layer fusion model, which conducts nonlinear and iterative combinations of features from multiple convolutional layers and merges the global and local features effectively. We encode point cloud with the bird’s eye view (BEV) representation to solve the irregularity of point cloud. Experimental results show that our proposed approach improves the baseline by a large margin and outperforms most of the state-of-the-art methods on the KITTI object detection benchmark.

Download: [官方链接] [preprint版本]

Keywords: 3D object detection, Multi-modal, Autonomous driving, Feature fusion, Point cloud.

Photos:

Jiang, G., Jiang, X., Fang, Z. et al. An efficient attention module for 3d convolutional neural networks in action recognition. Appl Intell

团队2018级研究生蒋光好同学的论文“An efficient attention module for 3d convolutional neural networks in action recognition”被SCI期刊《Applied Intelligence》 录用,祝贺!

Abstract:
Due to illumination changes, varying postures, and occlusion, accurately recognizing actions in videos is still a challenging task. A three-dimensional convolutional neural network (3D CNN), which can simultaneously extract spatio-temporal features from sequences, is one of the mainstream models for action recognition. However, most of the existing 3D CNN models ignore the importance of individual frames and spatial regions when recognizing actions. To address this problem, we propose an efficient attention module (EAM) that contains two sub-modules, that is, a spatial efficient attention module (EAM-S) and a temporal efficient attention module (EAM-T). Specifically, without dimensionality reduction, EAM-S concentrates on mining category-based correlation by local cross-channel interaction and assigns high weights to important image regions, while EAM-T estimates the importance score of different frames by cross-frame interaction between each frame and its neighbors. The proposed EAM module is lightweight yet effective, and it can be easily embedded into 3D CNN-based action recognition models. Extensive experiments on the challenging HMDB-51 and UCF-101 datasets showed that our proposed module achieves state-of-the-art performance and can significantly improve the recognition accuracy of 3D CNN-based action recognition methods.

Download: [官方链接] [preprint版本]

Keywords: Effective attention module, 3D CNN, Deep learning, Action recognition.

Photos: