Latest Chinese Computer Vision Research Offers Point-Track-Transformer (PTT) Module for Point Cloud-Based 3D Single Object Tracking Task
Single Object Tracking (SOT) with Lidar points has many applications in robotics and automatic driving. For example, the autonomous pedestrian follower robot should be able to adequately monitor and locate its master to provide adequate control in a crowd. Another example is the autonomous landing of unmanned aerial vehicles, in which the drone must follow the target and determine the precise distance and posture of the target in order to land safely. However, most current 3D SOT approaches rely on visual or RGB-D cameras, which can fail in low-vision or changing lighting conditions, as they rely heavily on dense images for target tracking.
3D LiDAR and visual or RGB-D sensors are commonly used in object tracking work because they are less sensitive to changes in light and can more accurately collect geometry and distance information. Unlike previous LiDAR-based multi-object tracking (MOT) approaches, LiDAR-based 3D SOT methods must model the similarity function. The similarity function between the target and the search region is modeled so as to locate the target element. Although they both require a similarity computation, MOT approaches compute similarity at the object level to associate detection results with tracklets. In contrast, SOT methods calculate intra-object similarity to identify the target object.
Therefore, 3D SOT presents unique problems compared to 3D MOT. SC3D is the first LiDAR-based 3D Siamese tracker that uses the Shape Completion Network. However, the approach only processes the input point cloud with an encoder consisting of three layers of one-dimensional convolutions, which makes it impossible to extract the robust representation of the features from the point cloud. Moreover, SC3D could not be operated in real time and trained from start to finish. The researchers also proposed a point-to-box (P2B) array to estimate the target bounding box from the raw point cloud. However, their technique often fails to follow in sparse point cloud circumstances.
Meanwhile, P2B does not favor non-accidental coincidences which further aid in locating the target center. Fan et al. recently merged a Siamese network with a LiDAR-based RPN network to handle 3D object tracking. Nevertheless, they directly use the classification scores to classify the regression results, ignoring the lag between location and classification. It should be noted that points at different geometric locations have varying degrees of value when representing goals. However, these techniques do not weight point cloud features by this attribute. Additionally, due to the sparseness and occlusion of point clouds, the point cloud features retrieved from the model and search region include less information about potential objects and more noise. background.
The architecture of PTT modules is divided into three sections: feature integration, location coding, and self-attention. The input consists of the coordinates and their associated characteristics. The feature integration module converts the provided features into the integration space. The k nearest neighbor is used in the position encoding module. After using a method to gather local location information, an MLP layer will learn the encoded position features. The self-attention module is learned according to the local context. The output characteristics of the PTT module are the sum of the input and residual characteristics.
Therefore, understanding how to pay attention to spatial cues is essential to improve the efficiency of tracking 3D objects. The transformer has recently demonstrated outstanding performance in feature encoding with its robust self-attention module. Transformers typically have three main modules: input (word) integration, position coding, and an attention module. Position coding encodes the coordinates of the point cloud into identifiable high-dimensional elements. By computing attention weights, self-attention creates enhanced attention features. Additionally, to test the usefulness of their PTT module, they integrated it with the mainstream P2B to create a revolutionary 3D SOT tracker called PTT-Net. PTT is now included in the voting and proposal creation stages of PTT-Net.
The PTT embedded in the voting step could mimic interactions between point patches in different geometric locations, learning context-dependent features and helping the network focus on more representative object attributes. Meanwhile, the built-in PTT in the proposal generation stage can collect contextual information between the item and the background, helping the network to suppress background noise effectively. These changes can significantly improve the performance of 3D object tracking. Experimental results of their PTT-Net on the KITTI tracking dataset show that their technique is superior (a 10% improvement over baseline).
They are further testing their PTT-Net on the NuScenes dataset, and the results demonstrate that their technique has the potential to achieve new cutting-edge performance. PTT-Net can also run in real time at 40FPS on a single NVIDIA 1080Ti GPU.
The contributions can be summarized as follows:
• PTT Module: TA Point-Track-Transformer (PTT) module for 3D single object tracking using raw point clouds, which can balance point cloud attributes to focus on deeper level object indications during tracking.
• PTT-Net: A unique 3D object tracking network integrated with PTT modules that can be taught from start to finish. To their knowledge, this is the first effort that uses a transformer to track 3D objects with point clouds.
• Open source: Extensive tests on KITTI and NuScenes datasets reveal that their technique outperforms state-of-the-art solutions by significant margins at 40 IPS. Moreover, they make their open source approach available to the scientific community.
The PyTorch implementation of the PTT module can be found freely on GitHub.
This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Real-time 3D Single Object Tracking with Transformer'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github link. Please Don't Forget To Join Our ML Subreddit
Asif Razzaq is an AI journalist and co-founder of Marktechpost, LLC. He is a visionary, entrepreneur and engineer who aspires to use the power of artificial intelligence for good.
Asif’s latest venture is the development of an artificial intelligence media platform (Marktechpost) that will revolutionize the way people can find relevant news related to artificial intelligence, data science and technology. machine learning.
Asif was featured by Onalytica in its ‘Who’s Who in AI? (Influential Voices & Brands)’ as one of the ‘Influential Journalists in AI’ (https://onalytica.com/wp-content/uploads/2021/09/Whos-Who-In-AI.pdf). His interview was also featured by Onalytica (https://onalytica.com/blog/posts/interview-with-asif-razzaq/).