您现在的位置是:首页 >技术教程 >Transformers在自动驾驶领域的应用(High-level like in scene)网站首页技术教程
Transformers在自动驾驶领域的应用(High-level like in scene)
Transformers in autonomous driving
Transformers uses global attention mechanism to capture long-range interaction information. Two core modules in Transformers are Self-Attention and Cross-Attention, which defined as
A t t n ( Q , K , V ) = s o f t m a x ( Q K ⊤ d k ) V S e l f A t t n ( X ) = A t t n ( W Q X , W K X , W V X ) C r o s s A t t n ( X , Y ) = A t t n ( W Q X , W K Y , W V Y ) mathrm{Attn}(mathrm{mathbf{Q,K,V}})=mathrm{softmax}(frac{mathrm{mathbf{Q}}mathrm{mathbf{K}}^ op}{sqrt{d_k}})mathrm{mathbf{V}}\mathrm{SelfAttn}(mathrm{mathbf{X}})=mathrm{Attn}(mathrm{mathbf{W_QX,W_KX,W_VX}})\mathrm{CrossAttn}(mathrm{mathbf{X,Y}})=mathrm{Attn}(mathrm{mathbf{W_QX,W_KY,W_VY}}) Attn(Q,K,V)=softmax(dkQK⊤)VSelfAttn(X)=Attn(WQX,WKX,WVX)CrossAttn(X,Y)=Attn(WQX,WKY,WVY)
where usually another modal Y mathrm{mathbf{Y}} Y input as K mathrm{mathbf{K}} Key and V mathrm{mathbf{V}} Value for cross domain information interaction.
Many works use Transformers at scene-level tasks such as trajectory prediction, multi sensor
fusion and BEV (Bird’s Eye View) prediction from front cameras.
Trajectory prediction
Given agent past trajectories, observations (if any) and map information (if any), predict driving actions or waypoints. Transformer can be usually viewed as a encoder. Moreover, researchers also focus on varations of self-attention and cross-attention mechanism to model interactions between different modalities or themselves. Futhermore, some papers treat Transformers as a auto-regressive or one-forward function to generate trajectory prediction iteratively.
Scene Encoder
GSA [ITSC’21]
GSA (Image transformer for explainable autonomous driving system) map visual features from images collected by onboard cameras to guide potential driving actions with corresponding explanations, which uses Transformer block to capture image information.
Trajformer [NeurIPS’20]
Trajformer (Trajformer: Trajectory Prediction with Local Self-Attentive Contexts for Autonomous Driving) uses self-attention to enable better control over representing the agent’s social context. Flattened Scene patched are combined with position encoding to Transformer Encoder. Output embeddings are then used to predict trajectories by normalizing flow module. Simple framework, not enough for complex scenarios.
NEAT [ICCV’21]
NEAT (NEAT- Neural Attention Fields for End-to-End Autonomous Driving) uses Transformer to encode image scene.Moreover, it employs intermediate attention maps to iteratively compress high-dimensional 2D image features into a compact representation.
Interaction Model
InteractionTransformer [IEEE IROS’21]
InteractionTransformer (End-to-end Contextual Perception and Prediction with Interaction Transformer) inputs ego-agent as Q mathrm{mathbf{Q}} Q and all other agents as K , V mathrm{mathbf{K, V}} K,V for cross-attention, where a residual block is applied to get the updated contextual feature.
Morevoer, the transformer block also servers as a auto-regressor for recurrent trajectory prediction. Nevertheless, it considers ego-agent only when computing agents interaction, which may ignore potential collisions caused by divserse motivation of other agents.
Multi-modal Transformer
Multi-modal Transformer (Multi-modal Motion Prediction with Transformer-based Neural Network for Autonomous Driving) encompasses two cross-attention Transformer layers for modeling agent interactions and map attention respectively. The agent-agent encoder is employed to model the relationship among interacting agents, and the map-agent encoder to model the relationship between the target agent with interaction feature and the waypoints on the map.
During training time, it uses Minimum over N (MoN) loss, which onlys calculate the loss between the ground truth and the closest output prediction from different attention head to guarantee divesity of predicted trajectories. However, mode collapses would also happen due to shared intermediate representation from Target-agent Feature. And it has no public code archive.
S2TNet [ACML’21]
S2TNet (S2TNet- Spatio-Temporal Transformer Networks for Trajectory Prediction in Autonomous Driving) leverages the encoders representation of history features and the decoder to obtain the refined out-put spatio-temporal features, and further generates future trajectories by passing them to trajectory generator. It interleaves the spatial self-attention and TCN in a single spatio-temporal Transformer.
SceneTransformer [ICLR’22]
SceneTransformer (Scene Transformer - A Unified Architecture for Predicting Multiple Agent Trajectories) unifies (conditional) motion prediction and goal-conditioned motion prediction by differnet agents/time masking stategies. Experiments show that model benefits from jointly learning different tasks, like auxiliary tasks in TransFuser.
Moreover, various kinds of attention such as time & agents self-attention and cross-attention agents to roadgraph between different modalities alternatively to catch interactions.
However, it treats time and agents axes seperately. Thus, it may fail to dig temporal-spatial interactions.
LTP [CVPR’22]
LTP (Lane-based Trajectory Prediction for Autonomous Driving) uses a multi-layer attention network is introduced to capture the global interaction between agents and lane segments. LTP exploits the sliced lane segments as fine-grained, shareable, and interpretable proposals. Moreover, a variance-based non-maximum suppression strategy is proposed to select representative trajectories that ensure the diversity of the final output.
AgentFormer [ICCV’21]
AgentFormer (AgentFormer- Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting) multaneously models the time and social dimensions. The model leverages a sequence representation of multi-agent trajectories by flattening trajectory features across time and agents.
To guarantee multi-modal in trajectory prediction, AgentFormer view the task as
where Z mathrm{mathbf{Z}} Z represents latent variable of different modes. Based on CVAE (conditional variation auto-encoder), it designs a multi-agent trajectory prediction framework. However, CVAE may be instable during training.
Multi-future Transformer [IEEE TITS’22]
Multi-future Transformer (Multi-future Transformer: Learning diverse interaction modesfor behaviour prediction in autonomous driving) decodes agent features from graph context features like other cross-domain methods. Moreover, local attention mechanism is incorperated for computational efficiency.
Moreover, it trains K K K parallel interaction modules with different parameter sets. Huber loss is also used to smoth training. No public code archive.
Trajectory Decoder
Gatformer [IEEE TITS’22]
Gatformer (Trajectory Prediction for Autonomous Driving Using Spatial-Temporal Graph Attention Transformer) proposes a Graph Attention Transformer (Gatformer) in which a traffic scene is represented by a sparse graph. And the GAT pass messages (features) to propagate information between an agent and surrounding agents using attention mechanism to guide the information propagation.
Moreover, like mmTransformer, Gatformer inputs trajectory observation to get future waypoints. However, it relys on BEV to make decisions, which may be a obstacle for offline inference. No public code archive.
mmTransformer [CVPR’21]
mmTransformer (Multimodal Motion Prediction with Stacked Transformers) uses stacked transformers as the backbone, which aggregates the contextual information on the fixed trajectory proposals.
The mmTransformer block can be view as a translator, which tranlates surrounding information to cat trajectories through cross-attention mechanism. However, trajectory proposals are generated by k-means of datasets, it may lead to poor generalization due to distribution shift.
InterFuser [CoRL’22]
InterFuser (Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion Transformer) outputs intermediate interpretable features with generated actions. The tokens from different sensors are then fused in the transformer encoder. Three types of queries are then fed into the transformer decoder to predict waypoints, object density maps and traffic rules respectively.
SceneRepTransformer
SceneRepTransformer (Augmenting Reinforcement Learning with Transformer-based Scene Representation Learning for Decision-making of Autonomous Driving) implicitly model agent history interactions through dynamic level Transformer and the following Cross-modality level models the interactions of neighbor agents. Moreover, ego-car representation is concatenated with other agents and environment for prediction.
At prediction head, Transformer decoder architecture outputs predictions for the next step as an auto-regressive way for trajectory predictions.
Moreover, SceneRepTransformer designs a SimSiam-style representation learning method to combat error accumulated during iterative process.
Others
TransFollower
TransFollower (TransFollower: Long-Sequence Car-Following Trajectory Prediction through Transformer) learns to build a connection between historical driving and future Leading Vehicle speed through cross-attention between encoder and decoder. While a simple PID controller may have a more effective and more explainable results.
ADAPT [ICLR’23]
ADAPT (ADAPT: Action-aware Driving Caption Transformer) provides user-friendly natural language narrations and reasoning for each decision making step of autonomous vehicular control and action. It uses a Vision-language cross-modal Transformer with special tokens to generate narration and reasoning for driving actions.
ACT‐Net [The Journal of Supercomputing’22]
ACT‐Net (Driver attention prediction based on convolution and transformers) proposes a novel Attention prediction method based on CNN and Transformer which is termed as ACT-Net. In particular, CNN and Transformer are combined as a block which is further stacked to form the deep model. Through this design, both local and long-range dependencies are captured that both are crucial for driver attention prediction.
RGB→BEV
Encoder
TransImage2Map [ICRA’22]
TransImage2Map (Translating Images into Maps) notices that, regardless of the depth of the image pixel, there is a 1–1 correspondence between a vertical scanline in the image (image column), and a polar ray passing through the camera location in an BEV map.
The use of axial cross-attention Transformers in the column direction and convolution in the row direction saves computation significantly.
TransFuser (PAMI’22)
TransFuser (TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving) uses Transformer blocks to fuse different scale futures between RGB image and LiDAR BEV, which are combined via element-wise summation.
Moreover, four auxiliary loss functions includes depth prediction, semantic segmentation, HD map prediction and vehicle detection are incorporated. Different learning tasks could reinforce original task by incorperating prior information.
However, image scalability is limited due to high computational complexity of the standard global attention and the features from multiple modalities are simply stacked as the input sequences without any prior conditions. Furthermore, the framework is hard to extend for coupled network design bewteen different sensors/modalities. And it does not consider temproal information when predicting future trajectories.
NMP
NMP (Neural Map Prior for Autonomous Driving) uses a neural representation of global maps that facilitates automatic global map updates and improves local map inference performance. It is a online way to update local semantic map and reinforce autonomous driving under poor conditions by prior information.
Moreover, it dynamically integrate prior information through GRU network. Like all cross-attention, C2P is used to generate new BEV features according to past records.
PYVA [CVPR’21]
PYVA (Projecting Your View Attentively- Monocular Road Scene Layout Estimation via Cross-view Transformation) is among the first to explicitly mention that a cross-attention decoder can be used for view transformation to lift image features to BEV space. Similar to earlier monocular BEV perception work, PYVA performs road layout and vehicle segmentation on the transformed BEV features.
Moreover, CVT uses the cross-view correlation scheme that explicitly correlates the features of views to achieve an attention map W to strengthen X′ as well as the feature selection scheme that extracts the most relevant information from X′′.
Decoder
STSU [ICCV’21]
STSU (Structured Bird’s-Eye-View Traffic Scene Understanding from Onboard Images) uses sparse queries for object detection, following the practice of DETR. STSU detects not only dynamic objects but also static road layouts.
STSU uses two sets of query vectors, one set for centerlines and one for objects. What is most interesting is its prediction of the structured road layout. The lane branch includes several prediction heads.
BEVFormer [ECCV’22]
BEVFormer has 6 encoder layers, each of which follows the conventional structure of transformers, except for three tailored designs, namely BEV queries, spatial cross-attention, and temporal self-attention. Specifically, BEV queries are grid-shaped learnable parameters, which is designed to query features in BEV space from multi-camera views via attention mechanisms.
3D detection and segmentation tasks could be performed based on trained BEV intermediate features.