LLM Note网站首页 技术教程

LLM Note

十里清风 2025-02-21 12:01:02

简介LLM Note

PreNorm vs PostNorm

Transformer Layer中有两处残连接，分别是网络输入 $x$ 与SelfAttention层和MLP/FFN层的输出。

前标准化： 标准化在残连接add之前，即对SelfAttention/MLP层的输入进行标准化，将其输出再与输入相加。
后标准化： 标准化在残连接add之后，即网络输入与SelfAttention/MLP层输出相加后，进行标准化。

Qwen2 DecoderLayer的实现：

class Qwen2DecoderLayer(nn.Module):
    def __init__(self, config: Qwen2Config, layer_idx: int):
        super().__init__()
        self.hidden_size = config.hidden_size
        self.self_attn = Qwen2Attention(config=config, layer_idx=layer_idx)
        self.mlp = Qwen2MLP(config)
        self.input_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        self.post_attention_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        if config.sliding_window and config._attn_implementation != "flash_attention_2":
            logger.warning_once(
                f"Sliding Window Attention is enabled but not implemented for `{config._attn_implementation}`; "
                "unexpected results may be encountered."
            )

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_value: Optional[Cache] = None,
        output_attentions: Optional[bool] = False,
        use_cache: Optional[bool] = False,
        cache_position: Optional[torch.LongTensor] = None,
        position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,  # necessary, but kept here for BC
        **kwargs: Unpack[FlashAttentionKwargs],
    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
        residual = hidden_states

        hidden_states = self.input_layernorm(hidden_states)

        # Self Attention
        hidden_states, self_attn_weights = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_value=past_key_value,
            output_attentions=output_attentions,
            use_cache=use_cache,
            cache_position=cache_position,
            position_embeddings=position_embeddings,
            **kwargs,
        )
        hidden_states = residual + hidden_states

        # Fully Connected
        residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states = residual + hidden_states

        outputs = (hidden_states,)
        if output_attentions:
            outputs += (self_attn_weights,)

        return outputs

Grouped-Query Attention(GQA)

两种方法弥补参数：

扩展FFN层宽度
扩展网络深度

源码：https://github.com/fkodom/grouped-query-attention-pytorch/blob/main/grouped_query_attention_pytorch/attention.py

import torch
from einops import einsum, rearrange

# Initialization
bsz = 2
qlen = 128
kv_cache_len = 32
hsz = 5120
qhead = 40
khead = 8

head_dim = 5120 // qhead
num_head_groups = qhead // khead

q_proj = torch.nn.Linear(hsz, qhead * head_dim)
k_proj = torch.nn.Linear(hsz, khead * head_dim)
v_proj = torch.nn.Linear(hsz, khead * head_dim)
o_proj = torch.nn.Linear(qhead * head_dim, hsz)

x = torch.randn((bsz, qlen, hsz))

# position from kv_cache_len to kv_cache_len + qlen - 1
q = q_proj(x)
k = k_proj(x)
v = v_proj(x)

# position from 0 to kv_cache_len - 1
k_cache = torch.randn(bsz, kv_cache_len, khead * head_dim)
v_cache = torch.randn(bsz, kv_cache_len, khead * head_dim)

# expand kv cache
k = torch.concat((k_cache, k), 1)
v = torch.concat((v_cache, v), 1)
print('shape after concat kv cache:', q.size(), k.size(), v.size())
# torch.Size([2, 128, 5120]) torch.Size([2, 160, 1024]) torch.Size([2, 160, 1024])

# ！！！这里实现与源码不同，源码是b n (h g d) -> b g h n d，感觉没必要加一层转置？？？
q = rearrange(q, 'b n (g h d) -> b g h n d', g=num_head_groups, h=khead, d=head_dim)
k = rearrange(k, 'b s (h d) -> b h s d', h=khead, d=head_dim)
v = rearrange(v, 'b s (h d) -> b h s d', h=khead, d=head_dim)
print('shape after reshape:', q.size(), k.size(), v.size())
# torch.Size([2, 5, 8, 128, 128]) torch.Size([2, 8, 160, 128]) torch.Size([2, 8, 160, 128])

scores = einsum(q, k, 'b g h n d, b h s d -> b g h n s')
attention = scores.softmax(-1)
print('attention shape:', attention.size())
# torch.Size([2, 5, 8, 128, 160])

out = einsum(attention, v, 'b g h n s, b h s d -> b g h n d')
out = rearrange(out, 'b g h n d -> b n (g h d)')
print('out shape:', out.size())
# torch.Size([2, 128, 5120])