NLP BERT 모델 코드 스터디 리뷰

머신러닝(Machine Learning)

NLP BERT 모델 코드 스터디 리뷰

Blaze_블즈 2023. 9. 16. 11:22

안녕하세요

블레이즈 테크노트

블레이즈 입니다.

지난 포스트에서 BERT 모델에 대해서 살펴봤습니다.

BERT는 구글에서 발표한 모델로 자연어 처리에서 높은 성능을 보여 주목받았습니다.

https://blazetechnote.tistory.com/35

NLP BERT 모델 이해하기 (1) 트랜스포머로부터

안녕하세요 블레이즈 테크노트 블레이즈 입니다. 자연어 처리 (Natural Language Processing, NLP)는 컴퓨터 과학과 인공 지능의 교차점에 위치한 분야입니다. 컴퓨터가 어떻게 인간의 언어를 이해하고

blazetechnote.tistory.com

오늘은 깃허브에서 BERT 모델의 코드를 공부해보도록 하겠습니다.

https://github.com/google-research/bert

가장 핵심이 되는 modeling.py의 내용입니다.

먼저, modeling.py 코드의 전체 구조를 도식화했습니다.

BertConfig 클래스를 정의하는 부분입니다.

이 BertConfig는 Bert 모델의 구성과 하이퍼 파라미터를 설정하는 용도인 것 같습니다.

class BertConfig(object):
  """Configuration for `BertModel`."""

  def __init__(self,
               vocab_size,
               hidden_size=768,
               num_hidden_layers=12,
               num_attention_heads=12,
               intermediate_size=3072,
               hidden_act="gelu",
               hidden_dropout_prob=0.1,
               attention_probs_dropout_prob=0.1,
               max_position_embeddings=512,
               type_vocab_size=16,
               initializer_range=0.02):
    """Constructs BertConfig.

    Args:
      vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.
      hidden_size: Size of the encoder layers and the pooler layer.
      num_hidden_layers: Number of hidden layers in the Transformer encoder.
      num_attention_heads: Number of attention heads for each attention layer in
        the Transformer encoder.
      intermediate_size: The size of the "intermediate" (i.e., feed-forward)
        layer in the Transformer encoder.
      hidden_act: The non-linear activation function (function or string) in the
        encoder and pooler.
      hidden_dropout_prob: The dropout probability for all fully connected
        layers in the embeddings, encoder, and pooler.
      attention_probs_dropout_prob: The dropout ratio for the attention
        probabilities.
      max_position_embeddings: The maximum sequence length that this model might
        ever be used with. Typically set this to something large just in case
        (e.g., 512 or 1024 or 2048).
      type_vocab_size: The vocabulary size of the `token_type_ids` passed into
        `BertModel`.
      initializer_range: The stdev of the truncated_normal_initializer for
        initializing all weight matrices.
    """
    self.vocab_size = vocab_size
    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.hidden_act = hidden_act
    self.intermediate_size = intermediate_size
    self.hidden_dropout_prob = hidden_dropout_prob
    self.attention_probs_dropout_prob = attention_probs_dropout_prob
    self.max_position_embeddings = max_position_embeddings
    self.type_vocab_size = type_vocab_size
    self.initializer_range = initializer_range

vocab_size: 입력으로 사용되는 inputs_ids의 어휘 크기입니다. BERT 모델이 처리할 수 있는 최대 단어 수입니다.
hidden_size: 인코더 레이어와 pooler 레이어의 크기를 나타냅니다. 기본값은 768입니다.
num_hidden_layers: Transformer 인코더 내부의 숨겨진 레이어 수입니다. 기본값은 12입니다.
num_attention_heads: Transformer 인코더의 각 attention 레이어에 있는 attention 헤드의 수입니다. 기본값은 12입니다.
intermediate_size: Transformer 인코더 내의 "intermediate" (즉, feed-forward) 레이어의 크기입니다. 기본값은 3072입니다.
hidden_act: 인코더와 pooler에서 사용되는 비선형 활성화 함수입니다. 기본적으로 "gelu"로 설정됩니다.
hidden_dropout_prob: 임베딩, 인코더, 및 pooler의 모든 완전 연결 레이어에 대한 드롭아웃 확률입니다. 기본값은 0.1입니다.
attention_probs_dropout_prob: attention 확률에 대한 드롭아웃 비율입니다. 기본값은 0.1입니다.
max_position_embeddings: 이 모델이 사용될 수 있는 최대 시퀀스 길이입니다. 보통 매우 큰 값 (예: 512, 1024, 2048)으로 설정하여 미래의 큰 시퀀스도 처리할 수 있게 합니다.
type_vocab_size: BertModel로 전달되는 token_type_ids의 어휘 크기입니다. 기본값은 16입니다.
initializer_range: 모든 가중치 행렬을 초기화하기 위해 사용되는 truncated_normal_initializer의 표준 편차입니다. 기본값은 0.02입니다.

BertModel 에 대해서 잘 이해하려면 embedding부터 순서대로 이해하는 게 좋을 것 같습니다.

Bert 는 세 개의 embedding이 적용되는데요,

이 embedding_lookup() 은 그 중 word embedding을 구현한 함수라고 볼 수 있습니다.

def embedding_lookup(input_ids,
                     vocab_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False):
  """Looks up words embeddings for id tensor.

  Args:
    input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
      ids.
    vocab_size: int. Size of the embedding vocabulary.
    embedding_size: int. Width of the word embeddings.
    initializer_range: float. Embedding initialization range.
    word_embedding_name: string. Name of the embedding table.
    use_one_hot_embeddings: bool. If True, use one-hot method for word
      embeddings. If False, use `tf.gather()`.

  Returns:
    float Tensor of shape [batch_size, seq_length, embedding_size].
  """
  # This function assumes that the input is of shape [batch_size, seq_length,
  # num_inputs].
  #
  # If the input is a 2D tensor of shape [batch_size, seq_length], we
  # reshape to [batch_size, seq_length, 1].
  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])

  embedding_table = tf.get_variable(
      name=word_embedding_name,
      shape=[vocab_size, embedding_size],
      initializer=create_initializer(initializer_range))

  flat_input_ids = tf.reshape(input_ids, [-1])
  if use_one_hot_embeddings:
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
    output = tf.matmul(one_hot_input_ids, embedding_table)
  else:
    output = tf.gather(embedding_table, flat_input_ids)

  input_shape = get_shape_list(input_ids)

  output = tf.reshape(output,
                      input_shape[0:-1] + [input_shape[-1] * embedding_size])
  return (output, embedding_table)

먼저, 임베딩 테이블을 생성합니다.

이때 임베딩 테이블의 크기는 vocab_size * embedding_size 겠죠.

예시를 보자면 아래와 같습니다.

저는 사이즈가 4인 임베딩 벡터 4개를 예시로 들었습니다.

I, teacher, love, some 이라는 단어가 다음과 같이 벡터로 표현된다면

embedding_table의 shape은 vocab_size * embedding_size가 됩니다.

그 다음으로 제가 탐색하고자 하는 input_ids 에 해당하는 임베딩 벡터를 찾아서

output과 embedding_table을 반환합니다.

다음으로, embedding_postprocessor()가 있습니다.

이 함수는 BERT 모델의 임베딩 중, position_embedding과 token_type_embedding 을 구현하고 있습니다.


def embedding_postprocessor(input_tensor,
                            use_token_type=False,
                            token_type_ids=None,
                            token_type_vocab_size=16,
                            token_type_embedding_name="token_type_embeddings",
                            use_position_embeddings=True,
                            position_embedding_name="position_embeddings",
                            initializer_range=0.02,
                            max_position_embeddings=512,
                            dropout_prob=0.1):
  """Performs various post-processing on a word embedding tensor.

  Args:
    input_tensor: float Tensor of shape [batch_size, seq_length,
      embedding_size].
    use_token_type: bool. Whether to add embeddings for `token_type_ids`.
    token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
      Must be specified if `use_token_type` is True.
    token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
    token_type_embedding_name: string. The name of the embedding table variable
      for token type ids.
    use_position_embeddings: bool. Whether to add position embeddings for the
      position of each token in the sequence.
    position_embedding_name: string. The name of the embedding table variable
      for positional embeddings.
    initializer_range: float. Range of the weight initialization.
    max_position_embeddings: int. Maximum sequence length that might ever be
      used with this model. This can be longer than the sequence length of
      input_tensor, but cannot be shorter.
    dropout_prob: float. Dropout probability applied to the final output tensor.

  Returns:
    float tensor with same shape as `input_tensor`.

  Raises:
    ValueError: One of the tensor shapes or input values is invalid.
  """
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  width = input_shape[2]

  output = input_tensor

  if use_token_type:
    if token_type_ids is None:
      raise ValueError("`token_type_ids` must be specified if"
                       "`use_token_type` is True.")
    token_type_table = tf.get_variable(
        name=token_type_embedding_name,
        shape=[token_type_vocab_size, width],
        initializer=create_initializer(initializer_range))
    # This vocab will be small so we always do one-hot here, since it is always
    # faster for a small vocabulary.
    flat_token_type_ids = tf.reshape(token_type_ids, [-1])
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
    token_type_embeddings = tf.reshape(token_type_embeddings,
                                       [batch_size, seq_length, width])
    output += token_type_embeddings

  if use_position_embeddings:
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
    with tf.control_dependencies([assert_op]):
      full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))
      # Since the position embedding table is a learned variable, we create it
      # using a (long) sequence length `max_position_embeddings`. The actual
      # sequence length might be shorter than this, for faster training of
      # tasks that do not have long sequences.
      #
      # So `full_position_embeddings` is effectively an embedding table
      # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
      # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
      # perform a slice.
      position_embeddings = tf.slice(full_position_embeddings, [0, 0],
                                     [seq_length, -1])
      num_dims = len(output.shape.as_list())

      # Only the last two dimensions are relevant (`seq_length` and `width`), so
      # we broadcast among the first dimensions, which is typically just
      # the batch size.
      position_broadcast_shape = []
      for _ in range(num_dims - 2):
        position_broadcast_shape.append(1)
      position_broadcast_shape.extend([seq_length, width])
      position_embeddings = tf.reshape(position_embeddings,
                                       position_broadcast_shape)
      output += position_embeddings

  output = layer_norm_and_dropout(output, dropout_prob)
  return output

positional embedding은 트랜스포머에서 적용되었던 임베딩 방식이므로 추가적인 설명을 하지 않겠습니다.

개념에 대한 설명은 아래의 포스팅을 참고해주세요.

https://blazetechnote.tistory.com/21

NLP 트랜스포머 두 번째, 포지셔널 인코딩(Positional Encoding) 알아보기

안녕하세요 블레이즈 테크노트 블레이즈입니다. 지난 포스팅에서 트랜스포머에 대한 간단한 소개를 했습니다. 혹시 인트로가 궁금하시다면 아래의 포스팅을 참고해 주세요. https://blazetechnote.tis

blazetechnote.tistory.com

다만 문장 내에서 단어의 상대적인 위치 정보를 추가하는 거라고 이해하시면 좋을 것 같습니다.

다음으로 token type embedding 은 두 문장을 구분하기 위한 정보입니다.

코드에서는 token_type_ids와 token_type_table은 이런 토큰 유형을 다루기 위한 것들입니다.

token_type_ids:
- 이는 [batch_size, seq_length]의 shape을 가진 tensor입니다.
- 각 토큰의 유형을 나타냅니다.
- BERT와 같은 모델에서 두 문장을 합쳐서 입력할 때, 첫 번째 문장의 모든 토큰은 0으로, 두 번째 문장의 모든 토큰은 1로 표시됩니다.
- 예: ["I", "love", "you", ".", "Do", "you", "love", "me", "?"] 라는 두 문장을 합친 시퀀스에서 token_type_ids는 [0, 0, 0, 0, 1, 1, 1, 1, 1]와 같을 것입니다. 이를 통해 BertModel이 두 문장이 서로 다른 문장임을 학습할 수 있도록 합니다.
token_type_table :
- 가능한 모든 토큰 유형에 대한 임베딩 벡터를 저장합니다.
- 예를 들면, BERT의 경우에는 두 가지 유형 (0과 1)만 있기 때문에, 이 테이블은 두 개의 임베딩 벡터를 저장합니다.
- 텐서의 shape은 [token_type_vocab_size, width]입니다.
- 여기서 token_type_vocab_size는 가능한 토큰 유형의 수 (BERT의 경우 2)이고, width는 임베딩 벡터의 차원입니다.

토큰 유형 임베딩의 작동 방식은 다음과 같습니다.

token_type_ids를 one-hot encoding으로 변환합니다.
one-hot 벡터를 token_type_table과 곱하여 해당 토큰 유형의 임베딩을 얻습니다.
결과적인 임베딩을 원래의 단어 임베딩에 추가합니다.

이렇게 하면, 모델은 입력 시퀀스의 각 토큰이 어떤 문장 또는 유형에 속하는지를 알 수 있게 됩니다.

다음으로 BertModel 코드에 대해서 알아보겠습니다.

class BertModel(object):
  """BERT model ("Bidirectional Encoder Representations from Transformers").

  Example usage:

  ```python
  # Already been converted into WordPiece token ids
  input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
  input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
  token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])

  config = modeling.BertConfig(vocab_size=32000, hidden_size=512,
    num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)

  model = modeling.BertModel(config=config, is_training=True,
    input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)

  label_embeddings = tf.get_variable(...)
  pooled_output = model.get_pooled_output()
  logits = tf.matmul(pooled_output, label_embeddings)
  ...
  ```
  """

  def __init__(self,
               config,
               is_training,
               input_ids,
               input_mask=None,
               token_type_ids=None,
               use_one_hot_embeddings=False,
               scope=None):
    """Constructor for BertModel.

    Args:
      config: `BertConfig` instance.
      is_training: bool. true for training model, false for eval model. Controls
        whether dropout will be applied.
      input_ids: int32 Tensor of shape [batch_size, seq_length].
      input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].
      token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
      use_one_hot_embeddings: (optional) bool. Whether to use one-hot word
        embeddings or tf.embedding_lookup() for the word embeddings.
      scope: (optional) variable scope. Defaults to "bert".

    Raises:
      ValueError: The config is invalid or one of the input tensor shapes
        is invalid.
    """
    config = copy.deepcopy(config)
    if not is_training:
      config.hidden_dropout_prob = 0.0
      config.attention_probs_dropout_prob = 0.0

    input_shape = get_shape_list(input_ids, expected_rank=2)
    batch_size = input_shape[0]
    seq_length = input_shape[1]

    if input_mask is None:
      input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

    if token_type_ids is None:
      token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

    with tf.variable_scope(scope, default_name="bert"):
      with tf.variable_scope("embeddings"):
        # Perform embedding lookup on the word ids.
        (self.embedding_output, self.embedding_table) = embedding_lookup(
            input_ids=input_ids,
            vocab_size=config.vocab_size,
            embedding_size=config.hidden_size,
            initializer_range=config.initializer_range,
            word_embedding_name="word_embeddings",
            use_one_hot_embeddings=use_one_hot_embeddings)

        # Add positional embeddings and token type embeddings, then layer
        # normalize and perform dropout.
        self.embedding_output = embedding_postprocessor(
            input_tensor=self.embedding_output,
            use_token_type=True,
            token_type_ids=token_type_ids,
            token_type_vocab_size=config.type_vocab_size,
            token_type_embedding_name="token_type_embeddings",
            use_position_embeddings=True,
            position_embedding_name="position_embeddings",
            initializer_range=config.initializer_range,
            max_position_embeddings=config.max_position_embeddings,
            dropout_prob=config.hidden_dropout_prob)

      with tf.variable_scope("encoder"):
        # This converts a 2D mask of shape [batch_size, seq_length] to a 3D
        # mask of shape [batch_size, seq_length, seq_length] which is used
        # for the attention scores.
        attention_mask = create_attention_mask_from_input_mask(
            input_ids, input_mask)

        # Run the stacked transformer.
        # `sequence_output` shape = [batch_size, seq_length, hidden_size].
        self.all_encoder_layers = transformer_model(
            input_tensor=self.embedding_output,
            attention_mask=attention_mask,
            hidden_size=config.hidden_size,
            num_hidden_layers=config.num_hidden_layers,
            num_attention_heads=config.num_attention_heads,
            intermediate_size=config.intermediate_size,
            intermediate_act_fn=get_activation(config.hidden_act),
            hidden_dropout_prob=config.hidden_dropout_prob,
            attention_probs_dropout_prob=config.attention_probs_dropout_prob,
            initializer_range=config.initializer_range,
            do_return_all_layers=True)

      self.sequence_output = self.all_encoder_layers[-1]
      # The "pooler" converts the encoded sequence tensor of shape
      # [batch_size, seq_length, hidden_size] to a tensor of shape
      # [batch_size, hidden_size]. This is necessary for segment-level
      # (or segment-pair-level) classification tasks where we need a fixed
      # dimensional representation of the segment.
      with tf.variable_scope("pooler"):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token. We assume that this has been pre-trained
        first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
        self.pooled_output = tf.layers.dense(
            first_token_tensor,
            config.hidden_size,
            activation=tf.tanh,
            kernel_initializer=create_initializer(config.initializer_range))

BertModel 클래스는 BertConfig의 인스턴스를 인자로 받습니다.

config = copy.deepcopy(config) 를 통해 config로 주어진 인자를 deepcopy 한다는 것을 알 수 있습니다.

if not is_training:
config.hidden_dropout_prob = 0.0
config.attention_probs_dropout_prob = 0.0

다음으로 트레이닝이 아닌 경우, 드롭아웃을 방지합니다.

드롭아웃은 훈련과정에서 오버피팅을 방지하기 위한 용도로 사용되는 기법이기 때문입니다.

input_shape = get_shape_list(input_ids, expected_rank=2)
batch_size = input_shape[0]
seq_length = input_shape[1]

보통 input_ids 는 2차원 텐서로 [배치 사이즈 * 시퀀스 길이]의 텐서입니다.

이를 get_shape_list로 가져와서

배치 사이즈가 몇인지, 시퀀스 길이가 몇인지 할당해줍니다.

다음으로 임베딩 부분을 확인해보겠습니다.

embedding_lookup() 과 embedding_postprocessor()를 통해 Bert의 3가지 임베딩을 처리합니다. 이 두 함수는 앞에서 설명했으니 넘어가도록 하겠습니다.