Tech & Traditions: Attention Algorithm

2025년 2월 9일 일요일

How the Attention Algorithm works?

Let me explain how the attention algorithm works with a simple example.

1. Scenario for Attention Application

Let’s say we’re translating the following sentence:

“The cat sat on the mat.”

→ “그 고양이는 매트 위에 앉아 있었다.” (The cat sat on the mat.)

Traditional sequential models (RNN, LSTM) would process the sentence from left to right, but the attention mechanism allows the model to look at all the words at once and focus on the important ones.

2. Attention Logic (Self-Attention)

The key to how attention works lies in three elements: Query (Q), Key (K), and Value (V).

Step 1: Convert the sentence into vectors

First, we need to convert the words into numbers (vectors).

For example:

• “The” → [0.2, 0.8, 0.5]

• “cat” → [0.3, 0.9, 0.7]

• “sat” → [0.6, 0.1, 0.3]

• “on” → [0.4, 0.2, 0.9]

• “the” → [0.2, 0.8, 0.5]

• “mat” → [0.7, 0.3, 0.4]

These vectors make the words interpretable by the computer.

Step 2: Create Query, Key, and Value matrices

Now, for each word, we create Query (Q), Key (K), and Value (V) matrices.

• Query (Q): Represents what a word is asking about (how related a word is to others).

• Key (K): Represents how much other words are related to the word being considered.

• Value (V): Contains the actual meaning or content of the word.

These matrices allow the model to assess relationships between words.

Step 3: Calculate the attention scores between words

To determine how related one word is to another, we calculate the attention score by taking the dot product of Query and Key.

For example, when translating “sat”:

• The relationship between “sat” and “cat” (Q·K calculation) → high score

• The relationship between “sat” and “mat” (Q·K calculation) → low score

This allows the model to focus more on words that are more relevant (like “cat” in this case).

Step 4: Apply Softmax to calculate attention weights

Next, we convert the attention scores into probabilities by applying the Softmax function. This step normalizes the attention scores so that they sum to 1 and give weights to each word based on importance.

For example:

Word	Attention Score (Q·K)	Softmax Result (Weight)
cat	5.2	0.45
sat	6.8	0.50
mat	2.1	0.05

Now, when translating “sat,” the model focuses more on “cat” and “sat” while largely ignoring “mat.”

Step 5: Compute the final output using Value (V)

Now, the attention weights (from Softmax) are multiplied by the Value (V) matrices to get the final output.

In other words,

• “cat” vector * 0.45

• “sat” vector * 0.50

• “mat” vector * 0.05

This process allows the model to combine the important information from the words “cat” and “sat” to generate a more accurate output.

3. Translation with Attention Applied

Whereas traditional models would translate word by word, attention looks at the entire sentence and focuses on the most relevant words.

For example:

• “The cat sat on the mat.”

• “그 고양이는 매트 위에 앉아 있었다.” (The cat sat on the mat.)

Here, the model can connect “the” with “cat” in the beginning and “the” with “mat” in the end, thanks to the attention mechanism.

4. Summary of Attention

✅ The model looks at all words at once and focuses on the important ones.

✅ It calculates the relationships between words using Query, Key, and Value.

✅ Softmax is used to give higher weights to more important words.

✅ The final output is generated by combining the relevant information.

Now, is the attention mechanism a bit clearer for you? Let me know if you have any further questions!

2025년 2월 9일 일요일

How the Attention Algorithm works?

1. Scenario for Attention Application

2. Attention Logic (Self-Attention)

3. Translation with Attention Applied

4. Summary of Attention

태그