1. 워드 임베딩(Word Embedding)¶

단어를 컴퓨터가 이해하고 효율적으로 처리할 수 있도록 단어를 벡터화하는 기술
단어를 밀집 벡터의 형태로 표현하는 방법
워드 임베딩 과정을 통해 나온 결과를 임베딩 벡터라고 부름
워드 임베딩을 거쳐 잘 표현된 단어 벡터들은 계산이 가능하며, 모델에 입력으로 사용할 수 있음

1-1. 희소 표현(Sparse Representation)¶

원 핫 인코딩을 통해서 나온 벡터들은 표현하고자 하는 단어의 인덱스의 값만 1이고, 나머지 인덱스에는 전부 0으로 표현되는 벡터 표현 방법에 의해 만들어지는 벡터를 희소 벡터라고함

1-2. 희소 벡터의 문제점¶

희소 벡터의 특징은 단어의 개수가 늘어나면 벡터의 차원이 한없이 커진다는 것
원 핫 벡터는 벡터 표현 방식이 단순하여 단어의 출현 여부만을 벡터에 표시할 수 있음
희소 벡터를 이용하여 문장 혹은 텍스트 간 유사도를 계산해보면 원하는 유사도를 얻기 힘듦

1-3. 밀집 표현(Dense Representation)¶

벡터의 차원이 조밀해졌다는 의미
사용자가 설정한 값으로 모든 단어의 벡터 표현의 차원을 맞추는 표현 방식
자연어를 밀집 표현으로 변환하는 인코딩 과정에서 0과 1의 binary 값이 아니라 연속적인 실수 값을 가질 수 있음
적은 차원으로 대상을 표현할 수 있음
더 큰 일반화 능력을 가지고 있음

1-4. 원 핫 벡터와 워드 임베딩의 차이¶

원 핫 벡터: 고차원, 희소 벡터, 값의 유형이 0과 1
워드 임베딩: 저차원, 밀집 벡터, 실수

1-5. 차원 축소(Dimensionality Reduction)¶

희소 벡터를 밀집 벡터의 형태로 변환하는 방법
머신러닝에서 많은 피처들로 구성된 고차원의 데이터에서 중요한 피처들만 뽑아 저차원의 데이터(행렬)로 변화하기 위해 사용
PCA(Principla Component Analysis), 잠재 의미 분석(Latesnt Semantic Analysis), 잠재 디리클레 할당(Latent Dirichlet Allocation), SVD(Singular Value Decomposition)

2. 주요 워드 임베딩 알고리즘¶

워드 임베딩은 고차원의 단어 공간에서 저차원의 벡터 공간으로 변환하는 방법
변환된 벡터는 단어의 의미적 유사성을 반영하며, 유사한 의미를 가진 단어들은 벡터 공간에서 가깝게 위치
모델이 텍스트 데이터의 의미를 이해하고 학습할 수 있도록 함

2-1. Word2Vec¶

분포 가설 하에 표현한 분산 표현을 따르는 워드 임베딩 모델
- 분산 표현(Distributed Representation)
  - 분포 가설: 비슷한 문맥에서 등장하는 단어들은 비슷한 의미를 가진다는 가설
  - 분포 가설의 목표는 단어 주변의 단어들, window 크기에 따라 정의되는 문맥의 의미를 이용해 단어를 벡터로 표현 하는 것
  - 분산 표현으로 표현된 벡터들은 원 핫 벡터처럼 차원이 단어 집합의 크기일 필요가 없으므로 벡터의 차원이 상대적으로 저차원으로 줄어듬
  - 희소 표현에서는 각각의 차원이 독립적인 정보를 가지고 있지만, 밀집에서는 하나의 차원이 여러 속성들이 버무려진 정보를 갖고 있음
중심 단어와 주변의 단어들을 사용하여 단어를 예측하는 방식으로 임베딩을 만듦
구글이 2013년도 처음 공개
Word2Vec의 학습 방식에는 CBOW(Continuous Bag of Words), Skip-Gram을 사용
- CBOW
  - 주변에 있는 단어들을 보고 중간에 있는 단어를 예측하는 방법
  - 주변 단어(context)는 타겟 단어(target)의 직전 n개 단어와 직후 n개 단어를 의미하며, 이 범위를 window라 부르고, n을 window size라고 부름
  - 문장 하나에 대해 한 번만 학습을 진행하면 데이터가 아깝기 때문에 sliding window 방식을 사용하여 하나의 문장을 가지고 여러 개의 학습 데이터셋을 만듦
- Skip-Gram
  - 중심 단어에서 주변 단어를 예측
  - 중심 단어를 sliding window 하면서 학습 데이터를 증강
  - 중심 단어를 가지고 주변 단어를 예측하는 방법이기 때문에 window size의 2n개 만큼 학습 데이터가 나옴
- CBOW vs Skip-Gram
  - Skip-Gram이 CBOW에 비해 여러 문맥을 고려하기 때문에 Skip-Gram의 성능이 일반적으로 더 좋음
  - Skip-Gram이 단어 당 학습 횟수가 더 많고, 임베딩의 조정 기회가 많으므로 더 정요한 임베딩 학습이 가능
```
작고 귀여운 강아지 문 앞에 앉아 있다
```
```
  * CBOW(window size=2)
      * 귀여운, 강아지 -> 작고
      * 작고, 강아지, 문 -> 귀여운
      * 작고, 귀여운, 문앞에 -> 강아지
      * 귀여운, 강아지, 앞에앉아있다 -> 문
  * Skip-Gram
      * 작고 -> 귀여운, 강아지
      * 귀여운 -> 작고, 강아지, 문
      * 강아지 -> 작고, 귀여운, 문, 앞에
      * 문 -> 귀여운, 강아지, 앞에, 앉아있다
      * 앞에 -> 강아지, 문 앉아있다
      * 앉아있다 -> 문, 앞에
```
Word2Vec의 한계점
- 단어의 형태학적 특성을 반영하지 못함(예: teach, teacher, teachersd와 같이 세 단어는 의미적으로 유사한 단어지만 각 단어를 개별단어로 처리)
- 단어 빈도 수의 영향을 많이 받아 희소한 단어를 임베딩하기 어려움
- OOV(Out of Vocabulary)의 처리가 어려움
- 새로운 단어가 등장하면 데이터 전체를 다시 학습시켜야 함
- 단어 사전의 크기가 클수록 학습하는데 오래걸림

2-2. FastText¶

Facebook의 AI Research 팀에서 개발한 텍스트 분류 및 단어 벡터 표현 도구
대규모 데이터셋에서 빠르게 작동하도록 설계, 단어 임베딩과 텍스트 분류 모두에 사용할 수 있음
작동 원리
- <, >는 단어의 경계를 나타내기 위한 특수 기호
- 단어를 먼저 <, >로 감싼 후 설정한 n-gram의 값에 따라 앞에서부터 단어를 쪼갬
  - 예) "apple" ["a", "ap", "app", "appl", "apple", "p", "pp", "ppl", "ple", "e"]
- 마지막에 본 단어를 설명하기 위해 <, >으로 감싸진 전체 단어를 하나 추가함
FastText 장점
- 오타나 모르는 단어에 대한 대응
- 단어 집합 내 빈도 수가 적었던 단어에 대한 대응
- 자연어 코퍼스 내 노이즈에 대응

3. 워드 임베딩 구축하기¶

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups(shuffle=True, random_state=2024, remove=('headers', 'footers', 'quotes'))
dataset = dataset.data

dataset[0]

'Hell, just set up a spark jammer, or some other _very_ electrically-noisy\ndevice. Or build an active Farrady cage around the room, with a "noise"\nsignal piped into it. While these measures will not totally mask the\nemissions of your equipment, they will provide sufficient interference to\nmake remote monitoring a chancy proposition, at best. There is, of course,\nthe consideration that these measures may (and almost cretainly will)\ncause a certain amount of interference in your own systems. It\'s a matter\nof balancing security versus convenience.\n\nBTW, I\'m an ex-Air Force Telecommunications Systems Control Supervisor and\nTelecommunications/Cryptographic Equipment Technician.\n'

len(dataset)

# 컬럼명을 document로 한 데이터프레임을 만들기
news_df = pd.DataFrame({'document':dataset})
news_df

	document
0	Hell, just set up a spark jammer, or some othe...
1	\nThank you very much. After reading the text ...
2	Anyone out there have a Sony 1304S?\n\nI have ...
3	\n(deletion)\n \nStraw man. And you brought u...
4	\n: >Hi Netters,\n: >\n: >I'm building a CAD p...
...	...
11309	The DEA and other organizations would have the...
11310	\nThat is not necessarily unorthodox. When Ch...
11311	Melido came off the DL today and will start to...
11312	Archive-name: rec-autos/part1\n\n[most recent ...
11313	Why crawl under the car at all? I have a machi...

11314 rows × 1 columns

# 데이터셋에서 결측값이 있다면 제거하고, 총 데이터셋의 개수를 출력
news_df = news_df.dropna().reset_index(drop=True)
len(news_df)

# 중복된 데이터가 있다면 제거
processed_news_df = news_df.drop_duplicates(['document']).reset_index(drop=True)
len(processed_news_df)

# 데이터셋의 데이터 중 특수 문자를 제거
processed_news_df['document'] = processed_news_df['document'].str.replace('[^a-zA-z0-9]', ' ', regex=True)
processed_news_df

	document
0	Hell just set up a spark jammer or some othe...
1	Thank you very much After reading the text s...
2	Anyone out there have a Sony 1304S I have on...
3	deletion Straw man And you brought up l...
4	Hi Netters I m building a CAD pack...
...	...
10989	The DEA and other organizations would have the...
10990	That is not necessarily unorthodox When Chr...
10991	Melido came off the DL today and will start to...
10992	Archive name rec autos part1 [most recent ch...
10993	Why crawl under the car at all I have a machi...

10994 rows × 1 columns

# 데이터셋의 길이가 너무 짧은 단어를 제거(단어의 길이는 2이하)
processed_news_df['document'] = processed_news_df['document'].apply(lambda x: ' '.join([token for token in x.split() if len(token) > 2]))
processed_news_df

	document
0	Hell just set spark jammer some other _very_ e...
1	Thank you very much After reading the text som...
2	Anyone out there have Sony 1304S have one and ...
3	deletion Straw man And you brought leniency As...
4	Netters building CAD package and need graphics...
...	...
10989	The DEA and other organizations would have the...
10990	That not necessarily unorthodox When Christian...
10991	Melido came off the today and will start tonig...
10992	Archive name rec autos part1 [most recent chan...
10993	Why crawl under the car all have machine got f...

10994 rows × 1 columns

# 전체 문장이 100자이상이거나 전체 단어의 갯수가  3이상인 데이터만 필터링
processed_news_df = processed_news_df[processed_news_df.document.apply(lambda x: len(str(x)) >= 100 and len(str(x).split()) >= 3)].reset_index(drop=True)
processed_news_df

# 소문자로 변경
processed_news_df['document'] = processed_news_df['document'].apply(lambda x: x.lower())
processed_news_df

	document
0	hell just set spark jammer some other _very_ e...
1	thank you very much after reading the text som...
2	anyone out there have sony 1304s have one and ...
3	deletion straw man and you brought leniency as...
4	netters building cad package and need graphics...
...	...
9984	the dea and other organizations would have the...
9985	that not necessarily unorthodox when christian...
9986	melido came off the today and will start tonig...
9987	archive name rec autos part1 [most recent chan...
9988	why crawl under the car all have machine got f...

9989 rows × 1 columns

import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

True

stop_words = stopwords.words('english')
print(len(stop_words))

# 데이터셋에서 불용어를 제외하고, 띄어쓰기 단위로 문장을 분리
tokenized_doc = processed_news_df['document'].apply(lambda x: x.split())
tokenized_doc = tokenized_doc.apply(lambda x: [s_word for s_word in x if s_word not in stop_words])
tokenized_doc

0       [hell, set, spark, jammer, _very_, electricall...
1       [thank, much, reading, text, distinct, questio...
2       [anyone, sony, 1304s, one, nice, however, run,...
3       [deletion, straw, man, brought, leniency, assu...
4       [netters, building, cad, package, need, graphi...
                              ...                        
9984    [dea, organizations, would, american, people, ...
9985    [necessarily, unorthodox, christians, call, go...
9986    [melido, came, today, start, tonight, rangers,...
9987    [archive, name, rec, autos, part1, [most, rece...
9988    [crawl, car, machine, got, boat, pulls, oil, s...
Name: document, Length: 9989, dtype: object

len(tokenized_doc)

from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(tokenized_doc)

word2idx = tokenizer.word_index
print(word2idx)

{'one': 1, 'would': 2, 'max': 3, 'people': 4, 'like': 5, 'get': 6, 'know': 7, 'also': 8, 'use': 9, 'think': 10, 'time': 11, 'new': 12, 'could': 13, 'well': 14, 'good': 15, 'edu': 16, 'may': 17, 'even': 18, 'two': 19, 'first': 20, 'see': 21, 'many': 22, 'much': 23, 'way': 24, 'make': 25, 'system': 26, 'god': 27, 'used': 28, 'say': 29, 'right': 30, ...

idx2word = {value: key for key, value in word2idx.items()}
print(idx2word)

{1: 'one', 2: 'would', 3: 'max', 4: 'people', 5: 'like', 6: 'get', 7: 'know', 8: 'also', 9: 'use', 10: 'think', 11: 'time', 12: 'new', 13: 'could', 14: 'well', 15: 'good', 16: 'edu', 17: 'may', 18: 'even', 19: 'two', 20: 'first', 21: 'see', 22: 'many', 23: 'much', 24: 'way', 25: 'make', 26: 'system', 27: 'god', 28: 'used', 29:...

encoded =  tokenizer.texts_to_sequences(tokenized_doc)
print(encoded[0])

[527, 95, 8746, 21210, 16483, 21211, 7462, 633, 468, 1738, 45060, 4542, 93, 774, 1820, 1240, 31107, 2784, 1285, 3079, 13712, 1030, 339, 1703, 3243, 25, 1241, 4646, 25104, 4751, 108, 110, 2157, 2784, 17, 336, 45061, 330, 370, 842, 3243, 212, 313, 16484, 344, 3789, 5804, 948, 430, 421, 3437, 212, 170, 11153, 3437, 3335, 1030, 6117]

vocab_size = len(word2idx)
print(f'단어 사전의 크기:{vocab_size}')

단어 사전의 크기:109589

# 텍스트 데이터를 다루는 과정에서 단어 쌍을 생성하는 데 사용
# Skip-gram 모델을 사용하여 단어 쌍을 만들며, Word2Vec 알고리즘의 구성 요소
# 주어진 단어를 기준으로 주변 단어를 예측하는 방식으로 단어 벡터를 학습
from tensorflow.keras.preprocessing.sequence import skipgrams

# skipgrams(시퀀스, 사전크기, 윈도우크기)
# 시퀀스: 인코딩한 리스트
# 사전크기: 단어 사전의 크기. 시퀀스에 등장하는 단어의 총 개수
# 중심단어와 주변단어간의 최대 거리
skip_grams = [skipgrams(sample, vocabulary_size=vocab_size, window_size=10) for sample in encoded[:5]]
print(f'전체 샘플 수: {len(skip_grams)}')

전체 샘플 수: 5

# 부정적 예: 중심 단어와 실제 텍스트에서 함께 등장하지 않은 단어(0)
# 긍정적 예: 실제 텍스트에서 중심 단어와 주변 단어가 함께 등장한 경우(1)
# pairs: (중심단어, 주변단어) 형태의 단어 쌍 리스트
# labels: 각 단어 쌍의 레이블 리스트로, 해당 단어 쌍이 실제로 등장하는지 여부를 나타냄
pairs, labels = skip_grams[0][0], skip_grams[0][1]
print(pairs)
print(labels)

[[1285, 468], [5804, 3243], [4646, 2784], [1738, 96783], [842, 27621], [468, 93], [774, 88304], [1240, 1030], [212, 421], [16483, 26264], [170, 3437], [95, 87130], [16483, 633], [170, 62347], [4751, 25], [7462, 1820], [430, 212], [344, 103671], [1703, 83361], [330, 5804], [93, 2784], [339, 1240], [13712, 1703], [5804, 370], [45061, 15087], [344, 9391], [336, 22663], [7462, 8334], [330, 336], [4751, 7856], [1703, 25], [25104, 110], [1820, 36542], [17, 92742], [3789, 948], [95, 72978], [4751, 45061], [8746, 21494], [3437, 7748], [45061, 80405], [8746, 4542], [421, 53617], [5804, 50130], [170, 212], [21210, 8744], [108, 370], [31107, 94530], [16484, 45061], [3243, 313], [4542, 84355], [16484, 336], [370, 65694], [110, 9897], [1738, 1285], [25104, 4751], [212, 45061], [93, 83585], ...

print(len(pairs))
print(len(labels))

2100
2100

for i in range(5):
    print('({:s}({:d}), {:s}({:d})) -> {:d}'.format(
        idx2word[pairs[i][0]], pairs[i][0],
        idx2word[pairs[i][1]], pairs[i][1],
        labels[i]
    ))

(totally(1285), build(468)) -> 1
(convenience(5804), interference(3243)) -> 1
(monitoring(4646), measures(2784)) -> 1
(active(1738), yfj2(96783)) -> 0
(amount(842), kulkhandanian(27621)) -> 0

training_dataset = [skipgrams(sample, vocabulary_size=vocab_size, window_size=10) for sample in encoded[:9988]]
len(training_dataset)

from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, Reshape, Activation, Input, Dot
from tensorflow.keras.utils import plot_model

embedding_dim = 100

w_inputs = Input(shape=(1,), dtype='int32')
word_embedding = Embedding(vocab_size, embedding_dim)(w_inputs)

c_inputs = Input(shape=(1,), dtype='int32')
context_embedding = Embedding(vocab_size, embedding_dim)(c_inputs)

dot_product = Dot(axes=2)([word_embedding, context_embedding])
dot_product = Reshape((1,), input_shape=(1, 1))(dot_product)
output = Activation('sigmoid')(dot_product)

model = Model(inputs=[w_inputs, c_inputs], outputs=output)
model.summary()

Model: "model_4"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
==================================================================================================
 input_3 (InputLayer)        [(None, 1)]                  0         []                            
                                                                                                  
 input_4 (InputLayer)        [(None, 1)]                  0         []                            
                                                                                                  
 embedding_2 (Embedding)     (None, 1, 100)               1095890   ['input_3[0][0]']             
                                                          0                                       
                                                                                                  
 embedding_3 (Embedding)     (None, 1, 100)               1095890   ['input_4[0][0]']             
                                                          0                                       
                                                                                                  
 dot_1 (Dot)                 (None, 1, 1)                 0         ['embedding_2[0][0]',         
                                                                     'embedding_3[0][0]']         
                                                                                                  
 reshape_1 (Reshape)         (None, 1)                    0         ['dot_1[0][0]']               
                                                                                                  
 activation_1 (Activation)   (None, 1)                    0         ['reshape_1[0][0]']           
                                                                                                  
==================================================================================================
Total params: 21917800 (83.61 MB)
Trainable params: 21917800 (83.61 MB)
Non-trainable params: 0 (0.00 Byte)
__________________________________________________________________________________________________

model.compile(loss='binary_crossentropy', optimizer='adam')

plot_model(model, to_file='model.png', show_shapes=True, show_layer_names=True)

No description has been provided for this image

for epoch in range(100):
    loss = 0
    for _, elem in enumerate(training_dataset):
        first_elem = np.array(list(zip(*elem[0]))[0], dtype='int32')
        second_elem = np.array(list(zip(*elem[0]))[1], dtype='int32')
        labels = np.array(elem[1], dtype='int32')
        X = [first_elem, second_elem]
        y = labels
        loss += model.train_on_batch(X, y)
    print('Epoch: ', epoch+1, 'Loss: ', loss)

for epoch in range(100):
    loss = 0
    for _, elem in enumerate(skip_grams):
        first_elem = np.array(list(zip(*elem[0]))[0], dtype='int32')
        second_elem = np.array(list(zip(*elem[0]))[1], dtype='int32')
        labels = np.array(elem[1], dtype='int32')
        X = [first_elem, second_elem]
        y = labels
        loss += model.train_on_batch(X, y)
    print('Epoch: ', epoch+1, 'Loss: ', loss)

Epoch:  1 Loss:  1.7103365063667297
Epoch:  2 Loss:  1.590340107679367
Epoch:  3 Loss:  1.4495297372341156
Epoch:  4 Loss:  1.311815470457077
Epoch:  5 Loss:  1.185643509030342
Epoch:  6 Loss:  1.073489561676979
Epoch:  7 Loss:  0.9753282219171524
Epoch:  8 Loss:  0.8900876939296722
Epoch:  9 Loss:  0.8163090497255325
Epoch:  10 Loss:  0.7524631395936012
Epoch:  11 Loss:  0.6971024721860886
Epoch:  12 Loss:  0.6489284932613373
Epoch:  13 Loss:  0.6068132743239403
Epoch:  14 Loss:  0.5697973072528839
Epoch:  15 Loss:  0.5370749905705452
Epoch:  16 Loss:  0.5079759880900383
Epoch:  17 Loss:  0.4819449707865715
Epoch:  18 Loss:  0.45852308720350266
Epoch:  19 Loss:  0.43733125180006027
Epoch:  20 Loss:  0.4180557504296303
Epoch:  21 Loss:  0.4004364088177681
Epoch:  22 Loss:  0.38425664231181145
Epoch:  23 Loss:  0.3693353198468685
Epoch:  24 Loss:  0.35552045330405235
Epoch:  25 Loss:  0.34268372505903244
Epoch:  26 Loss:  0.3307163007557392
Epoch:  27 Loss:  0.3195254020392895
Epoch:  28 Loss:  0.30903152748942375
Epoch:  29 Loss:  0.29916612431406975
Epoch:  30 Loss:  0.28986988216638565
Epoch:  31 Loss:  0.2810911722481251
Epoch:  32 Loss:  0.2727847956120968
Epoch:  33 Loss:  0.2649110369384289
Epoch:  34 Loss:  0.2574349008500576
Epoch:  35 Loss:  0.25032520294189453
Epoch:  36 Loss:  0.24355429038405418
Epoch:  37 Loss:  0.2370973452925682
Epoch:  38 Loss:  0.2309320718050003
Epoch:  39 Loss:  0.2250384483486414
Epoch:  40 Loss:  0.21939825639128685
Epoch:  41 Loss:  0.21399503387510777
Epoch:  42 Loss:  0.20881374925374985
Epoch:  43 Loss:  0.2038406953215599
Epoch:  44 Loss:  0.19906333275139332
Epoch:  45 Loss:  0.19447012431919575
Epoch:  46 Loss:  0.19005048274993896
Epoch:  47 Loss:  0.1857946552336216
Epoch:  48 Loss:  0.18169361166656017
Epoch:  49 Loss:  0.17773902788758278
Epoch:  50 Loss:  0.17392317578196526
Epoch:  51 Loss:  0.1702388860285282
Epoch:  52 Loss:  0.1666794866323471
Epoch:  53 Loss:  0.1632387824356556
Epoch:  54 Loss:  0.15991096571087837
Epoch:  55 Loss:  0.15669065713882446
Epoch:  56 Loss:  0.1535727959126234
Epoch:  57 Loss:  0.1505526416003704
Epoch:  58 Loss:  0.1476257871836424
Epoch:  59 Loss:  0.14478806033730507
Epoch:  60 Loss:  0.14203554950654507
Epoch:  61 Loss:  0.13936457969248295
Epoch:  62 Loss:  0.136771684512496
Epoch:  63 Loss:  0.1342535950243473
Epoch:  64 Loss:  0.13180723693221807
Epoch:  65 Loss:  0.12942969053983688
Epoch:  66 Loss:  0.1271182168275118
Epoch:  67 Loss:  0.1248701885342598
Epoch:  68 Loss:  0.1226831367239356
Epoch:  69 Loss:  0.1205547209829092
Epoch:  70 Loss:  0.11848272942006588
Epoch:  71 Loss:  0.11646503210067749
Epoch:  72 Loss:  0.11449964717030525
Epoch:  73 Loss:  0.11258462630212307
Epoch:  74 Loss:  0.11071817856281996
Epoch:  75 Loss:  0.10889856051653624
Epoch:  76 Loss:  0.10712412465363741
Epoch:  77 Loss:  0.10539329890161753
Epoch:  78 Loss:  0.10370456892997026
Epoch:  79 Loss:  0.10205649863928556
Epoch:  80 Loss:  0.10044772643595934
Epoch:  81 Loss:  0.09887693636119366
Epoch:  82 Loss:  0.09734286647289991
Epoch:  83 Loss:  0.09584432374686003
Epoch:  84 Loss:  0.09438015799969435
Epoch:  85 Loss:  0.09294925257563591
Epoch:  86 Loss:  0.09155056439340115
Epoch:  87 Loss:  0.09018306992948055
Epoch:  88 Loss:  0.08884578663855791
Epoch:  89 Loss:  0.08753781672567129
Epoch:  90 Loss:  0.08625821676105261
Epoch:  91 Loss:  0.08500614948570728
Epoch:  92 Loss:  0.08378078229725361
Epoch:  93 Loss:  0.08258131612092257
Epoch:  94 Loss:  0.0814069788902998
Epoch:  95 Loss:  0.08025704231113195
Epoch:  96 Loss:  0.07913079299032688
Epoch:  97 Loss:  0.07802754826843739
Epoch:  98 Loss:  0.07694664411246777
Epoch:  99 Loss:  0.07588745094835758
Epoch:  100 Loss:  0.07484935130923986

# Gensim
# 자연어 처리 작업에서 주로 사용되는 오픈 소스 라이브러리
# 토픽 모델링, 문서 유사도 계산, 단어 임베딩(Word2Vec, FastText 등)
import gensim

vectors = model.get_weights()[0]
vectors

array([[-0.02982065,  0.0215173 , -0.03577896, ..., -0.03639679,
        -0.00350826,  0.00749798],
       [-0.28699452,  0.30848753,  0.41423577, ...,  0.03836289,
         0.44764927,  0.4010425 ],
       [-0.15411642,  0.05482709,  0.23381615, ...,  0.4452949 ,
         0.47483867,  0.3769739 ],
       ...,
       [-0.03987028,  0.02529777,  0.03442145, ..., -0.01099628,
        -0.01817884,  0.00867484],
       [-0.03688021,  0.04699555,  0.0014374 , ..., -0.01721019,
         0.04268595, -0.02808468],
       [ 0.03915599, -0.0125185 , -0.03287099, ...,  0.01273588,
        -0.04056833, -0.0363853 ]], dtype=float32)

len(vectors)

f = open('vectors.txt', 'w')
f.write('{} {}\n'.format(vocab_size, embedding_dim))

for word, i in tokenizer.word_index.items():
    f.write('{} {}\n'.format(word, ' '.join(map(str, list(vectors[i-1, :])))))
f.close()

w2v = gensim.models.KeyedVectors.load_word2vec_format('./vectors.txt', binary=False)

w2v.most_similar(positive=['flower'])

[('individualism', 0.41862842440605164),
 ('nonproift', 0.3935631215572357),
 ('se4', 0.3841329514980316),
 ('ohsu', 0.3802776634693146),
 ('b\\orh8zd4', 0.3760339021682739),
 ('knot', 0.37595334649086),
 ('vus144\\', 0.36287355422973633),
 ('vestiges', 0.3622933328151703),
 ('bzw', 0.36037641763687134),
 ('pyromaniac', 0.3556657135486603)]

과제¶

AI Hub에 공개되어 있는 한국어 데이터셋을 활용하여 워드 임베딩을 구축

'코딩 > 자연어 처리' 카테고리의 다른 글

RNN 기초 (0)	2024.07.18
cbow text classification (0)	2024.07.18
LSTM과 GRU (0)	2024.07.18
임베딩 (0)	2024.07.18
문장 임베딩 (0)	2024.07.18

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

개발일지

워드 임베딩

1. 워드 임베딩(Word Embedding)¶

1-1. 희소 표현(Sparse Representation)¶

1-2. 희소 벡터의 문제점¶

1-3. 밀집 표현(Dense Representation)¶

1-4. 원 핫 벡터와 워드 임베딩의 차이¶

1-5. 차원 축소(Dimensionality Reduction)¶

2. 주요 워드 임베딩 알고리즘¶

2-1. Word2Vec¶

2-2. FastText¶

3. 워드 임베딩 구축하기¶

과제¶

'코딩 > 자연어 처리' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

워드 임베딩

1. 워드 임베딩(Word Embedding)¶

1-1. 희소 표현(Sparse Representation)¶

1-2. 희소 벡터의 문제점¶

1-3. 밀집 표현(Dense Representation)¶

1-4. 원 핫 벡터와 워드 임베딩의 차이¶

1-5. 차원 축소(Dimensionality Reduction)¶

2. 주요 워드 임베딩 알고리즘¶

2-1. Word2Vec¶

2-2. FastText¶

3. 워드 임베딩 구축하기¶

과제¶

'코딩 > 자연어 처리' 카테고리의 다른 글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역