1. 네이버 영화 리뷰 데이터셋¶

총 200,00개의 리뷰로 구성된 데이터로 영화 리뷰를 긍/부정으로 분류하기 위해 만들어진 데이터셋
리뷰가 긍정인 경우1, 부정인 경우0으로 표시한 레이블로 구성되어 있음

!sudo apt-get install -y fonts-nanum
!sudo fc-cache -fv
!rm ~/.cache/matplotlib -rf

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
fonts-nanum is already the newest version (20200506-1).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.
/usr/share/fonts: caching, new cache contents: 0 fonts, 1 dirs
/usr/share/fonts/truetype: caching, new cache contents: 0 fonts, 3 dirs
/usr/share/fonts/truetype/humor-sans: caching, new cache contents: 1 fonts, 0 dirs
/usr/share/fonts/truetype/liberation: caching, new cache contents: 16 fonts, 0 dirs
/usr/share/fonts/truetype/nanum: caching, new cache contents: 12 fonts, 0 dirs
/usr/local/share/fonts: caching, new cache contents: 0 fonts, 0 dirs
/root/.local/share/fonts: skipping, no such directory
/root/.fonts: skipping, no such directory
/usr/share/fonts/truetype: skipping, looped directory detected
/usr/share/fonts/truetype/humor-sans: skipping, looped directory detected
/usr/share/fonts/truetype/liberation: skipping, looped directory detected
/usr/share/fonts/truetype/nanum: skipping, looped directory detected
/var/cache/fontconfig: cleaning cache directory
/root/.cache/fontconfig: not cleaning non-existent cache directory
/root/.fontconfig: not cleaning non-existent cache directory
fc-cache: succeeded

import urllib.request
import pandas as pd

urllib.request.urlretrieve('https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt', filename='ratings_train.txt' )
urllib.request.urlretrieve('https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt', filename='ratings_test.txt' )

('ratings_test.txt', <http.client.HTTPMessage at 0x7f810c6be170>)

train_dataset = pd.read_table('ratings_train.txt')
train_dataset

	id	document	label
0	9976970	아 더빙.. 진짜 짜증나네요 목소리	0
1	3819312	흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나	1
2	10265843	너무재밓었다그래서보는것을추천한다	0
3	9045019	교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정	0
4	6483659	사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...	1
...	...	...	...
149995	6222902	인간이 문제지.. 소는 뭔죄인가..	0
149996	8549745	평점이 너무 낮아서...	1
149997	9311800	이게 뭐요? 한국인은 거들먹거리고 필리핀 혼혈은 착하다?	0
149998	2376369	청춘 영화의 최고봉.방황과 우울했던 날들의 자화상	1
149999	9619869	한국 영화 최초로 수간하는 내용이 담긴 영화	0

150000 rows × 3 columns

len(train_dataset)

2. 데이터 전처리¶

# 결측치를 확인하고 결측치를 제거하기
train_dataset.replace('', float('NaN'), inplace=True)
# any(): 배열에서 하나라도 True가 존재하는지 확인
train_dataset.isnull().values.any()

True

train_dataset = train_dataset.dropna().reset_index(drop=True)
len(train_dataset)

# 열(document)을 기준으로 중복 데이터를 제거
train_dataset = train_dataset.drop_duplicates(['document']).reset_index(drop=True)
len(train_dataset)

# 한글이 아닌 문자를 포함하는 데이터를 제거하기(단, ㅋㅋㅋ 제거하지 않음)
import re
train_dataset['document'] = train_dataset['document'].str.replace('[^ㄱ-ㅎㅏ-ㅣ가-힣]', ' ', regex=True)
train_dataset

	id	document	label
0	9976970	아 더빙 진짜 짜증나네요 목소리	0
1	3819312	흠 포스터보고 초딩영화줄 오버연기조차 가볍지 않구나	1
2	10265843	너무재밓었다그래서보는것을추천한다	0
3	9045019	교도소 이야기구먼 솔직히 재미는 없다 평점 조정	0
4	6483659	사이몬페그의 익살스런 연기가 돋보였던 영화 스파이더맨에서 늙어보이기만 했던 커스틴 ...	1
...	...	...	...
146177	6222902	인간이 문제지 소는 뭔죄인가	0
146178	8549745	평점이 너무 낮아서	1
146179	9311800	이게 뭐요 한국인은 거들먹거리고 필리핀 혼혈은 착하다	0
146180	2376369	청춘 영화의 최고봉 방황과 우울했던 날들의 자화상	1
146181	9619869	한국 영화 최초로 수간하는 내용이 담긴 영화	0

146182 rows × 3 columns

# 너무 짧은 단어를 제거하기(1글자 이하를 제거)
train_dataset['document'] = train_dataset['document'].apply(lambda x: ' '.join([token for token in x.split() if len(token) > 1]))
train_dataset

	id	document	label
0	9976970	더빙 진짜 짜증나네요 목소리	0
1	3819312	포스터보고 초딩영화줄 오버연기조차 가볍지 않구나	1
2	10265843	너무재밓었다그래서보는것을추천한다	0
3	9045019	교도소 이야기구먼 솔직히 재미는 없다 평점 조정	0
4	6483659	사이몬페그의 익살스런 연기가 돋보였던 영화 스파이더맨에서 늙어보이기만 했던 커스틴 ...	1
...	...	...	...
146177	6222902	인간이 문제지 소는 뭔죄인가	0
146178	8549745	평점이 너무 낮아서	1
146179	9311800	이게 뭐요 한국인은 거들먹거리고 필리핀 혼혈은 착하다	0
146180	2376369	청춘 영화의 최고봉 방황과 우울했던 날들의 자화상	1
146181	9619869	한국 영화 최초로 수간하는 내용이 담긴 영화	0

146182 rows × 3 columns

# 전체 길이가 50자 이하이거나 전체 단어의 개수가 3개 이하인 데이터를 제거하기
train_dataset = train_dataset[train_dataset.document.apply(lambda x: len(str(x)) > 50 and len(str(x).split()) > 3)].reset_index(drop=True)
train_dataset

	id	document	label
0	6483659	사이몬페그의 익살스런 연기가 돋보였던 영화 스파이더맨에서 늙어보이기만 했던 커스틴 ...	1
1	9443947	반개도 아깝다 욕나온다 이응경 길용우 연기생활이몇년인지 정말 발로해도 그것보단 낫겟...	0
2	9864035	취향은 존중한다지만 진짜 내생에 극장에서 영화중 가장 노잼 노감동임 스토리도 어거지...	0
3	9143163	사람들 웃긴게 바스코가 이기면 락스코라고 까고바비가 이기면 아이돌이라고 깐다 그냥 ...	1
4	9705777	재미없다 지루하고 같은 음식 영화인데도 바베트의 만찬하고 차이남 바베트의 만찬은 이...	0
...	...	...	...
22549	9811006	일본은 한국전쟁에 참전한적도 없고 차세계대전이후 완전 패망했다가 한국전쟁 군수업으로...	0
22550	6798178	그리 만족스럽진못했어도 점은 나와야되는것같아 점줌 주인공들연기도 훌륭했고 내용도 이...	1
22551	9633559	시간이 아깝다 어린 여주의 연기는 인상적이었고 나중이 기대되어서 좋았고 남주 여주 ...	0
22552	9492905	나쁜 인상은 아니지만 오랫동안 기억에 남아 종종 떠올라서 조금은 사람을 피곤하게 만...	1
22553	9335962	공포나 재난영화가 아니라 아예 대놓고 비급 크리쳐개그물임ㅋㅋ 음악 완전 흥겹다ㅋ 점...	0

22554 rows × 3 columns

!pip install konlpy

Requirement already satisfied: konlpy in /usr/local/lib/python3.10/dist-packages (0.6.0)
Requirement already satisfied: JPype1>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from konlpy) (1.5.0)
Requirement already satisfied: lxml>=4.1.0 in /usr/local/lib/python3.10/dist-packages (from konlpy) (4.9.4)
Requirement already satisfied: numpy>=1.6 in /usr/local/lib/python3.10/dist-packages (from konlpy) (1.25.2)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from JPype1>=0.7.0->konlpy) (24.1)

from konlpy.tag import Okt
okt = Okt()

# 불용어를 확인하고 불용어는 제거하기
stopwords = ['아', '휴', '아이구', '아이쿠', '아이고', '어', '나', '우리', '저희', '따라', '의해', '을', '를', '에', '의', '가', '으로', '로', '에게', '뿐이다', '의거하여', '근거하여', '입각하여', '기준으로', '예하면', '예를', '들면', '예를', '들자면', '저', '소인', '소생', '저희', '지말고', '하지마', '하지마라', '다른', '물론', '또한', '그리고', '비길수', '없다', '해서는', '안된다', '뿐만', '아니라', '만이', '아니다', '만은', '아니다', '막론하고', '관계없이', '그치지', '않다', '그러나', '그런데', '하지만', '든간에', '논하지', '않다', '따지지', '않다', '설사', '비록', '더라도', '아니면', '만', '못하다', '하는', '편이', '낫다', '불문하고', '향하여', '향해서', '향하다', '쪽으로', '틈타', '이용하여', '타다', '오르다', '제외하고', '이', '외에', '이', '밖에', '하여야', '비로소', '한다면', '몰라도', '외에도', '이곳', '여기', '부터', '기점으로', '따라서', '할', '생각이다', '하려고하다', '이리하여', '그리하여', '그렇게', '함으로써', '하지만', '일때', '할때', '앞에서', '중에서', '보는데서', '으로써', '로써', '까지', '해야한다', '일것이다', '반드시', '할줄알다', '할수있다', '할수있어', '임에', '틀림없다', '한다면', '등', '등등', '제', '겨우', '단지', '다만', '할뿐', '딩동', '댕그', '대해서', '대하여', '대하면', '훨씬', '얼마나', '얼마만큼', '얼마큼']

train_dataset = list(train_dataset['document'])
print(train_dataset)

['사이몬페그의 익살스런 연기가 돋보였던 영화 스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다', '반개도 아깝다 욕나온다 이응경 길용우 연기생활이몇년인지 정말 발로해도 그것보단 낫겟다 납치 감금만반복반복 이드라마는 가족도없다 연기못하는사람만모엿네', '취향은 존중한다지만 진짜 내생에 극장에서 영화중 가장 노잼 노감동임 스토리도 어거지고 감동도 어거지', ...

tokenized_data = []
for sentence in train_dataset:
    tokenized_sentence = okt.morphs(sentence, stem=True)
    stopwords_removed_sentence = [word for word in tokenized_sentence if not word in stopwords]
    tokenized_data.append(stopwords_removed_sentence)

print(tokenized_data[0])

['사이', '몬페', '그', '익살스럽다', '연기', '돋보이다', '영화', '스파이더맨', '에서', '늙다', '보이다', '하다', '커스틴', '던스트', '너무나도', '이쁘다', '보이다']

# 리뷰의 최대 길이와 리뷰의 평균 길이를 출력하기
print('리뷰의 최대 길이 :',max(len(review) for review in tokenized_data))
print('리뷰의 평균 길이 :',sum(map(len, tokenized_data)) / len(tokenized_data))

리뷰의 최대 길이 : 73
리뷰의 평균 길이 : 27.925290414117228

3. 워드 임베딩 구축¶

from gensim.models import Word2Vec

embedding_dim = 100

# sg: 0(CBOW), 1(Skip-gram)
model = Word2Vec(
    sentences = tokenized_data,
    vector_size = embedding_dim,
    window = 5,
    min_count = 5,
    workers = 4,
    sg = 0
)

# 임베딩 행렬의 크기
model.wv.vectors.shape

(8518, 100)

model.wv

<gensim.models.keyedvectors.KeyedVectors at 0x7f80c7b910f0>

word_vectors = model.wv
vocabs = list(word_vectors.index_to_key)
print(vocabs[:20])

['하다', '영화', '보다', '도', '들', '는', '은', '있다', '이다', '한', '같다', '좋다', '너무', '되다', '적', '에서', '정말', '과', '진짜', '연기']

for sim_word in model.wv.most_similar('영화'):
    print(sim_word)

('영화로', 0.7565528750419617)
('애니', 0.7400009036064148)
('애니메이션', 0.7056881785392761)
('다큐', 0.703105628490448)
('류', 0.6871150732040405)
('공포영화', 0.6824593544006348)
('장르', 0.6612555384635925)
('이보', 0.6583052277565002)
('명작', 0.6564598083496094)
('수작', 0.6558930277824402)

for sim_word in model.wv.most_similar('좋다'):
    print(sim_word)

('멋지다', 0.8200041055679321)
('예쁘다', 0.8033966422080994)
('유아인', 0.7752985954284668)
('훌륭하다', 0.7656264305114746)
('문소리', 0.7520274519920349)
('괜찮다', 0.7497999668121338)
('사랑스럽다', 0.7488312721252441)
('멋있다', 0.7483131885528564)
('역시', 0.7446177005767822)
('신선하다', 0.7429497838020325)

model.wv.similarity('좋다', '훌륭하다')

0.76562643

4. 워드 임베딩 시각화¶

import matplotlib.font_manager
import matplotlib.pyplot as plt

font_list = matplotlib.font_manager.findSystemFonts(fontpaths=None, fontext='ttf')
[matplotlib.font_manager.FontProperties(fname=font).get_name() for font in font_list if 'Nanum' in font]

['NanumGothic',
 'NanumGothicCoding',
 'NanumMyeongjo',
 'NanumSquareRound',
 'NanumSquare',
 'NanumGothicCoding',
 'NanumBarunGothic',
 'NanumSquare',
 'NanumGothic',
 'NanumMyeongjo',
 'NanumBarunGothic',
 'NanumSquareRound']

plt.rc('font', family='NanumBarunGothic')

print(vocabs)

['하다', '영화', '보다', '도', '들', '는', '은', '있다', '이다', '한', '같다', '좋다', '너무', '되다', '적', '에서', '정말', '과', '진짜', '연기', '나오다', '다', '생각', '사람', '만들다', '인', '스토리', '것', '고', '못', '드라마', '평점', '와', '배우', '안', '하고'...]

word_vector_list = [word_vectors[word] for word in vocabs]
word_vector_list[0]

array([-0.34784016, -0.16849771,  0.88728464,  1.075393  , -0.7368959 ,
       -1.187426  ,  0.6145811 ,  1.1576205 , -0.70221376, -0.8731941 ,
        0.05149937,  0.64524204, -0.81566083,  0.8594128 ,  0.22408877,
        0.09455317,  0.45184103, -0.17805634, -0.7752705 , -0.8701237 ,
        1.4452381 ,  0.8390722 ,  1.2022033 ,  0.42076182,  0.39765248,
       -0.24217506,  0.5084747 ,  0.24555375, -1.2640083 , -0.48160413,
        0.14733747,  0.35805106,  0.46542016, -0.55658674, -0.05962019,
       -0.44142222,  0.14535375, -0.6663644 , -0.7050491 ,  0.18734103,
       -0.84879607,  0.09343452, -1.142795  , -0.40024695,  0.36533567,
       -0.00224682, -0.73007804,  0.5704995 ,  0.854103  ,  0.70108676,
       -0.27891883,  0.62712836, -1.2488798 ,  0.31276733,  1.2004278 ,
       -0.42258662,  0.8661732 , -0.3739732 , -0.7323736 ,  0.33951512,
       -0.17922783,  0.6820473 ,  0.09505165, -0.4637062 ,  0.10069248,
        0.3163553 , -0.42854822, -0.52026105, -0.00578056,  0.0900441 ,
       -0.04557418,  0.8059918 , -1.0058637 , -0.72027427,  0.5382828 ,
        0.775967  ,  1.061688  , -0.51433045, -0.29729733, -0.41708854,
        0.57213706, -0.14759123,  0.14340214,  0.6998346 , -0.23525225,
       -0.25330645,  0.3070325 ,  0.575081  , -0.12902758,  0.5766181 ,
        0.9325258 ,  0.5811082 ,  0.24234109,  0.38263977,  0.6903438 ,
        0.33353162,  0.08093763, -0.43025896,  0.6275524 , -0.42603862],
      dtype=float32)

import numpy as np
# PCA: 차원축소방식. 자주 이용되는 방식이긴 하지만 군집의 변별력을 해친다는 단점
# PCA를 개선한 방법이 t-SNE 차원 축소 방식
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, learning_rate='auto', init='random')
transformed = tsne.fit_transform(np.array(word_vector_list))

x_axis_tsne = transformed[:, 0]
y_axis_tsne = transformed[:, 1]
print(x_axis_tsne)
print(y_axis_tsne)

[ 69.76023   51.441284  60.1358   ... -14.145403 -73.18895  -33.421547]
[ 8.384243   48.06204    40.67984    ...  0.88280594 -3.269093
 -5.8950243 ]

def plot_tsne_graph(vocabs, x_axis, y_axis):
    plt.figure(figsize=(30, 30))
    plt.scatter(x_axis, y_axis, marker='o')
    for i, v in enumerate(vocabs):
        plt.annotate(v, xy=(x_axis[i], y_axis[i]))
    plt.show()

plot_tsne_graph(vocabs, x_axis_tsne, y_axis_tsne)

Output hidden; open in https://colab.research.google.com to view.

5. TSNE 시각화 고도화¶

파이썬에서 제공하는 interactive visualization library 인 Bokeh 를 사용하여 시각화 고도화를 할 수 있음

tsne_df = pd.DataFrame(transformed, columns=['x_coord', 'y_coord'])
tsne_df

	x_coord	y_coord
0	69.760231	8.384243
1	51.441284	48.062038
2	60.135799	40.679840
3	83.284004	18.981834
4	69.300697	22.976822
...	...	...
8513	-65.761559	-27.534193
8514	-61.397038	-4.526075
8515	-14.145403	0.882806
8516	-73.188950	-3.269093
8517	-33.421547	-5.895024

8518 rows × 2 columns

tsne_df['vocabs'] = vocabs
tsne_df

	x_coord	y_coord	vocabs
0	69.760231	8.384243	하다
1	51.441284	48.062038	영화
2	60.135799	40.679840	보다
3	83.284004	18.981834	도
4	69.300697	22.976822	들
...	...	...	...
8513	-65.761559	-27.534193	전투력
8514	-61.397038	-4.526075	연가시
8515	-14.145403	0.882806	상쾌
8516	-73.188950	-3.269093	나혼자산다
8517	-33.421547	-5.895024	이대로

8518 rows × 3 columns

from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.io import push_notebook
from bokeh.resources import INLINE
from bokeh.io import curdoc
# 한글 폰트 설정
import matplotlib.pyplot as plt
plt.rc('font', family='NanumGothic')
# Bokeh 출력 설정
output_notebook(resources=INLINE)
# prepare the data in a form suitable for bokeh.
plot_data = ColumnDataSource(tsne_df)
# create the plot and configure it
tsne_plot = figure(title='t-SNE Word Embeddings',
  width = 800,
  height = 800,
  active_scroll='wheel_zoom'
)
# add a hover tool to display words on roll-over
tsne_plot.add_tools( HoverTool(tooltips = '@vocabs') )
tsne_plot.circle(
    'x_coord', 'y_coord', source=plot_data,
    color='red', line_alpha=0.2, fill_alpha=0.1,
    size=10, hover_line_color='orange'
  )
# adjust visual elements of the plot
tsne_plot.xaxis.visible = False
tsne_plot.yaxis.visible = False
tsne_plot.grid.grid_line_color = None
tsne_plot.outline_line_color = None
# show time!
show(tsne_plot);

'코딩 > 자연어 처리' 카테고리의 다른 글

자연어 처리를 위한 모델 학습 (0)	2024.07.18
PLM 실습 (0)	2024.07.18
RNN 기초 (0)	2024.07.18
cbow text classification (0)	2024.07.18
워드 임베딩 (0)	2024.07.18

개발일지

워드 임베딩 시각화

1. 네이버 영화 리뷰 데이터셋¶

2. 데이터 전처리¶

3. 워드 임베딩 구축¶

4. 워드 임베딩 시각화¶

5. TSNE 시각화 고도화¶

'코딩 > 자연어 처리' 카테고리의 다른 글

티스토리툴바

워드 임베딩 시각화

1. 네이버 영화 리뷰 데이터셋¶

2. 데이터 전처리¶

3. 워드 임베딩 구축¶

4. 워드 임베딩 시각화¶

5. TSNE 시각화 고도화¶

'코딩 > 자연어 처리' 카테고리의 다른 글

관련글

티스토리툴바