8장_텍스트_분석

50 minute read

본 포스팅은 [파이썬 머신러닝 완벽 가이드 _ 권철민 저] 도서를 기반으로 하고 있으며, 본인이 직접 요약, 정리한 내용입니다.

텍스트 분석과 NLP를 구분하는 것은 크게 의미가 없지만, 굳이 구분한다면 NLP는 기계가 인간의 언어를 이해하고 해석하는데 중점을 두고 있다면,텍스트 분석 혹은 텍스트 마이닝은 비정형 텍스트에서 의미있는 정보를 추출하는 것에 조금 더 중점을 두고 있다.

텍스트 분석은 머신러닝, 언어 이해, 통계 등을 활용하여 모델을 수립하고 정보를 추출해 비즈니스 인텔리전스나 예측 분석 등의 분석 작업을 주로 수행하며, 주로 다음과 같은 기술 영역에 집중해있다.

텍스트 분류(Text Classification) : Text Categorization이라고도 한다. 문서가 특정 분류 또는 카테고리에 속하는 것을 예측하는 기법을 통힝한다. ex) 뉴스의 카테고리 예측, 스팸 메일 검출 등
감성 분석(Sentiment Analsis) : 텍스트에서 나타나는 감정/판단/믿음/의견/기분 등의 주관적인 요소를 분석하는 기법을 총칭한다.

Text Analytics에서 가장 활발하게 사용되는 분야이다. 지도학습 방법 뿐만 아니라 비지도 학습을 이용해 적용할 수 있다.
텍스트 요약(Summarization) : 텍스트 내에서 중요한 주제나 중심 사상을 추출하는 기법을 말한다. ex) 토픽 모델링 등
텍스트 군집화(Clustring)와 유사도 측정 : 비슷한 유형의 문서에 대해 군집화를 수행하는 기법을 말한다.

텍스트 분류를 비지도학습으로 수행하는 방법의 일환으로 사용될 수 있다. 유사도 측정 역시 문서들 간의 유사도를 측정하여 비슷한 문서끼리 모을 수 있는 방법이다.

1 : 텍스트 분석 이해

텍스트 분석은 비정형 데이터인 텍스트를 분석하는 것이다.

머신러닝 알고리즘은 숫자형의 feature 기반 데이터만 입력받을 수 있기 때문에 텍스트를 머신러닝에 적용하기 위해서는 비정형 텍스트 데이터를 어떻게 feature 형태로 추출하고 추출된 feature에 의미 있는 값을 부여하는가 하는 것이 매우 중요한 요소이다.

텍스트를 word(또는 word의 일부분) 기반의 다수의 feature로 추출하고 이 feature에 단어 빈도수와 같은 숫자 값을 부여하면 텍스트는 단어의 조합인 벡터 값으로 표현될 수 있는데, 이렇게 텍스트를 변환하는 것을 Feature Vectorization 또는 Feature Extraction이라고 한다.

대표적으로 텍스트를 Feature Vectorization해서 변환하는 방법에는 BOW(Bag of Words)와 Word2Vec 방법이 있다.

이번 장에서는 BOW만 다루도록 하겠다.

텍스트 분석 수행 프로세스

텍스트 사전 준비 작업(텍스트 전처리) : 텍스트를 feature로 만들기 전에 미리 대/소문자 변경, 특수 문자 삭제 등의 클렌징 작업, 단어(Word) 등의 토큰화 작업, 의미 없는 단어(Stop word) 제거 작업, 어근 추출(Stemming/Lemmatization) 등의 텍스트 정규화 작업을 수행하는 것을 통칭한다.
Feature Vectorization : 사전 준비 작업으로 가공된 텍스트에서 Feature을 추출하고 여기에 벡터 값을 할당한다. 대표적인 방법은 BOW와 Word2Vec이 있으며, BOW는 대표적으로 Count 기반과 TF-IDF 기반 벡터화가 있다.
ML 모델 수립 및 학습/예측/평가 : Feature Vectorized된 데이터 세트에는 ML 모델을 적용하여 학습/예측 및 평가를 수행한다.

파이썬 기반의 NLP, 텍스트 분석의 패키지

NLTK(Natural Language Toolkit for Python) : 파이썬의 가장 대표적인 NLP 패키지이다. 수행 속도 측면에서 아쉬운 부분이 있어서 실제 대량의 데이터 기반에서는 제대로 활용되지 못하고 있다.
Gensim : 토픽 모델링 분야에서 가장 두각을 나타내는 패키지이다. 오래전부터 토픽 모델링을 쉽게 구현할 수 있는 기능을 제공해왔으며, Word2Vec 구현 등의 다양한 신기능도 제공한다. SpaCy와 함께 가장 많이 사용되는 NLP 패키지이다.
SpaCy : 뛰어난 수행 성능으로 최근 가장 주목을 받는 NLP 패키지이다. 많은 NLP 앱에서 SpaCy를 사용하는 사례가 늘고 있다.

2 : 텍스트 사전 준비 작업(텍스트 전처리) - 텍스트 정규화

클렌징(Cleansing)
토근화(Tokenization)
필터링/ Stop word 제거/ 철자 수정
Stemming
Lemmatization

클렌징(Cleansing)

텍스트에서 분석에 오히려 방해가 되는 불필요한 문자, 기호 등을 사전에 제거하는 작업이다. 예를 들어 HTML, XML 태그나 특정 기호 등을 사전에 제거한다.

텍스트 토큰화(Tokenization)

토큰화의 유형은 문서에서 문장을 분리하는 문장 토큰화와 문장에서 단어를 토큰으로 분리하는 단어 토큰화로 나눌 수 있다. NLTK는 이를 위해 다양한 API를 제공한다.

문장 토큰화

문장 토큰화는 문장의 마침표(.), 개행문자(\n) 등 문장의 마지막을 뜻하는 기호에 따라 분리하는 것이 일반적이다. 또한 정규 표현식에 따른 문장 토큰화도 가능하다.

from nltk import sent_tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Oh Won
[nltk_data]     Jin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!

text_sample = 'The Matrix is everywhere its all around us, here even in this room. \
You can see it out your window or on you television. \
You feel it when you go to work, or go to church or pay your taxes.'

sentences = sent_tokenize(text = text_sample)

print(sentences)
print(type(sentences),len(sentences))

['The Matirx is everywhere its all around us, here even in this room.', 'You can see it out your window or on you television.', 'You feel it when you go to wrk, or go to church or pay your taxes.']
<class 'list'> 3

단어 토큰화

단어 토큰화(Word Tokenization)는 문장을 단어로 토큰화하는 것이다. 기본적으로 공백, 콤마(,), 마침표(.), 개행 문자 등으로 단어를 분리하지만, 정규 표현식을 이용해 다양한 유형으로 토큰화를 수행할 수 있다.
마침표(.)나 개행 문자와 같이 문장을 분리하는 구분자를 이용하여 단어를 토큰화할 수 있으므로 Bag of Word와 같이 단어의 순서가 중요하지 않은 경우 문장 토큰화를 사용하지 않고 단어 토큰화만 사용해도 충분하다.
일반적으로 문장 토큰화는 각 문장이 가지는 문맥적(Semantic) 의미가 중요한 요소로 사용될 때 사용한다.

from nltk import word_tokenize

sentence = "The Matirx is everywhere its all around us, here even in this room."
words = word_tokenize(sentence)
print(type(words), len(words))
print(words)

<class 'list'> 15
['The', 'Matirx', 'is', 'everywhere', 'its', 'all', 'around', 'us', ',', 'here', 'even', 'in', 'this', 'room', '.']

이번에는 sent_tokenize와 word_tokenize를 조합해 문서에 대하여 모든 단어를 토큰화해 보겠다.

from nltk import sent_tokenize, word_tokenize

# 여러 개의 문장으로 된 입력 데이터를 문장 별로 단어 토큰화하게 만드는 함수 생성

def tokenize_text(text):
    
    # 문장 별로 분리 토큰
    sentences = sent_tokenize(text)
    
    # 분리된 문장별 단어 토큰화
    
    word_tokens = [word_tokenize(sentence) for sentence in sentences]
    return word_tokens

# 여러 문장에 대해 문장별 단어 토큰화 수행.

word_tokens = tokenize_text(text_sample)
print(type(word_tokens), len(word_tokens))
print(word_tokens)

<class 'list'> 3
[['The', 'Matirx', 'is', 'everywhere', 'its', 'all', 'around', 'us', ',', 'here', 'even', 'in', 'this', 'room', '.'], ['You', 'can', 'see', 'it', 'out', 'your', 'window', 'or', 'on', 'you', 'television', '.'], ['You', 'feel', 'it', 'when', 'you', 'go', 'to', 'wrk', ',', 'or', 'go', 'to', 'church', 'or', 'pay', 'your', 'taxes', '.']]

문장을 단어 별로 하나씩 토큰화 할 경우 문맥적인 의미는 무시될 수 밖에 없다. 이러한 문제를 조금이라도 해결해 보고자 도입된 것이 n-gram이다.

n-gram은 연속된 n개의 단어를 하나의 토큰화 단위로 분리해 내는 것이다. n개 단어 크기 윈도우를 만들어 문장의 처음부터 오른쪽으로 움직이면서 토큰화를 수행한다. 예를 들어 “Agent Smith knocks the door”를 2-gram(bigram)으로 만들면 (Agent,Smith), (Smith,knocks), (knocks, the), (the, door)와 같이 연속적으로 2개의 단어들을 순차적으로 이동하면서 단어들을 토큰화 한다.

스톱 워드 제거

스톱 워드는 분석에 큰 의미가 없는 단어를 지칭한다. 예를 들어 영어에서 is, the, a, will 등 무장을 구성하는 필수 문법 요소이지만 문맥적으로 큰 의미가 없는 단어가 이에 해당한다. 이러한 단어들의 경우, 문법적인 특성으로 인해 특히 빈번하게 텍스트에 나타나므로 이것들을 사전에 제거하지 않으면 그 빈번함으로 인해 오히려 중요한 단어로 인지할 수 있기 있다. 따라서 이 의미없는 단어를 제거하는 것이 중요한 전처리 작업이다.

import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Oh Won
[nltk_data]     Jin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

True

print('영어 stop words 개수 :', len(nltk.corpus.stopwords.words('english')))
print(nltk.corpus.stopwords.words('english')[:20])

영어 stop words 개수 : 179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']

stopwords = nltk.corpus.stopwords.words('english')
all_tokens = []

# 위 예제엇 3개의 문장 별로 얻은 word_tokens list에 대해 스톱 워드를 제거하는 반복문
text_sample = 'The Matrix is everywhere its all around us, here even in this room. \
You can see it out your window or on you television. \
You feel it when you go to work, or go to church or pay your taxes.'

sentences = sent_tokenize(text = text_sample)
word_tokens = [word_tokenize(sentence) for sentence in sentences]

for sentence in word_tokens :
    filtered_words = []
    
    for word in sentence :
        # 소문자로 모두 변환합니다.
        word = word.lower()
        # 토큰화된 개별 단어가 스톱 워드의 단어에 포함되지 않으면 word_tokens에 추가
        
        if word not in stopwords :
            filtered_words.append(word)
    all_tokens.append(filtered_words)
    
print(all_tokens)

[['matrix', 'everywhere', 'around', 'us', ',', 'even', 'room', '.'], ['see', 'window', 'television', '.'], ['feel', 'go', 'work', ',', 'go', 'church', 'pay', 'taxes', '.']]

Stemming과 Lemmatization

많은 언어에서 문법적인 요소에 따라 단어가 다양하게 변하는데, Stemming과 Lemmatization은 문법적 또는 의미적으로 변화하는 단어의 원형을 찾는 것이다.
두 기능 모두 원형 단어를 찾는다는 목적은 유사하지만, Lemmatization이 Stemming보다 정교하며, 의미론적인 기반에서 단어의 원형을 찾는다.

Stemming은 원형 단어로 변환 시 일반적인 방법을 적용하거나 더 단순화된 방법을 적용해 원래 단어에서 일부 철자가 훼손된 어근 단어를 추출하는 경향이 있다. 이에 반하여, Lemmatization은 품사와 같은 문법적인 요소와 더 의미적인 부분을 감안해 정확한 철자로 된 어근 단어를 찾아준다. 따라서 Lemmatization이 Stemming보다 변환에 더 오랜 시간을 필요로 한다.

#  NLTK는 다양한 Stemmer를 제공하는데, 대표적으로 Porter, Lancaster, Snowball Stemmer가 있다.

from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()
print(stemmer.stem('working'), stemmer.stem('works'), stemmer.stem('worked'))
print(stemmer.stem('amusing'), stemmer.stem('amuses'), stemmer.stem('amused'))
print(stemmer.stem('happier'), stemmer.stem('happiest'))
print(stemmer.stem('fancier'), stemmer.stem('fanciest'))

work work work
amus amus amus
happy happiest
fant fanciest

#  NLTK는 Lemmatization을 위해서 WordNetLemmatizer을 제공한다.

from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

lemma = WordNetLemmatizer()

# 일반적으로 Lemmatization은 보다 정확한 원형 단어 추출을 위해 단어의 '품사'를 입력해줘야 한다. 다음 예제에서 볼 수 있듯이
# lemmatize()의 파라미터로 동사의 경우 'v', 형용사의 경우 'a'를 입력한다.
print(lemma.lemmatize('amusing','v'),lemma.lemmatize('amuses','v'),lemma.lemmatize('amused','v'))
print(lemma.lemmatize('happier','a'),lemma.lemmatize('happiest','a'))
print(lemma.lemmatize('fancier','a'),lemma.lemmatize('fanciest','a'))

[nltk_data] Downloading package wordnet to C:\Users\Oh Won
[nltk_data]     Jin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


amuse amuse amuse
happy happy
fancy fancy

3 : Bag of Words - BOW

Bag of Words 모델은 문서가 가지는 모든 단어(Words)를 문맥이나 순서를 무시하고 일괄적으로 단어에 대해 빈도 값을 부여해 feature 값을 추출하는 모델이다.
문서 내 모든 단어를 한꺼번에 봉투 안에 넣은 뒤 흔들어서 섞는다는 의미로 Bag of Words 모델이라고 한다.

Bag of Words 모델의 장점은 쉽고 빠른 구축에 있다. 그러나 단점은 다음과 같다.

문맥 의미(Semantic Context) 반영 부족 : BOW는 단어의 순서를 고려하지 않기 때문에 문장 내에서 단어의 문맥적인 의미가 무시된다. 물론 이를 보완하기 위해 n_gram 기법을 활용할 수 있지만, 제한적인 부분에 그치므로 언어의 많은 부분을 차지하는 문맥적인 해석을 처리할 수 없다.
희소 행렬 문제(희소성, 희소 행렬) : BOW로 feature vectorization을 수행하면, 희소 행렬 형태의 데이터 세트가 만들어지기 쉽다. 많은 문서에서 단어를 추출하면 매우 많은 단어가 column으로 만들어지는데, 문서마다 서로 다른 단어로 구성되기에 ‘특정’ 문서에서의 단어는 생성한 column 중 극히 일부분이므로 ‘특정’ 문서내 행렬의 대부분의 값이 0으로 채워지게 된다.
이처럼 대규모의 column으로 구성된 행렬에서 대부분의 값이 0으로 채워지는 행렬을 Sparse Matrix라고 한다. 희소 행렬은 일반적으로 ML 알고리즘의 수행 시간과 예측 성능을 떨어뜨리기 때문에 희소 행렬을 위한 특별한 기법이 마련돼 있다.

BOW Feature Vectorization

Feature Vectorization은 각 문서의 텍스트를 단어로 추출해 feature로 할당하고, 각 단어의 발생 빈도와 같은 값을 이 feature에 값으로 부여해 각 문서를 이 단어 feature의 발생 빈도 값으로 구성된 벡터를 만드는 기법이다.

BOW 모델에서 Feature Vectorization을 수행한다는 것은 모든 문서에서 모든 단어를 column 형태로 나열하고 각 문서에서 해당 단어의 횟수나 정규화된 빈도를 값으로 부여하는 데이터 세트 모델을 변경하는 것이다.

예를 들어 M개의 텍스트 문서가 있고, 이 문서에서 모든 단어를 추출해 나열했을 때 N개의 단어가 있다고 가정하면 문서의 Feature Vectorization을 수행하면 M개의 문서는 각각 N개의 값이 할당된 feature의 벡터 세트가 된다. 결과적으로는 M X N개의 단어 feature로 이루어진 행렬을 구성하게 된다.

일반적으로 BOW의 Feature Vectorization은 두 가지 방식이 있다.

카운트 기반의 Vectorization
TF - IDF(Term Frequency - Inverse Document Frequency) 기반의 벡터화

단어 feature에 값을 부여할 때 각 문서에서 해당 단어가 나타나는 횟수,즉 Count를 부여하는 경우를 카운트 벡터화라고 한다.

카운트 벡터화에서는 카운트 값이 높을 수록 중요한 단어로 인식된다. 그러나 카운트만 부여하게 될 경우 그 문서의 특징을 나타내기보다는 언어의 특성상 문장에서 자주 사용될 수 밖에 없는 단어까지 높은 값을 부여하게 된다.

이러한 문제를 보완하기 위해 TF-IDF 벡터화를 사용한다. TF-IDF는 개별 문서에서 자주 나타나는 단어에 높은 가중치를 주되, 모든 문서에서 전반적으로 자주 나타나는 단어에 대해서는 페널티를 주는 방식으로 값을 부여한다.

문서마다 텍스트가 길고 문서의 개수가 많은 경우 카운트 방식보다는 TF-IDF 방식을 사용하는 것이 더 좋은 예측 성능을 보장할 수 있다.

사이킷런의 Count 및 TF-IDF 벡터화 구현 :

CountVectorizer, TfidfVectorizer

사이킷런의 CountVectorizer 클래스는 카운트 기반의 벡터화를 구현한 클래스이다.
CountVectorizer 클래스는 단지 feature vectorization만 수행하지는 않으며 소문자 일괄 변환, 토큰화, 스톱 워드 필터링 등의 텍스트 전처리도 함께 수행한다.

보통 CountVectorizer 클래스를 이용해 카운트 기반의 feature 여러 개의 문서로 구성된 텍스트의 feature vectorization 방법은 다음과 같다.

영어의 경우 모든 문자를 소문자로 변경하는 등의 전처리 작업을 수행한다.
default로 단어 기준으로 n_gram_range를 반영해 각 단어를 토큰화한다.
텍스트 정규화를 수행한다.

단, stop_words = ‘english’와 같이 stop_words 파라미터가 주어진 경우 스톱 워드 필터링만 가능하다.

Stemming과 Lemmatization같은 어근 변환은 CountVectorizer에서 직접 지원하진 않으나 tokenizer 파라미터에 커스텀 어근 변환 함수를 적용하여 어근 변환을 수행할 수 있다.

마지막으로 max_df, min_df, max_features 등의 파라미터를 이용해 토큰화된 단어를 feature로 추출하고 단어 빈도수 벡터 값을 적용한다.

사이킷런 CountVectorizer 테스트

text_sample_01 = 'The Matrix is everywhere its all around us, here even in this room. \
                  You can see it out your window or on your television. \
              You feel it when you go to work, or go to church or pay your taxes.'
text_sample_02 = 'You take the blue pill and the story ends.  You wake in your bed and you believe whatever you want to believe\
                  You take the red pill and you stay in Wonderland and I show you how deep the rabbit-hole goes.'
text=[]
text.append(text_sample_01); text.append(text_sample_02)
print(text,"\n", len(text))

['The Matrix is everywhere its all around us, here even in this room.                   You can see it out your window or on your television.                   You feel it when you go to work, or go to church or pay your taxes.', 'You take the blue pill and the story ends.  You wake in your bed and you believe whatever you want to believe                  You take the red pill and you stay in Wonderland and I show you how deep the rabbit-hole goes.'] 
 2

CountVectorizer객체 생성 후 fit(), transform()으로 텍스트에 대한 feature vectorization 수행

from sklearn.feature_extraction.text import CountVectorizer

# Count Vectorization으로 feature extraction 변환 수행. 
cnt_vect = CountVectorizer()
cnt_vect.fit(text)
ftr_vect = cnt_vect.transform(text)
print(type(ftr_vect), ftr_vect.shape)
print(ftr_vect)

<class 'scipy.sparse.csr.csr_matrix'> (2, 51)
  (0, 0)	1
  (0, 2)	1
  (0, 6)	1
  (0, 7)	1
  (0, 10)	1
  (0, 11)	1
  (0, 12)	1
  (0, 13)	2
  (0, 15)	1
  (0, 18)	1
  (0, 19)	1
  (0, 20)	2
  (0, 21)	1
  (0, 22)	1
  (0, 23)	1
  (0, 24)	3
  (0, 25)	1
  (0, 26)	1
  (0, 30)	1
  (0, 31)	1
  (0, 36)	1
  (0, 37)	1
  (0, 38)	1
  (0, 39)	1
  (0, 40)	2
  :	:
  (1, 1)	4
  (1, 3)	1
  (1, 4)	2
  (1, 5)	1
  (1, 8)	1
  (1, 9)	1
  (1, 14)	1
  (1, 16)	1
  (1, 17)	1
  (1, 18)	2
  (1, 27)	2
  (1, 28)	1
  (1, 29)	1
  (1, 32)	1
  (1, 33)	1
  (1, 34)	1
  (1, 35)	2
  (1, 38)	4
  (1, 40)	1
  (1, 42)	1
  (1, 43)	1
  (1, 44)	1
  (1, 47)	1
  (1, 49)	7
  (1, 50)	1

print(cnt_vect.vocabulary_)

{'the': 38, 'matrix': 22, 'is': 19, 'everywhere': 11, 'its': 21, 'all': 0, 'around': 2, 'us': 41, 'here': 15, 'even': 10, 'in': 18, 'this': 39, 'room': 30, 'you': 49, 'can': 6, 'see': 31, 'it': 20, 'out': 25, 'your': 50, 'window': 46, 'or': 24, 'on': 23, 'television': 37, 'feel': 12, 'when': 45, 'go': 13, 'to': 40, 'work': 48, 'church': 7, 'pay': 26, 'taxes': 36, 'take': 35, 'blue': 5, 'pill': 27, 'and': 1, 'story': 34, 'ends': 9, 'wake': 42, 'bed': 3, 'believe': 4, 'whatever': 44, 'want': 43, 'red': 29, 'stay': 33, 'wonderland': 47, 'show': 32, 'how': 17, 'deep': 8, 'rabbit': 28, 'hole': 16, 'goes': 14}

피처 벡터화 후 데이터 유형 및 여러 속성 확인

cnt_vect = CountVectorizer(max_features=5, stop_words='english')
cnt_vect.fit(text)
ftr_vect = cnt_vect.transform(text)
print(type(ftr_vect), ftr_vect.shape)
print(cnt_vect.vocabulary_)

<class 'scipy.sparse.csr.csr_matrix'> (2, 5)
{'window': 4, 'pill': 1, 'wake': 2, 'believe': 0, 'want': 3}

ngram_range 확인

cnt_vect = CountVectorizer(ngram_range=(1,3))
cnt_vect.fit(text)
ftr_vect = cnt_vect.transform(text)
print(type(ftr_vect), ftr_vect.shape)
print(cnt_vect.vocabulary_)

<class 'scipy.sparse.csr.csr_matrix'> (2, 201)
{'the': 129, 'matrix': 77, 'is': 66, 'everywhere': 40, 'its': 74, 'all': 0, 'around': 11, 'us': 150, 'here': 51, 'even': 37, 'in': 59, 'this': 140, 'room': 106, 'you': 174, 'can': 25, 'see': 109, 'it': 69, 'out': 90, 'your': 193, 'window': 165, 'or': 83, 'on': 80, 'television': 126, 'feel': 43, 'when': 162, 'go': 46, 'to': 143, 'work': 171, 'church': 28, 'pay': 93, 'taxes': 125, 'the matrix': 132, 'matrix is': 78, 'is everywhere': 67, 'everywhere its': 41, 'its all': 75, 'all around': 1, 'around us': 12, 'us here': 151, 'here even': 52, 'even in': 38, 'in this': 60, 'this room': 141, 'room you': 107, 'you can': 177, 'can see': 26, 'see it': 110, 'it out': 70, 'out your': 91, 'your window': 199, 'window or': 166, 'or on': 86, 'on your': 81, 'your television': 197, 'television you': 127, 'you feel': 179, 'feel it': 44, 'it when': 72, 'when you': 163, 'you go': 181, 'go to': 47, 'to work': 148, 'work or': 172, 'or go': 84, 'to church': 146, 'church or': 29, 'or pay': 88, 'pay your': 94, 'your taxes': 196, 'the matrix is': 133, 'matrix is everywhere': 79, 'is everywhere its': 68, 'everywhere its all': 42, 'its all around': 76, 'all around us': 2, 'around us here': 13, 'us here even': 152, 'here even in': 53, 'even in this': 39, 'in this room': 61, 'this room you': 142, 'room you can': 108, 'you can see': 178, 'can see it': 27, 'see it out': 111, 'it out your': 71, 'out your window': 92, 'your window or': 200, 'window or on': 167, 'or on your': 87, 'on your television': 82, 'your television you': 198, 'television you feel': 128, 'you feel it': 180, 'feel it when': 45, 'it when you': 73, 'when you go': 164, 'you go to': 182, 'go to work': 49, 'to work or': 149, 'work or go': 173, 'or go to': 85, 'go to church': 48, 'to church or': 147, 'church or pay': 30, 'or pay your': 89, 'pay your taxes': 95, 'take': 121, 'blue': 22, 'pill': 96, 'and': 3, 'story': 118, 'ends': 34, 'wake': 153, 'bed': 14, 'believe': 17, 'whatever': 159, 'want': 156, 'red': 103, 'stay': 115, 'wonderland': 168, 'show': 112, 'how': 56, 'deep': 31, 'rabbit': 100, 'hole': 54, 'goes': 50, 'you take': 187, 'take the': 122, 'the blue': 130, 'blue pill': 23, 'pill and': 97, 'and the': 6, 'the story': 138, 'story ends': 119, 'ends you': 35, 'you wake': 189, 'wake in': 154, 'in your': 64, 'your bed': 194, 'bed and': 15, 'and you': 8, 'you believe': 175, 'believe whatever': 18, 'whatever you': 160, 'you want': 191, 'want to': 157, 'to believe': 144, 'believe you': 20, 'the red': 136, 'red pill': 104, 'you stay': 185, 'stay in': 116, 'in wonderland': 62, 'wonderland and': 169, 'and show': 4, 'show you': 113, 'you how': 183, 'how deep': 57, 'deep the': 32, 'the rabbit': 134, 'rabbit hole': 101, 'hole goes': 55, 'you take the': 188, 'take the blue': 123, 'the blue pill': 131, 'blue pill and': 24, 'pill and the': 98, 'and the story': 7, 'the story ends': 139, 'story ends you': 120, 'ends you wake': 36, 'you wake in': 190, 'wake in your': 155, 'in your bed': 65, 'your bed and': 195, 'bed and you': 16, 'and you believe': 9, 'you believe whatever': 176, 'believe whatever you': 19, 'whatever you want': 161, 'you want to': 192, 'want to believe': 158, 'to believe you': 145, 'believe you take': 21, 'take the red': 124, 'the red pill': 137, 'red pill and': 105, 'pill and you': 99, 'and you stay': 10, 'you stay in': 186, 'stay in wonderland': 117, 'in wonderland and': 63, 'wonderland and show': 170, 'and show you': 5, 'show you how': 114, 'you how deep': 184, 'how deep the': 58, 'deep the rabbit': 33, 'the rabbit hole': 135, 'rabbit hole goes': 102}

BOW 벡터화를 위한 희소 행렬

사이킷런의 CountVectorizer/TfidfVectorizer 클래스를 이용하여 텍스트를 feature 단위로 vectorize해 변환하고 CSR 형태의 희소 행렬을 반환한다.

모든 문서에 있는 단어를 추출해 이를 feature로 vectorize하는 방법은 필연적으로 많은 feature column을 만들 수밖에 없다. 그런데 이러한 대규모의 행렬이 생성되더라도 레코드의 각 문서가 가지는 단어의 수는 제한적이기 때문에 이 행렬의 값은 대부분 0이 차지할 수 밖에 없다. 이처럼 대규모 행렬의 대부분의 값을 0이 차지하는 행렬을 가리켜 희소 행렬이라고 한다. BOW 형태를 가진 언어 모델의 feature vectorization은 대부분 희소 행렬이다.

희소 행렬은 너무 많은 불필요한 0 값이 메모리 공간에 할당되어 메모리 공간이 많이 필요하며, 행렬의 크기가 커서 연산 시에도 데이터 액세스를 위한 시간이 많이 소모된다.

따라서 이러한 희소 행렬을 물리적으로 적은 메모리 공간을 차지할 수 있도록 변환해야 하는데, 대표적인 방법으로 COO 형식과 CSR 형식이 있다. 일반적으로 큰 희소 행렬을 저장하고 계산을 수행하는 능력이 CSR 형식이 더 뛰어나기 때문에 CSR을 많이 사용한다.

파이썬에서는 희소 행렬 변환을 위해서 주로 Scipy를 이용한다.
Scipy의 sparse 패키지는 희소 행렬 변환을 위한 다양한 모듈을 제공한다.

희소 행렬 - COO 형식

COO(Coordinate : 좌표) 형식은 0이 아닌 데이터만 별도의 데이터 배열(Array)에 저장하고, 그 데이터가 가리키는 행과 열의 위치를 별도의 배열로 저장하는 방식이다.

import numpy as np

# dense는 밀집 행렬이다.
dense = np.array( [ [ 3, 0, 1 ], 
                    [0, 2, 0 ] ] )
dense

array([[3, 0, 1],
       [0, 2, 0]])

from scipy import sparse

# 0 이 아닌 데이터 추출
data = np.array([3,1,2])

# 행 위치와 열 위치를 각각 array로 생성 
row_pos = np.array([0,0,1])
col_pos = np.array([0,2,1])

# sparse 패키지의 coo_matrix를 이용하여 COO 형식으로 희소 행렬 생성
sparse_coo = sparse.coo_matrix((data, (row_pos,col_pos)))

print(type(sparse_coo))
print(sparse_coo)
dense01=sparse_coo.toarray()
print(type(dense01),"\n", dense01)

<class 'scipy.sparse.coo.coo_matrix'>
  (0, 0)	3
  (0, 2)	1
  (1, 1)	2
<class 'numpy.ndarray'> 
 [[3 0 1]
 [0 2 0]]

희소 행렬 – CSR 형식

CSR(Compressed Sparse Row) 형식은 COO 형식이 행과 열의 위치를 나타내기 위해서 반복적인 위치 데이터를 사용해야 하는 문제점을 해결한 방식이다.

from scipy import sparse

dense2 = np.array([[0,0,1,0,0,5],
             [1,4,0,3,2,5],
             [0,6,0,3,0,0],
             [2,0,0,0,0,0],
             [0,0,0,7,0,8],
             [1,0,0,0,0,0]])

# 0 이 아닌 데이터 추출
data2 = np.array([1, 5, 1, 4, 3, 2, 5, 6, 3, 2, 7, 8, 1])

# 행 위치와 열 위치를 각각 array로 생성 
row_pos = np.array([0, 0, 1, 1, 1, 1, 1, 2, 2, 3, 4, 4, 5])
col_pos = np.array([2, 5, 0, 1, 3, 4, 5, 1, 3, 0, 3, 5, 0])

# COO 형식으로 변환 
sparse_coo = sparse.coo_matrix((data2, (row_pos,col_pos)))

print('COO 변환된 데이터가 제대로 되었는지 다시 Dense로 출력 확인')
print(sparse_coo.toarray())

COO 변환된 데이터가 제대로 되었는지 다시 Dense로 출력 확인
[[0 0 1 0 0 5]
 [1 4 0 3 2 5]
 [0 6 0 3 0 0]
 [2 0 0 0 0 0]
 [0 0 0 7 0 8]
 [1 0 0 0 0 0]]

위에서 행 위치 배열을 보면 순차적인 같은 값이 반복적으로 나타남을 알 수 있다. 행 위치 배열이 0부터 순차적으로 증가하는 값으로 이루어졌다는 특성을 고려하면 행 위치 배열의 고유한 값의 시작 위치만 표기하는 반복으로 이러한 반복을 제거할 수 있다.

CSR 방식의 변환은 사이파이의 csr_matrix 클래스를 이용해 쉽게 할 수 있다. 0이 아닌 데이터 배열과 열 위치 배열, 그리고 행 위치 배열의 고유한 값의 시작 위치 배열을 csr_matrix의 생성 파라미터로 입력하면 된다.

# 0 이 아닌 데이터 추출
data2 = np.array([1, 5, 1, 4, 3, 2, 5, 6, 3, 2, 7, 8, 1])

# 행 위치 배열의 고유한 값들의 시작 위치 인덱스를 배열로 생성
row_pos_ind = np.array([0, 2, 7, 9, 10, 12, 13])

# 열 위치 배열은 COO와 같다.
col_pos = np.array([2, 5, 0, 1, 3, 4, 5, 1, 3, 0, 3, 5, 0])
# CSR 형식으로 변환 
sparse_csr = sparse.csr_matrix((data2, col_pos, row_pos_ind))

print('CSR 변환된 데이터가 제대로 되었는지 다시 Dense로 출력 확인')
print(sparse_csr.toarray())

CSR 변환된 데이터가 제대로 되었는지 다시 Dense로 출력 확인
[[0 0 1 0 0 5]
 [1 4 0 3 2 5]
 [0 6 0 3 0 0]
 [2 0 0 0 0 0]
 [0 0 0 7 0 8]
 [1 0 0 0 0 0]]

실제 사용시에는 다음과 같이 밀집 행렬을 생성 파라미터로 입력하면 COO나 CSR 희소 행렬로 생성한다.

dense3 = np.array([[0,0,1,0,0,5],
             [1,4,0,3,2,5],
             [0,6,0,3,0,0],
             [2,0,0,0,0,0],
             [0,0,0,7,0,8],
             [1,0,0,0,0,0]])

coo = sparse.coo_matrix(dense3)
csr = sparse.csr_matrix(dense3)

print(csr)

  (0, 2)	1
  (0, 5)	5
  (1, 0)	1
  (1, 1)	4
  (1, 3)	3
  (1, 4)	2
  (1, 5)	5
  (2, 1)	6
  (2, 3)	3
  (3, 0)	2
  (4, 3)	7
  (4, 5)	8
  (5, 0)	1

5 : 감성 분석

감성 분석 소개

감성 분석은 머신러닝 관점에서 지도 학습과 비지도 학습 방식으로 나눌 수 있다.

지도 학습은 학습 데이터와 타깃 레이블 값을 기반으로 감성 분석 학습을 수행한 뒤 이를 기반으로 다른 데이터의 감성 분석을 예측하는 방법으로 일반적인 텍스트 기반의 분류와 거의 동일하다.
비지도 학습은 ‘Lexicon‘이라는 일종의 감성 어휘 사전을 이용한다. Lexicon은 감성 분석을 위한 용어와 문맥에 대한 다양한 정보를 가지고 있으며, 이를 이용해 문서의 긍정적, 부정적 감성 여부를 판단한다.

지도학습 기반 감성 분석 실습 -IMDB 영화평

import pandas as pd

review_df = pd.read_csv('C:/Users/Oh Won Jin/Python/PerfectGuide/8장/labeledTrainData.tsv', header=0, sep="\t", quoting=3)
review_df.head(3)

	id	sentiment	review
0	"5814_8"	1	"With all this stuff going down at the moment ...
1	"2381_9"	1	"\"The Classic War of the Worlds\" by Timothy ...
2	"7759_3"	0	"The film starts with a manager (Nicholas Bell...

print(review_df['review'][0])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."

HTML 형식에서 추출하였기 때문에 줄 바꿈 기호인 <’br>, <’br/>이 존재한다.

데이터 사전 처리 html태그 제거 및 숫자, 문자 제거

정규 표현식 모듈 re

import re

# <br> html 태그는 replace 함수로 공백으로 변환
review_df['review'] = review_df['review'].str.replace('<br />',' ')

# 파이썬의 정규 표현식 모듈인 re를 이용하여 영어 문자열이 아닌 문자는 모두 공백으로 변환 
review_df['review'] = review_df['review'].apply( lambda x : re.sub("[^a-zA-Z]", " ", x) )

정규 표현식을 아는 것은 테스트 처리를 하는 데 매우 큰 도움이 된다. 간단한 정규 표현식은 익혀두자!

정규 표현식 [^a-zA-Z] : 영어 대/소문자가 아닌 모든 문자를 찾는 것

학습/테스트 데이터 분리

from sklearn.model_selection import train_test_split

class_df = review_df['sentiment']
# id는 의미 없으므로 drop
feature_df = review_df.drop(['id','sentiment'], axis=1, inplace=False)

X_train, X_test, y_train, y_test= train_test_split(feature_df, class_df, test_size=0.3, random_state=156)

X_train.shape, X_test.shape

((17500, 1), (7500, 1))

Pipeline을 통해 Count기반 피처 벡터화 및 머신러닝 학습/예측/평가

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# 스톱 워드는 English, filtering, ngram은 (1,2)로 설정해 CountVectorization수행. 
# LogisticRegression의 C는 10으로 설정. 
pipeline = Pipeline([
    ('cnt_vect', CountVectorizer(stop_words='english', ngram_range=(1,2) )),
    ('lr_clf', LogisticRegression(C=10))])

# Pipeline 객체를 이용하여 fit(), predict()로 학습/예측 수행. predict_proba()는 roc_auc때문에 수행.  
pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:,1]

print('예측 정확도는 {0:.4f}, ROC-AUC는 {1:.4f}'.format(accuracy_score(y_test ,pred),
                                         roc_auc_score(y_test, pred_probs)))

C:\Users\Oh Won Jin\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:765: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)


예측 정확도는 0.8860, ROC-AUC는 0.9503

Pipeline을 통해 TF-IDF기반 피처 벡터화 및 머신러닝 학습/예측/평가

# 스톱 워드는 english, filtering, ngram은 (1,2)로 설정해 TF-IDF 벡터화 수행. 
# LogisticRegression의 C는 10으로 설정. 
pipeline = Pipeline([
    ('tfidf_vect', TfidfVectorizer(stop_words='english', ngram_range=(1,2) )),
    ('lr_clf', LogisticRegression(C=10))])

pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:,1]

print('예측 정확도는 {0:.4f}, ROC-AUC는 {1:.4f}'.format(accuracy_score(y_test ,pred),
                                         roc_auc_score(y_test, pred_probs)))

예측 정확도는 0.8936, ROC-AUC는 0.9598

위 결과에서 확인할 수 있듯이 TF-IDF 기반 feature vectorize 예측 성능이 조금 더 나아졌다.

비지도학습 기반 감성 분석 소개

비지도 감성 분석은 Lexicon을 기반으로 하는 것이다. 위의 지도 감성 분석은 데이터 세트가 레이블 값을 가지고 있었다. 하지만 많은 감성 분석용 데이터는 이러한 결정된 레이블 값을 가지고 있지 않다. 이러한 경우 Lexicon을 사용한다.
Lexicon은 일반적으로 어휘집을 의미하지만 여기서는 주로 감성만을 분석하기 위해 지원하는 감성 어휘 사전이다.
Lexicon은 긍정(Positive) 감성 또는 부정(Negative) 감성의 정도를 의미하는 수치를 가지고 있으며 이를 감성 지수(Polarity score)라고 한다. 이 감성 지수는 단어의 위치나 주변 단어, 문맥, POS(Part of Speech) 등을 참고해 결정된다.
이러한 Lexicon을 구현한 대표격은 NLTK 패키지이고, NLTK 패키지 안의 Lexicon 서브 모듈이 있다.
Lexical semantics의 한계 : 시대에 따라 언어의 발전이나 변형에 따라 구조가 모순을 겪게 되므로 유연한 변화가 필요하지만, 이는 Computational model로써 Lexical semantics가 단어 의미에 대한 다양한 양상을 완전히 계산을 할 수 없다.

이에 따라 벡터 의미론(Vector semantics) 개념이 나오는데, 이 장에서는 다루지 않겠다. 벡터 의미론을 요약하자면, 예컨데 특정 단어 2개가 각 주변 단어의 등장 분포가 유사하다면, 그 단어들은 유사한 의미를 가질 가능성이 높다는 가정을 한 접근법이다.(이웃 단어들의 분포가 이 단어의 의미가 된다.)

NLP에서 제공하는 WordNet 모듈은 방대한 영어 어휘 사전이다. WordNet은 단순한 어휘 사전이 아닌 Semantic(문맥상 의미) 분석을 제공하는 어휘 사전이다.

WordNet은 다양한 상황에서 같은 어휘라도 다르게 사용되는 어휘의 Semantic 정보를 제공하며, 이를 위해 각각의 품사(명사, 동사, 형용사, 부사 등)로 구성된 개별 단어를 Synset(Sets of cognitive synonyms)이라는 개념을 이용해 표현한다. Synset은 단순한 하나의 단어가 아니라 그 단어가 가지는 문맥, 시맨틱 정보를 제공하는 WordNet의 핵심 개념이다.

그러나 NLTK Lexicon의 예측 성능이 그리 좋지 못하므로 실제 업무의 적용은 NLTK 패키지가 아닌 다른 Lexicon(감성사전)을 적용하는 것이 일반적이다. 이는 다음과 같다.

SentiWordNet : NLTK 패키지의 WordNet과 유사하게 감성 단어 전용의 WordNet을 구현한 것이다. WordNet의 Synset 개념을 감성 분석에 적용한 것이다.
VADER : 주로 소셜 미디어의 텍스트에 대한 감성 분석을 제공하기 위한 패키지이다. 뛰어난 감성 분석 결과를 제공하며, 비교적 빠른 수행 시간을 보장해 대용량 텍스트 데이터에 잘 사용되는 패키지이다.
Pattern : 예측 성능 측면에서 가장 주목받는 패키지이다. 하지만 파이썬 버전에 따라 작동 유무가 달라지므로 버전을 확인해보자.

SentiWordNet을 이용한 Sentiment Analysis

WordNet Synset과 SentiWordNet SentiSynset 클래스의 이해

SentiWordNet은 WordNet 기반의 synset을 이용하므로, 먼저 synset에 대한 개념을 이해한 후에 SentiWordNet을 살펴보겠다.

import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to C:\Users\Oh Won
[nltk_data]    |     Jin\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to C:\Users\Oh Won
[nltk_data]    |     Jin\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to C:\Users\Oh
.
.
.




True

from nltk.corpus import wordnet as wn

term = 'present'

# 'present'라는 단어로 wordnet의 synsets 생성. 
synsets = wn.synsets(term)
print('synsets() 반환 type :', type(synsets))
print('synsets() 반환 값 갯수:', len(synsets))
# 여러개의 서로 다른 semantic을 가지는 synset 객체가 반환됐다.
print('synsets() 반환 값 :', synsets)

synsets() 반환 type : <class 'list'>
synsets() 반환 값 갯수: 18
synsets() 반환 값 : [Synset('present.n.01'), Synset('present.n.02'), Synset('present.n.03'), Synset('show.v.01'), Synset('present.v.02'), Synset('stage.v.01'), Synset('present.v.04'), Synset('present.v.05'), Synset('award.v.01'), Synset('give.v.08'), Synset('deliver.v.01'), Synset('introduce.v.01'), Synset('portray.v.04'), Synset('confront.v.03'), Synset('present.v.12'), Synset('salute.v.06'), Synset('present.a.01'), Synset('present.a.02')]

위 결과에서 하나로 예를 들자면, 파라미터 ‘present.n.01’은 POS(품사) 태그를 나타낸다.’present.n.01’에서 present는 의미, n은 명사 품사, 01은 present가 명사로서 가지는 의미가 여러가지 있어서 이를 구분하는 인덱스이다.

synsets() 호출 시 반환되는 것은 여러 개의 Synset 객체를 가지는 리스트이다.

Synset은 POS, 정의, 부명제(Lemma) 등으로 시맨틱적인 요소를 표현할 수 있다.

for synset in synsets :
    print('##### Synset name : ', synset.name(),'#####')
    print('POS :',synset.lexname())
    print('Definition:',synset.definition())
    print('Lemmas:',synset.lemma_names())

##### Synset name :  present.n.01 #####
POS : noun.time
Definition: the period of time that is happening now; any continuous stretch of time including the moment of speech
Lemmas: ['present', 'nowadays']
##### Synset name :  present.n.02 #####
POS : noun.possession
Definition: something presented as a gift
Lemmas: ['present']
##### Synset name :  present.n.03 #####
POS : noun.communication
Definition: a verb tense that expresses actions or states at the time of speaking
Lemmas: ['present', 'present_tense']
##### Synset name :  show.v.01 #####
POS : verb.perception
Definition: give an exhibition of to an interested audience
Lemmas: ['show', 'demo', 'exhibit', 'present', 'demonstrate']
##### Synset name :  present.v.02 #####
POS : verb.communication
Definition: bring forward and present to the mind
Lemmas: ['present', 'represent', 'lay_out']
##### Synset name :  stage.v.01 #####
POS : verb.creation
Definition: perform (a play), especially on a stage
Lemmas: ['stage', 'present', 'represent']
##### Synset name :  present.v.04 #####
POS : verb.possession
Definition: hand over formally
Lemmas: ['present', 'submit']
##### Synset name :  present.v.05 #####
POS : verb.stative
Definition: introduce
Lemmas: ['present', 'pose']
##### Synset name :  award.v.01 #####
POS : verb.possession
Definition: give, especially as an honor or reward
Lemmas: ['award', 'present']
##### Synset name :  give.v.08 #####
POS : verb.possession
Definition: give as a present; make a gift of
Lemmas: ['give', 'gift', 'present']
##### Synset name :  deliver.v.01 #####
POS : verb.communication
Definition: deliver (a speech, oration, or idea)
Lemmas: ['deliver', 'present']
##### Synset name :  introduce.v.01 #####
POS : verb.communication
Definition: cause to come to know personally
Lemmas: ['introduce', 'present', 'acquaint']
##### Synset name :  portray.v.04 #####
POS : verb.creation
Definition: represent abstractly, for example in a painting, drawing, or sculpture
Lemmas: ['portray', 'present']
##### Synset name :  confront.v.03 #####
POS : verb.communication
Definition: present somebody with something, usually to accuse or criticize
Lemmas: ['confront', 'face', 'present']
##### Synset name :  present.v.12 #####
POS : verb.communication
Definition: formally present a debutante, a representative of a country, etc.
Lemmas: ['present']
##### Synset name :  salute.v.06 #####
POS : verb.communication
Definition: recognize with a gesture prescribed by a military regulation; assume a prescribed position
Lemmas: ['salute', 'present']
##### Synset name :  present.a.01 #####
POS : adj.all
Definition: temporal sense; intermediate between past and future; now existing or happening or in consideration
Lemmas: ['present']
##### Synset name :  present.a.02 #####
POS : adj.all
Definition: being or existing in a specified place
Lemmas: ['present']

이처럼 synset은 하나의 단어가 가질 수 있는 여러가지 시맨틱 정보를 개별 클래스로 나타낸 것이다.

WordNet은 어떤 어휘와 다른 어휘 간의 관계를 유사도로 나타낼 수 있다. synset 객체는 단어 간의 유사도를 나타내기 위해서 path_similarity() 메서드를 제공한다.

# synset 객체를 단어별로 생성한다.
tree = wn.synset('tree.n.01')
lion = wn.synset('lion.n.01')
tiger = wn.synset('tiger.n.02')
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')

entities = [tree , lion , tiger , cat , dog]
similarities = []
entity_names = [ entity.name().split('.')[0] for entity in entities]

# 단어별 synset 들을 iteration 하면서 다른 단어들의 synset과 유사도를 측정한다. 
for entity in entities:
    similarity = [ round(entity.path_similarity(compared_entity), 2)  for compared_entity in entities ]
    similarities.append(similarity)
    
# 개별 단어별 synset과 다른 단어의 synset과의 유사도를 DataFrame형태로 저장한다.  
similarity_df = pd.DataFrame(similarities , columns=entity_names,index=entity_names)
similarity_df

	tree	lion	tiger	cat	dog
tree	1.00	0.07	0.07	0.08	0.12
lion	0.07	1.00	0.33	0.25	0.17
tiger	0.07	0.33	1.00	0.25	0.17
cat	0.08	0.25	0.25	1.00	0.20
dog	0.12	0.17	0.17	0.20	1.00

SentiWordNet은 WordNet의 Synset과 유사한 Senti_Synset 클래스를 가지고 있다. SentiWordNet 모듈의 senti_synsets()는 WordNet 모듈이라서 synsets()와 비슷하게 Senti_Synset 클래스를 리스트 형태로 반환한다.

import nltk
from nltk.corpus import sentiwordnet as swn

senti_synsets = list(swn.senti_synsets('slow'))
print('senti_synsets() 반환 type :', type(senti_synsets))
print('senti_synsets() 반환 값 갯수:', len(senti_synsets))
print('senti_synsets() 반환 값 :', senti_synsets)

senti_synsets() 반환 type : <class 'list'>
senti_synsets() 반환 값 갯수: 11
senti_synsets() 반환 값 : [SentiSynset('decelerate.v.01'), SentiSynset('slow.v.02'), SentiSynset('slow.v.03'), SentiSynset('slow.a.01'), SentiSynset('slow.a.02'), SentiSynset('dense.s.04'), SentiSynset('slow.a.04'), SentiSynset('boring.s.01'), SentiSynset('dull.s.08'), SentiSynset('slowly.r.01'), SentiSynset('behind.r.03')]

SentiSynset 객체는 단어의 감성을 나타내는 감성 지수와 객관성을(감성과 반대) 나타내는 객관성 지수를 가지고 있다.

감성 지수는 다시 긍정 감성 지수와 부정 감성 지수로 나뉜다. 어떤 단어가 전혀 감성적이지 않으면 객관성 지수는 1이 되고, 감성 지수는 모두 0이 된다. 다음은 father(아버지)라는 단어와 fabulous(아주 멋진)라는 두 개 단어의 감성 지수와 객관성 지수를 나타낸다.

import nltk
from nltk.corpus import sentiwordnet as swn

father = swn.senti_synset('father.n.01')
print('father 긍정감성 지수: ', father.pos_score())
print('father 부정감성 지수: ', father.neg_score())
print('father 객관성 지수: ', father.obj_score())
print('\n')
fabulous = swn.senti_synset('fabulous.a.01')
print('fabulous 긍정감성 지수: ',fabulous .pos_score())
print('fabulous 부정감성 지수: ',fabulous .neg_score())

father 긍정감성 지수:  0.0
father 부정감성 지수:  0.0
father 객관성 지수:  1.0


fabulous 긍정감성 지수:  0.875
fabulous 부정감성 지수:  0.125

SentiWordNet을 이용한 영화 감상평 감성 분석

위 내용을 이용하여 IMDB 영화 감상평 감성 분석을 SentiWordNet Lexicon 기반으로 수행하겠다. 순서는 다음과 같다.

문서(Document)를 문장(Sentence) 단위로 분해
다시 문장을 단어(Word) 단위로 토큰화하고 품사 태깅
품사 태깅된 단어 기반으로 synset 객체와 senti_synset 객체를 생성
Senti_synset에서 긍정 감성/부정 감성 지수를 구하고 이를 모두 합산해 특정 임계치 값 이상일 때 긍정 감성으로, 그렇지 않을 때는 부정 감성으로 결정

from nltk.corpus import wordnet as wn

# 간단한 NTLK PennTreebank Tag를 기반으로 WordNet기반의 품사 Tag로 변환
def penn_to_wn(tag):
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    return 

from nltk.stem import WordNetLemmatizer
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag

def swn_polarity(text):
    # 감성 지수 초기화 
    sentiment = 0.0
    tokens_count = 0
    
    lemmatizer = WordNetLemmatizer()
    raw_sentences = sent_tokenize(text)
    # 분해된 문장별로 단어 토큰 -> 품사 태깅 후에 SentiSynset 생성 -> 감성 지수 합산 
    for raw_sentence in raw_sentences:
        # NTLK 기반의 품사 태깅 문장 추출  
        tagged_sentence = pos_tag(word_tokenize(raw_sentence))
        for word , tag in tagged_sentence:
            
            # WordNet 기반 품사 태깅과 어근 추출(penntree 기반의 tag를 변환함)
            wn_tag = penn_to_wn(tag)
            if wn_tag not in (wn.NOUN , wn.ADJ, wn.ADV):
                continue                   
            lemma = lemmatizer.lemmatize(word, pos=wn_tag)
            if not lemma:
                continue
            # 어근을 추출한 단어와 WordNet 기반 품사 태깅을 입력해 Synset 객체를 생성. 
            synsets = wn.synsets(lemma , pos=wn_tag)
            if not synsets:
                continue
            # sentiwordnet의 감성 단어 분석으로 감성 synset 추출
            # 모든 단어에 대해 긍정 감성 지수는 +로 부정 감성 지수는 -로 합산해 감성 지수 계산. 
            synset = synsets[0]
            swn_synset = swn.senti_synset(synset.name())
            sentiment += (swn_synset.pos_score() - swn_synset.neg_score())           
            tokens_count += 1
    
    # ??? 이 코드는 뭘까
    if not tokens_count:
        return 0
    
    # 총 score가 0 이상일 경우 긍정(Positive) 1, 그렇지 않을 경우 부정(Negative) 0 반환
    if sentiment >= 0 :
        return 1
    
    return 0

review_df['preds'] = review_df['review'].apply( lambda x : swn_polarity(x) )
y_target = review_df['sentiment'].values
preds = review_df['preds'].values

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score 
from sklearn.metrics import recall_score, f1_score, roc_auc_score

print(confusion_matrix( y_target, preds))
print("정확도:", accuracy_score(y_target , preds))
print("정밀도:", precision_score(y_target , preds))
print("재현율:", recall_score(y_target, preds))

[[7639 4861]
 [3575 8925]]
정확도: 0.66256
정밀도: 0.64739590889308
재현율: 0.714

VADER을 이용한 감성 분석

VADER은 소셜 미디어의 감성 분석 용도로 만들어진 룰 기반의 Lexicon이다.VADER는 SentimentIntensityAnalyzer 클래스를 이용해 쉽게 감성 분석을 제공한다.
VADER는 NLTK 패키지의 서브 모듈로 제공될 수도 있고, 단독 패키지로 제공될 수도 있다.

먼저 SentimentIntensityAnalyzer 객체를 생성한 뒤에 문서별로 polarity_scores() 메서드를 호출해 감성 점수를 구한 뒤, 해당 문서의 감성 점수가 특정 임계값 이상이면 긍정, 그렇지 않으면 부정으로 판단한다.

SentimentIntensityAnalyzer 객체의 polarity_scores() 메서드는 딕셔너리 형태의 감성 점수를 반환한다. ‘neg’는 부정 감성 지수, ‘neu’는 중립적인 감성 지수, ‘pos’는 긍정 감성 지수, 그리고 compound는 neg, neu, pos socre를 적절히 조합해 -1에서 1 사이의 감성 지수를 표현한 값이다. compound score를 기반으로 부정 감성 또는 긍정 감성 여부를 결정한다. 보통 0.1 이상이면 긍정 감성, 그 이하이면 부정 감성으로 판단하나 상황에 따라 이 임계값을 적절히 조정해 예측 성능을 조절한다.

from nltk.sentiment.vader import SentimentIntensityAnalyzer

senti_analyzer = SentimentIntensityAnalyzer()
senti_scores = senti_analyzer.polarity_scores(review_df['review'][0])
print(senti_scores)

{'neg': 0.13, 'neu': 0.744, 'pos': 0.126, 'compound': -0.8278}

다음으로 VADER을 이용해 IMDB의 감성 분석을 수행하겠다.

# vader_polarity() 함수는 입력 파라미터로 영화 감상평 텍스트와 긍정/부정을 결정하는 임곗값을 가지고, 
# SentimentIntensityAnalyzer 객체의 polarity_scores() 메서드를 호출해 감성 결과를 반환한다.

def vader_polarity(review,threshold=0.1):
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    
    # compound 값에 기반하여 threshold 입력값보다 크면 1, 그렇지 않으면 0을 반환 
    agg_score = scores['compound']
    final_sentiment = 1 if agg_score >= threshold else 0
    return final_sentiment

# apply lambda 식을 이용하여 레코드별로 vader_polarity( )를 수행하고 결과를 'vader_preds'에 저장
review_df['vader_preds'] = review_df['review'].apply( lambda x : vader_polarity(x, 0.1) )
y_target = review_df['sentiment'].values
vader_preds = review_df['vader_preds'].values

print('#### VADER 예측 성능 평가 ####')
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score 
from sklearn.metrics import recall_score, f1_score, roc_auc_score

print(confusion_matrix( y_target, vader_preds))
print("정확도:", accuracy_score(y_target , vader_preds))
print("정밀도:", precision_score(y_target , vader_preds))
print("재현율:", recall_score(y_target, vader_preds))

#### VADER 예측 성능 평가 ####
[[ 6786  5714]
 [ 1937 10563]]
정확도: 0.69396
정밀도: 0.6489525096762303
재현율: 0.84504

이외에도 좋은 감성사전으로 pattern 패키지가 있으니 참고하자.

6 : 토픽 모델링(Topic Modeling) - 20 뉴스그룹

토픽 모델링이란 문서 집합에 숨어있는 주제를 찾아내는 것이다. 많은 양의 문서가 있을 때 사람이 이 문서를 다 읽고 핵심 주제를 찾는 것은 매우 많은 시간이 소모되므로, 이 경우 토픽 모델링이 효과적일 수 있다.
사람이 수행하는 토픽 모델링은 더 함축적인 의미로 문장을 요약하는 것에 반해, 머신러닝 기반의 토픽 모델은 숨겨진 주제를 효과적으로 표현할 수 있는 중심 단어를 함축적으로 추출한다.
머신러닝 기반의 토픽 모델링에 자주 사용되는 기법은 LSA(Latent Semantic Analysis)와 LDA(Latent Dirichlet Allocation)이다.
LSA와 LDA의 이론적 배경은 다음 링크를 참조하자.

위키독스 티스토리

이 절에서는 LDA만을 이용하여 토픽 모델링을 다루겠다.

※ 물론 차원 축소의 LDA와 다른 개념이니 혼동하지 말자.

사이킷런은 LDA 기반의 토픽 모델링을 LatentDirichletAllocation 클래스로 제공한다.

20개 중 8개의 주제 데이터 로드 및 Count기반 피처 벡터화.

LDA는 Count기반 Vectorizer만 적용한다.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# 모토사이클, 야구, 그래픽스, 윈도우즈, 중동, 기독교, 전자공학, 의학 등 8개 주제를 추출. 
cats = ['rec.motorcycles', 'rec.sport.baseball', 'comp.graphics', 'comp.windows.x',
        'talk.politics.mideast', 'soc.religion.christian', 'sci.electronics', 'sci.med'  ]

# 위에서 cats 변수로 기재된 category만 추출. featch_20newsgroups( )의 categories에 cats 입력
news_df= fetch_20newsgroups(subset='all',remove=('headers', 'footers', 'quotes'), 
                            categories=cats, random_state=0)

#LDA 는 Count기반의 Vectorizer만 적용한다.  
count_vect = CountVectorizer(max_df=0.95, max_features=1000, min_df=2, stop_words='english', ngram_range=(1,2))
feat_vect = count_vect.fit_transform(news_df.data)
print('CountVectorizer Shape:', feat_vect.shape)

CountVectorizer Shape: (7862, 1000)

LDA 객체 생성 후 Count 피처 벡터화 객체로 LDA수행

lda = LatentDirichletAllocation(n_components=8, random_state=0)
lda.fit(feat_vect)

LatentDirichletAllocation(n_components=8, random_state=0)

위 과정을 수행하면 LatentDirichletAllocation 객체는 components_ 속성값을 가지게 된다. components_는 개별 topic별로 각 word feature가 얼마나 많이 그 토픽에 할당됐는지에 대한 수치를 가지고 있다.

print(lda.components_.shape)
lda.components_

(8, 1000)





array([[3.60992018e+01, 1.35626798e+02, 2.15751867e+01, ...,
        3.02911688e+01, 8.66830093e+01, 6.79285199e+01],
       [1.25199920e-01, 1.44401815e+01, 1.25045596e-01, ...,
        1.81506995e+02, 1.25097844e-01, 9.39593286e+01],
       [3.34762663e+02, 1.25176265e-01, 1.46743299e+02, ...,
        1.25105772e-01, 3.63689741e+01, 1.25025218e-01],
       ...,
       [3.60204965e+01, 2.08640688e+01, 4.29606813e+00, ...,
        1.45056650e+01, 8.33854413e+00, 1.55690009e+01],
       [1.25128711e-01, 1.25247756e-01, 1.25005143e-01, ...,
        9.17278769e+01, 1.25177668e-01, 3.74575887e+01],
       [5.49258690e+01, 4.47009532e+00, 9.88524814e+00, ...,
        4.87048440e+01, 1.25034678e-01, 1.25074632e-01]])

각 토픽 모델링 주제별 단어들의 연관도 확인
위 결과에서 확인할 수 있듯이 lda객체의 components_ 속성은 주제별로 개별 단어들의 연관도 정규화 숫자가 들어있음

shape는 주제 개수 X 피처 단어 개수

components_ 에 들어 있는 숫자값은 각 주제별로 단어가 나타난 횟수를 정규화 하여 나타냄.

숫자가 클 수록 토픽에서 단어가 차지하는 비중이 높음

display_topic_words() 함수를 만들어서 각 topic별로 연관도가 높은 순으로 word를 나열하겠다.

각 토픽별 중심 단어 확인

def display_topic_words(model, feature_names, no_top_words):
    for topic_index, topic in enumerate(model.components_):
        print('\nTopic #',topic_index)

        # components_ array에서 가장 값이 큰 순으로 정렬했을 때, 그 값의 array index를 반환. 
        topic_word_indexes = topic.argsort()[::-1]
        top_indexes=topic_word_indexes[:no_top_words]
        
        # top_indexes대상인 index별로 feature_names에 해당하는 word feature 추출 후 join으로 concat
        feature_concat = ' + '.join([str(feature_names[i])+'*'+str(round(topic[i],1)) for i in top_indexes])                
        print(feature_concat)

# CountVectorizer객체내의 전체 word들의 명칭을 get_features_names( )를 통해 추출
feature_names = count_vect.get_feature_names()

# Topic별 가장 연관도가 높은 word를 15개만 추출
display_topic_words(lda, feature_names, 15)

# 모토사이클, 야구, 그래픽스, 윈도우즈, 중동, 기독교, 전자공학, 의학 등 8개 주제를 추출. 

Topic # 0
year*703.2 + 10*563.6 + game*476.3 + medical*413.2 + health*377.4 + team*346.8 + 12*343.9 + 20*340.9 + disease*332.1 + cancer*319.9 + 1993*318.3 + games*317.0 + years*306.5 + patients*299.8 + good*286.3

Topic # 1
don*1454.3 + just*1392.8 + like*1190.8 + know*1178.1 + people*836.9 + said*802.5 + think*799.7 + time*754.2 + ve*676.3 + didn*675.9 + right*636.3 + going*625.4 + say*620.7 + ll*583.9 + way*570.3

Topic # 2
image*1047.7 + file*999.1 + jpeg*799.1 + program*495.6 + gif*466.0 + images*443.7 + output*442.3 + format*442.3 + files*438.5 + color*406.3 + entry*387.6 + 00*334.8 + use*308.5 + bit*308.4 + 03*258.7

Topic # 3
like*620.7 + know*591.7 + don*543.7 + think*528.4 + use*514.3 + does*510.2 + just*509.1 + good*425.8 + time*417.4 + book*410.7 + read*402.9 + information*395.2 + people*393.5 + used*388.2 + post*368.4

Topic # 4
armenian*960.6 + israel*815.9 + armenians*699.7 + jews*690.9 + turkish*686.1 + people*653.0 + israeli*476.1 + jewish*467.0 + government*464.4 + war*417.8 + dos dos*401.1 + turkey*393.5 + arab*386.1 + armenia*346.3 + 000*345.2

Topic # 5
edu*1613.5 + com*841.4 + available*761.5 + graphics*708.0 + ftp*668.1 + data*517.9 + pub*508.2 + motif*460.4 + mail*453.3 + widget*447.4 + software*427.6 + mit*421.5 + information*417.3 + version*413.7 + sun*402.4

Topic # 6
god*2013.0 + people*721.0 + jesus*688.7 + church*663.0 + believe*563.0 + christ*553.1 + does*500.1 + christian*474.8 + say*468.6 + think*446.0 + christians*443.5 + bible*422.9 + faith*420.1 + sin*396.5 + life*371.2

Topic # 7
use*685.8 + dos*635.0 + thanks*596.0 + windows*548.7 + using*486.5 + window*483.1 + does*456.2 + display*389.1 + help*385.2 + like*382.8 + problem*375.7 + server*370.2 + need*366.3 + know*355.5 + run*315.3

개별 문서별 토픽 분포 확인

lda객체의 transform()을 수행하면 개별 문서별 토픽 분포를 반환한다.

doc_topics = lda.transform(feat_vect)
print(doc_topics.shape)
print(doc_topics[:3])

(7862, 8)
[[0.01389701 0.01394362 0.01389104 0.48221844 0.01397882 0.01389205
  0.01393501 0.43424401]
 [0.27750436 0.18151826 0.0021208  0.53037189 0.00212129 0.00212102
  0.00212113 0.00212125]
 [0.00544459 0.22166575 0.00544539 0.00544528 0.00544039 0.00544168
  0.00544182 0.74567512]]

개별 문서별 토픽 분포도를 출력

20newsgroup으로 만들어진 문서명을 출력.

fetch_20newsgroups()으로 만들어진 데이터의 filename속성은 모든 문서의 문서명을 가지고 있음.

filename속성은 절대 디렉토리를 가지는 문서명을 가지고 있으므로 ‘\‘로 분할하여 맨 마지막 두번째 부터 파일명으로 가져옴

def get_filename_list(newsdata):
    filename_list=[]

    for file in newsdata.filenames:
            #print(file)
            filename_temp = file.split('\\')[-2:]
            filename = '.'.join(filename_temp)
            filename_list.append(filename)
    
    return filename_list

filename_list = get_filename_list(news_df)
print("filename 개수:",len(filename_list), "filename list 10개만:",filename_list[:10])

filename 개수: 7862 filename list 10개만: ['soc.religion.christian.20630', 'sci.med.59422', 'comp.graphics.38765', 'comp.graphics.38810', 'sci.med.59449', 'comp.graphics.38461', 'comp.windows.x.66959', 'rec.motorcycles.104487', 'sci.electronics.53875', 'sci.electronics.53617']

DataFrame으로 생성하여 문서별 토픽 분포도 확인

import pandas as pd 

topic_names = ['Topic #'+ str(i) for i in range(0, 8)]
doc_topic_df = pd.DataFrame(data=doc_topics, columns=topic_names, index=filename_list)
doc_topic_df.head(20)

	Topic #0	Topic #1	Topic #2	Topic #3	Topic #4	Topic #5	Topic #6	Topic #7
soc.religion.christian.20630	0.013897	0.013944	0.013891	0.482218	0.013979	0.013892	0.013935	0.434244
sci.med.59422	0.277504	0.181518	0.002121	0.530372	0.002121	0.002121	0.002121	0.002121
comp.graphics.38765	0.005445	0.221666	0.005445	0.005445	0.005440	0.005442	0.005442	0.745675
comp.graphics.38810	0.005439	0.005441	0.005449	0.578959	0.005440	0.388387	0.005442	0.005442
sci.med.59449	0.006584	0.552000	0.006587	0.408485	0.006585	0.006585	0.006588	0.006585
comp.graphics.38461	0.008342	0.008352	0.182622	0.767314	0.008335	0.008341	0.008343	0.008351
comp.windows.x.66959	0.372861	0.041667	0.377020	0.041668	0.041703	0.041703	0.041667	0.041711
rec.motorcycles.104487	0.225351	0.674669	0.004814	0.075920	0.004812	0.004812	0.004812	0.004810
sci.electronics.53875	0.008944	0.836686	0.008932	0.008941	0.008935	0.109691	0.008932	0.008938
sci.electronics.53617	0.041733	0.041720	0.708081	0.041742	0.041671	0.041669	0.041699	0.041686
sci.electronics.54089	0.001647	0.512634	0.001647	0.152375	0.001645	0.001649	0.001647	0.326757
rec.sport.baseball.102713	0.982653	0.000649	0.013455	0.000649	0.000648	0.000648	0.000649	0.000649
rec.sport.baseball.104711	0.288554	0.007358	0.007364	0.596561	0.078082	0.007363	0.007360	0.007358
comp.graphics.38232	0.044939	0.138461	0.375098	0.003914	0.003909	0.003911	0.003912	0.425856
sci.electronics.52732	0.017944	0.874782	0.017869	0.017904	0.017867	0.017866	0.017884	0.017885
talk.politics.mideast.76440	0.003381	0.003385	0.003381	0.843991	0.135716	0.003380	0.003384	0.003382
sci.med.59243	0.491684	0.486865	0.003574	0.003577	0.003578	0.003574	0.003574	0.003574
talk.politics.mideast.75888	0.015639	0.499140	0.015641	0.015683	0.015640	0.406977	0.015644	0.015636
soc.religion.christian.21526	0.002455	0.164735	0.002455	0.002456	0.208655	0.002454	0.614333	0.002458
comp.windows.x.66408	0.000080	0.000080	0.809449	0.163054	0.000080	0.027097	0.000080	0.000080

7 : 문서 군집화 소개와 실습(Opinion Review data set)

문서 군집화 개념

문서 군집화(Document Clustering)는 비슷한 텍스트 구성의 문서를 군집화(Clustering)하는 것이다. 텍스트 분류 기반의 문서 분류는 사전에 결정 카테고리 값을 가진 학습 데이터 세트가 필요한 데 반해, 문서 군집화는 학습 데이터 세트가 필요없는 비지도학습 기반으로 작동한다.

Opinion Review 데이터 세트를 이용한 문서 군집화 수행하기

Opinion Review 파일은 구글링을 통해 쉽게 구할 수 있다.

데이터 로딩

import pandas as pd
import glob ,os

path = r'C:\Users\Oh Won Jin\Python\PerfectGuide\8장\OpinosisDataset1.0\topics'                     
# path로 지정한 디렉토리 밑에 있는 모든 .data 파일들의 파일명을 리스트로 취합
all_files = glob.glob(os.path.join(path, "*.data"))    
filename_list = []
opinion_text = []

# 개별 파일들의 파일명은 filename_list 리스트로 취합, 
# 개별 파일들의 파일내용은 DataFrame로딩 후 다시 string으로 변환하여 opinion_text 리스트로 취합 
for file_ in all_files:
    # 개별 파일을 읽어서 DataFrame으로 생성 
    df = pd.read_table(file_,index_col=None, header=0,encoding='latin1')
    
    # 절대경로로 주어진 file 명을 가공. 만일 Linux에서 수행시에는 아래 \\를 / 변경. 맨 마지막 .data 확장자도 제거
    filename_ = file_.split('\\')[-1]
    filename = filename_.split('.')[0]

    #파일명 리스트와 파일내용 리스트에 파일명과 파일 내용을 추가. 
    filename_list.append(filename)
    opinion_text.append(df.to_string())

# 파일명 리스트와 파일내용 리스트를  DataFrame으로 생성
document_df = pd.DataFrame({'filename':filename_list, 'opinion_text':opinion_text})
document_df.head()

	filename	opinion_text
0	accuracy_garmin_nuvi_255W_gps	...
1	bathroom_bestwestern_hotel_sfo	...
2	battery-life_amazon_kindle	...
3	battery-life_ipod_nano_8gb	...
4	battery-life_netbook_1005ha	...

Lemmatization을 위한 함수 생성

from nltk.stem import WordNetLemmatizer
import nltk
import string

# nltk는 
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
lemmar = WordNetLemmatizer()

def LemTokens(tokens):
    return [lemmar.lemmatize(token) for token in tokens]

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

TF-IDF 피처 벡터화, TfidfVectorizer에서 피처 벡터화 수행 시 Lemmatization을 적용하여 토큰화

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english' , \
                             ngram_range=(1,2), min_df=0.05, max_df=0.85 )

#opinion_text 컬럼값으로 feature vectorization 수행
feature_vect = tfidf_vect.fit_transform(document_df['opinion_text'])

C:\Users\Oh Won Jin\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:391: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words.
  'stop_words.' % sorted(inconsistent))

5개의 군집으로 K-Means군집화

from sklearn.cluster import KMeans

# 5개 집합으로 군집화 수행. 예제를 위해 동일한 클러스터링 결과 도출용 random_state=0 
km_cluster = KMeans(n_clusters=5, max_iter=10000, random_state=0)
km_cluster.fit(feature_vect)
cluster_label = km_cluster.labels_
cluster_centers = km_cluster.cluster_centers_

군집화된 그룹별로 데이터 확인

document_df['cluster_label'] = cluster_label
document_df.head()

	filename	opinion_text	cluster_label
0	accuracy_garmin_nuvi_255W_gps	...	2
1	bathroom_bestwestern_hotel_sfo	...	0
2	battery-life_amazon_kindle	...	1
3	battery-life_ipod_nano_8gb	...	1
4	battery-life_netbook_1005ha	...	1

document_df[document_df['cluster_label']==0].sort_values(by='filename')

	filename	opinion_text
1	bathroom_bestwestern_hotel_sfo	...
32	room_holiday_inn_london	...
30	rooms_bestwestern_hotel_sfo	...
31	rooms_swissotel_chicago	...

document_df[document_df['cluster_label']==1].sort_values(by='filename')

	filename	opinion_text	cluster_label
2	battery-life_amazon_kindle	...	1
3	battery-life_ipod_nano_8gb	...	1
4	battery-life_netbook_1005ha	...	1
19	keyboard_netbook_1005ha	...	1
26	performance_netbook_1005ha	...	1
41	size_asus_netbook_1005ha	...	1
42	sound_ipod_nano_8gb	headphone jack i got a clear case for it a...	1
44	speed_windows7	...	1

document_df[document_df['cluster_label']==2].sort_values(by='filename')

	filename	opinion_text	cluster_label
0	accuracy_garmin_nuvi_255W_gps	...	2
5	buttons_amazon_kindle	...	2
8	directions_garmin_nuvi_255W_gps	...	2
9	display_garmin_nuvi_255W_gps	...	2
10	eyesight-issues_amazon_kindle	...	2
11	features_windows7	...	2
12	fonts_amazon_kindle	...	2
23	navigation_amazon_kindle	...	2
33	satellite_garmin_nuvi_255W_gps	...	2
34	screen_garmin_nuvi_255W_gps	...	2
35	screen_ipod_nano_8gb	...	2
36	screen_netbook_1005ha	...	2
43	speed_garmin_nuvi_255W_gps	...	2
48	updates_garmin_nuvi_255W_gps	...	2
49	video_ipod_nano_8gb	...	2
50	voice_garmin_nuvi_255W_gps	...	2

document_df[document_df['cluster_label']==3].sort_values(by='filename')

	filename	opinion_text	cluster_label
13	food_holiday_inn_london	...	3
14	food_swissotel_chicago	...	3
15	free_bestwestern_hotel_sfo	...	3
20	location_bestwestern_hotel_sfo	...	3
21	location_holiday_inn_london	...	3
24	parking_bestwestern_hotel_sfo	...	3
27	price_amazon_kindle	...	3
28	price_holiday_inn_london	...	3
38	service_bestwestern_hotel_sfo	...	3
39	service_holiday_inn_london	...	3
40	service_swissotel_hotel_chicago	...	3
45	staff_bestwestern_hotel_sfo	...	3
46	staff_swissotel_chicago	...	3

document_df[document_df['cluster_label']==4].sort_values(by='filename')

	filename	opinion_text	cluster_label
6	comfort_honda_accord_2008	...	4
7	comfort_toyota_camry_2007	...	4
16	gas_mileage_toyota_camry_2007	...	4
17	interior_honda_accord_2008	...	4
18	interior_toyota_camry_2007	...	4
22	mileage_honda_accord_2008	...	4
25	performance_honda_accord_2008	...	4
29	quality_toyota_camry_2007	...	4
37	seats_honda_accord_2008	...	4
47	transmission_toyota_camry_2007	...	4

from sklearn.cluster import KMeans

# 3개의 집합으로 군집화 
km_cluster = KMeans(n_clusters=3, max_iter=10000, random_state=0)
km_cluster.fit(feature_vect)
cluster_label = km_cluster.labels_


# 소속 클러스터를 cluster_label 컬럼으로 할당하고 cluster_label 값으로 정렬
document_df['cluster_label'] = cluster_label
document_df.sort_values(by='cluster_label')

	filename	opinion_text	cluster_label
0	accuracy_garmin_nuvi_255W_gps	...	0
48	updates_garmin_nuvi_255W_gps	...	0
44	speed_windows7	...	0
43	speed_garmin_nuvi_255W_gps	...	0
42	sound_ipod_nano_8gb	headphone jack i got a clear case for it a...	0
41	size_asus_netbook_1005ha	...	0
36	screen_netbook_1005ha	...	0
35	screen_ipod_nano_8gb	...	0
34	screen_garmin_nuvi_255W_gps	...	0
33	satellite_garmin_nuvi_255W_gps	...	0
27	price_amazon_kindle	...	0
26	performance_netbook_1005ha	...	0
49	video_ipod_nano_8gb	...	0
23	navigation_amazon_kindle	...	0
19	keyboard_netbook_1005ha	...	0
50	voice_garmin_nuvi_255W_gps	...	0
9	display_garmin_nuvi_255W_gps	...	0
2	battery-life_amazon_kindle	...	0
3	battery-life_ipod_nano_8gb	...	0
4	battery-life_netbook_1005ha	...	0
5	buttons_amazon_kindle	...	0
12	fonts_amazon_kindle	...	0
11	features_windows7	...	0
10	eyesight-issues_amazon_kindle	...	0
8	directions_garmin_nuvi_255W_gps	...	0
47	transmission_toyota_camry_2007	...	1
37	seats_honda_accord_2008	...	1
6	comfort_honda_accord_2008	...	1
7	comfort_toyota_camry_2007	...	1
16	gas_mileage_toyota_camry_2007	...	1
25	performance_honda_accord_2008	...	1
17	interior_honda_accord_2008	...	1
18	interior_toyota_camry_2007	...	1
22	mileage_honda_accord_2008	...	1
29	quality_toyota_camry_2007	...	1
1	bathroom_bestwestern_hotel_sfo	...	2
46	staff_swissotel_chicago	...	2
45	staff_bestwestern_hotel_sfo	...	2
14	food_swissotel_chicago	...	2
20	location_bestwestern_hotel_sfo	...	2
21	location_holiday_inn_london	...	2
30	rooms_bestwestern_hotel_sfo	...	2
38	service_bestwestern_hotel_sfo	...	2
13	food_holiday_inn_london	...	2
24	parking_bestwestern_hotel_sfo	...	2
28	price_holiday_inn_london	...	2
15	free_bestwestern_hotel_sfo	...	2
32	room_holiday_inn_london	...	2
31	rooms_swissotel_chicago	...	2
39	service_holiday_inn_london	...	2
40	service_swissotel_hotel_chicago	...	2

군집(Cluster)별 핵심 단어 추출하기

각 군집에 속한 문서는 핵심 단어를 주축으로 군집화 되어있을 것이다. 이번에는 각 군집을 구성하는 핵심 단어가 어떤 것이 있는지 확인해보겠다.

KMeans 객체는 각 군집을 구성하는 단어 feature가 군집의 중심(Centroid)을 기준으로 얼마나 가깝게 위치해 있는지 clusters_centers_라는 속성을 제공한다.clusters_centers_는 배열 값으로 제공되며, 행은 개별 군집을, 열은 개별 feature을 의미한다. 각 배열 내의 값은 개별 군집 내의 상대 위치를 숫자 값으로 표현한 일종의 좌표 값이다.예를 들어 cluster_centers[0,1]은 0번 군집에서 두 번째 feature의 위치 값이다.

feature_vect.shape

(51, 4611)

cluster_centers = km_cluster.cluster_centers_
print('cluster_centers shape :',cluster_centers.shape)
print(cluster_centers)

cluster_centers shape : (3, 4611)
[[0.01005322 0.         0.         ... 0.00706287 0.         0.        ]
 [0.         0.00092551 0.         ... 0.         0.         0.        ]
 [0.         0.00099499 0.00174637 ... 0.         0.00183397 0.00144581]]

cluster_centers_는 (3,2409) 배열이다. 이는 군집이 3개, word feature가 2409개로 구성되었음을 의미한다.각 행의 배열 값은 각 군집 내의 2409 feature의 위치가 개별 중심과 얼마나 가까운가를 상대 값으로 나타낸 것이다. 이 값은 0~1까지의 값으로 표현되며 1에 가까울 수록 중심에 더 가깝다는 의미이다.

군집별 top n 핵심단어, 그 단어의 중심 위치 상대값, 대상 파일명들을 반환하는 함수 생성

# 군집별 top n 핵심단어, 그 단어의 중심 위치 상대값, 대상 파일명들을 반환함. 
def get_cluster_details(cluster_model, cluster_data, feature_names, clusters_num, top_n_features=10):
    cluster_details = {}
    
    # cluster_centers array 의 값이 큰 순으로 정렬된 index 값을 반환
    # 군집 중심점(centroid)별 할당된 word 피처들의 거리값이 큰 순으로 값을 구하기 위함.  
    centroid_feature_ordered_ind = cluster_model.cluster_centers_.argsort()[:,::-1]
    
    #개별 군집별로 iteration하면서 핵심단어, 그 단어의 중심 위치 상대값, 대상 파일명 입력
    for cluster_num in range(clusters_num):
        # 개별 군집별 정보를 담을 데이터 초기화. 
        cluster_details[cluster_num] = {}
        cluster_details[cluster_num]['cluster'] = cluster_num
        
        # cluster_centers_.argsort()[:,::-1] 로 구한 index 를 이용하여 top n 피처 단어를 구함. 
        top_feature_indexes = centroid_feature_ordered_ind[cluster_num, :top_n_features]
        top_features = [ feature_names[ind] for ind in top_feature_indexes ]
        
        # top_feature_indexes를 이용해 해당 피처 단어의 중심 위치 상댓값 구함 
        top_feature_values = cluster_model.cluster_centers_[cluster_num, top_feature_indexes].tolist()
        
        # cluster_details 딕셔너리 객체에 개별 군집별 핵심 단어와 중심위치 상대값, 그리고 해당 파일명 입력
        cluster_details[cluster_num]['top_features'] = top_features
        cluster_details[cluster_num]['top_features_value'] = top_feature_values
        filenames = cluster_data[cluster_data['cluster_label'] == cluster_num]['filename']
        filenames = filenames.values.tolist()
        cluster_details[cluster_num]['filenames'] = filenames
        
    return cluster_details

위 함수를 호출하여 dictionary를 원소로 가지는 리스트인 cluster_details를 반환한다. 이를 좀 더 보기 좋게 표현하기 위해 별도의 print_cluster_detailes() 함수를 만들겠다.

클러스터별 top feature들의 단어와 파일명 출력

def print_cluster_details(cluster_details):
    for cluster_num, cluster_detail in cluster_details.items():
        print('####### Cluster {0}'.format(cluster_num))
        print('Top features:', cluster_detail['top_features'])
        print('Reviews 파일명 :',cluster_detail['filenames'][:7])
        print('==================================================')

feature_names = tfidf_vect.get_feature_names()

cluster_details = get_cluster_details(cluster_model=km_cluster, cluster_data=document_df,\
                                  feature_names=feature_names, clusters_num=3, top_n_features=10 )
print_cluster_details(cluster_details)

####### Cluster 0
Top features: ['screen', 'battery', 'keyboard', 'battery life', 'life', 'kindle', 'direction', 'video', 'size', 'voice']
Reviews 파일명 : ['accuracy_garmin_nuvi_255W_gps', 'battery-life_amazon_kindle', 'battery-life_ipod_nano_8gb', 'battery-life_netbook_1005ha', 'buttons_amazon_kindle', 'directions_garmin_nuvi_255W_gps', 'display_garmin_nuvi_255W_gps']
==================================================
####### Cluster 1
Top features: ['interior', 'seat', 'mileage', 'comfortable', 'gas', 'gas mileage', 'transmission', 'car', 'performance', 'quality']
Reviews 파일명 : ['comfort_honda_accord_2008', 'comfort_toyota_camry_2007', 'gas_mileage_toyota_camry_2007', 'interior_honda_accord_2008', 'interior_toyota_camry_2007', 'mileage_honda_accord_2008', 'performance_honda_accord_2008']
==================================================
####### Cluster 2
Top features: ['room', 'hotel', 'service', 'staff', 'food', 'location', 'bathroom', 'clean', 'price', 'parking']
Reviews 파일명 : ['bathroom_bestwestern_hotel_sfo', 'food_holiday_inn_london', 'food_swissotel_chicago', 'free_bestwestern_hotel_sfo', 'location_bestwestern_hotel_sfo', 'location_holiday_inn_london', 'parking_bestwestern_hotel_sfo']
==================================================

포터블 전자제품 리뷰 군집인 Cluster #0에서는 ‘screen’,’battery’,’life’ 등과 같은 화면과 배터리 수명 등이 핵심 단어로 군집화되었다.

8 : 문서 유사도

문서와 문서간의 유사도 비교는 일반적으로 코사인 유사도(Cosine Similarity)를 사용한다. 코사인 유사도가 문서의 유사도 비교에 가장 많이 사용되는 이유는, 희소 행렬 기반에서 문서와 문서 벡터 간의 크기에 기반한 유사도 지표(ex : 유클리드 거리 기반 지표)는 정확도가 떨이지기 쉽고, 또한 문서가 매우 긴 경우 단어의 빈도수도 더 많을 것이기 때문에 이러한 빈도수에만 기반해서는 공정한 비교를 할 수 없기 때문이다.

유사도 $cos\theta$ = similarity = $ {A·B}\over{||A||·||B||}$

유사도는 위와 같이 두 벡터의 내적을 총 벡터 크기의 합으로 나눈 것이다.(즉, 내적 결과를 총 벡터 크기로 정규화(L2Norm)한 것이다.)
넘파이 배열에 대한 코사인 유사도를 구하는 cos_similarity() 함수를 작성하겠다. 물론 사이킷런에서 패키지를 제공하기도 한다.

코사인 유사도 반환 함수 생성

import numpy as np

def cos_similarity(v1, v2):
    dot_product = np.dot(v1, v2)
    l2_norm = (np.sqrt(sum(np.square(v1))) * np.sqrt(sum(np.square(v2))))
    similarity = dot_product / l2_norm     
    
    return similarity

TF-IDF 벡터화 후 코사인 유사도 비교

from sklearn.feature_extraction.text import TfidfVectorizer

doc_list = ['if you take the blue pill, the story ends' ,
            'if you take the red pill, you stay in Wonderland',
            'if you take the red pill, I show you how deep the rabbit hole goes']

tfidf_vect_simple = TfidfVectorizer()
feature_vect_simple = tfidf_vect_simple.fit_transform(doc_list)
print(feature_vect_simple.shape)

(3, 18)

print(type(feature_vect_simple))

<class 'scipy.sparse.csr.csr_matrix'>

# TFidfVectorizer로 transform()한 결과는 Sparse Matrix이므로 Dense Matrix로 변환. 
feature_vect_dense = feature_vect_simple.todense()

#첫번째 문장과 두번째 문장의 feature vector  추출
vect1 = np.array(feature_vect_dense[0]).reshape(-1,)
vect2 = np.array(feature_vect_dense[1]).reshape(-1,)

#첫번째 문장과 두번째 문장의 feature vector로 두개 문장의 Cosine 유사도 추출
similarity_simple = cos_similarity(vect1, vect2 )
print('문장 1, 문장 2 Cosine 유사도: {0:.3f}'.format(similarity_simple))

문장 1, 문장 2 Cosine 유사도: 0.402

vect1 = np.array(feature_vect_dense[0]).reshape(-1,)
vect3 = np.array(feature_vect_dense[2]).reshape(-1,)
similarity_simple = cos_similarity(vect1, vect3 )
print('문장 1, 문장 3 Cosine 유사도: {0:.3f}'.format(similarity_simple))

vect2 = np.array(feature_vect_dense[1]).reshape(-1,)
vect3 = np.array(feature_vect_dense[2]).reshape(-1,)
similarity_simple = cos_similarity(vect2, vect3 )
print('문장 2, 문장 3 Cosine 유사도: {0:.3f}'.format(similarity_simple))

문장 1, 문장 3 Cosine 유사도: 0.404
문장 2, 문장 3 Cosine 유사도: 0.456

사이킷런의 cosine_similarity()함수를 이용하여 비교

from sklearn.metrics.pairwise import cosine_similarity

similarity_simple_pair = cosine_similarity(feature_vect_simple[0] , feature_vect_simple)
print(similarity_simple_pair)

[[1.         0.40207758 0.40425045]]

위 결과에서, 자기 자신의 코사인 유사도가 1임을 알 수 있다.

similarity_simple_pair = cosine_similarity(feature_vect_simple , feature_vect_simple)
print(similarity_simple_pair)
print('shape:',similarity_simple_pair.shape)

[[1.         0.40207758 0.40425045]
 [0.40207758 1.         0.45647296]
 [0.40425045 0.45647296 1.        ]]
shape: (3, 3)

Opinion Review 데이터 셋을 이용한 문서 유사도 측정

from nltk.stem import WordNetLemmatizer
import nltk
import string

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
lemmar = WordNetLemmatizer()

def LemTokens(tokens):
    return [lemmar.lemmatize(token) for token in tokens]

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

import pandas as pd
import glob ,os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

path = r'C:\Users\Oh Won Jin\Python\PerfectGuide\8장\OpinosisDataset1.0\topics' 
all_files = glob.glob(os.path.join(path, "*.data"))     
filename_list = []
opinion_text = []

for file_ in all_files:
    df = pd.read_table(file_,index_col=None, header=0,encoding='latin1')
    filename_ = file_.split('\\')[-1]
    filename = filename_.split('.')[0]
    filename_list.append(filename)
    opinion_text.append(df.to_string())

document_df = pd.DataFrame({'filename':filename_list, 'opinion_text':opinion_text})

tfidf_vect = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english' , \
                             ngram_range=(1,2), min_df=0.05, max_df=0.85 )
feature_vect = tfidf_vect.fit_transform(document_df['opinion_text'])

km_cluster = KMeans(n_clusters=3, max_iter=10000, random_state=0)
km_cluster.fit(feature_vect)
cluster_label = km_cluster.labels_
cluster_centers = km_cluster.cluster_centers_
document_df['cluster_label'] = cluster_label

C:\Users\Oh Won Jin\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:391: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words.
  'stop_words.' % sorted(inconsistent))

호텔로 클러스터링 된 문서중에서 비슷한 문서를 추출

from sklearn.metrics.pairwise import cosine_similarity

# cluster_label=1인 데이터는 호텔로 클러스터링된 데이터임. DataFrame에서 해당 Index를 추출
hotel_indexes = document_df[document_df['cluster_label']==1].index
print('호텔로 클러스터링 된 문서들의 DataFrame Index:', hotel_indexes)

# 호텔로 클러스터링된 데이터 중 첫번째 문서를 추출하여 파일명 표시.  
comparison_docname = document_df.iloc[hotel_indexes[0]]['filename']
print('##### 비교 기준 문서명 ',comparison_docname,' 와 타 문서 유사도######')

''' document_df에서 추출한 Index 객체를 feature_vect로 입력하여 호텔 클러스터링된 feature_vect 추출 
이를 이용하여 호텔로 클러스터링된 문서 중 첫번째 문서와 다른 문서간의 코사인 유사도 측정.'''
similarity_pair = cosine_similarity(feature_vect[hotel_indexes[0]] , feature_vect[hotel_indexes])
print(similarity_pair)

호텔로 클러스터링 된 문서들의 DataFrame Index: Int64Index([6, 7, 16, 17, 18, 22, 25, 29, 37, 47], dtype='int64')
##### 비교 기준 문서명  comfort_honda_accord_2008  와 타 문서 유사도######
[[1.         0.83969704 0.15655631 0.33044002 0.25981841 0.16544257
  0.27569738 0.18050974 0.65502034 0.06229873]]

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# argsort()를 이용하여 앞예제의 첫번째 문서와 타 문서간 유사도가 큰 순으로 정렬한 인덱스 반환하되 자기 자신은 제외. 
sorted_index = similarity_pair.argsort()[:,::-1]
sorted_index = sorted_index[:, 1:]
print(sorted_index)

# 유사도가 큰 순으로 hotel_indexes를 추출하여 재 정렬. 
print(hotel_indexes)
hotel_sorted_indexes = hotel_indexes[sorted_index.reshape(-1,)]

# 유사도가 큰 순으로 유사도 값을 재정렬하되 자기 자신은 제외
hotel_1_sim_value = np.sort(similarity_pair.reshape(-1,))[::-1]
hotel_1_sim_value = hotel_1_sim_value[1:]

# 유사도가 큰 순으로 정렬된 Index와 유사도값을 이용하여 파일명과 유사도값을 Seaborn 막대 그래프로 시각화
hotel_1_sim_df = pd.DataFrame()
hotel_1_sim_df['filename'] = document_df.iloc[hotel_sorted_indexes]['filename']
hotel_1_sim_df['similarity'] = hotel_1_sim_value

sns.barplot(x='similarity', y='filename',data=hotel_1_sim_df)
plt.title(comparison_docname)

[[1 8 3 6 4 7 5 2 9]]
Int64Index([6, 7, 16, 17, 18, 22, 25, 29, 37, 47], dtype='int64')

Text(0.5, 1.0, 'comfort_honda_accord_2008')

output_178_2

9 : 한글 텍스트 처리 - 네이버 영화 평점 감성 분석

파이썬은 KoNLPy 패키지를 활용한다.

konlpy 패키지의 설치 방법은 까다로우니 java 를 참고하자.

!pip install konlpy

import pandas as pd

train_df = pd.read_csv('C:/Users/Oh Won Jin/Python/PerfectGuide/8장/ratings_train.txt', sep='\t')
train_df.head(3)

	id	document	label
0	9976970	아 더빙.. 진짜 짜증나네요 목소리	0
1	3819312	흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나	1
2	10265843	너무재밓었다그래서보는것을추천한다	0

train_df['label'].value_counts( )

0    75173
1    74827
Name: label, dtype: int64

train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   id        150000 non-null  int64 
 1   document  149995 non-null  object
 2   label     150000 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.4+ MB

import re

train_df = train_df.fillna(' ')
# 정규 표현식을 이용하여 숫자를 공백으로 변경(정규 표현식으로 \d 는 숫자를 의미함.) 
train_df['document'] = train_df['document'].apply( lambda x : re.sub(r"\d+", " ", x) )
train_df.drop('id', axis=1, inplace=True)

# 테스트 데이터 셋을 로딩하고 동일하게 Null 및 숫자를 공백으로 변환
test_df = pd.read_csv('C:/Users/Oh Won Jin/Python/PerfectGuide/8장/ratings_test.txt', sep='\t')
test_df = test_df.fillna(' ')
test_df['document'] = test_df['document'].apply( lambda x : re.sub(r"\d+", " ", x) )
test_df.drop('id', axis=1, inplace=True)

from konlpy.tag import Twitter

twitter = Twitter()

# 이 tw_tokenizer() 함수는 뒤의 사이킷런의 TfidVectorizer 클래스의 tokenizer로 사용된다.
def tw_tokenizer(text):
    # 입력 인자로 들어온 text 를 형태소 단어로 토큰화 하여 list 객체 반환
    tokens_ko = twitter.morphs(text)
    return tokens_ko

tw_tokenizer('첫째')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Twitter 객체의 morphs( ) 객체를 이용한 tokenizer를 사용. ngram_range는 (1,2) 
tfidf_vect = TfidfVectorizer(tokenizer=tw_tokenizer, ngram_range=(1,2), min_df=3, max_df=0.9)
tfidf_vect.fit(train_df['document'])
tfidf_matrix_train = tfidf_vect.transform(train_df['document'])

C:\Users\Oh Won Jin\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:489: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn("The parameter 'token_pattern' will not be used"

# Logistic Regression 을 이용하여 감성 분석 Classification 수행. 
lg_clf = LogisticRegression(random_state=0)

# Parameter C 최적화를 위해 GridSearchCV 를 이용. 
params = { 'C': [1 ,3.5, 4.5, 5.5, 10 ] }
grid_cv = GridSearchCV(lg_clf , param_grid=params , cv=3 ,scoring='accuracy', verbose=1 )
grid_cv.fit(tfidf_matrix_train , train_df['label'] )
print(grid_cv.best_params_ , round(grid_cv.best_score_,4))

Fitting 3 folds for each of 5 candidates, totalling 15 fits


C:\Users\Oh Won Jin\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:765: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
C:\Users\Oh Won Jin\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:765: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

{'C': 3.5} 0.8592

이제 테스트 세트를 이용하여 최종 감성 분석 예측을 수행하겠다. 테스트 세트를 이용해 예측할 때는 학습할 때 적용한 TfidfVectorizer를 그대로 사용해야 한다. 그래야만 학습 시 설정된 TfidfVectorizer의 feature 개수와 테스트 데이터를 TfidfVectorizer로 변환할 feature 개수가 같아진다.

from sklearn.metrics import accuracy_score

# 학습 데이터를 적용한 TfidfVectorizer를 이용하여 테스트 데이터를 TF-IDF 값으로 Feature 변환함. 
tfidf_matrix_test = tfidf_vect.transform(test_df['document'])

# classifier 는 GridSearchCV에서 최적 파라미터로 학습된 classifier를 그대로 이용
best_estimator = grid_cv.best_estimator_
preds = best_estimator.predict(tfidf_matrix_test)

print('Logistic Regression 정확도: ',accuracy_score(test_df['label'],preds))

Logistic Regression 정확도:  0.86186

요약. 정리

사실 텍스트 분석 및 NLP 영역은 이 포스트 내용보다 더 확장된 개념과 딥러닝 개념을 요구하므로, 이 포스트는 텍스트 분석에 대한 맛보기 정도로 이해하도록 하자.

Share on

Twitter Facebook LinkedIn

Oh Won Jin