728x90

1. LDA_Implementation

In [277]:

import os 

corpus = list()
for filename in os.listdir('./News'):
    if filename.startswith('정치'):
        with open('./News/'+filename,encoding='utf-8') as f:
            corpus.append(f.read())

In [278]:

corpus[0]

Out[278]:

"\n\n\n\n\n// flash 오류를 우회하기 위한 함수 추가\nfunction _flash_removeCallback() {}\n\n  김부겸 행정안전부장관이 14일 서울 여의도 국회 행정안전위원회에 출석해 얼굴을 어루만지고 있다. [뉴스1]           김부겸 행정안전부 장관이 정부의 개각 인사 발표 방식에 대해 “늘 하던 방식이 아닌 출신고별로 발표하는 발상은 누가 했는지 모르지만, 상당히 치졸하다고 생각한다”며 비판적 태도를 보였다.        김 장관은 14일 국회에서 열린 행정안전위원회 업무보고 오후 질의에서 윤재옥 자유한국당 의원의 질문에 이같이 답했다. 이날 질의는 사실상 자신의 마지막 국회 업무보고다.      윤 의원은 “장관 일곱 분 개각이 됐는데 TK(대구ㆍ경북) 출신은 한 명도 없다”며 “정략적으로 고립화한다는 지역 여론이 있다”고 했다. 또 “출신 지역을 숨기고 출신고를 발표했는데 그 결과 호남 출신은 한 명도 없는 것으로 나왔으나 실제로는 4명이었다”며 “특정 지역이 소외감을 느끼는 불균형 인사는 빨리 시정돼야 한다. 국회로 돌아오면 목소리를 같이 내 달라”고 질의했다.      이에 김 장관은 “대한민국에서 인사를 하면 늘 그런 식으로 평가가 엇갈리게 마련이지만, 그런 측면이 있더라도 한 국가의 인사에 그런 잣대를 들이대는 것은 지나치다”고 답했다. 이에 김 장관은 ‘출신고 기준’ 발표 방식이 치졸하다면서 “앞으로는 제가 국회로 돌아가서 그런 문제에 앞장서겠다”고 말했다.      앞서 지난 8일 문재인 대통령은 진영 의원을 새 행안부 장관에 내정했다. 당시 청와대는 개각 명단을 발표하면서 이번에 처음으로 출신지를 제외하고 출생연도와 출신 고교ㆍ대학 등 주요 학력과 경력만을 공개했다.      문재인 대통령이 지난 8일 7개 부처에 대한 중폭 개각을 단행했다. 왼쪽 위부터 시계방향으로 중소벤처기업부장관에 내정된 박영선 더불어민주당 의원, 행안부장관에 내정된 진영 더불어민주당 의원, 통일부장관에 내정된 김연철 통일연구원장, 국토부장관에 내정된 최정호 전 국토부 2차관, 과기부장관에 내정된 조동호 카이스트 교수, 해수부장관에 내정된 문성혁 세계해사대교수, 문체부장관에 내정된 박양우 전 문화관광부 차관. [사진 청와대]           장관 후보자 중 서울 지역 고등학교 졸업자는 조동호 과학기술정보통신부 장관 후보자(서울 배문고), 진영 행정안전부 장관 후보자(서울 경기고), 문성혁 해양수산부 장관 후보자(서울 대신고), 박영선 중소벤처기업부 장관 후보자(서울 수도여고) 등 4명이다. 김연철 통일부 장관 후보자는 강원 북평고, 박양우 문화체육관광부 장관 후보자는 인천 제물포고, 최정호 국토교통부 장관 후보자는 경북 금오공고를 나왔다. 고등학교 기준으로 하면 서울 4명, 인천 1명, 경북 1명, 강원 1명의 분포다.        그러나 종전의 출생지 기준으로 재분류를 하면 전북이 3명(진영ㆍ조동호ㆍ최정호)이고 광주 1명(박양우), 부산 1명(문성혁), 경남 1명(박영선), 강원 1명(김연철)의 분포가 된다. 청와대 발표에는 안 보이던 호남 출신이 4명이다.        당시 청와대는 “지연 중심 문화를 탈피해야 한다는 데 사회의 공감대가 있다”면서 “출신지라는 게 객관적이지도 않아서 그곳에서 태어나 오랫동안 성장한 사람이 있는가 하면 출생만 하고 성장은 다른 곳에서 해온 분들도 있다. 불필요한 논란을 끌지 않기 위해 이번에 고등학교 중심으로 발표했다”고 설명했다.        한영혜 기자 han.younghye@joongang.co.kr  ▶ 중앙일보 '홈페이지' / '페이스북' 친구추가▶ 네이버 구독 1위 신문, 중앙일보ⓒ중앙일보(https://joongang.co.kr), 무단 전재 및 재배포 금지\n\t\n"

In [279]:

from string import punctuation

corpus2 = list()
for i in range(len(corpus)):
    corpus[i] = corpus[i].translate(str.maketrans('', '', punctuation))
    corpus[i] = corpus[i].replace('flash 오류를 우회하기 위한 함수 추가\nfunction flashremoveCallback', '')
    corpus2.append((corpus[i]))

In [280]:

len(corpus2)

Out[280]:

In [285]:

## 진짜 Noun으로만 DTM 만드는거 
DTM = defaultdict(lambda: defaultdict(int))
dictNoun = list()

for i in range(len(corpus2)):
    for t in Kkma().pos(corpus2[i]):
        if len(t[0]) >1 and t[1].startswith('N'):
            DTM[i][t[0]] += 1

In [313]:

# DTM

In [287]:

termList = list(list([] for _ in range(len(corpus2))))

In [288]:

for i in range(len(corpus2)):
    for j,d in DTM[i].items():
        termList[i].append(j)

In [294]:

len(termList), type(termList)

Out[294]:

(40, list)

In [295]:

from collections import defaultdict

documents = defaultdict(lambda: defaultdict(int)) # DTM
vocabulary = list()
# i : 문서제목 / d : i번째 문서 내 단어목록 
for i, d in enumerate(termList):  
    for term in d:
        documents[i][term.lower()] += 1
        vocabulary.append(term.lower())
        
vocabulary = list(set(vocabulary))

In [297]:

# D : docu, a,b
# alpha, beta 만들기 
a = 0.1
b = 0.1

K = 3 # 전체 토픽 수
M = len(documents) # 전체 문서의 수 
V = len(vocabulary) # 전체 단어의 수 
# N은 특정 문서마다 항상 다르다. 

# 특정 토픽에 몇 개의 단어가 있는지 -> 분모
# 특정 토픽 : sum(단어)
topicTermCount = defaultdict(int)

# 특정 문서의 단어에 상관없이 토픽 할당 횟수 
docTopicDistribution = defaultdict(lambda: defaultdict(int))
# [document][0번째토픽: 몇 개의 단어, 1번째 토픽:몇 개의 단어]

# 문서에 상관없이 특정 단어의 토픽 할당 횟수 
topicTermDistribution = defaultdict(lambda: defaultdict(int))
# [topic][vocab 0 : 몇 번, ... , n]

# Z_ml = m번째 문서 1번째 단어의 Topic
# M개의 문서만큼 -> N개의 단어 -> Topic
termTopicAssignmentMatrix = defaultdict(lambda:defaultdict(int))
# Z[documents][term] = topic
# n(i,(j,r)) = i번째 토픽의 횟수, j번째 문서의 r번째 단어

In [299]:

from random import randrange,seed

seed(0)

for i,t in enumerate(termList):
    for j, term in enumerate(t):
        token = term.lower()
        topic = randrange(K)
        
        topicTermCount[topic] += 1
        docTopicDistribution[i][topic] += 1
        topicTermDistribution[topic][term] += 1
        termTopicAssignmentMatrix[i][j] = topic

In [300]:

from random import random

def collapsedGibbsSampling(i,term):
    sampling = list()
    # k번째 토픽에 대한 확률     
    for k in range(K):
        sampling.append(likelighoodAlpha(i,k) * likelighoodBeta(k,term))
    # 0~1의 실수값을 가짐 
    threshold =  sum(sampling) * random()   
    
    for topicNo, topicProbability in enumerate(sampling):
        threshold -= topicProbability
        
        if threshold <= 0.0:
            return topicNo
    
#     print(sampling)
#     return termTopicAssignmentMatrix[i][term]

In [301]:

def likelighoodAlpha(i,k):
    return docTopicDistribution[i][k] + a

In [302]:

def likelighoodBeta(k,term):
    return  (topicTermDistribution[k][term] +b) / (topicTermCount[k] + b * V)

In [304]:

iterationNumber = 1000

for _ in range(iterationNumber):
    # m을 고정, l을 고정해야함 -> topicTermAssingnmentMatrix
    # m,l => i, j
    for i,t in enumerate(termList):
        for j,term in enumerate(t):
            topic = termTopicAssignmentMatrix[i][j]
            
            topicTermCount[topic] -= 1
            docTopicDistribution[i][topic] -= 1
            topicTermDistribution[topic][term] -= 1
            
            topic = collapsedGibbsSampling(i,term)
            
            topicTermCount[topic] += 1
            docTopicDistribution[i][topic] += 1
            topicTermDistribution[topic][term] += 1
            
            termTopicAssignmentMatrix[i][j] = topic

In [306]:

topicTermCount

Out[306]:

defaultdict(int, {1: 1541, 0: 2892, 2: 2704})

In [314]:

# topicTermDistribution
# 결과가 기니 직접 쳐서 확인해보자

In [308]:

docTopicDistribution

Out[308]:

defaultdict(<function __main__.<lambda>()>,
            {0: defaultdict(int, {1: 67, 0: 75, 2: 25}),
             1: defaultdict(int, {2: 69, 0: 45, 1: 1}),
             2: defaultdict(int, {1: 2, 2: 48, 0: 66}),
             3: defaultdict(int, {2: 62, 0: 51, 1: 124}),
             4: defaultdict(int, {0: 182, 2: 0, 1: 0}),
             5: defaultdict(int, {2: 104, 1: 232, 0: 34}),
             6: defaultdict(int, {2: 157, 1: 0, 0: 40}),
             7: defaultdict(int, {2: 107, 0: 24, 1: 0}),
             8: defaultdict(int, {2: 87, 1: 117, 0: 25}),
             9: defaultdict(int, {1: 0, 2: 110, 0: 33}),
             10: defaultdict(int, {0: 25, 2: 54, 1: 0}),
             11: defaultdict(int, {1: 22, 0: 36, 2: 90}),
             12: defaultdict(int, {2: 78, 0: 83, 1: 39}),
             13: defaultdict(int, {1: 23, 2: 142, 0: 43}),
             14: defaultdict(int, {1: 91, 2: 79, 0: 46}),
             15: defaultdict(int, {0: 46, 2: 58, 1: 104}),
             16: defaultdict(int, {1: 0, 2: 97, 0: 13}),
             17: defaultdict(int, {2: 0, 0: 129, 1: 0}),
             18: defaultdict(int, {0: 103, 1: 0, 2: 0}),
             19: defaultdict(int, {0: 74, 1: 3, 2: 73}),
             20: defaultdict(int, {0: 101, 2: 81, 1: 82}),
             21: defaultdict(int, {0: 89, 2: 29, 1: 0}),
             22: defaultdict(int, {2: 60, 0: 47, 1: 66}),
             23: defaultdict(int, {2: 33, 1: 41, 0: 43}),
             24: defaultdict(int, {2: 89, 0: 45, 1: 0}),
             25: defaultdict(int, {1: 0, 2: 99, 0: 53}),
             26: defaultdict(int, {2: 145, 1: 32, 0: 51}),
             27: defaultdict(int, {2: 1, 0: 149, 1: 5}),
             28: defaultdict(int, {1: 24, 0: 105, 2: 61}),
             29: defaultdict(int, {2: 113, 1: 0, 0: 71}),
             30: defaultdict(int, {0: 77, 1: 6, 2: 73}),
             31: defaultdict(int, {0: 91, 2: 46, 1: 43}),
             32: defaultdict(int, {1: 0, 2: 0, 0: 108}),
             33: defaultdict(int, {0: 86, 2: 81, 1: 34}),
             34: defaultdict(int, {1: 0, 0: 121, 2: 0}),
             35: defaultdict(int, {1: 75, 0: 4, 2: 60}),
             36: defaultdict(int, {0: 89, 1: 189, 2: 124}),
             37: defaultdict(int, {0: 243, 2: 0, 1: 22}),
             38: defaultdict(int, {1: 64, 2: 113, 0: 72}),
             39: defaultdict(int, {0: 74, 1: 33, 2: 56})})

In [315]:

# termTopicAssignmentMatrix
# 결과가 기니 직접 쳐서 확인해보자

2. Sentiment_Analysis

Sentiment : A view or opinion that is held or expressed, general feeling or opinion

Lexicon Based Learning의 방법으로 접할 것이다.

PMI,SO(극성이 양수인지 음수인지)를 이용

PMI : MI(Mutual Information) 두 변수에 대해 의존성을 보기 위함

$\sum_{x \in X} \sum_{y \in Y} P_{X, Y} (x, y) \log \frac{P_{X, Y} (x, y)}{P_{X} (x) P_{Y} (y)}$

$P M I (x; y) = \log \frac{P_{X, Y} (x, y)}{P_{X} (x) P_{Y} (y)}$

PMI가 높으면 두 단어가 가깝다는 의미이다.

내일 영화 수집을 할 것 이다.

https://movie.naver.com/movie/point/af/list.nhn

csv 형태로 저장 ( 평점과 리뷰만을 가져옴 + 영화명 가져와서 영화별 평가하기 )

728x90

저작자표시 (새창열림)

[NLP] Day 37 - LDA & Sentiment_Analysis

1. LDA_Implementation

2. Sentiment_Analysis

티스토리툴바