728x90

가중치 기법

In [173]:

collection = [
    ("Document1", "This is a sample"),  # a 가 중요
    ("Document2","This is another sample"), # another 가 중요
    ("Document2","This is not sample") # not 이 중요
]

query = "this is a sample"

In [161]:

# 전역변수 만들기
# in-memory (Hash-key값)
# {단어1 : 포스팅위치, 단어2: 포스팅위치, ...}
globalLexicon = dict()
# [0:문서, 1:문서2, ...]
globalDocument = list()

# disk  ( out memory )
# [0:(단어 idx, 문서 idx, 빈도, 다음주소), ...]
# 메모리에 존재안하고 파일에만 존재해도 됨, 위치를 알고 있어서
globalPosting = list()

In [162]:

for (docName, docContent) in collection:
    # Pointer 대체용, Key, Document이름은 절대로 겹치지 않는다는 가정
    docIdx = len(globalDocument)
    globalDocument.append(docName)
    
    # {단어idx:빈도, 단어idx:빈도, ...}
    localPosting = dict()
    
    # 로컬 / 띄어쓰기 단위로 
    for term in docContent.lower().split():   
        if term not in localPosting.keys():
            localPosting[term] = 1
        else:
            localPosting[term] += 1
    
    # fp -> struct(단어,빈도) ( localPosting)
    # Merge와 sorting이 같이 있는 것 
    for indexTerm, termFreq in localPosting.items():
        if indexTerm not in globalLexicon.keys():
            lexiconIdx = len(globalLexicon)
            postingIdx = len(globalPosting) # fseek
            postingData = (lexiconIdx,docIdx,termFreq, -1)
            globalPosting.append(postingData)
            globalLexicon[indexTerm] =  postingIdx   # globalPosting 위치 (ptr:idx)
        else:
            lexiconIdx = list(globalLexicon.keys()).index(indexTerm)
            postingIdx =  len(globalPosting)
            beforeIdx = globalLexicon[indexTerm]
            postingData = (lexiconIdx,docIdx,termFreq, beforeIdx)
            globalPosting.append(postingData)
            globalLexicon[indexTerm] =  postingIdx   # globalPosting 위치 (ptr:idx)
#     print(localPosting)
# print(globalDocument)  
#         if term not in globalLexicon.keys():
#             localPosting
#             lexiconIdx = len(globalLexicon)   # 처음엔 Length가 0일 것.

In [163]:

len(globalDocument)

Out[163]:

In [164]:

globalDocument.index("Document1")

Out[164]:

In [165]:

globalLexicon

Out[165]:

{'this': 8, 'is': 9, 'a': 2, 'sample': 11, 'another': 6, 'not': 10}

In [166]:

globalPosting
# 끝이 -1 이면 끝 
# 세번째 인덱스가 빈도

Out[166]:

[(0, 0, 1, -1),
 (1, 0, 6, -1),
 (2, 0, 1, -1),
 (3, 0, 1, -1),
 (0, 1, 1, 0),
 (1, 1, 1, 1),
 (4, 1, 1, -1),
 (3, 1, 1, 3),
 (0, 2, 1, 4),
 (1, 2, 1, 5),
 (5, 2, 1, -1),
 (3, 2, 1, 7)]

In [167]:

for indexTerm, postingIdx in globalLexicon.items():
    # indexTerm : 단어,  postingIdx : 위치, 
    print(indexTerm)
    
    while True: # Postrin Next : -1 -> -1이 될 때 까지 찾으려고
        if postingIdx == -1:
            break
            
        postingData = globalPosting[postingIdx]
        print( " {0} / {1} / {2} ".format(globalDocument[postingData[1]],postingData[2],postingData[3]))
        postingIdx = postingData[3]

this
 Document2 / 1 / 4 
 Document2 / 1 / 0 
 Document1 / 1 / -1 
is
 Document2 / 1 / 5 
 Document2 / 1 / 1 
 Document1 / 6 / -1 
a
 Document1 / 1 / -1 
sample
 Document2 / 1 / 7 
 Document2 / 1 / 3 
 Document1 / 1 / -1 
another
 Document2 / 1 / -1 
not
 Document2 / 1 / -1

가중치 기법

http://www.cs.virginia.edu/%7Ehw5x/Course/IR2015/_site/docs/PDFs/Boolean&VS%20model.pdf

TF : Document 안에서 얼마나 중요한지 DF : Collection 안에서 얼마나 중요한지 / 많이 나타난 정도

TF

In [168]:

from math import log10

def rawTF(freq):
    return freq

def normTF(freq,totalCount):
    return (freq / totalCount)

def logTF(freq):
    if freq > 0:
        return 1 + log10(freq)
    else:
        return 0

def maxTF(a,freq,maxFreq):   # double normalization K -  doc : 0 / query : 0.5
    return a + ((1-a)* (freq/maxFreq))

In [169]:

globalTF = list()

for (docName, docContent) in collection:
    localPosting = dict()
    maxCount = 0
    # 로컬 / 띄어쓰기 단위로 
    for term in docContent.lower().split():   
        maxCount += 1
        if term not in localPosting.keys():
            localPosting[term] = 1
        else:
            localPosting[term] += 1
            
    print(docName)
    a = 0.5
    maxFreq = max(localPosting.values())
    
    for term,freq in localPosting.items():
        print("1. {0} rawTF : {1}".format(term,rawTF(freq)))
        print("2. {0} normTF : {1}".format(term,normTF(freq,maxCount)))
        print("3. {0} logTF : {1}".format(term,logTF(freq)))
        print("4. {0} maxTF : {1}".format(term,maxTF(a,freq,maxFreq)))
        print()
        localPosting[term] = maxTF(a,freq,maxFreq)
        
    for indexTerm, termTF in localPosting.items():
        if indexTerm not in globalLexicon.keys():
            lexiconIdx = len(globalLexicon)
            postingIdx = len(globalTF) # fseek
            postingData = (lexiconIdx,docIdx,termTF, -1)
            globalTF.append(postingData)
            globalLexicon[indexTerm] =  postingIdx   # globalPosting 위치 (ptr:idx)
        else:
            lexiconIdx = list(globalLexicon.keys()).index(indexTerm)
            postingIdx =  len(globalTF)
            beforeIdx = globalLexicon[indexTerm]
            postingData = (lexiconIdx,docIdx,termTF, beforeIdx)
            globalTF.append(postingData)
            globalLexicon[indexTerm] =  postingIdx

Document1
1. this rawTF : 1
2. this normTF : 0.1111111111111111
3. this logTF : 1.0
4. this maxTF : 0.5833333333333334

1. is rawTF : 6
2. is normTF : 0.6666666666666666
3. is logTF : 1.7781512503836436
4. is maxTF : 1.0

1. a rawTF : 1
2. a normTF : 0.1111111111111111
3. a logTF : 1.0
4. a maxTF : 0.5833333333333334

1. sample rawTF : 1
2. sample normTF : 0.1111111111111111
3. sample logTF : 1.0
4. sample maxTF : 0.5833333333333334

Document2
1. this rawTF : 1
2. this normTF : 0.25
3. this logTF : 1.0
4. this maxTF : 1.0

1. is rawTF : 1
2. is normTF : 0.25
3. is logTF : 1.0
4. is maxTF : 1.0

1. another rawTF : 1
2. another normTF : 0.25
3. another logTF : 1.0
4. another maxTF : 1.0

1. sample rawTF : 1
2. sample normTF : 0.25
3. sample logTF : 1.0
4. sample maxTF : 1.0

Document2
1. this rawTF : 1
2. this normTF : 0.25
3. this logTF : 1.0
4. this maxTF : 1.0

1. is rawTF : 1
2. is normTF : 0.25
3. is logTF : 1.0
4. is maxTF : 1.0

1. not rawTF : 1
2. not normTF : 0.25
3. not logTF : 1.0
4. not maxTF : 1.0

1. sample rawTF : 1
2. sample normTF : 0.25
3. sample logTF : 1.0
4. sample maxTF : 1.0

In [170]:

print(globalPosting), print(globalTF)

[(0, 0, 1, -1), (1, 0, 6, -1), (2, 0, 1, -1), (3, 0, 1, -1), (0, 1, 1, 0), (1, 1, 1, 1), (4, 1, 1, -1), (3, 1, 1, 3), (0, 2, 1, 4), (1, 2, 1, 5), (5, 2, 1, -1), (3, 2, 1, 7)]
[(0, 2, 0.5833333333333334, 8), (1, 2, 1.0, 9), (2, 2, 0.5833333333333334, 2), (3, 2, 0.5833333333333334, 11), (0, 2, 1.0, 0), (1, 2, 1.0, 1), (4, 2, 1.0, 6), (3, 2, 1.0, 3), (0, 2, 1.0, 4), (1, 2, 1.0, 5), (5, 2, 1.0, 10), (3, 2, 1.0, 7)]

Out[170]:

(None, None)

0~1 사이의 값을 가지는 "maxTF" 가 상대적인 중요도를 알기에 좋다.

IDF

적게 나오는 단어에 대해 더 높은 가중치를 부여하자!

tf-idf : https://en.wikipedia.org/wiki/Tf–idf

In [317]:

# 일반적인 IDF
def rawIdf(df, N):
    return log10(N / df)

# the,a, 불용어 안날림 => to be or not to be 
def smoothingIdf(df,N):
    return log10((N+1) / df)

def probabilityIdf(df,N):
    return log10((N-df+1) / df)

maxTF , rawIDF 를 일반적으로 사용

In [318]:

collection = [
    ("Document1", "This is a sample"),  # a 가 중요
    ("Document2","This is another sample"), # another 가 중요
    ("Document3","This is not sample"), # not 이 중요
]

query = "this is sample"

In [319]:

globalLexicon = dict()
globalDocument = list()
globalPosting = list()

for (docName, docContent) in collection:
    # Pointer 대체용, Key, Document이름은 절대로 겹치지 않는다는 가정
    docIdx = len(globalDocument)
    globalDocument.append(docName)
    
    # {단어idx:빈도, 단어idx:빈도, ...}
    localPosting = dict()
    
    # 로컬 / 띄어쓰기 단위로 
    for term in docContent.lower().split():   
        if term not in localPosting.keys():
            localPosting[term] = 1
        else:
            localPosting[term] += 1
    
    maxFreq = max(localPosting.values())
    
    # fp -> struct(단어,빈도) ( localPosting)
    # Merge와 sorting이 같이 있는 것 
    for indexTerm, termFreq in localPosting.items():
        if indexTerm not in globalLexicon.keys():
            lexiconIdx = len(globalLexicon)
            postingIdx = len(globalPosting) # fseek
            postingData = [lexiconIdx,docIdx,maxTF(0,termFreq,maxFreq), -1]
            globalPosting.append(postingData)
            globalLexicon[indexTerm] =  postingIdx   # globalPosting 위치 (ptr:idx)
        else:
            lexiconIdx = list(globalLexicon.keys()).index(indexTerm)
            postingIdx =  len(globalPosting)
            beforeIdx = globalLexicon[indexTerm]
            postingData = [lexiconIdx,docIdx,maxTF(0,termFreq,maxFreq), beforeIdx]
            globalPosting.append(postingData)
            globalLexicon[indexTerm] =  postingIdx   # globalPosting 위치 (ptr:idx)
#     print(localPosting)
# print(globalDocument)  
#         if term not in globalLexicon.keys():
#             localPosting
#             lexiconIdx = len(globalLexicon)   # 처음엔 Length가 0일 것.

In [320]:

globalPosting
# 가중치가 다 1.0

Out[320]:

[[0, 0, 1.0, -1],
 [1, 0, 1.0, -1],
 [2, 0, 1.0, -1],
 [3, 0, 1.0, -1],
 [0, 1, 1.0, 0],
 [1, 1, 1.0, 1],
 [4, 1, 1.0, -1],
 [3, 1, 1.0, 3],
 [0, 2, 1.0, 4],
 [1, 2, 1.0, 5],
 [5, 2, 1.0, -1],
 [3, 2, 1.0, 7]]

In [321]:

# 최종 코드 ( TF & IDF )
N = len(globalDocument)
globalLexiconIDF = dict()

for indexTerm, postingIdx in globalLexicon.items():
    df = 0
    oldPostingIdx = postingIdx
    
    while True: 
        if postingIdx == -1:
            break
            
        df += 1
        postingData = globalPosting[postingIdx]
        postingIdx = postingData[3]
        
    postingIdx = oldPostingIdx
    idf = rawIdf(df,N)
    globalLexiconIDF[indexTerm] = idf
    
    print("{0} / IDF-{1}".format(indexTerm,idf))
    
    while True: 
        if postingIdx == -1:
            break
         
        postingData = globalPosting[postingIdx]
        TF = postingData[2]
        postingData[2] = postingData[2] * idf
        print(" Document:{0} / TF:{1} / TF-IDF:{2}".format(globalDocument[postingData[1]],
                                                         TF,
                                                          globalPosting[postingIdx][2]))
        postingIdx = postingData[3]

this / IDF-0.0
 Document:Document3 / TF:1.0 / TF-IDF:0.0
 Document:Document2 / TF:1.0 / TF-IDF:0.0
 Document:Document1 / TF:1.0 / TF-IDF:0.0
is / IDF-0.0
 Document:Document3 / TF:1.0 / TF-IDF:0.0
 Document:Document2 / TF:1.0 / TF-IDF:0.0
 Document:Document1 / TF:1.0 / TF-IDF:0.0
a / IDF-0.47712125471966244
 Document:Document1 / TF:1.0 / TF-IDF:0.47712125471966244
sample / IDF-0.0
 Document:Document3 / TF:1.0 / TF-IDF:0.0
 Document:Document2 / TF:1.0 / TF-IDF:0.0
 Document:Document1 / TF:1.0 / TF-IDF:0.0
another / IDF-0.47712125471966244
 Document:Document2 / TF:1.0 / TF-IDF:0.47712125471966244
not / IDF-0.47712125471966244
 Document:Document3 / TF:1.0 / TF-IDF:0.47712125471966244

In [322]:

def euclidean(x,y):
    return (x - y) ** 2

In [323]:

query # Document
queryPosting = dict()

for term in query.lower().split():   
        if term not in queryPosting.keys():
            queryPosting[term] = 1
        else:
            queryPosting[term] += 1
    
maxFreq = max(queryPosting.values())
    
    # fp -> struct(단어,빈도) ( localPosting)
    # Merge와 sorting이 같이 있는 것 
for indexTerm, termFreq in queryPosting.items():
    queryPosting[indexTerm] = maxTF(0.5,termFreq,maxFreq)

In [324]:

queryPosting

Out[324]:

{'this': 1.0, 'is': 1.0, 'sample': 1.0}

In [325]:

globalLexicon.items()

Out[325]:

dict_items([('this', 8), ('is', 9), ('a', 2), ('sample', 11), ('another', 6), ('not', 10)])

In [326]:

candidateList = dict()   # 아직까진 검색후보군이어서 

for indexTerm, postingIdx in globalLexicon.items():
    queryTFIDF = 0
    
    if indexTerm in queryPosting.keys():
        queryTFIDF = queryPosting[indexTerm] * globalLexiconIDF[indexTerm]


    while True: 
        if postingIdx == -1:
            break

        postingData = globalPosting[postingIdx]
        postingIdx = postingData[3]
        documentWeight = postingData[2]

        if postingData[1] not in candidateList.keys():
            candidateList[postingData[1]] = euclidean(queryTFIDF,documentWeight) # 각 다큐먼트마다 누적시켜야함
        else:
             candidateList[postingData[1]] += euclidean(queryTFIDF,documentWeight)

In [327]:

resultList = sorted(candidateList.items(), key=lambda x:x[1])

for i, (documentIdx, distance) in  enumerate(resultList):
    print("순위: {0}. 문서 : {1} / Distance:{2}".format((i+1),globalDocument[documentIdx], distance))

순위: 1. 문서 : Document3 / Distance:0.227644691705265
순위: 2. 문서 : Document2 / Distance:0.227644691705265
순위: 3. 문서 : Document1 / Distance:0.227644691705265

query 바꿔서

In [328]:

query = "this is a sample"

queryPosting = dict()

for term in query.lower().split():   
        if term not in queryPosting.keys():
            queryPosting[term] = 1
        else:
            queryPosting[term] += 1
    
maxFreq = max(queryPosting.values())
    
    # fp -> struct(단어,빈도) ( localPosting)
    # Merge와 sorting이 같이 있는 것 
for indexTerm, termFreq in queryPosting.items():
    queryPosting[indexTerm] = maxTF(0.5,termFreq,maxFreq)
    
candidateList = dict()   # 아직까진 검색후보군이어서 

for indexTerm, postingIdx in globalLexicon.items():
    queryTFIDF = 0
    
    if indexTerm in queryPosting.keys():
        queryTFIDF = queryPosting[indexTerm] * globalLexiconIDF[indexTerm]


    while True: 
        if postingIdx == -1:
            break

        postingData = globalPosting[postingIdx]
        postingIdx = postingData[3]
        documentWeight = postingData[2]

        if postingData[1] not in candidateList.keys():
            candidateList[postingData[1]] = euclidean(queryTFIDF,documentWeight) # 각 다큐먼트마다 누적시켜야함
        else:
             candidateList[postingData[1]] += euclidean(queryTFIDF,documentWeight) 

resultList = sorted(candidateList.items(), key=lambda x:x[1])

for i, (documentIdx, distance) in  enumerate(resultList):
    print("순위: {0}. 문서 : {1} / Distance:{2}".format((i+1),globalDocument[documentIdx], distance))
    print(" {0}".format(collection[documentIdx][1]))
    
    # 거리가 0인것은 아예 같다는 것.

순위: 1. 문서 : Document1 / Distance:0.0
 This is a sample
순위: 2. 문서 : Document3 / Distance:0.227644691705265
 This is not sample
순위: 3. 문서 : Document2 / Distance:0.227644691705265
 This is another sample

Smoothing 사용

collection 도 변경.

In [348]:

collection = [
    ("Document1", "This is a sample"),  # a 가 중요
    ("Document2","This is another sample"), # another 가 중요
    ("Document3","This is not sample"), # not 이 중요
    ("Document4","a not sample"),
    ("Document5","not"),
    ("Document5","not sample"),
]

query = "this is sample"

In [349]:

globalLexicon = dict()
globalDocument = list()
globalPosting = list()

for (docName, docContent) in collection:
    # Pointer 대체용, Key, Document이름은 절대로 겹치지 않는다는 가정
    docIdx = len(globalDocument)
    globalDocument.append(docName)
    
    # {단어idx:빈도, 단어idx:빈도, ...}
    localPosting = dict()
    
    # 로컬 / 띄어쓰기 단위로 
    for term in docContent.lower().split():   
        if term not in localPosting.keys():
            localPosting[term] = 1
        else:
            localPosting[term] += 1
    
    maxFreq = max(localPosting.values())
    
    # fp -> struct(단어,빈도) ( localPosting)
    # Merge와 sorting이 같이 있는 것 
    for indexTerm, termFreq in localPosting.items():
        if indexTerm not in globalLexicon.keys():
            lexiconIdx = len(globalLexicon)
            postingIdx = len(globalPosting) # fseek
            postingData = [lexiconIdx,docIdx,maxTF(0,termFreq,maxFreq), -1]
            globalPosting.append(postingData)
            globalLexicon[indexTerm] =  postingIdx   # globalPosting 위치 (ptr:idx)
        else:
            lexiconIdx = list(globalLexicon.keys()).index(indexTerm)
            postingIdx =  len(globalPosting)
            beforeIdx = globalLexicon[indexTerm]
            postingData = [lexiconIdx,docIdx,maxTF(0,termFreq,maxFreq), beforeIdx]
            globalPosting.append(postingData)
            globalLexicon[indexTerm] =  postingIdx   # globalPosting 위치 (ptr:idx)
#     print(localPosting)
# print(globalDocument)  
#         if term not in globalLexicon.keys():
#             localPosting
#             lexiconIdx = len(globalLexicon)   # 처음엔 Length가 0일 것.

In [350]:

# 최종 코드 ( TF & IDF )
N = len(globalDocument)
globalLexiconIDF = dict()

for indexTerm, postingIdx in globalLexicon.items():
    df = 0
    oldPostingIdx = postingIdx
    
    while True: 
        if postingIdx == -1:
            break
            
        df += 1
        postingData = globalPosting[postingIdx]
        postingIdx = postingData[3]
        
    postingIdx = oldPostingIdx
    idf = smoothingIdf(df,N)
    globalLexiconIDF[indexTerm] = idf
    
    print("{0} / IDF-{1}".format(indexTerm,idf))
    
    while True: 
        if postingIdx == -1:
            break
         
        postingData = globalPosting[postingIdx]
        TF = postingData[2]
        postingData[2] = postingData[2] * idf
        print(" Document:{0} / TF:{1} / TF-IDF:{2}".format(globalDocument[postingData[1]],
                                                         TF,
                                                          globalPosting[postingIdx][2]))
        postingIdx = postingData[3]

this / IDF-0.36797678529459443
 Document:Document3 / TF:1.0 / TF-IDF:0.36797678529459443
 Document:Document2 / TF:1.0 / TF-IDF:0.36797678529459443
 Document:Document1 / TF:1.0 / TF-IDF:0.36797678529459443
is / IDF-0.36797678529459443
 Document:Document3 / TF:1.0 / TF-IDF:0.36797678529459443
 Document:Document2 / TF:1.0 / TF-IDF:0.36797678529459443
 Document:Document1 / TF:1.0 / TF-IDF:0.36797678529459443
a / IDF-0.5440680443502757
 Document:Document4 / TF:1.0 / TF-IDF:0.5440680443502757
 Document:Document1 / TF:1.0 / TF-IDF:0.5440680443502757
sample / IDF-0.146128035678238
 Document:Document5 / TF:1.0 / TF-IDF:0.146128035678238
 Document:Document4 / TF:1.0 / TF-IDF:0.146128035678238
 Document:Document3 / TF:1.0 / TF-IDF:0.146128035678238
 Document:Document2 / TF:1.0 / TF-IDF:0.146128035678238
 Document:Document1 / TF:1.0 / TF-IDF:0.146128035678238
another / IDF-0.8450980400142568
 Document:Document2 / TF:1.0 / TF-IDF:0.8450980400142568
not / IDF-0.24303804868629444
 Document:Document5 / TF:1.0 / TF-IDF:0.24303804868629444
 Document:Document5 / TF:1.0 / TF-IDF:0.24303804868629444
 Document:Document4 / TF:1.0 / TF-IDF:0.24303804868629444
 Document:Document3 / TF:1.0 / TF-IDF:0.24303804868629444

In [362]:

query = "not"

queryPosting = dict()

for term in query.lower().split():   
        if term not in queryPosting.keys():
            queryPosting[term] = 1
        else:
            queryPosting[term] += 1
    
maxFreq = max(queryPosting.values())
    
    # fp -> struct(단어,빈도) ( localPosting)
    # Merge와 sorting이 같이 있는 것 
for indexTerm, termFreq in queryPosting.items():
    queryPosting[indexTerm] = maxTF(0.5,termFreq,maxFreq)
    
candidateList = dict()   # 아직까진 검색후보군이어서 

for indexTerm, postingIdx in globalLexicon.items():
    queryTFIDF = 0
    
    if indexTerm in queryPosting.keys():
        queryTFIDF = queryPosting[indexTerm] * globalLexiconIDF[indexTerm]


    while True: 
        if postingIdx == -1:
            break

        postingData = globalPosting[postingIdx]
        postingIdx = postingData[3]
        documentWeight = postingData[2]

        if postingData[1] not in candidateList.keys():
            candidateList[postingData[1]] = euclidean(queryTFIDF,documentWeight) # 각 다큐먼트마다 누적시켜야함
        else:
             candidateList[postingData[1]] += euclidean(queryTFIDF,documentWeight) 

resultList = sorted(candidateList.items(), key=lambda x:x[1])

print(query)
for i, (documentIdx, distance) in  enumerate(resultList):
    print("순위: {0}. 문서 : {1} / Distance:{2}".format((i+1),globalDocument[documentIdx], distance))
    print(" {0}".format(collection[documentIdx][1]))
    
    # 거리가 0인것은 아예 같다는 것.

not
순위: 1. 문서 : Document1 / Distance:0.030912091084848208
 This is a sample
순위: 2. 문서 : Document5 / Distance:0.05229564026195988
 not
순위: 3. 문서 : Document5 / Distance:0.05230537672631026
 not sample
순위: 4. 문서 : Document3 / Distance:0.05727075708269295
 This is not sample
순위: 5. 문서 : Document4 / Distance:0.07824235099042541
 a not sample
순위: 6. 문서 : Document2 / Distance:0.36926118878670866
 This is another sample

vector space cosine similarity

유사도

In [363]:

def innerProduct(x,y):
    return x * y

In [364]:

candidateList = dict()  

for indexTerm, queryWeight in queryPosting.items():
    if indexTerm in globalLexicon.keys():
        postingIdx = globalLexicon[indexTerm]
        
        while True: 
            if postingIdx == -1:
                break 

            postingData = globalPosting[postingIdx]
            postingIdx = postingData[3]
            documentWeight = postingData[2]
            
            
            if postingData[1] not in candidateList.keys():
                candidateList[postingData[1]] = innerProduct(queryWeight,documentWeight) # 각 다큐먼트마다 누적시켜야함
            else:
                 candidateList[postingData[1]] += innerProduct(queryWeight,documentWeight) 
                    
for documentIdx, sumProduct in candidateList.items():
    candidateList[documentIdx] /= globalDocumentLength[documentIdx]

In [365]:

# 최종 코드 ( TF & IDF )
N = len(globalDocument)
globalLexiconIDF = dict()
globalDocumentLength = dict()

for indexTerm, postingIdx in globalLexicon.items():
    df = 0
    oldPostingIdx = postingIdx
    
    while True: 
        if postingIdx == -1:
            break
            
        df += 1
        postingData = globalPosting[postingIdx]
        postingIdx = postingData[3]
        
    postingIdx = oldPostingIdx
    idf = smoothingIdf(df,N)
    globalLexiconIDF[indexTerm] = idf
    
    print("{0} / IDF-{1}".format(indexTerm,idf))
    
    while True: 
        if postingIdx == -1:
            break
         
        postingData = globalPosting[postingIdx]
        TF = postingData[2]
        postingData[2] = postingData[2] * idf
        print(" Document:{0} / TF:{1} / TF-IDF:{2}".format(globalDocument[postingData[1]],
                                                         TF,
                                                          globalPosting[postingIdx][2]))
        postingIdx = postingData[3]
        
        if postingData[1] not in globalDocumentLength.keys():
                globalDocumentLength[postingData[1]] = globalPosting[postingIdx][2] ** 2 # 각 다큐먼트마다 누적시켜야함
        else:
                globalDocumentLength[postingData[1]] += globalPosting[postingIdx][2] ** 2

this / IDF-0.36797678529459443
 Document:Document3 / TF:0.04982660111016345 / TF-IDF:0.018335032498674016
 Document:Document2 / TF:0.04982660111016345 / TF-IDF:0.018335032498674016
 Document:Document1 / TF:0.04982660111016345 / TF-IDF:0.018335032498674016
is / IDF-0.36797678529459443
 Document:Document3 / TF:0.04982660111016345 / TF-IDF:0.018335032498674016
 Document:Document2 / TF:0.04982660111016345 / TF-IDF:0.018335032498674016
 Document:Document1 / TF:0.04982660111016345 / TF-IDF:0.018335032498674016
a / IDF-0.5440680443502757
 Document:Document4 / TF:0.16104960187505943 / TF-IDF:0.08762194193555407
 Document:Document1 / TF:0.16104960187505943 / TF-IDF:0.08762194193555407
sample / IDF-0.146128035678238
 Document:Document5 / TF:0.0031203308078439577 / TF-IDF:0.00045596781161652707
 Document:Document4 / TF:0.0031203308078439577 / TF-IDF:0.00045596781161652707
 Document:Document3 / TF:0.0031203308078439577 / TF-IDF:0.00045596781161652707
 Document:Document2 / TF:0.0031203308078439577 / TF-IDF:0.00045596781161652707
 Document:Document1 / TF:0.0031203308078439577 / TF-IDF:0.00045596781161652707
another / IDF-0.8450980400142568
 Document:Document2 / TF:0.6035611584305071 / TF-IDF:0.5100683520183559
not / IDF-0.24303804868629444
 Document:Document5 / TF:0.014355648266061229 / TF-IDF:0.0034889687422103074
 Document:Document5 / TF:0.014355648266061229 / TF-IDF:0.0034889687422103074
 Document:Document4 / TF:0.014355648266061229 / TF-IDF:0.0034889687422103074
 Document:Document3 / TF:0.014355648266061229 / TF-IDF:0.0034889687422103074

In [367]:

resultList = sorted(candidateList.items(), key=lambda x:x[1], reverse=True)

print(query)
for i, (documentIdx, distance) in  enumerate(resultList):
    print("순위: {0}. 문서 : {1} / 유사도:{2}".format((i+1),globalDocument[documentIdx], distance))
    print(" {0}".format(collection[documentIdx][1]))
    
    # 거리가 0인것은 아예 같다는 것.

not
순위: 1. 문서 : Document5 / 유사도:4.114582080482251
 not
순위: 2. 문서 : Document5 / 유사도:3.6390061209312363
 not sample
순위: 3. 문서 : Document3 / 유사도:0.38657199102044526
 This is not sample
순위: 4. 문서 : Document4 / 유사도:0.15677773997424263
 a not sample

+ 색인어 목록 만들기

In [401]:

from konlpy.corpus import kobill

def getLexiconBySet():
    lexicon = list()
    
    for docName in kobill.fileids():
        document = kobill.open(docName).read()
        
        for token in document.split():
            lexicon.extend(token)
            
    return list(set(lexicon))

In [402]:

from collections import defaultdict
from konlpy.tag import Kkma

ma = Kkma().morphs

def getDocReprByDefaultDict(lexicon):
    docRepr = defaultdict(lambda: defaultdict(int))
    
    for docName in kobill.fileids():
        document = kobill.open(docName).read()
 
        for token in document.split(): 
            for morpheme in ma(token):
                docRepr[docName][morpheme] += 1
            
    return docRepr

In [408]:

txt = getLexiconBySet()
DTM = getDocReprByDefaultDict(txt)

In [412]:

# invertedDocument (역문헌구조, 어휘)
def convertInvertedDocument(DTM):
    TDM = defaultdict(lambda: defaultdict(int))
    
    for fileName, termList in DTM.items():  
        maxFreq = max(termList.values())
        for term, freq in termList.items():
            TDM[term][fileName] = maxTF(0,freq,maxFreq)
            
    return TDM

In [434]:

# 빈도로 나타내기 위해 기존의 getDocReprByDefaultDict의= 를 +=로 변경 
TDM = convertInvertedDocument(DTM)
#TDM

In [414]:

# term-document -> term weight
# defaultdict 를 써서 key 걱정을 안해도 된다. 
N = len(DTM)

def TDM2TWM(TDM):
    TWM = defaultdict(lambda: defaultdict(float))
    DVL = defaultdict(float)
    
    for term, tfList in TDM.items():
        df = len(tfList)
        idf = rawIdf(df,N)
        for fileName, tf in tfList.items():
            TWM[term][fileName] = tf * idf
            DVL[fileName] += TWM[term][fileName]  ** 2
            
    return TWM, DVL

In [415]:

TWM,DVL = TDM2TWM(TDM)

In [433]:

#TWM

In [417]:

DVL

Out[417]:

defaultdict(float,
            {'1809896.txt': 0.23728598660065164,
             '1809890.txt': 0.08891068576179656,
             '1809891.txt': 0.09407896411798816,
             '1809893.txt': 0.14887973976614557,
             '1809892.txt': 0.056003860773669124,
             '1809899.txt': 0.08503881498928928,
             '1809895.txt': 0.3553986340082429,
             '1809894.txt': 0.5467267027461256,
             '1809897.txt': 0.8414240759234843,
             '1809898.txt': 0.5833111894640503})

In [423]:

query = "국방의 의무와 보편적 교육에 대한 법안"

In [425]:

queryRepr = defaultdict(int) # 빈도를 갖는

for token in query.split(): 
    for morpheme in ma(token):
        queryRepr[morpheme] += 1
        
queryWeight = defaultdict(float)
maxFreq = max(queryRepr.values())

for token, freq in queryRepr.items():
    if token in TWM.keys():
        tf = maxTF(0.5,freq,maxFreq)
        df = len(TWM[token])
        idf = rawIdf(df,N)
        queryWeight[token] = tf * idf

In [426]:

queryWeight

Out[426]:

defaultdict(float,
            {'국방': 0.6989700043360189,
             '의': 0.0,
             '의무': 0.6989700043360189,
             '와': 0.04575749056067514,
             '교육': 0.09691001300805642,
             '에': 0.0,
             '대하': 0.0,
             'ㄴ': 0.0,
             '법안': 0.3979400086720376})

In [435]:

from math import sqrt

candidateList = defaultdict(float)

for token, weight in queryWeight.items():
    for fileName, tfidf in TWM[token].items():
        print(" {0} : {1} = {2} * {3}".format(
        token, fileName,weight,tfidf))
        candidateList[fileName] += innerProduct(weight, tfidf)
        
for fileName, sumProduct in candidateList.items():
    candidateList[fileName] /= sqrt(DVL[fileName])

 국방 : 1809897.txt = 0.6989700043360189 * 0.02184281263550059
 국방 : 1809898.txt = 0.6989700043360189 * 0.00998528577622884
 의 : 1809896.txt = 0.0 * 0.0
 의 : 1809897.txt = 0.0 * 0.0
 의 : 1809895.txt = 0.0 * 0.0
 의 : 1809894.txt = 0.0 * 0.0
 의 : 1809890.txt = 0.0 * 0.0
 의 : 1809891.txt = 0.0 * 0.0
 의 : 1809893.txt = 0.0 * 0.0
 의 : 1809892.txt = 0.0 * 0.0
 의 : 1809899.txt = 0.0 * 0.0
 의 : 1809898.txt = 0.0 * 0.0
 의무 : 1809896.txt = 0.6989700043360189 * 0.0060256034856553346
 의무 : 1809899.txt = 0.6989700043360189 * 0.0016255116379907415
 와 : 1809896.txt = 0.04575749056067514 * 0.001972305627615308
 와 : 1809897.txt = 0.04575749056067514 * 0.0028598431600421964
 와 : 1809895.txt = 0.04575749056067514 * 0.0031923830623726843
 와 : 1809890.txt = 0.04575749056067514 * 0.0011160363551384182
 와 : 1809891.txt = 0.04575749056067514 * 0.0011535501822018943
 와 : 1809893.txt = 0.04575749056067514 * 0.0014920920835002763
 와 : 1809892.txt = 0.04575749056067514 * 0.0006863623584101271
 와 : 1809899.txt = 0.04575749056067514 * 0.0007448893812202931
 와 : 1809898.txt = 0.04575749056067514 * 0.001307356873162147
 교육 : 1809896.txt = 0.09691001300805642 * 0.000835431146621176
 교육 : 1809897.txt = 0.09691001300805642 * 0.019684846392261462
 교육 : 1809894.txt = 0.09691001300805642 * 0.006683449172969408
 교육 : 1809890.txt = 0.09691001300805642 * 0.00315154513847338
 교육 : 1809891.txt = 0.09691001300805642 * 0.0032574794288422323
 교육 : 1809893.txt = 0.09691001300805642 * 0.004213478826437235
 교육 : 1809892.txt = 0.09691001300805642 * 0.00678370091056395
 교육 : 1809899.txt = 0.09691001300805642 * 0.001126860616372749
 에 : 1809896.txt = 0.0 * 0.0
 에 : 1809897.txt = 0.0 * 0.0
 에 : 1809895.txt = 0.0 * 0.0
 에 : 1809894.txt = 0.0 * 0.0
 에 : 1809890.txt = 0.0 * 0.0
 에 : 1809891.txt = 0.0 * 0.0
 에 : 1809893.txt = 0.0 * 0.0
 에 : 1809892.txt = 0.0 * 0.0
 에 : 1809899.txt = 0.0 * 0.0
 에 : 1809898.txt = 0.0 * 0.0
 대하 : 1809896.txt = 0.0 * 0.0
 대하 : 1809897.txt = 0.0 * 0.0
 대하 : 1809895.txt = 0.0 * 0.0
 대하 : 1809894.txt = 0.0 * 0.0
 대하 : 1809890.txt = 0.0 * 0.0
 대하 : 1809891.txt = 0.0 * 0.0
 대하 : 1809893.txt = 0.0 * 0.0
 대하 : 1809892.txt = 0.0 * 0.0
 대하 : 1809899.txt = 0.0 * 0.0
 대하 : 1809898.txt = 0.0 * 0.0
 ㄴ : 1809896.txt = 0.0 * 0.0
 ㄴ : 1809897.txt = 0.0 * 0.0
 ㄴ : 1809895.txt = 0.0 * 0.0
 ㄴ : 1809894.txt = 0.0 * 0.0
 ㄴ : 1809890.txt = 0.0 * 0.0
 ㄴ : 1809891.txt = 0.0 * 0.0
 ㄴ : 1809893.txt = 0.0 * 0.0
 ㄴ : 1809892.txt = 0.0 * 0.0
 ㄴ : 1809899.txt = 0.0 * 0.0
 ㄴ : 1809898.txt = 0.0 * 0.0
 법안 : 1809890.txt = 0.3979400086720376 * 0.0032352846233498996
 법안 : 1809891.txt = 0.3979400086720376 * 0.0033440336863196436
 법안 : 1809893.txt = 0.3979400086720376 * 0.004325434876869974
 법안 : 1809892.txt = 0.3979400086720376 * 0.001989700043360188

In [439]:

from nltk.tokenize import sent_tokenize
K=5

resultList = sorted(candidateList.items(), key = lambda x:x[1], reverse=True)
for i,(fileName, similarity) in enumerate(resultList):
    if i < K:
        print(" Rank:{0} / Document:{1} / Similarity:{2:.4f}".format((i+1),fileName,similarity))
        content = kobill.open(fileName).read()
        print(sent_tokenize(content)[:5])

 Rank:1 / Document:1809897.txt / Similarity:0.0189
['국군부대의 아랍에미리트(UAE)군 교육훈련 지원 등에 \n관한 파견 동의안\n\n의안\n                                                                  제출연월일 :  2010.', '11.', '15.', '9897\n번호\n                                                                        제  출  자 :  정         부\n\n제안이유\n\n    가.  UAE측 요청과 협의에 따라,  국익창출과  다양한 지역에서의 \n우리 특전부대 임무수행능력 향상 등을 목적으로  국군부대를 \nUAE에 파견하려는 것임.', '나.']
 Rank:2 / Document:1809898.txt / Similarity:0.0092
['국군부대의 소말리아 해역 파견연장 동의안\n\n의안\n                                                                  제출연월일 :  2010.', '11.', '15.', '9898\n번호\n                                                                        제  출  자 :  정         부\n\n제안이유\n\n      소말리아 아덴만 해역에 파견된 국군부대 ( 청해부대 )의 파견기간이 \n2010년 12월 31일 종료될 예정이나,  다음 이유로 파견기간을 연장\n하고자 함.', '첫째,  소말리아 해적활동으로 우리 선박의 안전이 위협을 받고 있고,\n\n      둘째,  청해부대가  성공적으로 임무를 수행하여 우리 국익보호 및  국위\n선양에 기여하고 있으며,   \n\n      셋째,  국내외 관계기관에서 파견연장을 요청하고 있음.']
 Rank:3 / Document:1809896.txt / Similarity:0.0090
['행정절차법 일부개정법률안\n\n(유선호의원 대표발의 )\n\n 의 안\n 번 호\n\n9896\n\n발의연월일 : 2010.', '11.', '15.', '발  의  자 : 유선호․강기갑․김효석  \n\n최문순ㆍ최재성ㆍ조영택  \n\n김성곤ㆍ문학진ㆍ백재현  \n\n송민순ㆍ양승조ㆍ신낙균  \n\n조배숙ㆍ박은수ㆍ정동영  \n\n김춘진ㆍ김재윤ㆍ우윤근  \n\n이성남ㆍ박영선 의원\n\n             (20인)\n\n제안이유\n\n  현행법은 입법예고와 행정예고를 통하여 정책 결정 과정에 국민 참\n\n여의 절차를 규정하고 있기는 하나 실제 정책 결정·집행·평가의 단계\n\n에서 근본적인 국민 참여 규정은 거의 없어 위임입법에 의하여 정책 \n\n결정 및 집행 권한이 부여되는 문제점이 있음.', '따라서 입법예고 이전의 국민적 협의절차와 재입법예고 규정 등을 \n\n신설하고, 당사자 등의 개념을 명확히 하여 당사자의 신청에 의한 청\n\n문의 기회를 보장하는 한편, 법령상의 일부 미비점을 개선․보완함으\n\n- 1 -\n\n\x0c- 2 -\n\n로써 실질적인 국민 참여의 기회를 보장하여 행정에 대한 국민의 불\n\n신을 없애고 행정의 투명성을 확보하려는 것임.']
 Rank:4 / Document:1809892.txt / Similarity:0.0063
['교육공무원법 일부개정법률안\n\n(정의화의원 대표발의 )\n\n 의 안\n 번 호\n\n9892\n\n발의연월일 : 2010.', '11.', '12.', '발  의  자 : 정의화․이명수․김을동 \n\n이사철․여상규․안규백\n\n황영철․박영아․김정훈\n\n김학송 의원(10인)\n\n제안이유 및 주요내용\n\n  현행 교육공무원의 육아휴직은 만 6세 이하의 초등학교 취학 전 자\n\n녀의 육아를 위한 경우로 한정되어 있어 초등학교 취학 후 등하교 및 \n\n방과 후 양육 등에 어려움이 많고, 저학년 자녀 혼자 등하교를 하거나 \n\n어른 없는 집에서 지내다가 성폭력 범죄 등 흉악범죄의 피해자가 되\n\n고 있음.', '이에 육아휴직 가능 시기를 만 8세 이하의 자녀로 확대하여 자녀가 \n\n초등학교에 입학하여 학교생활에 순조롭게 적응할 수 있는 나이까지 \n\n교육공무원이 자녀양육을 위해서 휴직할 수 있도록 하려는 것임(안 제\n\n44조제1항제7호).']
 Rank:5 / Document:1809893.txt / Similarity:0.0057
['남녀고용평등과 일 ·가정 양립 지원에 관한 법률 \n\n일부개정법률안\n\n(정의화의원 대표발의 )\n\n 의 안\n 번 호\n\n9893\n\n발의연월일 : 2010.', '11.', '12.', '발  의  자 : 정의화․이명수․김을동 \n\n이사철․여상규․안규백\n\n황영철․박영아․김정훈\n\n김학송 의원(10인)\n\n제안이유 및 주요내용\n\n  현행법상 근로자가 육아휴직을 신청할 수 있는 경우는 만 6세 이하\n\n의 초등학교 취학 전 자녀를 양육하기 위한 경우임.', '그런데 초등학교 1․2학년의 경우 이들을 돌보는데 세심한 주의가 \n\n필요함에도 불구하고 사회에서의 돌보는 제도가 부족하여 아동대상 \n\n성폭력 등 범죄에 노출되어있고, 이로 인해 여성 근로자들은 직장생활\n\n을 그만두고 있는 실정임.']

728x90

저작자표시 (새창열림)

[NLP] Day 18,19 - TF & IDF

가중치 기법

가중치 기법

TF

0~1 사이의 값을 가지는 "maxTF" 가 상대적인 중요도를 알기에 좋다.

IDF

적게 나오는 단어에 대해 더 높은 가중치를 부여하자!

maxTF , rawIDF 를 일반적으로 사용

query 바꿔서

Smoothing 사용

vector space cosine similarity

유사도

+ 색인어 목록 만들기

티스토리툴바