【文字分析】3-4 TF-IDF文字概念

【文字分析】3-4 TF-IDF文字概念

说明

一种分析某单词在文章中重要程度公式
TF-IDF值与档案中出现次数成正比,语料库出现频率成反比

TF

    指某词语在档案中的出现频率

ni,j:该字词在档案中出现次数
Σni,k:档案中字词数量

IDF

    指某词语在文章中的重要性

D:档案数量
1+|j:ti dj|:含有ti词语的档案数量
1:避免分母为0

TF-IDF

範例:

假设一篇文章总共有100个词语,而「大角怪」出现了5次,
而「大角怪」在1,000篇文章出现,文章数量总共有10,000,000篇。

文字加权

程式範例

公式函式

tf

from math import logdef tf(term, doc, normalize=True):    doc = doc.lower().split()    if (normalize):        result = doc.count(term.lower())/float(len(doc))    else:        result = doc.count(term.lower())/1    return result

idf

def idf(term, docs):    num_text_with_term = len(        [True for doc in docs if term.lower() in doc.lower().split()])    try:        return 1.0 + log(len(docs) / num_text_with_term)    except ZeroDivisionError:        return 1.0

tf-idf

def tf_idf(term, doc, docs):    return tf(term, doc)*idf(term, docs)

公式运用

宣告内容

corpus = \    {'a': 'Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.',     'b': 'Professor Plumb has a green plant in his study ',     'c': "Miss Scarlett watered Professor Plumb's green plant while he was away from his office last week."}## i.lower()=>转小写## split()=>分割QUERY_TERMS = ['green']

带入公式

for term in [t.lower() for t in QUERY_TERMS]:    for doc in sorted(corpus):        print('TF(%s): %s' % (doc, term), tf(term, corpus[doc]))    print('IDF: %s' % (term, ), idf(term, corpus.values()),"\n")    for doc in sorted(corpus):        score = tf_idf(term, corpus[doc], corpus.values())        print('TF-IDF(%s): %s' % (doc, term), score,"\n")        # 将tf*idf相加

套件运用

内容宣告

import nltkterms = "Develop daily routines before and after school—for example, things to pack for school in the morning (like hand sanitizer and a backup mask) and things to do when you return home (like washing hands immediately and washing worn cloth masks). Wash your hands immediately after taking off a mask.People who live in multi-generational households may find it difficult to take precautions to protect themselves from COVID-19 or isolate those who are sick, especially if space in the household is limited and many people live in the same household. CDC recently created guidance for multi-generational households. Although the guidance was developed as part of CDC’s outreach to tribal communities, the information could be useful for all families, including those with both children and older adults in the same home."text = [text for text in terms.split()]## 断词处理,存为列表tc = nltk.TextCollection(text)## 放入nltk的套件处理term = 'a'## 搜寻字idx = 0

公式处理

print('TF(%s): %s' % ('a', term), tc.tf(term, text[idx]))# If a term does not appear in the corpus, 0.0 is returned.print('IDF(%s): %s' % ('a', term), tc.idf(term))print ('TF-IDF(%s): %s' % ('a', term), tc.tf_idf(term, text[idx]))

执行结果


关于作者: 网站小编

码农网专注IT技术教程资源分享平台,学习资源下载网站,58码农网包含计算机技术、网站程序源码下载、编程技术论坛、互联网资源下载等产品服务,提供原创、优质、完整内容的专业码农交流分享平台。

热门文章