【文字分析】3-4 TF-IDF文字概念

说明

一种分析某单词在文章中重要程度公式
TF-IDF值与档案中出现次数成正比,语料库出现频率成反比

TF

    指某词语在档案中的出现频率

ni,j:该字词在档案中出现次数
Σni,k:档案中字词数量

IDF

    指某词语在文章中的重要性

D:档案数量
1+|j:ti dj|:含有ti词语的档案数量
1:避免分母为0

TF-IDF

範例:

假设一篇文章总共有100个词语,而「大角怪」出现了5次,
而「大角怪」在1,000篇文章出现,文章数量总共有10,000,000篇。

文字加权

程式範例

公式函式

tf

from math import logdef tf(term, doc, normalize=True):    doc = doc.lower().split()    if (normalize):        result = doc.count(term.lower())/float(len(doc))    else:        result = doc.count(term.lower())/1    return result

idf

def idf(term, docs):    num_text_with_term = len(        [True for doc in docs if term.lower() in doc.lower().split()])    try:        return 1.0 + log(len(docs) / num_text_with_term)    except ZeroDivisionError:        return 1.0

tf-idf

def tf_idf(term, doc, docs):    return tf(term, doc)*idf(term, docs)

公式运用

宣告内容

corpus = \    {'a': 'Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.',     'b': 'Professor Plumb has a green plant in his study ',     'c': "Miss Scarlett watered Professor Plumb's green plant while he was away from his office last week."}## i.lower()=>转小写## split()=>分割QUERY_TERMS = ['green']

带入公式

for term in [t.lower() for t in QUERY_TERMS]:    for doc in sorted(corpus):        print('TF(%s): %s' % (doc, term), tf(term, corpus[doc]))    print('IDF: %s' % (term, ), idf(term, corpus.values()),"\n")    for doc in sorted(corpus):        score = tf_idf(term, corpus[doc], corpus.values())        print('TF-IDF(%s): %s' % (doc, term), score,"\n")        # 将tf*idf相加

套件运用

内容宣告

import nltkterms = "Develop daily routines before and after school—for example, things to pack for school in the morning (like hand sanitizer and a backup mask) and things to do when you return home (like washing hands immediately and washing worn cloth masks). Wash your hands immediately after taking off a mask.People who live in multi-generational households may find it difficult to take precautions to protect themselves from COVID-19 or isolate those who are sick, especially if space in the household is limited and many people live in the same household. CDC recently created guidance for multi-generational households. Although the guidance was developed as part of CDC’s outreach to tribal communities, the information could be useful for all families, including those with both children and older adults in the same home."text = [text for text in terms.split()]## 断词处理，存为列表tc = nltk.TextCollection(text)## 放入nltk的套件处理term = 'a'## 搜寻字idx = 0

公式处理

print('TF(%s): %s' % ('a', term), tc.tf(term, text[idx]))# If a term does not appear in the corpus, 0.0 is returned.print('IDF(%s): %s' % ('a', term), tc.idf(term))print ('TF-IDF(%s): %s' % ('a', term), tc.tf_idf(term, text[idx]))

【文字分析】3-4 TF-IDF文字概念

【文字分析】3-4 TF-IDF文字概念

说明

TF

IDF

TF-IDF

範例:

文字加权

程式範例

公式函式

tf

idf

tf-idf

公式运用

宣告内容

带入公式

套件运用

内容宣告

公式处理

执行结果

关于作者: 网站小编

【文字分析】3-4 TF-IDF文字概念

说明

TF

IDF

TF-IDF

範例:

文字加权

程式範例

公式函式

tf

idf

tf-idf

公式运用

宣告内容

带入公式

套件运用

内容宣告

公式处理

执行结果

给这篇文章的作者打赏

关于作者: 网站小编

相关文章

HBO Max vs.Netflix：当你负担不起两者时如何选择

课内笔记整理---作业系统实务(资安相关篇)

excel vba捞网页数据问题

热门文章

1【文字分析】3-4 TF-IDF文字概念

2【文字分析】3-5 词嵌入模型

3高内聚与低耦合

4Linux哲学思想

5範围和裁缝(Scoping and Tailoring)