语义检索 Semantic Search NLP ( BM25 +wordcloud+LSA summary )

本文将完成:

语义检索 从 IMDB影评档(100则)-->从文字栏位'IMDB_plot',找出BM25分数最高者。-->以worldcloud图示之 Top 10 words -->Summarize LSA method 摘要三句话

dataset来源:Kaggle movies.csv
程式码 参考来源
pypi官网 rank-BM25 安装+範例

BM25 algorithm是一种优化的TF/IDF检索方式,运算公式请自行参阅 wikipaedia说明 我们今天只实作 【程式码在 GitHub】
import套件

from rank_bm25 import BM25Okapiimport pandas as pdimport os#--- NLP summarize libimport sumyfrom sumy.parsers.plaintext import PlaintextParserfrom sumy.nlp.tokenizers import Tokenizer as sumyTokenfrom sumy.summarizers.lsa import LsaSummarizer#--- wordcloudimport numpy as npimport matplotlib.pyplot as pltfrom wordcloud import WordCloudfrom PIL import Image

载入csv 取栏位 “title” “imdb_plot”
http://img2.58codes.com/2024/201113732CljFdugci.jpg

# load from csv df = pd.read_csv('movies.csv',dtype=object)movies = df[['title','imdb_plot']]mtitle = movies['title'].astype(str)mimdb  = movies['imdb_plot'].astype(str) 

开始使用BM25

#--- tokenizetokenized_corpus = [doc.split(" ") for doc in mimdb]#--- initiate bm = BM25Okapi(tokenized_corpus)# query --> 要查询的 字词query = "music "tokenized_query = query.split(" ")# 计算 BM25 score (log)scores = bm.get_scores(tokenized_query)idx = scores.argmax()

scores.argmax() 代表'分数最大'的元素之index ,我们可以使用此index来找出 mtitle[idx] mimdb[idx]文字内容。我们先使用keyword "music"查询看看:
最佳配对(BM/best match)是第30则,分数是3.11332... ,电影title Amadeus,影评前60个字是The story begins...

idx: 303.1133320601273993AmadeusThe story begins in 1823 as the elderly Salieri attempts sui

如果把每一则的score都印出来看看...

[0.         1.45850928 0.         0.         0.         0. 0.         1.9614662  0.         0.         0.         0. 0.         0.         0.         0.         0.         2.7574558 1.81637605 0.         0.         1.6851387  0.         0. 0.         0.         0.         0.         0.         1.94504518 3.11333206 0.         0.         0.         0.         0. 0.         0.         0.         0.         0.         0. 0.         0.         0.         0.         0.         1.9093863 0.         0.         0.         0.         0.         0. 0.         0.         0.         0.         0.         0. 0.         0.         0.        ....略

文字云wordcloud
'有图有真象',Make wordcloud 把该则影评的关键字show出来。(使用遮罩alice_mask.png)
cloud.words_ 是一个已经完成排序的 dict,列出前面10个就是了...
-->图片存档

#--- make wordcloud def mkCloud(txt):    mask = np.array(Image.open('alice_mask.png'))     font = 'SourceHanSansTW-Regular.otf'            cloud = WordCloud(background_color='white',mask=mask,font_path=font,                      contour_width=3, contour_color='steelblue').generate(txt)         plt.imshow(cloud)    plt.axis("off")    plt.show()    # keywords 已经完成排序的 一个 dict                    keywords = cloud.words_    mostly = list(keywords.keys())           print('Top10 keywords: ',mostly[:10])       mostkeys = str(mostly[:10])    pmt = f'Top10 keywords in the text\n{mostkeys}'    print(pmt)     # 将wordcloud 存档    destFile = 'bmFig.jpg'    cloud.to_file(destFile)           # show image on screen    if os.path.exists(destFile):        img = Image.open(destFile, 'r')        img.show()

top 10 words

Top10 keywords:  ['Salieri', 'Mozart', 'God', 'Requiem', 'music', 'priest', 'Vienna', 'Constanze', 'mass', 'begins']

http://img2.58codes.com/2024/201113737S6CBZ59st.jpg
Summarize 三句话,摘要说一下内容

#--- make summary ---    def mkSummText(content):    # Initializing the parser    my_parser = PlaintextParser.from_string(content, sumyToken('english'))    # Creating a summary of 3 sentences    lsa_summarizer = LsaSummarizer()    Extract = lsa_summarizer(my_parser.document,sentences_count=3)       conclusion = []    for sentence in Extract:        #print(sentence)        conclusion.append(str(sentence))                     return conclusion

结果,三句话:

>>  He believes that God, through Mozart's genius, is cruelly laughing at Salieri's own musical mediocrity.>>  When Salieri learns of Mozart's financial straits, he sees his chance to avenge ...略

另外改用keyword: 'musician'检索,结果:

idx: 584.300402699706742The Pianist"The Pianist" begins in Warsaw, Poland in September, 1939, ...

http://img2.58codes.com/2024/201113730cvsEz4GgF.jpg

代码+csv+ alice_mask.png 在GitHub


关于作者: 网站小编

码农网专注IT技术教程资源分享平台,学习资源下载网站,58码农网包含计算机技术、网站程序源码下载、编程技术论坛、互联网资源下载等产品服务,提供原创、优质、完整内容的专业码农交流分享平台。

热门文章