本文将完成:
语义检索 从 IMDB影评档(100则)-->从文字栏位'IMDB_plot',找出BM25分数最高者。-->以worldcloud图示之 Top 10 words -->Summarize LSA method 摘要三句话dataset来源:Kaggle movies.csv
程式码 参考来源
pypi官网 rank-BM25 安装+範例
BM25 algorithm是一种优化的TF/IDF检索方式,运算公式请自行参阅 wikipaedia说明 我们今天只实作 【程式码在 GitHub】
import套件
from rank_bm25 import BM25Okapiimport pandas as pdimport os#--- NLP summarize libimport sumyfrom sumy.parsers.plaintext import PlaintextParserfrom sumy.nlp.tokenizers import Tokenizer as sumyTokenfrom sumy.summarizers.lsa import LsaSummarizer#--- wordcloudimport numpy as npimport matplotlib.pyplot as pltfrom wordcloud import WordCloudfrom PIL import Image
载入csv 取栏位 “title” “imdb_plot”
# load from csv df = pd.read_csv('movies.csv',dtype=object)movies = df[['title','imdb_plot']]mtitle = movies['title'].astype(str)mimdb = movies['imdb_plot'].astype(str)
开始使用BM25
#--- tokenizetokenized_corpus = [doc.split(" ") for doc in mimdb]#--- initiate bm = BM25Okapi(tokenized_corpus)# query --> 要查询的 字词query = "music "tokenized_query = query.split(" ")# 计算 BM25 score (log)scores = bm.get_scores(tokenized_query)idx = scores.argmax()
scores.argmax() 代表'分数最大'的元素之index ,我们可以使用此index来找出 mtitle[idx] mimdb[idx]文字内容。我们先使用keyword "music"查询看看:
最佳配对(BM/best match)是第30则,分数是3.11332... ,电影title Amadeus,影评前60个字是The story begins...
idx: 303.1133320601273993AmadeusThe story begins in 1823 as the elderly Salieri attempts sui
如果把每一则的score都印出来看看...
[0. 1.45850928 0. 0. 0. 0. 0. 1.9614662 0. 0. 0. 0. 0. 0. 0. 0. 0. 2.7574558 1.81637605 0. 0. 1.6851387 0. 0. 0. 0. 0. 0. 0. 1.94504518 3.11333206 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.9093863 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ....略
文字云wordcloud
'有图有真象',Make wordcloud 把该则影评的关键字show出来。(使用遮罩alice_mask.png)
cloud.words_ 是一个已经完成排序的 dict,列出前面10个就是了...
-->图片存档
#--- make wordcloud def mkCloud(txt): mask = np.array(Image.open('alice_mask.png')) font = 'SourceHanSansTW-Regular.otf' cloud = WordCloud(background_color='white',mask=mask,font_path=font, contour_width=3, contour_color='steelblue').generate(txt) plt.imshow(cloud) plt.axis("off") plt.show() # keywords 已经完成排序的 一个 dict keywords = cloud.words_ mostly = list(keywords.keys()) print('Top10 keywords: ',mostly[:10]) mostkeys = str(mostly[:10]) pmt = f'Top10 keywords in the text\n{mostkeys}' print(pmt) # 将wordcloud 存档 destFile = 'bmFig.jpg' cloud.to_file(destFile) # show image on screen if os.path.exists(destFile): img = Image.open(destFile, 'r') img.show()
top 10 words
Top10 keywords: ['Salieri', 'Mozart', 'God', 'Requiem', 'music', 'priest', 'Vienna', 'Constanze', 'mass', 'begins']
Summarize 三句话,摘要说一下内容
#--- make summary --- def mkSummText(content): # Initializing the parser my_parser = PlaintextParser.from_string(content, sumyToken('english')) # Creating a summary of 3 sentences lsa_summarizer = LsaSummarizer() Extract = lsa_summarizer(my_parser.document,sentences_count=3) conclusion = [] for sentence in Extract: #print(sentence) conclusion.append(str(sentence)) return conclusion
结果,三句话:
>> He believes that God, through Mozart's genius, is cruelly laughing at Salieri's own musical mediocrity.>> When Salieri learns of Mozart's financial straits, he sees his chance to avenge ...略
另外改用keyword: 'musician'检索,结果:
idx: 584.300402699706742The Pianist"The Pianist" begins in Warsaw, Poland in September, 1939, ...
代码+csv+ alice_mask.png 在GitHub