COVID-19 literature searching (BM25)文献搜寻-BM25方法

今天选个大资料集,来试试看BM25的语义搜寻。(据说BM25不必先做”断词处理”,说错了,是不必处理stopwords)
59万笔COVID-19相关文献860MB
资料集来源:Kaggle COVID-19 metadata.csv 568,230笔(不重覆者)(title有内容的有48万笔)
http://img2.58codes.com/2024/201113739vdpEXCrbF.jpg
先前两篇请参考 < 语义检索 Semantic Search NLP >
< Semantic search BM25 COVID-19 dataset 自然语言BM25搜寻新冠文献资料>
搜寻关键字: Taiwan vaccine mortality
搜寻标的: 文献Title
http://img2.58codes.com/2024/20111373dIVeLT5JoZ.jpg
程式就简单一点,少一点花俏。

读档csv (读档较费时,请稍待)取我们要的相关栏位给关键字s (可多词、空白隔开)把文献title tokenize计算 BM25 score列出最高分数前10篇结果存档
Source Code
''' article_search01.py     searching article title     BM25 method       '''from rank_bm25 import BM25Okapiimport pandas as pdimport numpy as np         ''' main flow '''        # load csv file# https://www.kaggle.com/maksimeren/covid-19-literature-clustering/data?select=metadata.csvprint('读档中,请稍候...')df_raw = pd.read_csv('ArticleCOVID.csv',dtype=object)# 测试 sample 1万笔#df = df_raw.sample(n = 10000, random_state=20)df = df_raw #print(df.head())# 取我们要的相关栏位mtitle   = df['title'].astype(str)mabs     = df['abstract'].astype(str)murl     = df['url'].astype(str) mpubtime = df['publish_time'].astype(str)mpmcid   = df['pmcid'].astype(str)mauthor  = df['authors'].astype(str)mjournal = df['journal'].astype(str)mdoi     = df['doi'].astype(str)#print(len(mtitle),mtitle.shape)#print(mtitle.iloc[5])#--- 把文献title tokenizetokenized_corpus = [doc.split(" ") for doc in mtitle]print(f'文献数量: {len(tokenized_corpus)}')print(f'前五篇 title token\n{tokenized_corpus[:5]}')#--- initiate BM25bm = BM25Okapi(tokenized_corpus)# query --> 要查询的 keywords 可多词,以空格间隔query = input('搜寻【文献标题】之关键字s:>> ')tokenized_query = query.split(" ")print(f'keywords数目: {len(tokenized_query)}\n tokenized: {tokenized_query}')# 计算 BM25 score (log)scores = bm.get_scores(tokenized_query)# sort scores (take index)s1 = np.argsort(scores)sidx = s1[::-1]   # reverse s1print(sidx[:10])   # top 10 highest score papers fw = open('article_result.txt','w',encoding='utf-8')print(f'Searching keywords: {query}')print('Top 10 aritcles listed below:')print(f'Searching keywords: {query}',file=fw)print('Top 10 aritcles listed below:',file=fw)for i in range(10):    no = sidx[i]    tmp =       f'Location: {no}\n'    tmp = tmp + f'BM25 score: {scores[no]}\n'    tmp = tmp + f'Title: {mtitle.iloc[no]}\n'    tmp = tmp + f'Authors: {mauthor.iloc[no]}\n'    tmp = tmp + f'PubTime: {mpubtime.iloc[no]}\n'    tmp = tmp + f'Abstract: {mabs.iloc[no][:500]}\n'       tmp = tmp + f'Journal: {mjournal.iloc[no]}\n'    tmp = tmp + f'URL: {murl.iloc[no]}\n'    tmp = tmp + f'pmcid: {mpmcid.iloc[no]}\n'    tmp = tmp + f'doi: {mdoi.iloc[no]}\n\n'    print(tmp)    print(tmp,file=fw)    fw.close()    print('搜寻结果,已存档完成: article_result.txt')

http://img2.58codes.com/2024/20111373f0xF3kfced.jpg


关于作者: 网站小编

码农网专注IT技术教程资源分享平台,学习资源下载网站,58码农网包含计算机技术、网站程序源码下载、编程技术论坛、互联网资源下载等产品服务,提供原创、优质、完整内容的专业码农交流分享平台。

热门文章