pythontfidf关键词提取_python提取文本内容

激活谷笔记 • 2025-01-02 12:39 • 阅读 235

在Python中提取关键词可以通过多种方法实现，以下是一些常用的方法：

1. 使用jieba库进行中文分词和关键词提取：

 import jieba import jieba.analyse text = "新闻，也叫消息，是指报纸、电台、电视台、互联网经常使用的记录社会、传播信息、反映时代的一种文体，具有真实性、时效性、简洁性、可读性、准确性的特点。" fenci_text = jieba.cut（text） stopwords = set（line.rstrip（） for line in open（'stopwords.txt'）） final = "" for word in fenci_text: if word not in stopwords: final += " " + word print（final.strip（））

2. 使用TF-IDF算法提取关键词：

 from sklearn.feature_extraction.text import TfidfVectorizer documents = [ "这是第一篇文章的内容", "这是第二篇文章的内容", "这是第三篇文章的内容" ] vectorizer = TfidfVectorizer（） tfidf_matrix = vectorizer.fit_transform（documents） feature_names = vectorizer.get_feature_names_out（） tfidf_scores = tfidf_matrix.toarray（） for doc_index, document in enumerate（documents）: print（f"Document {doc_index + 1} keywords:"） for term_index, term in enumerate（feature_names）: if tfidf_scores[doc_index, term_index] > 0.1: print（f" - {term} （{tfidf_scores[doc_index, term_index]:.2f}）"）

3. 使用TextRank算法提取关键词：

 from rank_bm25 import BM25Okapi corpus = [ "这是第一篇文章的内容", "这是第二篇文章的内容", "这是第三篇文章的内容" ] bm25 = BM25Okapi（corpus） scores = bm25.get_scores（corpus） for idx, score in enumerate（scores）: print（f"Document {idx + 1} has a score of {score:.2f}"）

4. 使用Rake算法提取关键词：

 from rake_nltk import Rake text = "新闻，也叫消息，是指报纸、电台、电视台、互联网经常使用的记录社会、传播信息、反映时代的一种文体，具有真实性、时效性、简洁性、可读性、准确性的特点。" r = Rake（） r.extract_keywords_from_text（text） print（r.get_ranked_phrases（））

5. 使用KeyBERT提取关键词：

 from keybert import KeyBERT model = KeyBERT（） text = "新闻，也叫消息，是指报纸、电台、电视台、互联网经常使用的记录社会、传播信息、反映时代的一种文体，具有真实性、时效性、简洁性、可读性、准确性的特点。" keywords = model.extract_keywords（text, top_n=5, ratio=0.1） print（keywords）

6. 使用nltk库提取英文文章中的关键词：

 import nltk from nltk.corpus import brown nltk.download（'brown'） nltk.download（'averaged_perceptron_tagger'） text = "This is a sample sentence to extract keywords from." tokens = nltk.word_tokenize（text） tagged = nltk.pos_tag（tokens） noun_phrases = nltk.chunk.ne_chunk（tagged） matches = [] for chunk in noun_phrases: if chunk.label（） in （'NP', 'NN'）: matches.append（' '.join（[c for c in chunk]）） print（matches）

以上方法涵盖了从简单的中文分词到复杂的算法提取，您可以根据具体需求选择合适的方法进行关键词提取。

pythontfidf关键词提取_python提取文本内容

相关推荐