词频统计软件python_中文词频统计python

激活谷笔记 • 2025-05-13 17:02 • 阅读 92

进行Python词频分析的基本步骤如下：

导入必要的库

 import string from collections import Counter

预处理文本

转换为小写字母

删除标点符号和数字

分割文本为单词

 text = text.lower（） text = text.translate（str.maketrans（'', '', string.punctuation）） words = text.split（）

创建词频字典

使用`Counter`类创建词频字典，其中键为单词，值为单词出现的次数。

 word_counts = Counter（words）

排序词频

根据单词频率对字典进行排序，从出现次数最多的单词开始。

 sorted_word_counts = sorted（word_counts.items（）, key=lambda x: x, reverse=True）

打印结果

打印排序后的词频列表。

 for word, count in sorted_word_counts: print（f"{word}: {count}"）

如果你需要使用第三方库`jieba`进行中文分词，可以按照以下步骤：

安装库

 pip install jieba

分词

 import jieba seg_list = jieba.cut_for_search（"小明硕士毕业于中国科学院计算所，后在日本京都大学深造"） print（" ".join（seg_list））

自定义词典（可选）：

 with open（"stoplist.txt", "r", encoding="utf-8-sig"） as f: stop_words = f.read（）.split（） stop_words.extend（["天龙八部", "\n", "\u3000", "目录", "一声", "之中", "只见"]） stop_words = set（stop_words） all_words = [word for word in cut_word if len（word） > 1 and word not in stop_words] print（len（all_words）, all_words[:20]）

请根据你的具体需求调整代码。