爬虫去重方案_python去重并从小到大排列

激活谷笔记 • 2025-04-18 10:53 • 阅读 140

在Python 3中，爬虫队列去重可以通过以下几种方法实现：

使用集合（Set）

优点：集合中的素是唯一的，添加重复素时会自动忽略。

缺点：集合是无序的，如果需要保持素顺序，则不适用。

 data = set（） new_data = "新数据" if new_data not in data: data.add（new_data）

使用字典（Dictionary）

优点：字典中的键是唯一的，可以避免重复添加相同的数据。

缺点：需要为新数据设置一个值，即使这个值不重要。

 data = dict（） new_data = "新数据" if new_data not in data: data[new_data] = None 为新数据设置任意值

使用Bloom过滤器

优点：可以快速判断数据是否存在，误判率较低。

缺点：存在一定的误判率，即可能会错误地认为一个不存在的数据存在。

 import pybloom bloom = pybloom.BloomFilter（） new_data = "新数据" if new_data not in bloom: bloom.add（new_data）

使用外部数据库

优点：可以存储大量数据，并且可以保持数据的顺序和唯一性。

缺点：需要额外的数据库设置和维护。

 使用MySQL作为数据库的例子 import pymysql 连接数据库 connection = pymysql.connect（host='localhost', user='username', password='password', db='database_name', charset='utf8mb4', cursorclass=pymysql.cursors.DictCursor） 创建表 with connection.cursor（） as cursor: cursor.execute（""" CREATE TABLE IF NOT EXISTS visited_urls （ url VARCHAR（2048） PRIMARY KEY ） """） 插入数据 new_data = "http://example.com" with connection.cursor（） as cursor: try: cursor.execute（"INSERT INTO visited_urls （url） VALUES （%s）", （new_data,）） connection.commit（） except pymysql.IntegrityError: print（f"URL {new_data} 已存在，跳过插入。"） 关闭连接 connection.close（）

选择哪种方法取决于你的具体需求，例如数据量大小、是否需要保持数据顺序以及是否需要持久化存储数据。如果数据量不大且不需要持久化，使用集合或字典可能就足够了。如果数据量较大或需要持久化存储，则可能需要使用数据库。

爬虫去重方案_python去重并从小到大排列

使用集合（Set）

使用字典（Dictionary）

使用Bloom过滤器

使用外部数据库

相关推荐