python爬虫爬取多页_python爬虫教程

激活谷笔记 • 2025-05-18 08:20 • 阅读 95

爬取多层连接的页面通常需要递归或迭代的方法，以下是一个基本的流程，使用 Python 语言和 `requests`、`BeautifulSoup` 库来实现：

1. 导入所需库：

 import requests from bs4 import BeautifulSoup

2. 定义一个函数来获取页面内容：

 def get_page_content（url）: response = requests.get（url） if response.status_code == 200: return response.text else: return None

3. 定义一个函数来解析页面并提取链接：

 def extract_links（html_content）: soup = BeautifulSoup（html_content, 'html.parser'） links = [a['href'] for a in soup.find_all（'a', href=True）] return links

4. 定义一个函数来递归爬取链接：

 def crawl_links（start_url, max_depth, current_depth=0）: if current_depth > max_depth: return html_content = get_page_content（start_url） if html_content: links = extract_links（html_content） for link in links: print（f"Found link: {link}"） crawl_links（link, max_depth, current_depth + 1）

5. 调用函数开始爬取：

 start_url = 'http://example.com' 起始URL max_depth = 2 最大爬取深度 crawl_links（start_url, max_depth）

这个例子中，`crawl_links` 函数会递归地访问每个链接，直到达到指定的最大深度。每次递归时，`current_depth` 参数增加，当它超过 `max_depth` 时，递归停止。

请注意，爬取网站时应遵守网站的 `robots.txt` 文件规定，并尊重网站的版权和使用条款。此外，频繁的请求可能会给网站服务器带来压力，因此请合理安排爬取频率

python爬虫爬取多页_python爬虫教程

相关推荐