获取多个URL的Python方法有很多,以下是一些常用的方法:
1. 使用`requests`库和`BeautifulSoup`库:
import requestsfrom bs4 import BeautifulSoupurls = ['http://www.example.com/page1','http://www.example.com/page2','http://www.example.com/page3']for url in urls:response = requests.get(url)soup = BeautifulSoup(response.content, 'html.parser')获取网页标题和正文内容title = soup.title.stringcontent = soup.find('body').get_text()print('标题:', title)print('正文内容:', content)
2. 使用`Scrapy`框架递归调用`parse`方法:
from scrapy.spiders import Spiderclass QiubaiSpider(Spider):name = 'qiubai'allowed_domains = ['www.qiushibaike.com/text']start_urls = ['https://www.qiushibaike.com/text/']def parse(self, response):提取所有URLfor link in response.css('a::attr(href)').getall():yield response.follow(link, self.parse)
3. 使用`lxml`库和XPath表达式:
from lxml import htmltree = html.fromstring(html_content)links = tree.xpath('//a/@href')for link in links:print(link)
4. 使用`urllib`库和`BeautifulSoup`库:
from bs4 import BeautifulSoupimport urllib.requestdef scanpage(url):html = urllib.request.urlopen(url).read()soup = BeautifulSoup(html, 'html.parser')pageurls = soup.find_all('a', href=True)for links in pageurls:if url in links.get('href') and links.get('href') not in Upageurls and links.get('href') not in websiteurls:Upageurls[links.get('href')] = 0for links in Upageurls.keys():try:urllib.request.urlopen(links).getcode()except:print('connect failed')else:Upageurls[links] = urllib.request.urlopen(links).getcode()
5. 批量获取百度搜索结果的URL:
import requestsDOMAIN = 'https://www.baidu.com/s?wd='a = input('请输入搜索关键词:')b = int(input('请输入爬取的页数:'))c = int((b-1)*10+1)for i in range(0, c, 10):d = str(i)url = str(DOMAIN + a + '&pn=' + d)headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36','Cookie': 'PSTM=; BIDUPSID=C6D409FA9EC7DBCD64A2D7581; BD_UPN=;'}response = requests.get(url, headers=headers)处理响应内容
以上代码示例展示了如何使用不同的Python库和工具来获取多个URL。请根据您的具体需求选择合适的方法。
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
如需转载请保留出处:https://sigusoft.com/bj/91806.html