python标记清除_python爬去去掉标签

激活谷笔记 • 2026-03-28 12:23 • 阅读 26

在Python中删除标签内容，可以使用正则表达式或HTML解析库，如BeautifulSoup或lxml。以下是使用不同方法的示例：

方法一：使用正则表达式

python

import re

def remove_html_tags（text）:

正则表达式匹配标签及其内容

pattern = re.compile（r'<[^>]+>', re.S）

替换匹配到的标签内容为空字符串

result = pattern.sub（'', text）

return result

html_content = '

这是一个包含HTML标签的文本。

text_without_tags = remove_html_tags（html_content）

print（text_without_tags）输出：这是一个包含HTML标签的文本。

方法二：使用BeautifulSoup

python

from bs4 import BeautifulSoup

def remove_html_tags_bs（html_content）:

使用BeautifulSoup解析HTML

soup = BeautifulSoup（html_content, 'html.parser'）

获取文本内容，不包含标签

text_without_tags = soup.get_text（）

return text_without_tags

html_content = '

这是一个包含HTML标签的文本。

text_without_tags = remove_html_tags_bs（html_content）

print（text_without_tags）输出：这是一个包含HTML标签的文本。

方法三：使用lxml

python

from lxml import etree

def remove_html_tags_lxml（html_content）:

使用lxml解析HTML

response = etree.HTML（html_content）

获取文本内容，不包含标签

text_without_tags = response.xpath（'//text（）'）

return text_without_tags if text_without_tags else ''

html_content = '

这是一个包含HTML标签的文本。

text_without_tags = remove_html_tags_lxml（html_content）

print（text_without_tags）输出：这是一个包含HTML标签的文本。

以上方法都可以用来删除HTML或XML中的标签内容，选择哪一种方法取决于你的具体需求和偏好。如果你需要处理更复杂的HTML结构，BeautifulSoup和lxml通常提供了更强大和灵活的功能。如果你只需要简单的文本提取，正则表达式可能就足够了