Beautifulsoup 4库

和 lxml 一样，Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。
lxml 只会局部遍历，而Beautiful Soup 是基于HTML DOM（Document Object Model）的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。
BeautifulSoup 用来解析 HTML 比较简单，API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持 lxml 的 XML解析器。
Beautiful Soup 3 目前已经停止开发，推荐现在的项目使用Beautiful Soup 4。

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

#创建 Beautiful Soup 对象 # 使用lxml来进行解析 soup = BeautifulSoup(html,"lxml") print(soup.prettify())

四、四个常用的对象：

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag
NavigatableString
BeautifulSoup
Comment

#创建 Beautiful Soup 对象 soup = BeautifulSoup(html,'lxml') print soup.title # The Dormouse's story print soup.head # The Dormouse's story print soup.a # print soup.p #
The Dormouse's story
print type(soup.p) #
我们可以利用 soup 加标签名轻松地获取这些标签的内容，这些对象的类型是bs4.element.Tag。但是注意，它查找的是在所有内容中的第一个符合要求的标签。如果要查询所有的标签，后面会进行介绍。
对于Tag，它有两个重要的属性，分别是name和attrs。示例代码如下：
print soup.name # [document] #soup 对象本身比较特殊，它的 name 即为 [document] print soup.head.name # head #对于其他内部标签，输出的值便为标签本身的名称 print soup.p.attrs # {'class': ['title'], 'name': 'dromouse'} # 在这里，我们把 p 标签的所有属性打印输出了出来，得到的类型是一个字典。 print soup.p['class'] # soup.p.get('class') # ['title'] #还可以利用get方法，传入属性的名称，二者是等价的 soup.p['class'] = "newClass" print soup.p # 可以对这些属性和内容等等进行修改 # The Dormouse's story

2. NavigableString：

如果拿到标签后，还想获取标签中的内容。那么可以通过tag.string获取标签中的文字。示例代码如下：

print soup.p.string
# The Dormouse's story

print type(soup.p.string)
# thon

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc,'lxml') head_tag = soup.head # 返回所有子节点的列表 print(head_tag.contents) # 返回所有子节点的迭代器 for child in head_tag.children: print(child)

2. strings 和 stripped_strings

如果tag中包含多个字符串 [2] ,可以使用 .strings 来循环获取：

for string in soup.strings:
print(repr(string))
# u"The Dormouse's story"
# u'\n\n'
# u"The Dormouse's story"
# u'\n\n'
# u'Once upon a time there were three little sisters; and their names were\n'
# u'Elsie'
# u',\n'
# u'Lacie'
# u' and\n'
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'\n\n'
# u'...'
# u'\n'

输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容：

for string in soup.stripped_strings:
print(repr(string))
# u"The Dormouse's story"
# u"The Dormouse's story"
# u'Once upon a time there were three little sisters; and their names were'
# u'Elsie'
# u','
# u'Lacie'
# u'and'
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'...'

))

（6）获取内容

以上的 select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容。

soup = BeautifulSoup(html, 'lxml')
print type(soup.select('title'))
print soup.select('title')[0].get_text()

for title in soup.select('title'):
print title.get_text()

Python

Beautifulsoup 4库

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

#创建 Beautiful Soup 对象 # 使用lxml来进行解析 soup = BeautifulSoup(html,"lxml") print(soup.prettify())

四、四个常用的对象：

2. NavigableString：

from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc,'lxml') head_tag = soup.head # 返回所有子节点的列表 print(head_tag.contents) # 返回所有子节点的迭代器 for child in head_tag.children: print(child)

2. strings 和 stripped_strings

))

（6）获取内容

相关

学习《Python编程从入门到实践》PDF+代码训练

python-----面向对象简单理解

python多线程控制

Sublime 的安装、汉化、配置、Python环境和插件

python——time strftime() 函数表示当地时间

python 初识函数

python 函数对象嵌套闭包

Python栈溢出——设置python栈大小

python-面向对象-01课堂笔记

python爬虫

Python 之父的解析器系列之五：左递归 PEG 语法

Python 为了提升性能，竟运用了共享经济

标签