【Python爬虫】BeautifulSoup 解析库
BeautifulSoup解析 HTML或XML
阅读目录
- 初识Beautiful Soup
- Beautiful Soup库的4种解析器
- Beautiful Soup类的基本元素
- 基本使用
- 标签选择器
- 节点操作
- 标准选择器
- find_all( name , attrs , recursive , text , **kwargs )
- find( name , attrs , recursive , text , **kwargs )
- CSS选择器
- 实例:中国大学排名爬虫
初识Beautiful Soup
官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#
中文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
Beautiful Soup 是一个可以从HTML或XML文本中提取数据的Python库,它能对HTML、XML格式进行解析成树形结构并提取相关信息。
Beautiful Soup库是一个灵活又方便的网页解析库,处理高效,支持多种解析库(后面会介绍),利用它不用编写正则表达式即可方便地实现网页信息的提取。
安装
Beautiful Soup 3 目前已经停止开发,推荐在现在的项目中使用Beautiful Soup 4,安装方法:
pip install beautifulsoup4
Beautiful Soup库的4种解析器
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python标准库 | BeautifulSoup(markup, "html.parser") | Python的内置标准库、执行速度适中 、文档容错能力强 | Python 2.7.3 or 3.2.2)前的版本中文容错能力差 |
lxml HTML 解析器 | BeautifulSoup(markup, "lxml") | 速度快、文档容错能力强 | 需要安装C语言库 |
lxml XML 解析器 | BeautifulSoup(markup, "xml") | 速度快、唯一支持XML的解析器 | 需要安装C语言库 |
html5lib | BeautifulSoup(markup, "html5lib") | 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部扩展 |
如果仅是想要解析HTML文档,只要用文档创建 BeautifulSoup 对象就可以了。Beautiful Soup会自动选择一个解析器来解析文档.但是还可以通过参数指定使用那种解析器来解析当前文档。BeautifulSoup 第一个参数应该是要被解析的文档字符串或是文件句柄,第二个参数用来标识怎样解析文档.如果第二个参数为空,那么Beautiful Soup根据当前系统安装的库自动选择解析器,解析器的优先数序: lxml, html5lib, Python标准库(python自带的解析库).
安装解析器库:
pip install html5lib
pip install lxml
Beautiful Soup类的基本元素
基本使用
容错处理,文档的容错能力指的是在html代码不完整的情况下,使用该模块可以识别该错误。
使用BeautifulSoup解析上述代码,能够得到一个 BeautifulSoup 的对象,并能按照 标准的缩进格式结构输出
html = """The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.prettify()) #处理好缩进,结构化显示 print(soup.title.string)
输出结果The Dormouse's story class="title" name="dromouse"> The Dormouse's story
class="story"> Once upon a time there were three little sisters; and their names were class="sister" href="http://example.com/elsie" id="link1"> , class="sister" href="http://example.com/lacie" id="link2"> Lacie and class="sister" href="http://example.com/tillie" id="link3"> Tillie ; and they lived at the bottom of a well.
class="story"> ...
The Dormouse's story
标签选择器
选择标签元素(存在多个时取第一个)
获取标签名称 + 获取标签 + 获取标签内容 + 获取标签属性
from bs4 import BeautifulSoup import requests html = """The Dormouse's story The is pppp
Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" soup = BeautifulSoup(html, 'lxml') print(soup.title) #获取改标签The Dormouse's story print(soup.title.name) #获取标签名 print(soup.title.text) #获取标签内容 print(soup.p.text) print(soup.p.string) dic = soup.p.attrs #获取 p标签所有属性返回一个字典结构 print(dic) #获取 p标签所有属性返回一个字典结构 print(dic["name"]) print(soup.p.attrs["class"]) #获取指定属性值,返回列表 print(soup.p["class"])
打印输出:
The Dormouse's story title The Dormouse's story The is pppp The is pppp {'class': ['title'], 'name': 'dromouse'} dromouse ['title'] ['title']
标签嵌套选择
html = """The Dormouse's story The Dormouse's storyOnce upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
...
""" soup = BeautifulSoup(html, 'lxml') print(soup.div.b['class']) #标签嵌套选择 print(soup.p.stripped_strings) #print(list(soup.p.stripped_strings)) print(soup.p.text)
打印输出:
['bb', 'bcls', 'xiong']['Once upon a time there were three little sisters; and their names were', ',', 'Lacie', 'and', 'Tillie', ';\nand they lived at the bottom of a well.'] Once upon a time there were three little sisters; and their names were , Lacie and Tillie; and they lived at the bottom of a well.
节点操作
子节点和子孙节点
对于一个标签的儿子节点不仅包括标签节点,也包括字符串节点,空格表示为'\n'
html = """The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.
...
""" soup = BeautifulSoup(html, 'lxml') print(soup.p.contents) #子节点列表,将所有子节点存在列表中
print("======================================================================>") print(soup.p.children) #子节点的可迭代类型,for i, child in enumerate(soup.p.children): print(i, str(child).strip()) #child 是bs4.element 对象 print("======================================================================>") print(soup.p.descendants) #子孙节点的迭代类型, for i, child in enumerate(soup.p.descendants): print(i, child)
打印输出:
['\n Once upon a time there were three little sisters; and their names were\n ', class="sister" href="http://example.com/elsie" id="link1"> Elsie , '\n', class="sister" href="http://example.com/lacie" id="link2">Lacie, '\n and\n ', class="sister" href="http://example.com/tillie" id="link3">Tillie, '\n and they lived at the bottom of a well.\n '] ======================================================================>0 Once upon a time there were three little sisters; and their names were 1 class="sister" href="http://example.com/elsie" id="link1"> Elsie 2 3 class="sister" href="http://example.com/lacie" id="link2">Lacie 4 and 5 class="sister" href="http://example.com/tillie" id="link3">Tillie 6 and they lived at the bottom of a well. ======================================================================> 0 Once upon a time there were three little sisters; and their names were 1 class="sister" href="http://example.com/elsie" id="link1"> Elsie 2 3 Elsie 4 Elsie 5 6 7 class="sister" href="http://example.com/lacie" id="link2">Lacie 8 Lacie 9 and 10 class="sister" href="http://example.com/tillie" id="link3">Tillie 11 Tillie 12 and they lived at the bottom of a well.
父节点和祖先节点
html = """The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.
...
""" soup = BeautifulSoup(html, 'lxml') print(soup.a.parent) print("========================================================================>") print(soup.a.parents) #祖先节点,返回可迭代类型 for item in soup.a.parents: print(item)
打印输出:
class="story"> Once upon a time there were three little sisters; and their names were class="sister" href="http://example.com/elsie" id="link1"> Elsie class="sister" href="http://example.com/lacie" id="link2">Lacie and class="sister" href="http://example.com/tillie" id="link3">Tillie and they lived at the bottom of a well.
========================================================================>class="story"> Once upon a time there were three little sisters; and their names were class="sister" href="http://example.com/elsie" id="link1"> Elsie class="sister" href="http://example.com/lacie" id="link2">Lacie and class="sister" href="http://example.com/tillie" id="link3">Tillie and they lived at the bottom of a well.
class="story"> Once upon a time there were three little sisters; and their names were class="sister" href="http://example.com/elsie" id="link1"> Elsie class="sister" href="http://example.com/lacie" id="link2">Lacie and class="sister" href="http://example.com/tillie" id="link3">Tillie and they lived at the bottom of a well.
class="story">...
The Dormouse's story class="story"> Once upon a time there were three little sisters; and their names were class="sister" href="http://example.com/elsie" id="link1"> Elsie class="sister" href="http://example.com/lacie" id="link2">Lacie and class="sister" href="http://example.com/tillie" id="link3">Tillie and they lived at the bottom of a well.
class="story">...
The Dormouse's story class="story"> Once upon a time there were three little sisters; and their names were class="sister" href="http://example.com/elsie" id="link1"> Elsie class="sister" href="http://example.com/lacie" id="link2">Lacie and class="sister" href="http://example.com/tillie" id="link3">Tillie and they lived at the bottom of a well.
class="story">...
兄弟节点
html = """The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.
...
""" soup = BeautifulSoup(html, 'lxml') print(list(enumerate(soup.a.next_sibling))) #下一个兄弟节点 print(list(enumerate(soup.a.next_siblings))) #下面所有的兄弟节点 print(list(enumerate(soup.a.previous_sibling))) #上一个兄弟节点 print(list(enumerate(soup.a.previous_siblings))) #上面所有的兄弟节点
打印输出:
[(0, '\n')] [(0, '\n'), (1, class="sister" href="http://example.com/lacie" id="link2">Lacie), (2, '\n and\n '), (3, class="sister" href="http://example.com/tillie" id="link3">Tillie), (4, '\n and they lived at the bottom of a well.\n ')] [(0, '\n'), (1, ' '), (2, ' '), (3, ' '), (4, ' '), (5, ' '), (6, ' '), (7, ' '), (8, ' '), (9, ' '), (10, ' '), (11, ' '), (12, ' '), (13, 'O'), (14, 'n'), (15, 'c'), (16, 'e'), (17, ' '), (18, 'u'), (19, 'p'), (20, 'o'), (21, 'n'), (22, ' '), (23, 'a'), (24, ' '), (25, 't'), (26, 'i'), (27, 'm'), (28, 'e'), (29, ' '), (30, 't'), (31, 'h'), (32, 'e'), (33, 'r'), (34, 'e'), (35, ' '), (36, 'w'), (37, 'e'), (38, 'r'), (39, 'e'), (40, ' '), (41, 't'), (42, 'h'), (43, 'r'), (44, 'e'), (45, 'e'), (46, ' '), (47, 'l'), (48, 'i'), (49, 't'), (50, 't'), (51, 'l'), (52, 'e'), (53, ' '), (54, 's'), (55, 'i'), (56, 's'), (57, 't'), (58, 'e'), (59, 'r'), (60, 's'), (61, ';'), (62, ' '), (63, 'a'), (64, 'n'), (65, 'd'), (66, ' '), (67, 't'), (68, 'h'), (69, 'e'), (70, 'i'), (71, 'r'), (72, ' '), (73, 'n'), (74, 'a'), (75, 'm'), (76, 'e'), (77, 's'), (78, ' '), (79, 'w'), (80, 'e'), (81, 'r'), (82, 'e'), (83, '\n'), (84, ' '), (85, ' '), (86, ' '), (87, ' '), (88, ' '), (89, ' '), (90, ' '), (91, ' '), (92, ' '), (93, ' '), (94, ' '), (95, ' ')] [(0, '\n Once upon a time there were three little sisters; and their names were\n ')]
标准选择器 find/find_all(* * * * *)
基于bs4库的HTML内容查找方法
<>.find_all(name,attrs,recursive,text,**kwargs) # 返回一个列表类型,存储查找的结果
name 对标签名称的检索字符串
attrs 对标签属性值的检索字符串,可标注属性检索
recursive 是否对子孙全部搜索,默认True
text 对文本内容进行检索
其他的 find 方法:
find_all( name , attrs , recursive , text , **kwargs )
可根据标签名、属性、内容查找文档
name
html='''''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all('ul')) print(type(soup.find_all('ul')[0]))Hello
- Foo
- Bar
- Jay
- Foo
- Bar
[
- class="list" id="list-1">
- class="element">Foo
- class="element">Bar
- class="element">Jay
- class="list list-small" id="list-2">
- class="element">Foo
- class="element">Bar
html='''''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.find_all('ul'): print(ul.find_all('li'))Hello
- Foo
- Bar
- Jay
- Foo
- Bar
[
属性attrs
html='''''' soup = BeautifulSoup(html, 'lxml') print(soup.find_all(attrs={'id': 'list-1'})) #推荐这种写法 print(soup.find_all(id="list-1")) #类似于**kwargs传值,与上一种写法效果相同 print(soup.find_all(attrs={'class': 'list-small'})) print(soup.find_all(class_="list2"))Hello
- Foo
- Bar
- Jay
- Foo
- Bar
打印输出:
[
- class="list" id="list-1" name="elements">
- class="element">Foo
- class="element">Bar
- class="element">Jay
- class="list" id="list-1" name="elements">
- class="element">Foo
- class="element">Bar
- class="element">Jay
- class="list2 list-small" id="list-2">
- class="element">Foo
- class="element">Bar
- class="list2 list-small" id="list-2">
- class="element">Foo
- class="element">Bar
text
html='''''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(text='Foo'))Hello
- Foo
- Bar
- Jay
- Foo
- Bar
['Foo', 'Foo']
find( name , attrs , recursive , text , **kwargs )
find返回单个元素,find_all返回所有元素
html='''''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find('ul')) print(type(soup.find('ul'))) print(soup.find('page'))Hello
- Foo
- Bar
- Jay
- Foo
- Bar
- class="list" id="list-1">
- class="element">Foo
- class="element">Bar
- class="element">Jay
find_parents() find_parent()
find_parents()返回所有祖先节点,find_parent()返回直接父节点。
find_next_siblings() find_next_sibling()
find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。
find_previous_siblings() find_previous_sibling()
find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。
find_all_next() find_next()
find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点
find_all_previous() 和 find_previous()
find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点
CSS选择器(* * * * * )
通过select()直接传入CSS选择器即可完成选择
html='''''' soup = BeautifulSoup(html, 'lxml') print(soup.select('.panel .panel-heading')) print(soup.select('ul li')) print(soup.select('#list-2 .element')) print(type(soup.select('ul')[0]))Hello
World
- Foo
- Bar
- Jay
- Foo
- Bar
输出结果:
[class="panel-heading">,Hello
class="panel-heading">] [World
html='''''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'): print(ul.select('li'))Hello
- Foo
- Bar
- Jay
- Foo
- Bar
[
获取属性
ul.attrs['id']
ul['id']
html='''''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'): print(ul['id']) print(ul.attrs['id'])Hello
- Foo
- Bar
- Jay
- Foo
- Bar
list-1 list-1 list-2 list-2
获取内容
li.get_text()
html='''''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for li in soup.select('li'): print(li.get_text())Hello
- Foo
- Bar
- Jay
- Foo
- Bar
Foo
Bar
Jay
Foo
Bar
总结:
- 推荐使用lxml解析库,必要时使用html.parser
- 标签选择筛选功能弱但是速度快
- 建议使用find()、find_all() 查询匹配单个结果或者多个结果
- 如果对CSS选择器熟悉建议使用select()
实例:中国大学排名爬虫
步骤1:从网络上获取大学排名网页内容getHTMLText()
步骤2:提取网页内容中信息到合适的数据结构fillUnivList()
步骤3:利用数据结构展示并输出结果printUnivLise()
import requests from bs4 import BeautifulSoup import bs4 def getHTMLText(url): try: r = requests.get(url, timeout=30) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return "error" def fillUnivList(ulist, html): soup = BeautifulSoup(html, "html.parser") for tr in soup.find('tbody').children: if isinstance(tr, bs4.element.Tag): # 过滤掉非标签类型 tds = tr('td') ulist.append([tds[0].string, tds[1].string, tds[3].string]) # 中文对齐问题的解决: # 采用中文字符的空格填充 chr(12288) def printUnivList(ulist, num): tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}" print(tplt.format("排名", "学校名称", "总分", chr(12288))) for i in range(num): u = ulist[i] print(tplt.format(u[0], u[1], u[2], chr(12288))) def main(): uinfo = [] url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html' html = getHTMLText(url) fillUnivList(uinfo, html) printUnivList(uinfo, 20) if __name__ == '__main__': main()代码
采集到的数据使用pyecharts进行数据可视化展示
import requests,json,re,bs4 from bs4 import BeautifulSoup header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3472.3 Safari/537.36'} def getHtmlText(url): try: ret = requests.get(url , headers=header , timeout=30) ret.encoding = "utf8" ret.raise_for_status() return ret.text except: return None def fillUnivList(ulist,html): soup = BeautifulSoup(html,"lxml") for tr in soup.tbody.children: if isinstance(tr, bs4.element.Tag): #判断tr是否是bs4.element.Tag类型 tds = tr("td") # print(tds) ulist.append([tds[0].string,tds[1].string,tds[2].string,tds[3].string]) # 中文对齐问题的解决: # 采用中文字符的空格填充 chr(12288) def printUnivList(ulist, num): tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}" print(tplt.format("排名", "学校名称", "总分", chr(12288))) for i in range(num): u = ulist[i] print(tplt.format(u[0], u[1], u[3], chr(12288))) #pyecharts数据可视化展示 def showData(ulist,num): from pyecharts import Bar attrs = [] vals = [] for i in range(num): attrs.append(ulist[i][1]) vals.append(ulist[i][3]) bar = Bar("2019中国大学排行榜") bar.add( "中国大学排行榜", attrs, vals, is_datazoom_show=True, datazoom_type="both", datazoom_range=[0, 10], xaxis_rotate=30, xaxis_label_textsize=8, is_label_show=True, ) bar.render("2019中国大学排行榜4.html") def showData_funnel(ulist,num): from pyecharts import Funnel attrs = [] vals = [] for i in range(num): attrs.append(ulist[i][1]) vals.append(ulist[i][3]) funnel = Funnel(width=1000,height=800) funnel.add( "大学排行榜", attrs, vals, is_label_show=True, label_pos="inside", label_text_color="#fff", ) funnel.render("2019中国大学排行榜4.html") def main(): uinfo = [] url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html' html = getHtmlText(url) fillUnivList(uinfo, html) print(uinfo) # showData(uinfo,100) showData_funnel(uinfo,20) # printUnivList(uinfo, 30) if __name__ == '__main__': main()代码
补充1:
Python中内建函数isinstance的用法
语法:isinstance(object,type)
作用:来判断一个对象是否是一个已知的类型。
其第一个参数(object)为对象,第二个参数(type)为类型名(int...)或类型名的一个列表((int,list,float)是一个列表)。其返回值为布尔型(True or flase)。
若对象的类型与参数二的类型相同则返回True。若参数二为一个元组,则若对象类型与元组中类型名之一相同即返回True。
下面是两个例子:
例一
>>> a = 4
>>> isinstance (a,int)
True
>>> isinstance (a,str)
False
>>> isinstance (a,(str,int,list))
True
例二
>>> a = "b"
>>> isinstance(a,str)
True
>>> isinstance(a,int)
False
>>> isinstance(a,(int,list,float))
False
>>> isinstance(a,(int,list,float,str))
True
补充2:
Response.raise_for_status()
如果发送了一个错误请求(一个 4XX 客户端错误,或者 5XX 服务器错误响应),我们可以通过 Response.raise_for_status()
来抛出异常:
>>> bad_r = requests.get('http://httpbin.org/status/404') >>> bad_r.status_code 404
>>> bad_r.raise_for_status() Traceback (most recent call last): File "requests/models.py", line 832, in raise_for_status raise http_error requests.exceptions.HTTPError: 404 Client Error
但是,由于我们的例子中 r
的 status_code
是 200
,当我们调用 raise_for_status()
时,得到的是:
>>> r.raise_for_status()
None
参考:
http://www.cnblogs.com/0bug/p/8260834.html
http://pyecharts.org/#/
https://www.cnblogs.com/kongzhagen/p/6472746.html
https://www.cnblogs.com/haiyan123/p/8289560.html
https://www.cnblogs.com/haiyan123/p/8317398.html