Python爬虫基础之BeautifulSoup

一、BeautifulSoup的基本使用

  1 from bs4 import BeautifulSoup
  2 from bs4 import SoupStrainer
  3 import re
  4 
  5 
  6 html_doc = """
  7 
  8  
  9   
</span><span style="color: rgba(0, 128, 128, 1)"> 10</span> <span style="color: rgba(128, 0, 0, 1)">   The Dormouse's story
</span><span style="color: rgba(0, 128, 128, 1)"> 11</span> <span style="color: rgba(128, 0, 0, 1)">  
 12  
 13  
 14   
 15    
 16     The Dormouse's story
 17    
 18   
 19   
 20    Once upon a time there were three little sisters; and their names were
 21    
 22     Elsie
 23    
 24    ,
 25    
 26     Lacie
 27    
 28    and
 29    
 30     Tillie
 31    
 32    ; and they lived at the bottom of a well.
 33   
 34   
 35    ...
 36   
 37  
 38 
 39 """
 40 soup = BeautifulSoup(html_doc, "html.parser")
 41 # print(soup.prettify()) # 打印所有标准化html code
 42 print('-----------------------------')
 43 print(soup.title)
 44 print('----------------------------')
 45 print(soup.title.name)
 46 print('----------------------------')
 47 print(soup.title.string)
 48 print('----------------------------')
 49 print(soup.title.parent.name)
 50 print('----------------------------')
 51 print(soup.p)
 52 # item_b = soup.p.
 53 print('----------------------------')
 54 print(soup.p['class'])
 55 print('----------------------------')
 56 print(soup.find_all('a'))
 57 print('----------------------------')
 58 print(soup.find(id='link3'))
 59 print(soup.find(id='link3')['class'])
 60 print(soup.find(id='link3')['href'])  # 打印指定属性文本
 61 print(soup.find(id='link3')['id'])
 62 print(soup.find(id='link3').get_text())  # 打印文本
 63 
 64 # find_all(name, attrs, recursive, text, limit, **kwargs)
 65 # name 参数
 66 soup.find_all('title')
 67 
 68 # keyword参数
 69 soup.find_all(id='link2')
 70 soup.find_all(href=re.compile("elsie"))
 71 soup.find_all(id=True) # 在文档树中查找所有包含 id 属性的tag,无论 id 的值是什么
 72 soup.find_all(href=re.compile("elsie"), id='link1') # 多个指定名字的参数可以同时过滤tag的多个属性
 73 soup.find_all(attrs={"data-foo": "value"}) # 可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag:
 74 soup.find_all('a', limit=2)  # 当搜索结果到达limit个数，就停止搜索
 75 
 76 # 按CSS搜索
 77 soup.find_all("a", class_="sister")
 78 soup.find_all(class_=re.compile("itl"))  # class_ 参数同样接受不同类型的 过滤器 ,字符串,正则表达式
 79 
 80 # CSS选择器
 81 title_list = soup.select('head > title') # 查找所有满足条件的元素
 82 title_list_one = soup.select_one('head > title')  # 查找单个满足条件的元素
 83 print(title_list)  # 打印 [ The Dormouse's story]
 84 print(title_list[0].string)  # 打印The Dormouse's story<
 85 
 86 # 文档中找到所有标签的链接：
 87 for link in soup.find_all('a'):
 88     print(link.get('href'))
 89 # http://example.com/elsie
 90 # http://example.com/lacie
 91 # http://example.com/tillie
 92 
 93 # find查找元素第一个类样式未story的p标签
 94 p_story = soup.find('p',class_='story')
 95 # print(p_story.a)
 96 
 97 # 使用正则表达式
 98 p_re_all = soup.find_all(re.compile('p'))
 99 print(p_re_all)
100 
101 # find_all查找所有class_=True匹配任何类样式的p标签
102 p_all = soup.find_all('p', class_=True)
103 # print(p_all)  # 打印数组
104 # [
105 # 
106 #     The Dormouse's story
107 #    
108 # 
, 
109 #    Once upon a time there were three little sisters; and their names were
110 #    
111 #     Elsie
112 #    
113 #    ,
114 #    
115 #     Lacie
116 #    
117 #    and
118 #    
119 #     Tillie
120 #    
121 #    ; and they lived at the bottom of a well.
122 #   
, 
123 #    ...
124 #   
]

二、BeautifulSoup的实际应用

1.解析网易云音乐html源码

这是网易云音乐华语歌曲的分类链接http://music.163.com/#/discover/playlist/?order=hot&cat=华语&limit=35&offset=0，打开Chrome F12的Elements查看到页面源码，我们发现每页的歌单都在一个iframe浮窗上面，每首单曲的信息构成一个li标签，包含歌单图片、

歌单链接、歌单名称等。

首先提取一段html源码出来

 1  <ul class="m-cvrlst f-cb" id="m-pl-container"> 
 2    <li> 
 3     <div class="u-cover u-cover-1"> 
 4      <img class="j-flag" src="http://p1.music.126.net/FGe-rVrHlBTbnOvhMR99PQ==/109951162989189558.jpg?param=140y140" /> 
 5      <a title="【说唱】留住你一面，画在我心间" href="/playlist?id=832790627" class="msk">a> 
 6      <div class="bottom"> 
 7       <a class="icon-play f-fr" title="播放" href="javascript:;" data-res-type="13" data-res-id="832790627" data-res-action="play">a> 
 8       <span class="icon-headset">span> 
 9       <span class="nb">1615span> 
10      div> 
11     div> <p class="dec"> <a title="【说唱】留住你一面，画在我心间" href="/playlist?id=832790627" class="tit f-thide s-fc0">【说唱】留住你一面，画在我心间a> p> <p><span class="s-fc4">byspan> <a title="JediMindTricks" href="/user/home?id=17647877" class="nm nm-icn f-thide s-fc3">JediMindTricksa> <sup class="u-icn u-icn-84 ">sup> p> li> 
12    <li> 
13     <div class="u-cover u-cover-1"> 
14      <img class="j-flag" src="http://p1.music.126.net/If644P7ZrfPm_qcvtYyfzg==/18936888765458653.jpg?param=140y140" /> 
15      <a title="鞋子好看｜国产自赏摇滚噪音流行" href="/playlist?id=721462105" class="msk">a> 
16      <div class="bottom"> 
17       <a class="icon-play f-fr" title="播放" href="javascript:;" data-res-type="13" data-res-id="721462105" data-res-action="play">a> 
18       <span class="icon-headset">span> 
19       <span class="nb">77652span> 
20      div> 
21     div> <p class="dec"> <a title="鞋子好看｜国产自赏摇滚噪音流行" href="/playlist?id=721462105" class="tit f-thide s-fc0">鞋子好看｜国产自赏摇滚噪音流行a> p> <p><span class="s-fc4">byspan> <a title="原创君" href="/user/home?id=201586" class="nm nm-icn f-thide s-fc3">原创君a> <sup class="u-icn u-icn-1 ">sup> p> li> 
22   ul>

开始解析html源码

首先实例化一个BeautifulSoup对象，指定解析器为html.parser,通过BeautifulSoup对象的CSS选择器select_one()，这里用ID选择器搜索到无序列表ul，再通过find_all获取ul下的所有li标签，接着遍历li，获取到歌单的图片链接，歌单列表链接和歌单名称。

 1 from bs4 import BeautifulSoup
 2 
 3 html = '''上面提取的html源码'''
 4 soup = BeautifulSoup(html, 'html.parser')
 5 ul = soup.select_one('#m-pl-container')
 6 for li in ul.find_all('li'):
 7     img_url = li.img['src']
 8     a_msk = li.find('a', class_='msk')
 9     musicList_url = 'http:/%s' % a_msk['href']
10     musicList_name = a_msk['title']
11     print(img_url,musicList_url,musicList_name)  # 打印 http://p1.music.126.net/FGe-rVrHlBTbnOvhMR99PQ==/109951162989189558.jpg?param=140y140 http://playlist?id=832790627 【说唱】留住你一面，画在我心间

三、Beautiful Soup 4.4.0

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.详细使用请转移官网 http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

Python 爬虫 BeautifulSoup

Python爬虫基础之BeautifulSoup

一、BeautifulSoup的基本使用

二、BeautifulSoup的实际应用

三、Beautiful Soup 4.4.0

相关

学习《Python编程从入门到实践》PDF+代码训练

python-----面向对象简单理解

python多线程控制

Sublime 的安装、汉化、配置、Python环境和插件

python——time strftime() 函数表示当地时间

python 初识函数

python 函数对象嵌套闭包

Python栈溢出——设置python栈大小

python-面向对象-01课堂笔记

爬虫安装相关软件

python爬虫

Python 之父的解析器系列之五：左递归 PEG 语法

标签