【Python爬虫】PyQuery解析库
PyQuery解析库
阅读目录
- 初始化
- 基本CSS选择器
- 查找元素
- 遍历
- 获取信息
- DOM操作
- 伪类选择器
PyQuery 是 Python 仿照 jQuery 的严格实现。语法与 jQuery 几乎完全相同。
官方文档:http://pyquery.readthedocs.io/
安装
pip install pyquery
初始化
字符串初始化
html = '''''' from pyquery import PyQuery as pq doc = pq(html) print(doc('li'))
- first item
- second item
- third item
- fourth item
- fifth item
URL初始化
from pyquery import PyQuery as pq doc = pq(url='http://www.baidu.com') print(doc('head'))
"content-type" content="text/html;charset=utf-8"/>"X-UA-Compatible" content="IE=Edge"/>"always" name="referrer"/>"stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/>输出结果????o|??????????? ?°±??¥é??
文件初始化
from pyquery import PyQuery as pq doc = pq(filename='demo.html') print(doc('li'))
基本CSS选择器
html = '''''' from pyquery import PyQuery as pq doc = pq(html) print(doc('#container .list li'))
- first item
- second item
- third item
- fourth item
- fifth item
查找元素
子元素
html = '''''' from pyquery import PyQuery as pq doc = pq(html) items = doc('.list') print(type(items)) print(items) lis = items.find('li') print(type(lis)) print(lis)
- first item
- second item
- third item
- fourth item
- fifth item
<class 'pyquery.pyquery.PyQuery'>
- class="list">
- class="item-0">first item
- class="item-1">"link2.html">second item
- class="item-0 active">"link3.html">class="bold">third item
- class="item-1 active">"link4.html">fourth item
- class="item-0">"link5.html">fifth item
html = '''''' from pyquery import PyQuery as pq doc = pq(html) items = doc('.list') lis = items.children() print(type(lis)) print(lis)
- first item
- second item
- third item
- fourth item
- fifth item
<class 'pyquery.pyquery.PyQuery'>
html = '''''' from pyquery import PyQuery as pq doc = pq(html) items = doc('.list') lis = items.children('.active') print(lis)
- first item
- second item
- third item
- fourth item
- fifth item
父元素
html = '''''' from pyquery import PyQuery as pq doc = pq(html) items = doc('.list') container = items.parent() print(type(container)) print(container)
- first item
- second item
- third item
- fourth item
- fifth item
<class 'pyquery.pyquery.PyQuery'>输出结果"container">class="list">
- class="item-0">first item
- class="item-1">"link2.html">second item
- class="item-0 active">"link3.html">class="bold">third item
- class="item-1 active">"link4.html">fourth item
- class="item-0">"link5.html">fifth item
html = '''''' from pyquery import PyQuery as pq doc = pq(html) items = doc('.list') parents = items.parents() print(type(parents)) print(parents)
- first item
- second item
- third item
- fourth item
- fifth item
<class 'pyquery.pyquery.PyQuery'>输出结果class="wrap">"container">class="list">
- class="item-0">first item
- class="item-1">"link2.html">second item
- class="item-0 active">"link3.html">class="bold">third item
- class="item-1 active">"link4.html">fourth item
- class="item-0">"link5.html">fifth item
"container">class="list">
- class="item-0">first item
- class="item-1">"link2.html">second item
- class="item-0 active">"link3.html">class="bold">third item
- class="item-1 active">"link4.html">fourth item
- class="item-0">"link5.html">fifth item
parent = items.parents('.wrap') print(parent)
输出结果class="wrap">"container">class="list">
- class="item-0">first item
- class="item-1">"link2.html">second item
- class="item-0 active">"link3.html">class="bold">third item
- class="item-1 active">"link4.html">fourth item
- class="item-0">"link5.html">fifth item
兄弟元素
html = '''''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.list .item-0.active') print(li.siblings())
- first item
- second item
- third item
- fourth item
- fifth item
html = '''''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.list .item-0.active') print(li.siblings('.active'))
- first item
- second item
- third item
- fourth item
- fifth item
遍历
单个元素
html = '''''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.item-0.active') print(li)
- first item
- second item
- third item
- fourth item
- fifth item
html = '''''' from pyquery import PyQuery as pq doc = pq(html) lis = doc('li').items() print(type(lis)) for li in lis: print(li)
- first item
- second item
- third item
- fourth item
- fifth item
<class 'generator'>
获取信息
获取属性
html = '''''' from pyquery import PyQuery as pq doc = pq(html) a = doc('.item-0.active a') print(a) print(a.attr('href')) print(a.attr.href)
- first item
- second item
- third item
- fourth item
- fifth item
"link3.html">class="bold">third item link3.html link3.html输出结果
获取文本
html = '''''' from pyquery import PyQuery as pq doc = pq(html) a = doc('.item-0.active a') print(a) print(a.text())
- first item
- second item
- third item
- fourth item
- fifth item
"link3.html">class="bold">third item third item输出结果
获取HTML
html = '''''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.item-0.active') print(li) print(li.html())
- first item
- second item
- third item
- fourth item
- fifth item
DOM操作
addClass、removeClass
html = '''''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.item-0.active') print(li) li.removeClass('active') print(li) li.addClass('active') print(li)
- first item
- second item
- third item
- fourth item
- fifth item
attr、css
html = '''''' from pyquery import PyQuery as pq doc = pq(html) li = doc('.item-0.active') print(li) li.attr('name', 'link') print(li) li.css('font-size', '14px') print(li)
- first item
- second item
- third item
- fourth item
- fifth item
remove
html = '''Hello, World''' from pyquery import PyQuery as pq doc = pq(html) wrap = doc('.wrap') print(wrap.text()) wrap.find('p').remove() print(wrap.text())This is a paragraph.
Hello, World This is a paragraph. Hello, World输出结果
其他DOM方法 http://pyquery.readthedocs.io/en/latest/api.html
伪类选择器
html = '''''' from pyquery import PyQuery as pq doc = pq(html) li = doc('li:first-child') print(li) li = doc('li:last-child') print(li) li = doc('li:nth-child(2)') print(li) li = doc('li:gt(2)') print(li) li = doc('li:nth-child(2n)') print(li) li = doc('li:contains(second)') print(li)
- first item
- second item
- third item
- fourth item
- fifth item
更多CSS选择器可以查看 http://www.w3school.com.cn/css/index.asp