企查查简单爬虫
经历过企查查这个网站后,强烈感觉到使用抓包的重要性,以至于决定从此以后使用抓包进行模拟请求,放弃使用F12进行分析。
写下这篇文章,奠基死去的F12~~~
1 import requests 2 from lxml import etree 3 4 url = "https://www.qcc.com/search?key=%E5%A4%A9%E6%B4%A5%E6%BB%A8%E6%B5%B7%E6%96%B0%E5%8C%BA" 5 6 hed = { 7 "host": "www.qcc.com", 8 "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36", 9 "upgrade-insecure-requests": "1", 10 "cookie": "QCCSESSID=vpk1mpc45ci95eu83etg528881; zg_did=%7B%22did%22%3A%20%221732cdcac86bf-0039dd6baef69a-4353761-100200-1732cdcac8844f%22%7D; UM_distinctid=1732cdcb0a713b-01b058b949aa5a-4353761-100200-1732cdcb0ab44e; hasShow=1; _uab_collina=159418552807339394444789; acw_tc=7d27c71c15941953776602556e6b8442bc8001e4e1270e8fead4b79557; CNZZDATA1254842228=1092104090-1594185078-https%253A%252F%252Fwww.baidu.com%252F%7C1594195878; Hm_lvt_78f134d5a9ac3f92524914d0247e70cb=1594194111,1594195892,1594195918,1594196042; Hm_lpvt_78f134d5a9ac3f92524914d0247e70cb=1594196294; zg_de1d1a35bfa24ce29bbf2c7eb17e6c4f=%7B%22sid%22%3A%201594185526424%2C%22updated%22%3A%201594196294349%2C%22info%22%3A%201594185526455%2C%22superProperty%22%3A%20%22%7B%7D%22%2C%22platform%22%3A%20%22%7B%7D%22%2C%22utm%22%3A%20%22%7B%5C%22%24utm_source%5C%22%3A%20%5C%22baidu1%5C%22%2C%5C%22%24utm_medium%5C%22%3A%20%5C%22cpc%5C%22%2C%5C%22%24utm_term%5C%22%3A%20%5C%22pzsy%5C%22%7D%22%2C%22referrerDomain%22%3A%20%22www.baidu.com%22%2C%22cuid%22%3A%20%22fd05f1ac2b561244aaa6b27b3bb617a4%22%7D", 11 } 12 13 resq = requests.get(url = url,headers = hed).content 14 response = etree.HTML(resq) 15 16 title_list = [] 17 title = response.xpath('//*[@id="search-result"]//tr/td[3]/a//text()') 18 for tit in title: 19 tit = tit.replace(',','').strip() 20 title_list.append(tit) 21 22 addr_list = [] 23 addrs = response.xpath('//*[@id="search-result"]//tr/td[3]/p[4]//text()') 24 for addr in addrs: 25 addr = addr.replace(',','').strip() 26 addr_list.append(addr) 27 28 print(title_list) 29 print(addr_list)
代码很简单,甚至于简陋,为什么要记录下这个爬虫,因为请求头部信息,自己进行分析,和ctrl+c+v导致请求头数据不准确,严重感觉到抓包工具的请求分析更加快速有效。
继续加油,继续努力