Java爬虫系列三:使用Jsoup解析HTML


在上一篇随笔《》中介绍了怎么使用HttpClient进行爬虫的第一步--抓取页面html,今天接着来看下爬虫的第二步--解析抓取到的html。

有请第二步的主角:Jsoup粉墨登场。下面我们把舞台交给Jsoup,让他完成本文剩下的内容。

============华丽的分割线=============

一、Jsoup自我介绍

大家好,我是Jsoup。

我是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据,用Java写爬虫的同行们十之八九用过我。为什么呢?因为我在这个方面功能强大、使用方便。不信的话,可以继续往下看,代码是不会骗人的。

二、Jsoup解析html

上一篇中,HttpClient大哥已经抓取到了博客园首页的html,但是一堆的代码,不是程序员的人们怎么能看懂呢?这个就需要我这个html解析专家出场了。

下面通过案例展示如何使用Jsoup进行解析,案例中将获取博客园首页的标题和第一页的博客文章列表

请看代码(在上一篇代码的基础上进行操作,如果还不知道如何使用httpclient的朋友请进行阅读):

  1. 引入依赖
    <dependency>
        <groupId>org.jsoupgroupId>
        <artifactId>jsoupartifactId>
        <version>1.12.1version>
    dependency>
  2. 实现代码。实现代码之前首先要分析下html结构。标题是不用说了,那文章列表呢?按下浏览器的F12,查看页面元素源码,你会发现列表是一个大的div,id="post_list",每篇文章是小的div,class="post_item" <p>接下来就可以开始代码了,Jsoup核心代码如下(整体源码会在文章末尾给出):</p> <pre><span style="color: rgba(0, 128, 0, 1)">/**</span><span style="color: rgba(0, 128, 0, 1)"> * 下面是Jsoup展现自我的平台 </span><span style="color: rgba(0, 128, 0, 1)">*/</span> <span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">6.Jsoup解析html</span> Document document =<span style="color: rgba(0, 0, 0, 1)"> Jsoup.parse(html); </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">像js一样,通过标签获取title</span> System.out.println(document.getElementsByTag("title"<span style="color: rgba(0, 0, 0, 1)">).first()); </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">像js一样,通过id 获取文章列表元素对象</span> Element postList = document.getElementById("post_list"<span style="color: rgba(0, 0, 0, 1)">); </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">像js一样,通过class 获取列表下的所有博客</span> Elements postItems = postList.getElementsByClass("post_item"<span style="color: rgba(0, 0, 0, 1)">); </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">循环处理每篇博客</span> <span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)"> (Element postItem : postItems) { </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">像jquery选择器一样,获取文章标题元素</span> Elements titleEle = postItem.select(".post_item_body a[class='titlelnk']"<span style="color: rgba(0, 0, 0, 1)">); System.out.println(</span>"文章标题:" +<span style="color: rgba(0, 0, 0, 1)"> titleEle.text());; System.out.println(</span>"文章地址:" + titleEle.attr("href"<span style="color: rgba(0, 0, 0, 1)">)); </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">像jquery选择器一样,获取文章作者元素</span> Elements footEle = postItem.select(".post_item_foot a[class='lightblue']"<span style="color: rgba(0, 0, 0, 1)">); System.out.println(</span>"文章作者:" +<span style="color: rgba(0, 0, 0, 1)"> footEle.text());; System.out.println(</span>"作者主页:" + footEle.attr("href"<span style="color: rgba(0, 0, 0, 1)">)); System.out.println(</span>"*********************************"<span style="color: rgba(0, 0, 0, 1)">); }</span></pre> <p>根据以上代码你会发现,我通过Jsoup.parse(String html)方法对httpclient获取到的html内容进行解析获取到Document,然后document可以有两种方式获取其子元素:像js一样 可以通过getElementXXXX的方式 和 像jquery 选择器一样通过select()方法。 无论哪种方法都可以,我个人推荐用select方法处理。对于元素中的属性,比如超链接地址,可以使用element.attr(String)方法获取, 对于元素的文本内容通过element.text()方法获取。</p> </li> <li>执行代码,查看结果(不得不感慨博客园的园友们真是太厉害了,从上面分析首页html结构到Jsoup分析的代码执行完,这段时间首页多了那么多文章) <p>由于新文章发布的太快了,导致上面的截图和这里的输出有些不一样。</p> </li> </ol> <p><span style="font-size: 16px"><strong>三、Jsoup的其他用法</strong></span></p> <p>我,Jsoup,除了可以在httpclient大哥的工作成果上发挥作用,我还能自己独立干活,自己抓取页面,然后自己分析。分析的本领已经在上面展示过了,下面来展示自己抓取页面,其实很简单,所不同的是我直接获取到的是document,不用再通过Jsoup.parse()方法进行解析了。</p> <p></p> <p>除了能直接访问网上的资源,我还能解析本地资源:</p> <p>代码:</p> <pre><span style="color: rgba(0, 0, 255, 1)">public</span> <span style="color: rgba(0, 0, 255, 1)">static</span> <span style="color: rgba(0, 0, 255, 1)">void</span><span style="color: rgba(0, 0, 0, 1)"> main(String[] args) { </span><span style="color: rgba(0, 0, 255, 1)">try</span><span style="color: rgba(0, 0, 0, 1)"> { Document document </span>= Jsoup.parse(<span style="color: rgba(0, 0, 255, 1)">new</span> File("d://1.html"), "utf-8"<span style="color: rgba(0, 0, 0, 1)">); System.out.println(document); } </span><span style="color: rgba(0, 0, 255, 1)">catch</span><span style="color: rgba(0, 0, 0, 1)"> (IOException e) { e.printStackTrace(); } }</span></pre> <p><span style="font-size: 16px"><strong>四、Jsoup另一个值得一提的功能</strong></span></p> <p> 你肯定有过这种经历,在你的页面文本框中,如果输入html元素的话,保存后再查看很大概率会导致页面排版乱七八糟,如果能对这些内容进行过滤的话,就完美了。</p> <p>刚好我Jsoup就能做到。</p> <pre><span style="color: rgba(0, 0, 255, 1)">public</span> <span style="color: rgba(0, 0, 255, 1)">static</span> <span style="color: rgba(0, 0, 255, 1)">void</span><span style="color: rgba(0, 0, 0, 1)"> main(String[] args) { String unsafe </span>= "<p><a href='网址' onclick='stealCookies()'>博客园</a></p>"<span style="color: rgba(0, 0, 0, 1)">; System.out.println(</span>"unsafe: " +<span style="color: rgba(0, 0, 0, 1)"> unsafe); String safe </span>=<span style="color: rgba(0, 0, 0, 1)"> Jsoup.clean(unsafe, Whitelist.basic()); System.out.println(</span>"safe: " +<span style="color: rgba(0, 0, 0, 1)"> safe); }</span></pre> <p>通过Jsoup.clean方法,用一个白名单进行过滤。执行结果:</p> <pre>unsafe: <p><a href='网址' onclick='stealCookies()'>博客园</a></p><span style="color: rgba(0, 0, 0, 1)"> safe: </span><p><a rel="nofollow">博客园</a></p></pre> <p><span style="font-size: 16px"><strong>五、结束语</strong></span></p> <p> 通过以上大家相信我很强大了吧,不仅可以解析HttpClient抓取到的html元素,我自己也能抓取页面dom,我还能load并解析本地保存的html文件。</p> <p>此外,我还能通过一个白名单对字符串进行过滤,筛掉一些不安全的字符。</p> <p>最最重要的,上面所有功能的API的调用都比较简单。</p> <p>============华丽的分割线=============</p> <p>码字不易,点个赞再走呗~~</p> <p>最后,附上案例中 解析博客园首页文章列表的完整源码:</p> <pre><span style="color: rgba(0, 0, 255, 1)">package</span><span style="color: rgba(0, 0, 0, 1)"> httpclient_learn; </span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> java.io.IOException; </span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> org.apache.http.HttpEntity; </span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> org.apache.http.HttpStatus; </span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> org.apache.http.client.ClientProtocolException; </span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> org.apache.http.client.methods.CloseableHttpResponse; </span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> org.apache.http.client.methods.HttpGet; </span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> org.apache.http.client.utils.HttpClientUtils; </span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> org.apache.http.impl.client.CloseableHttpClient; </span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> org.apache.http.impl.client.HttpClients; </span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> org.apache.http.util.EntityUtils; </span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> org.jsoup.Jsoup; </span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> org.jsoup.nodes.Document; </span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> org.jsoup.nodes.Element; </span><span style="color: rgba(0, 0, 255, 1)">import</span><span style="color: rgba(0, 0, 0, 1)"> org.jsoup.select.Elements; </span><span style="color: rgba(0, 0, 255, 1)">public</span> <span style="color: rgba(0, 0, 255, 1)">class</span><span style="color: rgba(0, 0, 0, 1)"> HttpClientTest { </span><span style="color: rgba(0, 0, 255, 1)">public</span> <span style="color: rgba(0, 0, 255, 1)">static</span> <span style="color: rgba(0, 0, 255, 1)">void</span><span style="color: rgba(0, 0, 0, 1)"> main(String[] args) { </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">1.生成httpclient,相当于该打开一个浏览器</span> CloseableHttpClient httpClient =<span style="color: rgba(0, 0, 0, 1)"> HttpClients.createDefault(); CloseableHttpResponse response </span>= <span style="color: rgba(0, 0, 255, 1)">null</span><span style="color: rgba(0, 0, 0, 1)">; </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">2.创建get请求,相当于在浏览器地址栏输入 网址</span> HttpGet request = <span style="color: rgba(0, 0, 255, 1)">new</span> HttpGet("https://www.cnblogs.com/"<span style="color: rgba(0, 0, 0, 1)">); </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">设置请求头,将爬虫伪装成浏览器</span> request.setHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"<span style="color: rgba(0, 0, 0, 1)">); </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)"> HttpHost proxy = new HttpHost("60.13.42.232", 9999); </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)"> RequestConfig config = RequestConfig.custom().setProxy(proxy).build(); </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)"> request.setConfig(config);</span> <span style="color: rgba(0, 0, 255, 1)">try</span><span style="color: rgba(0, 0, 0, 1)"> { </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">3.执行get请求,相当于在输入地址栏后敲回车键</span> response =<span style="color: rgba(0, 0, 0, 1)"> httpClient.execute(request); </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">4.判断响应状态为200,进行处理</span> <span style="color: rgba(0, 0, 255, 1)">if</span>(response.getStatusLine().getStatusCode() ==<span style="color: rgba(0, 0, 0, 1)"> HttpStatus.SC_OK) { </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">5.获取响应内容</span> HttpEntity httpEntity =<span style="color: rgba(0, 0, 0, 1)"> response.getEntity(); String html </span>= EntityUtils.toString(httpEntity, "utf-8"<span style="color: rgba(0, 0, 0, 1)">); System.out.println(html); </span><span style="color: rgba(0, 128, 0, 1)">/**</span><span style="color: rgba(0, 128, 0, 1)"> * 下面是Jsoup展现自我的平台 </span><span style="color: rgba(0, 128, 0, 1)">*/</span> <span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">6.Jsoup解析html</span> Document document =<span style="color: rgba(0, 0, 0, 1)"> Jsoup.parse(html); </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">像js一样,通过标签获取title</span> System.out.println(document.getElementsByTag("title"<span style="color: rgba(0, 0, 0, 1)">).first()); </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">像js一样,通过id 获取文章列表元素对象</span> Element postList = document.getElementById("post_list"<span style="color: rgba(0, 0, 0, 1)">); </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">像js一样,通过class 获取列表下的所有博客</span> Elements postItems = postList.getElementsByClass("post_item"<span style="color: rgba(0, 0, 0, 1)">); </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">循环处理每篇博客</span> <span style="color: rgba(0, 0, 255, 1)">for</span><span style="color: rgba(0, 0, 0, 1)"> (Element postItem : postItems) { </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">像jquery选择器一样,获取文章标题元素</span> Elements titleEle = postItem.select(".post_item_body a[class='titlelnk']"<span style="color: rgba(0, 0, 0, 1)">); System.out.println(</span>"文章标题:" +<span style="color: rgba(0, 0, 0, 1)"> titleEle.text());; System.out.println(</span>"文章地址:" + titleEle.attr("href"<span style="color: rgba(0, 0, 0, 1)">)); </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">像jquery选择器一样,获取文章作者元素</span> Elements footEle = postItem.select(".post_item_foot a[class='lightblue']"<span style="color: rgba(0, 0, 0, 1)">); System.out.println(</span>"文章作者:" +<span style="color: rgba(0, 0, 0, 1)"> footEle.text());; System.out.println(</span>"作者主页:" + footEle.attr("href"<span style="color: rgba(0, 0, 0, 1)">)); System.out.println(</span>"*********************************"<span style="color: rgba(0, 0, 0, 1)">); } } </span><span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)"> { </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">如果返回状态不是200,比如404(页面不存在)等,根据情况做处理,这里略</span> System.out.println("返回状态不是200"<span style="color: rgba(0, 0, 0, 1)">); System.out.println(EntityUtils.toString(response.getEntity(), </span>"utf-8"<span style="color: rgba(0, 0, 0, 1)">)); } } </span><span style="color: rgba(0, 0, 255, 1)">catch</span><span style="color: rgba(0, 0, 0, 1)"> (ClientProtocolException e) { e.printStackTrace(); } </span><span style="color: rgba(0, 0, 255, 1)">catch</span><span style="color: rgba(0, 0, 0, 1)"> (IOException e) { e.printStackTrace(); } </span><span style="color: rgba(0, 0, 255, 1)">finally</span><span style="color: rgba(0, 0, 0, 1)"> { </span><span style="color: rgba(0, 128, 0, 1)">//</span><span style="color: rgba(0, 128, 0, 1)">6.关闭</span> <span style="color: rgba(0, 0, 0, 1)"> HttpClientUtils.closeQuietly(response); HttpClientUtils.closeQuietly(httpClient); } } }</span></pre> <span class="1024sou_code_collapse"></span> </div> <!--conend--> <div class="p-2"></div> <div class="arcinfo my-3 fs-7 text-center"> <a href='/t/etagid25242-0.html' class='tagbtn' target='_blank'>JAVA爬虫</a><a href='/t/etagid12031-0.html' class='tagbtn' target='_blank'>jsoup</a><a href='/t/etagid26795-0.html' class='tagbtn' target='_blank'>解析html</a> </div> <div class="p-2"></div> </div> <div class="p-2"></div> <!--xg--> <div class="lbox p-4 shadow-sm rounded-3"> <div class="boxtitle"><h2 class="fs-4">相关</h2></div> <hr> <div class="row g-0 py-2 border-bottom align-items-center"> <div class="col-7 col-lg-11 border-lg-end"> <h3 class="fs-6 mb-0 mb-lg-2"><a href="/a/1-435379.html">基于HttpClient和JSoup的java网络爬虫</a></h3> <div class="ltag fs-8 d-none d-lg-block"> </div> </div> </div><div class="row g-0 py-2 border-bottom align-items-center"> <div class="col-7 col-lg-11 border-lg-end"> <h3 class="fs-6 mb-0 mb-lg-2"><a href="/a/1-376433.html">C#Xpath解析HtmlDocument的使用方法与递归取得页面所有标签xpath值(附源码)</a></h3> <div class="ltag fs-8 d-none d-lg-block"> </div> </div> </div><div class="row g-0 py-2 border-bottom align-items-center"> <div class="col-7 col-lg-11 border-lg-end"> <h3 class="fs-6 mb-0 mb-lg-2"><a href="/a/1-266032.html">JQ规范解析html字符串为DOM</a></h3> <div class="ltag fs-8 d-none d-lg-block"> </div> </div> </div><div class="row g-0 py-2 border-bottom align-items-center"> <div class="col-7 col-lg-11 border-lg-end"> <h3 class="fs-6 mb-0 mb-lg-2"><a href="/a/1-178014.html">PyQt4解析HTML Dom</a></h3> <div class="ltag fs-8 d-none d-lg-block"> </div> </div> </div><div class="row g-0 py-2 border-bottom align-items-center"> <div class="col-7 col-lg-11 border-lg-end"> <h3 class="fs-6 mb-0 mb-lg-2"><a href="/a/1-129314.html">windows下用QTwebkit解析html</a></h3> <div class="ltag fs-8 d-none d-lg-block"> </div> </div> </div><div class="row g-0 py-2 border-bottom align-items-center"> <div class="col-7 col-lg-11 border-lg-end"> <h3 class="fs-6 mb-0 mb-lg-2"><a href="/a/1-122636.html">JAVA爬虫实践(实践四:webMagic和phantomjs和淘宝爬虫)</a></h3> <div class="ltag fs-8 d-none d-lg-block"> </div> </div> </div><div class="row g-0 py-2 border-bottom align-items-center"> <div class="col-7 col-lg-11 border-lg-end"> <h3 class="fs-6 mb-0 mb-lg-2"><a href="/a/1-40295.html">beautifulsoup 解析html库</a></h3> <div class="ltag fs-8 d-none d-lg-block"> </div> </div> </div><div class="row g-0 py-2 border-bottom align-items-center"> <div class="col-7 col-lg-11 border-lg-end"> <h3 class="fs-6 mb-0 mb-lg-2"><a href="/a/1-33624.html">用NSoup解析HTML</a></h3> <div class="ltag fs-8 d-none d-lg-block"> </div> </div> </div> <!----> <!----> </div> <!--xgend--> </div> <div class="col-lg-3 col-12 p-0 ps-lg-2"> <!--box--> <!--boxend--> <!--<div class="p-2"></div>--> <!--box--> <div class="lbox p-4 shadow-sm rounded-3"> <div class="boxtitle pb-2"><h2 class="fs-4"><a href="#">标签</a></h2></div> <div class="clearfix"></div> <ul class="m-0 p-0 fs-7 r-tag"> </ul> <div class="clearfix"></div> </div> <!--box end--> </div> </div> </div> </main> <div class="p-2"></div> <footer> <div class="container-fluid p-0 bg-black"> <div class="container p-0 fs-8"> <p class="text-center m-0 py-2 text-white-50">一品网 <a class="text-white-50" href="https://beian.miit.gov.cn/" target="_blank">冀ICP备14022925号-6</a></p> </div> </div> <script> var _hmt = _hmt || []; (function() { var hm = document.createElement("script"); hm.src = "https://hm.baidu.com/hm.js?6e3dd49b5f14d985cc4c6bdb9248f52b"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(hm, s); })(); </script> </footer> <script src="/skin/bootstrap.bundle.js"></script> </body> </html>