This repository has been archived by the owner on Sep 13, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1
/
jsoupji-ben-ying-yong-jie-shao.html
228 lines (190 loc) · 19.8 KB
/
jsoupji-ben-ying-yong-jie-shao.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
<!DOCTYPE html>
<html lang="zh">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>jsoup基本应用介绍</title>
<link href="/feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="华科美团点评技术俱乐部 Full Atom Feed" />
<link href="/feeds/java.atom.xml" type="application/atom+xml" rel="alternate" title="华科美团点评技术俱乐部 Categories Atom Feed" />
<!-- Bootstrap Core CSS -->
<link href="/theme/css/bootstrap.min.css" rel="stylesheet">
<!-- Custom CSS -->
<link href="/theme/css/clean-blog.min.css" rel="stylesheet">
<!-- Code highlight color scheme -->
<link href="/theme/css/code_blocks/darkly.css" rel="stylesheet">
<!-- Custom Fonts -->
<link href="https://maxcdn.bootstrapcdn.com/font-awesome/4.5.0/css/font-awesome.min.css" rel="stylesheet" type="text/css">
<link href='https://fonts.googleapis.com/css?family=Lora:400,700,400italic,700italic' rel='stylesheet' type='text/css'>
<link href='https://fonts.googleapis.com/css?family=Open+Sans:300italic,400italic,600italic,700italic,800italic,400,300,600,700,800' rel='stylesheet' type='text/css'>
<!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
<![endif]-->
<meta name="description" content="前些时间要做一个v2ex的第三方客户端,偶然发现了一个爬虫神器jsoup.运用jsoup对网站进行爬虫非常容易,仅需几行代码即可完成对整个网站的爬虫。 jsoup是的主要功能如下: 1.从一个url,文件或者字符串中解析HTML...">
<meta name="author" content="Zijie Han">
<meta name="tags" content="Jsoup">
<meta property="og:locale" content="zh_CN.UTF-8">
<meta property="og:site_name" content="华科美团点评技术俱乐部">
<meta property="og:type" content="article">
<meta property="article:author" content="/author/zijie-han.html">
<meta property="og:url" content="/jsoupji-ben-ying-yong-jie-shao.html">
<meta property="og:title" content="jsoup基本应用介绍">
<meta property="article:published_time" content="2017-04-20 08:22:00+08:00">
<meta property="og:description" content="前些时间要做一个v2ex的第三方客户端,偶然发现了一个爬虫神器jsoup.运用jsoup对网站进行爬虫非常容易,仅需几行代码即可完成对整个网站的爬虫。 jsoup是的主要功能如下: 1.从一个url,文件或者字符串中解析HTML...">
<meta property="og:image" content="//images/bg.jpg">
</head>
<body>
<!-- Navigation -->
<nav class="navbar navbar-default navbar-custom navbar-fixed-top">
<div class="container-fluid">
<!-- Brand and toggle get grouped for better mobile display -->
<div class="navbar-header page-scroll">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="/">华科美团点评技术俱乐部</a>
</div>
<!-- Collect the nav links, forms, and other content for toggling -->
<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
<ul class="nav navbar-nav navbar-right">
<li><a href="/categories.html">分类</a></li>
<li><a href="/archives.html">归档</a></li>
<li><a href="/authors.html">作者</a></li>
<li><a href="/tags.html">标签</a></li>
<li><a href="/pages/about/index.html">关于</a></li>
<li><a href="/pages/friendlinks/index.html">友链</a></li>
</ul>
</div>
<!-- /.navbar-collapse -->
</div>
<!-- /.container -->
</nav>
<!-- Page Header -->
<header class="intro-header" style="background-image: url('/images/bg.jpg')">
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<div class="post-heading">
<h1>jsoup基本应用介绍</h1>
<span class="meta">Posted by
<a href="/author/zijie-han.html">Zijie Han</a>
on 2017年 4月20日 周四
</span>
</div>
</div>
</div>
</div>
</header>
<!-- Main Content -->
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<!-- Post Content -->
<article>
<p>前些时间要做一个v2ex的第三方客户端,偶然发现了一个爬虫神器jsoup.运用jsoup对网站进行爬虫非常容易,仅需几行代码即可完成对整个网站的爬虫。</p>
<p>jsoup是的主要功能如下:</p>
<p>1.从一个url,文件或者字符串中解析HTML</p>
<p>2.使用DOM或者CSS选择器来查找、取出数据;</p>
<p>3.操作HTML元素、属性和文本。</p>
<p> </p>
<p><a href="http://jsoup.org/">jsoup官网</a>进入官网下载jar.jsoup是基于开源协议MIT发布的。
如何对一个网站进行爬虫呢?我们v2ex网站首页为例。首先打开<a href="http://v2ex.com">v2ex官网</a>筛选我们需要的内容,现以v2ex的首页中的问答为例,检查该元素,发现所有的问答都在下的<tbody>中。其次,我们所需要的标题为<code><a href="xxx">标题内容</a></code></p>
<p>这条文本语句当中,检查副标题,其存在于<code>&lt;a class="node" href="xxx"&gt;程序员&lt;/a&gt;</code>以及<code>&lt;strong&gt;xxx&lt;/strong&gt;</code>等标签中。现在我们找到了我们需要的内容及其标签,如何爬取这些文件呢?我们仅仅需要几行代码即可,这里我们在子线程中爬去,因为在主线程中爬取时可能会报错:</p>
<div class="highlight"><pre><span></span> <span class="n">Runnable</span> <span class="n">runnable</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Runnable</span><span class="o">()</span> <span class="o">{</span>
<span class="nd">@Override</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">run</span><span class="o">()</span> <span class="o">{</span>
<span class="n">String</span> <span class="n">url</span> <span class="o">=</span> <span class="s">"https://www.v2ex.com/"</span><span class="o">;</span><span class="c1">//伪装成浏览器进行访问,因为有的网站可能禁止爬虫。</span>
<span class="n">Connection</span> <span class="n">connection</span> <span class="o">=</span> <span class="n">Jsoup</span><span class="o">.</span><span class="na">connect</span><span class="o">(</span><span class="n">url</span><span class="o">);</span>
<span class="n">connection</span><span class="o">.</span><span class="na">header</span><span class="o">(</span><span class="s">"User-Agent"</span><span class="o">,</span> <span class="s">"Mozilla/5.0 (X11; Linux x86_64; rv:32.0) Gecko/ 20100101 Firefox/32.0"</span><span class="o">);</span><span class="c1">//这里不用管,随便写</span>
<span class="n">Document</span> <span class="n">document</span> <span class="o">=</span> <span class="kc">null</span><span class="o">;</span>
<span class="k">try</span> <span class="o">{</span>
<span class="n">document</span> <span class="o">=</span> <span class="n">connection</span><span class="o">.</span><span class="na">get</span><span class="o">();</span><span class="c1">//下载整个页面,以doc保存</span>
<span class="o">}</span> <span class="k">catch</span> <span class="o">(</span><span class="n">IOException</span> <span class="n">e</span><span class="o">)</span> <span class="o">{</span>
<span class="n">e</span><span class="o">.</span><span class="na">printStackTrace</span><span class="o">();</span>
<span class="o">}</span>
<span class="n">Elements</span> <span class="n">elements</span> <span class="o">=</span> <span class="n">document</span><span class="o">.</span><span class="na">select</span><span class="o">(</span><span class="s">"tbody tr"</span><span class="o">);</span><span class="c1">//删选tbody tr的标签</span>
<span class="k">for</span> <span class="o">(</span><span class="n">Element</span> <span class="n">element</span> <span class="o">:</span> <span class="n">elements</span><span class="o">)</span> <span class="o">{</span>
<span class="n">String</span> <span class="n">title</span> <span class="o">=</span> <span class="n">element</span><span class="o">.</span><span class="na">getElementsByClass</span><span class="o">(</span><span class="s">"item_title"</span><span class="o">).</span><span class="na">text</span><span class="o">();</span><span class="c1">//选出标题</span>
<span class="n">String</span> <span class="n">urll</span> <span class="o">=</span> <span class="n">element</span><span class="o">.</span><span class="na">select</span><span class="o">(</span><span class="s">"a"</span><span class="o">).</span><span class="na">attr</span><span class="o">(</span><span class="s">"href"</span><span class="o">);</span><span class="c1">//根据类别选取</span>
<span class="n">String</span> <span class="n">vtitle</span> <span class="o">=</span> <span class="n">element</span><span class="o">.</span><span class="na">getElementsByClass</span><span class="o">(</span><span class="s">"node"</span><span class="o">).</span><span class="na">text</span><span class="o">();</span><span class="c1">//根据class选取副标题中的“程序员”字样</span>
<span class="n">String</span> <span class="n">aserandlast</span> <span class="o">=</span> <span class="n">element</span><span class="o">.</span><span class="na">getElementsByTag</span><span class="o">(</span><span class="s">"strong"</span><span class="o">).</span><span class="na">text</span><span class="o">();</span> <span class="c1">//由于我们爬到的strong标签有两个,所我们还需要将字符串进行一次分割</span>
<span class="o"><</span><span class="n">span</span> <span class="n">class</span><span class="o">=</span><span class="s">"pl-smi"</span><span class="o">></span><span class="n">String</span><span class="o"></</span><span class="n">span</span><span class="o">></span> <span class="n">src</span> <span class="o"><</span><span class="n">span</span> <span class="n">class</span><span class="o">=</span><span class="s">"pl-k"</span><span class="o">>=</</span><span class="n">span</span><span class="o">></span> <span class="n">element</span><span class="o"><</span><span class="n">span</span> <span class="n">class</span><span class="o">=</span><span class="s">"pl-k"</span><span class="o">>.</</span><span class="n">span</span><span class="o">></span><span class="n">select</span><span class="o">(<</span><span class="n">span</span> <span class="n">class</span><span class="o">=</span><span class="s">"pl-s"</span><span class="o">><</span><span class="n">span</span> <span class="n">class</span><span class="o">=</span><span class="s">"pl-pds"</span><span class="o">></span><span class="s">"</span>img<span class="</span><span class="n">pl</span><span class="o">-</span><span class="n">pds</span><span class="s">">"</span><span class="o"></</span><span class="n">span</span><span class="o">></</span><span class="n">span</span><span class="o">>)<</span><span class="n">span</span> <span class="n">class</span><span class="o">=</span><span class="s">"pl-k"</span><span class="o">>.</</span><span class="n">span</span><span class="o">></span><span class="n">attr</span><span class="o">(<</span><span class="n">span</span> <span class="n">class</span><span class="o">=</span><span class="s">"pl-s"</span><span class="o">><</span><span class="n">span</span> <span class="n">class</span><span class="o">=</span><span class="s">"pl-pds"</span><span class="o">></span><span class="s">"</span>src<span class="</span><span class="n">pl</span><span class="o">-</span><span class="n">pds</span><span class="s">">"</span><span class="o"></</span><span class="n">span</span><span class="o">></</span><span class="n">span</span><span class="o">>);</span> <span class="c1">//抓取图片链接</span>
<span class="n">String</span><span class="o">[]</span> <span class="n">Array</span> <span class="o">=</span> <span class="n">convertStr</span><span class="o">(</span><span class="n">aserandlast</span><span class="o">);</span>
<span class="k">if</span><span class="o">(</span><span class="n">urll</span> <span class="o">!=</span> <span class="s">""</span> <span class="o">&&</span> <span class="n">title</span> <span class="o">!=</span><span class="s">""</span><span class="o">){</span>
<span class="n">Log</span><span class="o">.</span><span class="na">d</span><span class="o">(</span><span class="n">TAG</span><span class="o">,</span> <span class="s">"大标题为: "</span><span class="o">+</span><span class="n">title</span><span class="o">+</span><span class="s">"链接为:"</span><span class="o">+</span><span class="n">urll</span><span class="o">+</span><span class="s">"小标题为: "</span><span class="o">+</span><span class="n">vtitle</span><span class="o">+</span><span class="s">" /1:"</span><span class="o">+</span><span class="n">Array</span><span class="o">[</span><span class="mi">0</span><span class="o">]+</span><span class="s">" /2:"</span><span class="o">+</span><span class="n">Array</span><span class="o">[</span><span class="mi">1</span><span class="o">]);</span>
<span class="n">Title</span> <span class="n">title1</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Title</span><span class="o">(</span><span class="n">title</span><span class="o">,</span> <span class="n">urll</span><span class="o">);</span>
<span class="n">titleList</span><span class="o">.</span><span class="na">add</span><span class="o">(</span><span class="n">title1</span><span class="o">);}<</span><span class="n">a</span> <span class="n">href</span><span class="o">=</span><span class="s">"http://115.159.151.84/wp-content/uploads/2017/04/FD1WUCWPF2PG5B.png"</span><span class="o">><</span><span class="n">img</span> <span class="n">src</span><span class="o">=</span><span class="s">"http://115.159.151.84/wp-content/uploads/2017/04/FD1WUCWPF2PG5B-1024x392.png"</span> <span class="n">alt</span><span class="o">=</span><span class="s">""</span> <span class="n">width</span><span class="o">=</span><span class="s">"730"</span> <span class="n">height</span><span class="o">=</span><span class="s">"279"</span> <span class="n">class</span><span class="o">=</span><span class="s">"alignnone size-large wp-image-111"</span> <span class="o">/></</span><span class="n">a</span><span class="o">></span>
<span class="o">}</span>
<span class="n">handler</span><span class="o">.</span><span class="na">sendEmptyMessage</span><span class="o">(</span><span class="mi">0</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">};</span>
<span class="kd">private</span> <span class="kd">static</span> <span class="n">String</span><span class="o">[](</span><span class="n">String</span> <span class="n">str</span><span class="o">){</span>
<span class="n">String</span><span class="o">[]</span> <span class="n">strArray</span> <span class="o">=</span> <span class="kc">null</span><span class="o">;</span>
<span class="n">strArray</span> <span class="o">=</span> <span class="n">str</span><span class="o">.</span><span class="na">split</span><span class="o">(</span><span class="s">" "</span><span class="o">);</span>
<span class="k">return</span> <span class="n">strArray</span><span class="o">;</span>
<span class="o">}</span>
</pre></div>
<p>运行结果附图。</p>
<p><a href="http://115.159.151.84/wp-content/uploads/2017/04/FD1WUCWPF2PG5B.png"><img class="alignnone size-large wp-image-111" src="http://115.159.151.84/wp-content/uploads/2017/04/FD1WUCWPF2PG5B-1024x392.png" alt="" width="730" height="279" /></a></p>
<p>jsoup在使用过程中还添加了多个过滤器来防止攻击:
<ol>
<li>none():只允许包含文本信息</li>
<li>basic(): 只允许基本标签:a,b,blockqoute,br,captiion,code,cite等等</li>
<li>basicWihtImages():允许基本标签、图片信息</li>
<li>relaxed(): 允许更多标签如:h1,h2,h3.h4.tbody,td,thead,tr,u,ul等等</li>
</ol>
除代码所示的方法以外,jsoup还支持很多很强大的方法:
可利用jsoup下载form表单并进行模拟登陆,详情可参见<a href="jsoup.org">jsoup官方网站</a></p>
</article>
<div class="tags">
<p>tags: <a href="/tag/jsoup.html">Jsoup</a></p>
</div>
<hr>
</div>
</div>
</div>
<hr>
<!-- Footer -->
<footer>
<div class="container">
<div class="row">
<div class="col-lg-8 col-lg-offset-2 col-md-10 col-md-offset-1">
<ul class="list-inline text-center">
<li>
<a href="https://github.com/HUSTMeituanClub">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-github fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
<li>
<a href="mailto:@hustmeituan.club">
<span class="fa-stack fa-lg">
<i class="fa fa-circle fa-stack-2x"></i>
<i class="fa fa-envelope fa-stack-1x fa-inverse"></i>
</span>
</a>
</li>
</ul>
<p class="copyright text-muted">
Blog powered by <a href="http://getpelican.com">Pelican</a>,
which takes great advantage of <a href="http://python.org">Python</a>.
</p> </div>
</div>
</div>
</footer>
<!-- jQuery -->
<script src="/theme/js/jquery.min.js"></script>
<!-- Bootstrap Core JavaScript -->
<script src="/theme/js/bootstrap.min.js"></script>
<!-- Custom Theme JavaScript -->
<script src="/theme/js/clean-blog.min.js"></script>
</body>
</html>