【爬虫】WebMagic结合Spring mvc爬取数据进行存储

 

工作4年多了,也没写过什么博客,去年回老家入职一家国企,工作稍微轻松些,没有在深圳的时候那么忙。最近感觉精力充沛(轻松的工作还是蛮养人的),想把自己研究或者使用到的相关技术做一个记录。第一、对这些知识做一个总结,因为现在发现脑袋不好使了,体会到了好记性不如烂笔头。 废话不多说,那就从最近用的爬虫说起吧。另外自己对爬虫也没有什么研究,纯粹处于会使用的地步。

前言

最近由于工作需要,接触到了爬虫这一块。抓取完整数据分如下二步。 第一步、选择爬虫框架。我们老总说直接用jsoup抓取就行了,这些网站都好抓。那就用吧,把jar下下来,试用了一下,API挺简单,方便,感觉挺好的。总觉得这些网站是好抓,jsoup能够满足,但是有木有更好的、更方便的框架呢,答案是肯定的。那就到网上的查,果然webMagic能够满足我这个需求,主要是文档是中文的呀。那就是它了。 第二步、页面元素分析,那就得看你需要那些数据,来分析页面了,下面我们细说。

工程搭建

因为我使用的是springmvc + mybatis ,所以Maven如下:

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

<codeclass="hljs xml"><project xmlns="http://maven.apache.org/POM/4.0.0"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemalocation="http://maven.apache.org/POM/4.0.0http://maven.apache.org/maven-v4_0_0.xsd">

  <modelversion>4.0.0</modelversion>

  <groupid>Spider</groupid>

  Spider</artifactid>

  <packaging>war</packaging>

  <version>1.0-SNAPSHOT</version>

  <name>Spider Maven Webapp</name>

  <url>http://maven.apache.org</url>

  <properties>

    <spring.version>4.1.1.RELEASE</spring.version>

    <jstl>1.2</jstl>

    <mybatis.version>3.3.0</mybatis.version>

  </properties>

  <dependencies>

    <dependency>

      <groupid>junit</groupid>

      junit</artifactid>

      <version>3.8.1</version>

      <scope>test</scope>

    </dependency>

    <!--log-->

    <dependency>

      <groupid>log4j</groupid>

      log4j</artifactid>

      <version>1.2.17</version>

    </dependency>

     

    <dependency>

      <groupid>org.jsoup</groupid>

      jsoup</artifactid>

      <version>1.9.2</version>

    </dependency>

    <dependency>

      <groupid>net.sf.json-lib</groupid>

      json-lib</artifactid>

      <version>2.4</version>

      <classifier>jdk15</classifier>

    </dependency>

    <dependency>

      <groupid>commons-collections</groupid>

      commons-collections</artifactid>

      <version>3.2.1</version>

    </dependency>

    <dependency>

      <groupid>org.apache.httpcomponents</groupid>

      httpclient</artifactid>

      <version>4.3.3</version>

    </dependency>

    <!--Spring -->

    <dependency>

      <groupid>junit</groupid>

      junit</artifactid>

      <version>3.8.1</version>

      <scope>test</scope>

    </dependency>

    <dependency>

      <groupid>org.springframework</groupid>

      spring-core</artifactid>

      <version>${spring.version}</version>

    </dependency>

 

    <dependency>

      <groupid>org.springframework</groupid>

      spring-web</artifactid>

      <version>${spring.version}</version>

    </dependency>

    <dependency>

      <groupid>org.springframework</groupid>

      spring-context-support</artifactid>

      <version>${spring.version}</version>

    </dependency>

 

    <dependency>

      <groupid>org.springframework</groupid>

      spring-webmvc</artifactid>

      <version>${spring.version}</version>

    </dependency>

 

    <dependency>

      <groupid>org.springframework</groupid>

      spring-test</artifactid>

      <version>${spring.version}</version>

      <scope>test</scope>

    </dependency>

    <dependency>

      <groupid>org.springframework</groupid>

      spring-jdbc</artifactid>

      <version>${spring.version}</version>

    </dependency>

    <dependency>

      <groupid>org.springframework</groupid>

      spring-test</artifactid>

      <version>4.1.1.RELEASE</version>

    </dependency>

    <!--JSTL-->

    <dependency>

      <groupid>javax.servlet</groupid>

      servlet-api</artifactid>

      <version>2.5</version>

    </dependency>

    <dependency>

      <groupid>javax.servlet</groupid>

      jstl</artifactid>

      <version>1.2</version>

    </dependency>

    <dependency>

      <groupid>javax.servlet.jsp</groupid>

      jsp-api</artifactid>

      <version>2.1</version>

      <scope>provided</scope>

    </dependency>

    <dependency>

      <groupid>jstl</groupid>

      jstl</artifactid>

      <version>${jstl}</version>

    </dependency>

    <!--mybatis-->

    <dependency>

      <groupid>org.mybatis</groupid>

      mybatis</artifactid>

      <version>${mybatis.version}</version>

    </dependency>

    <dependency>

      <groupid>org.mybatis</groupid>

      <artifactid>mybatis-spring</artifactid>

      <version>1.2.3</version>

    </dependency>

    <dependency>

      <groupid>org.mybatis.generator</groupid>

      <artifactid>mybatis-generator-core</artifactid>

      <version>1.3.2</version>

    </dependency>

    <dependency>

      <groupid><a href="/database/MySQL/"target="_blank"class="keylink">mysql</groupid>

      <artifactid>mysql-connector-java</artifactid>

      <version>5.1.32</version>

    </dependency>

    <dependency>

      <groupid>org.apache.commons</groupid>

      <artifactid>commons-collections4</artifactid>

      <version>4.0</version>

    </dependency>

    <dependency>

      <groupid>commons-dbcp</groupid>

      <artifactid>commons-dbcp</artifactid>

      <version>1.4</version>

    </dependency>

    <dependency>

      <groupid>commons-pool</groupid>

      <artifactid>commons-pool</artifactid>

      <version>1.6</version>

    </dependency>

    <!-- 添加druid连接池包 -->

    <dependency>

      <groupid>com.alibaba</groupid>

      <artifactid>druid</artifactid>

      <version>1.0.15</version>

    </dependency>

    <!--fastjson-->

    <dependency>

      <groupid>com.fasterxml.jackson.core</groupid>

      <artifactid>jackson-core</artifactid>

      <version>2.3.0</version>

    </dependency>

    <dependency>

      <groupid>com.fasterxml.jackson.core</groupid>

      <artifactid>jackson-databind</artifactid>

      <version>2.3.0</version>

    </dependency>

    <!--Gson-->

    <dependency>

      <groupid>com.google.code.gson</groupid>

      <artifactid>gson</artifactid>

      <version>2.7</version>

    </dependency>

  </dependencies>

  <build>

    <finalname>GogBuySpider</finalname>

    <plugins>

      <plugin>

        <artifactid>maven-compiler-plugin</artifactid>

        <configuration>

          <source>1.6

          <target>1.6</target>

        </configuration><source><source><source><source><source><source>

      </plugin><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source>

      <plugin>

        <artifactid>maven-surefire-plugin</artifactid>

        <configuration>

          <includes>

            <include>**/*Tests.java</include>

          </includes>

        </configuration>

      </plugin>

      <!--mybatis <a href="/article/jiami/"target="_blank"class="keylink">逆向工程插件-->

      <plugin>

        <groupid>org.mybatis.generator</groupid>

        <artifactid>mybatis-generator-maven-plugin</artifactid>

        <version>1.3.2</version>

        <configuration>

          <verbose>true</verbose>

          <overwrite>true</overwrite>

        </configuration>

      </plugin>

    </plugins><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source>

  </build><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source><source>

</project>

</code>

配置spring-serlvet.xml

?

1

2

3

4

5

6

7

<codeclass="hljs xml"><context:component-scan base-package="com.xxx.spider">

    <mvc:annotation-driven>

    <mvc:default-servlet-handler>

    <beanclass="org.springframework.web.servlet.view.InternalResourceViewResolver"id="viewResolver">

        <property name="prefix"value="/WEB-INF/view/">

        <property name="suffix"value=".jsp">

    </property></property></bean></mvc:default-servlet-handler></mvc:annotation-driven></context:component-scan></code>

整个Spring mybatis

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

<codeclass="hljs xml"><!--?xml version="1.0"encoding="UTF-8"?-->

<beans xmlns="http://www.springframework.org/schema/beans"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemalocation="http://www.springframework.org/schema/beanshttp://www.springframework.org/schema/beans/spring-beans.xsd">

    <beanclass="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">

        <property name="location"value="classpath:jdbc.properties">

    </property></bean>

    <beanclass="com.alibaba.druid.pool.DruidDataSource"destroy-method="close"id="dataSource">

        <property name="driverClassName"value="${jdbc_driverClassName}">

        <property name="url"value="${jdbc_url}">

        <property name="username"value="${jdbc_username}">

        <property name="password"value="${jdbc_password}">

        <!-- 配置监控统计拦截的filters -->

        <property name="filters"value="stat">

 

        <property name="maxActive"value="20">

        <property name="initialSize"value="1">

        <property name="minIdle"value="1">

        <!-- 配置获取连接等待超时的时间 -->

        <property name="maxWait"value="60000">

        <!-- 配置间隔多久才进行一次检测,检测需要关闭的空闲连接,单位是毫秒 -->

        <property name="timeBetweenEvictionRunsMillis"value="60000">

        <!-- 配置一个连接在池中最小生存的时间,单位是毫秒 -->

        <property name="minEvictableIdleTimeMillis"value="300000">

 

        <property name="validationQuery"value="SELECT 'x'">

        <property name="testWhileIdle"value="true">

        <property name="testOnBorrow"value="false">

        <property name="testOnReturn"value="false">

 

        <!--如果用Oracle,则把poolPreparedStatements配置为true,mysql可以配置为false。-->

        <!-- 打开PSCache,并且指定每个连接上PSCache的大小 -->

        <property name="poolPreparedStatements"value="true">

        <property name="maxPoolPreparedStatementPerConnectionSize"value="50">

    </property></property></property></property></property></property></property></property></property></property></property></property></property></property></property></property></property></bean>

    <beanclass="org.mybatis.spring.SqlSessionFactoryBean"id="sqlSessionFactory">

        <property name="dataSource"ref="dataSource">

        <!-- 自动扫描mapping.xml文件 -->

        <property name="mapperLocations"value="classpath:mapper/*">

    </property></property></bean>

    <!--扫描dao-->

    <beanclass="org.mybatis.spring.mapper.MapperScannerConfigurer">

        <property name="basePackage"value="com.xxx.spider.dao">

        <property name="sqlSessionFactoryBeanName"value="sqlSessionFactory">

    </property></property></bean>

    <!-- (事务管理)transaction manager, use JtaTransactionManager forglobal tx -->

    <beanclass="org.springframework.jdbc.datasource.DataSourceTransactionManager"id="transactionManager">

        <property name="dataSource"ref="dataSource">

    </property></bean>

</beans></code>

配置spring context (不配置好像也可以)

?

1

2

3

4

5

6

<codeclass="hljs xml"><beans xmlns="http://www.springframework.org/schema/beans"xmlns:websocket="http://www.springframework.org/schema/websocket"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemalocation="http://www.springframework.org/schema/beanshttp://www.springframework.org/schema/beans/spring-beans.xsd

       http://www.springframework.org/schema/websocket

       http://www.springframework.org/schema/websocket/spring-websocket-4.1.xsd">

        <importresource="spring-servlet.xml">

        <importresource="spring-mybatis.xml">

</import></import></beans></code>

web.xml

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

<codeclass="hljs lasso"><web-app version="3.0"xmlns="http://java.sun.com/xml/ns/javaee"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemalocation="http://java.sun.com/xml/ns/javaeehttp://java.sun.com/xml/ns/javaee/web-app_3_1.xsd">

  <display-name>Archetype Created Web Application</display-name>

  <context-param>

    <param-name>contextConfigLocation</param-name>

    <param-value>classpath:spring-context.xml</param-value>

  </context-param>

  <listener>

    <listener-class>org.springframework.web.context.ContextLoaderListener</listener-class>

  </listener>

  <servlet>

    <servlet-name>springMvc</servlet-name>

    <servlet-class>org.springframework.web.servlet.DispatcherServlet</servlet-class>

    <load-on-startup>1</load-on-startup>

    <init-param>

      <param-name>contextConfigLocation</param-name>

      <param-value>classpath:spring-servlet.xml</param-value>

    </init-param>

  </servlet>

  <servlet-mapping>

    <servlet-name>springMvc</servlet-name>

    <url-pattern>*.do</url-pattern>

  </servlet-mapping>

  <servlet-mapping>

    <servlet-name>springMvc</servlet-name>

    <url-pattern>/</url-pattern>

  </servlet-mapping>

  <welcome-file-list>

    <welcome-file>index.jsp</welcome-file>

  </welcome-file-list>

</web-app></code>

然后到这里工程就建好了。
因为我们用到webMagic,在maven中添加

?

1

2

3

4

5

6

7

8

9

10

11

<codeclass="hljs xml"><!--webMagic-->

    <dependency>

      <groupid>us.codecraft</groupid>

      <version>0.5.3</version>

      webmagic-core</artifactid>

    </dependency>

    <dependency>

      <groupid>us.codecraft</groupid>

      <version>0.5.3</version>

      webmagic-extension</artifactid>

    </dependency></code>

一切OK,剩下的就是分析页面,然后用webMagic解析了。

页面分析

如下图
这里写图片描述

我们想要类别名称跟URL,分析可知是在标签里面,通过webmagic的css选择器和xpath对页面元素进行抽取。

 

?

1

2

<codeclass="hljs avrasm"> List<string> titles = page.getHtml().xpath("//div[@class='class1']/p/a/text()").all();

 List<string> urls = page.getHtml().css("div.nav_style1_contentBg").links().regex(".*?c1=.*").all();</string></string></code>

这样就得到全部的大类别和对应的URL。怎么使用大家可以查看http://webmagic.io/docs/

数据保存

借用说明文档上的一段话。

好了,爬虫编写完成,现在我们可能还有一个问题:我如果想把抓取的结果保存下来,要怎么做呢?WebMagic用于保存结果的组件叫做Pipeline。例如我们通过“控制台输出结果”这件事也是通过一个内置的Pipeline完成的,它叫做ConsolePipeline。那么,我现在想要把结果用Json的格式保存下来,怎么做呢?我只需要将Pipeline的实现换成”JsonFilePipeline”就可以了。

public static void main(String[] args) {
Spider.create(new GithubRepoPageProcessor())
//从”https://github.com/code4craft“开始抓
.addUrl(“https://github.com/code4craft“)
.addPipeline(new JsonFilePipeline(“D:\webmagic\”))
//开启5个线程抓取
.thread(5)
//启动爬虫
.run(); } 这样子下载下来的文件就会保存在D盘的webmagic目录中了。

通过定制Pipeline,我们还可以实现保存结果到文件、数据库等一系列功能。这个会在第7章“抽取结果的处理”中介绍。

至此为止,我们已经完成了一个基本爬虫的编写,也具有了一些定制功能。

我们通过自定义pipeline来保存到数据库。

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

<codeclass="hljs java">@Repository

publicclassDataBasePipelineimplementsPipeline{

    @Autowired

    privateCategoryMapper categoryMapper;

    @Autowired

    privateShopMapper shopMapper;

    @Autowired

    privateItemMapper itemMapper;

    @Override

    publicvoidprocess(ResultItems resultItems, Task task) {

            //TODO 保存类目到数据库

            //TODO 保存商品到数据库

    }

}</code>

抓取

定义一个main方法,run就行了,坐等爬完。
 

?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

@Controller

publicclassDoor {

@Autowired

privateDataBasePipeline dataBasePipeline;

publicstaticvoidmain(String[] args) {

 

    ApplicationContext applicationContext = newClassPathXmlApplicationContext("classpath:spring-context.xml");

    Door door = applicationContext.getBean(Door.class);

    door.goSpider();

}

 

publicvoidgoSpider() {

    Spider.create(newQmiaolingPageProcessor())

            .addUrl("http://www.xxx.com/")

            .addPipeline(newConsolePipeline())

            .addPipeline(dataBasePipeline)

            .thread(5)

            .run();

}

}

结束语

后续会把项目放到github上面。

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值