SpringBoot集成WebMagic爬虫

不是太喜欢Spring Boot这种“黑盒”框架,所以在正式项目中一般不会去使用。正好有个实验性质的爬虫项目前期,所以用Spring Boot集成WebMagic做一下尝试,看看是否能改变之前的刻板印象。

一、使用Eclipse创建Spring Boot项目

参考了博客  Eclipse中spring boot的安装和创建简单的Web应用,通过Eclipse Marketplace安装Spring Boot插件

创建Spring Boot项目,依赖勾选了MyBatis/MySQL/Redis/Web。

这里需要把使用的数据库依赖一并选中,我用的是MySQL,不然在之后会提示找不到MySQL驱动包,需要手动添加依赖。

创建项目,建立一个简单的Controller测试一下

import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

/**
* @author Zln
* @version 2018-12-24 11:12
* 
* 示例
*/

@RestController
@RequestMapping("/sample")
public class SampleController {

	@RequestMapping("/hello")
	public String hello(String name) {
		
		return "Hello " + name;
	}
}

在SpiderApplication上Run As->Spring Boot App,出现错误提示

***************************
APPLICATION FAILED TO START
***************************

Description:

Failed to configure a DataSource: 'url' attribute is not specified and no embedded datasource could be configured.

Reason: Failed to determine a suitable driver class

据查是因为还没有配置数据库连接,因为添加了MyBatis依赖,Spring Boot启动的时候会去尝试查找链接数据库,这里可以选择先屏蔽掉数据库配置或者在resource/application.properties添加链接信息。

因为本地有现成的项目数据库,选择添加数据库配置

spring.datasource.url=jdbc:mysql://192.168.8.10:3307/hyms?useUnicode=true&zeroDateTimeBehavior=convertToNull&autoReconnect=true
spring.datasource.username=hyms
spring.datasource.password=hyms
spring.datasource.driver-class-name=com.mysql.jdbc.Driver

数据库连接配置好以后,再次运行SpiderApplication。可以正常启动,在浏览器访问

http://localhost:8080/sample/hello?name=test

能够看到页面的正常输入

Spring Boot内置了Tomcat,默认端口号8080,Context Path是"/"。会与正常项目有冲突,先进行修改。在applicaiton.properties中增加相关配置

server.port=8081
server.servlet.context-path=/spider

重新在浏览器运行

http://localhost:8081/spider/sample/hello?name=test1

查看页面输出正常

二、添加WebMagic

WebMagic官网:http://webmagic.io

添加Maven依赖

<!-- WebMagic -->
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>0.7.3</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>0.7.3</version>
</dependency>

按照官网的示例代码 实现PageProcessor 创建一个测试,运行发现提示异常

14:37:43.356 [pool-1-thread-1] DEBUG org.apache.http.impl.conn.PoolingHttpClientConnectionManager - Connection released: [id: 0][route: {s}->https://github.com:443][total kept alive: 0; route allocated: 0 of 100; total allocated: 0 of 1]
14:37:43.358 [pool-1-thread-1] WARN us.codecraft.webmagic.downloader.HttpClientDownloader - download page https://github.com/code4craft error
javax.net.ssl.SSLException: Received fatal alert: protocol_version
	at sun.security.ssl.Alerts.getSSLException(Alerts.java:208)
	at sun.security.ssl.Alerts.getSSLException(Alerts.java:154)
	at sun.security.ssl.SSLSocketImpl.recvAlert(SSLSocketImpl.java:2023)
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1125)
	at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
	at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:396)
	at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355)
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373)
	at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:394)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
	at us.codecraft.webmagic.downloader.HttpClientDownloader.download(HttpClientDownloader.java:85)
	at us.codecraft.webmagic.Spider.processRequest(Spider.java:404)
	at us.codecraft.webmagic.Spider.access$000(Spider.java:61)
	at us.codecraft.webmagic.Spider$1.run(Spider.java:320)
	at us.codecraft.webmagic.thread.CountableThreadPool$1.run(CountableThreadPool.java:74)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
14:37:44.363 [main] INFO us.codecraft.webmagic.Spider - Spider github.com closed! 1 pages downloaded.

查找了一下原来是0.7.3版本在抓取只支持TLS1.2的https站点时候有问题(不知道其他版本是否有此问题,未尝试)。需要自行修改HttpClientGenerator,并创建HttpClientDownloader引入修改后的HttpClientGenerator。在创建Spider时引入新的HttpClientDownloader即可。

相关问题作者在github上已经有回复 参见:Https下无法抓取只支持TLS1.2的站点,并且github上下载到的源代码已经是修改过的了。

下载 https://github.com/code4craft/webmagic/tree/master/webmagic-core/src/main/java/us/codecraft/webmagic/downloader下的HttpClientDownloader.java和HttpClientGenerator.java放入工程内,并在创建Spider时指定新的HttpClientDownloader。再次运行,可正常读取示例中内容。

抄官网代码的时候没注意Spider.create后面的new GithubRepoPageProcessor()是示例自带的process,在后续修改的过程中发现一直无效才看到,这里需要注意一下

Spider.create(new SampleProcessor()).setDownloader(new HttpClientDownloader())
         //从"https://github.com/code4craft"开始抓
         .addUrl("https://github.com/code4craft")
         //开启5个线程抓取
         .thread(1)
         //启动爬虫
         .run();

也可以通过直接将github上项目重新打包deploy本地仓库的方式。

三、使用WebMagic爬取饿了么店铺和菜单信息

通过查看https://h5.ele.me/,查找到店铺列表api url:https://h5.ele.me/restapi/shopping/v3/restaurants?latitude=31.032697&longitude=121.216669&offset=0&limit=8&extras[]=activities&extras[]=tags&extra_filters=home&rank_id=&terminal=h5,作为爬虫入口。

饿了么使用高德地图坐标系,地址经纬度参数可以通过高德地图提供的API进行获取,或直接访问页面https://lbs.amap.com/api/javascript-api/example/map/click-to-get-lnglat进行人工获取

店铺详情及菜单信息,通过访问https://www.ele.me,并点击店铺详情获得https://www.ele.me/shop/E1209616406188511795。shop后为店铺ID,可通过第一步获取的店铺列表中数据进行替换,但是店铺ID可能有时效性,历史店铺ID会出现不可用的情况,建议抓取店铺列表后,根据即时数据进行拼接URL。这两部分具体的JSON数据分析就不详细写了,通过浏览器执行拿到JSON返回结果,格式化一下就比较清楚了。

由于JsonPath在解析菜品数据的时候无法解析JsonArray,具体什么原因不清楚,所以使用了框架引入的FastJson

创建WebMagic的Processor,预留实例化参数为经纬度,是否抓取菜品信息


import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import com.alibaba.fastjson.JSONPath;
import com.zln.spider.pojo.ElemeFood;
import com.zln.spider.pojo.ElemeShop;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;

/**
 * @author Zln
 * @version 2018-12-24 14:47
 * 
 * 饿了么店铺 爬虫
 */

public class ElemeShopProcessor implements PageProcessor {

	Logger logger = LoggerFactory.getLogger(getClass());

	private Site site = Site.me()
            .setDomain("ele.me")
            .setSleepTime(100)
            .setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36");

	private String longitude;
	private String latitude;
	private Boolean crawlMenu = false; // 是否抓取菜单信息
	
	/*
	 * 数据处理类型
	 */
	public static enum ProcessType {  
	    SHOP, FOOD  
	}
	
	/**
	 * 饿了么店铺爬虫实例化方法
	 * @param longitude	 经度
	 * @param latitude	 纬度
	 * @param crawlMenu 是否抓取菜单
	 */
	public ElemeShopProcessor(String longitude, String latitude,
			Boolean crawlMenu) {
		this.longitude = longitude;
		this.latitude = latitude;
		this.crawlMenu = crawlMenu;
	}

	@Override
	public void process(Page page) {
		// H5列表数据
		if (page.getUrl()
				.regex("https://h5\\.ele\\.me/restapi/shopping/v3/restaurants+")
				.match()) {
			logger.info("解析H5商户列表数据");
			logger.debug(page.getRawText());

			JSONObject joData = JSON.parseObject(page.getRawText());
			JSONArray jaRestaurant = (JSONArray) JSONPath.eval(joData,"$.items.restaurant");

			// 解析餐厅数据
			if (jaRestaurant != null) {
				List<ElemeShop> listShops = new ArrayList<>();
				for (Object oRest : jaRestaurant) {
					// 遍历餐厅数据,解析为ElemeShop对象
					JSONObject joRest = (JSONObject) oRest;
					ElemeShop objShop = JSONObject.toJavaObject(joRest,ElemeShop.class);
					logger.debug(objShop.toString());
					listShops.add(objShop);
					if (crawlMenu) {
						// 需要抓取菜单信息,加入后续targetRequest
						String targetUrl = "https://www.ele.me/restapi/shopping/v2/menu?restaurant_id=" + objShop.getId() + "&terminal=web";
						page.addTargetRequest(targetUrl);
					}
				}
				page.putField("type", ProcessType.SHOP);
				page.putField("listShops", listShops);
			}
		} else {
			logger.info("解析菜品");
			logger.debug(page.getRawText());
			Set<String> setItemIds = new HashSet<>(); // 保存itemId用于菜品去重
			
			JSONArray jaMenuGroup = JSON.parseArray(page.getRawText());
			if (jaMenuGroup != null) {
				List<ElemeFood> listFoods = new ArrayList<>();
				for (Object oMenuGroup : jaMenuGroup) {
					JSONArray jaFoods = (JSONArray) JSONPath.eval(oMenuGroup,"$.foods");
					for (Object oFood : jaFoods) {
						JSONObject joFood = (JSONObject)oFood;
						// 获取每个菜品数据
						String strItemId = (String) JSONPath.eval(joFood,"$.item_id"); // itemId
						// 判断是否重复
						if (setItemIds.contains(strItemId)) {
							// 重复菜品,跳过
							continue;
						}
						else {
							setItemIds.add(strItemId);
						}
						
						ElemeFood objFood = JSONObject.toJavaObject(joFood,ElemeFood.class);
						logger.debug(objFood.toString());
						listFoods.add(objFood);
						
					}
				}
				page.putField("type", ProcessType.FOOD);
				page.putField("listFoods", listFoods);
			}
		}
	}

	@Override
	public Site getSite() {
		return site;
	}

	/**
	 * 获取起始url
	 * @return
	 */
	public String getUrl() {
		// 门店查询页面 https://www.ele.me/place/wtw0w37dxs0r?latitude=31.032709&longitude=121.217287
    	// 门店里列表json https://h5.ele.me/restapi/shopping/v3/restaurants?latitude=31.107641&longitude=121.252976&offset=0&limit=1&extras[]=activities&extras[]=tags&terminal=h5
    	// 店铺json https://www.ele.me/shop/E7326872827353281855
    	/*String url = "https://www.ele.me/place/";
    	String geohash = "wtw0w37dxs0r";*/
		
		String url = "https://h5.ele.me/restapi/shopping/v3/restaurants?latitude=" + latitude + "&longitude=" + longitude + "&offset=0&limit=3&extras[]=activities&extras[]=tags&terminal=h5";
		return url;
	}

}

fastjson解析用的三个pojo类


import com.alibaba.fastjson.annotation.JSONField;

/**
* @author Zln
* @version 2018-12-28 11:14
* 
* 饿了么门店对象
*/

public class ElemeShop {

	private String id;
	
	private String name;
	
	private String address;
	
	private String latitude;
	
	private String longitude;
	
	private String phone;
	
	private Double rating;
	
	@JSONField(name = "rating_count")
	private Integer ratingCount;

	public String getId() {
		return id;
	}

	public void setId(String id) {
		this.id = id;
	}

	public String getName() {
		return name;
	}

	public void setName(String name) {
		this.name = name;
	}

	public String getAddress() {
		return address;
	}

	public void setAddress(String address) {
		this.address = address;
	}

	public String getLatitude() {
		return latitude;
	}

	public void setLatitude(String latitude) {
		this.latitude = latitude;
	}

	public String getLongitude() {
		return longitude;
	}

	public void setLongitude(String longitude) {
		this.longitude = longitude;
	}

	public String getPhone() {
		return phone;
	}

	public void setPhone(String phone) {
		this.phone = phone;
	}

	public Double getRating() {
		return rating;
	}

	public void setRating(Double rating) {
		this.rating = rating;
	}

	public Integer getRatingCount() {
		return ratingCount;
	}

	public void setRatingCount(Integer ratingCount) {
		this.ratingCount = ratingCount;
	}

	@Override
	public String toString() {
		return "ElemeShop [id=" + id + ", name=" + name + ", address=" + address
				+ ", latitude=" + latitude + ", longitude=" + longitude
				+ ", phone=" + phone + ", rating=" + rating + ", ratingCount="
				+ ratingCount + "]";
	}

}

import java.util.List;

import com.alibaba.fastjson.annotation.JSONField;

/**
 * @author Zln
 * @version 2018-12-28 14:18
 * 
 * 饿了么菜品
 */

public class ElemeFood {

	@JSONField(name = "item_id")
	private String itemId;

	private String name;

	private Double rating;

	@JSONField(name = "rating_count")
	private Long ratingCount;

	private String description;

	private List<ElemeFoodSpec> specfoods;

	public String getItemId() {
		return itemId;
	}

	public void setItemId(String itemId) {
		this.itemId = itemId;
	}

	public String getName() {
		return name;
	}

	public void setName(String name) {
		this.name = name;
	}

	public Double getRating() {
		return rating;
	}

	public void setRating(Double rating) {
		this.rating = rating;
	}

	public Long getRatingCount() {
		return ratingCount;
	}

	public void setRatingCount(Long ratingCount) {
		this.ratingCount = ratingCount;
	}

	public String getDescription() {
		return description;
	}

	public void setDescription(String description) {
		this.description = description;
	}

	public List<ElemeFoodSpec> getSpecfoods() {
		return specfoods;
	}

	public void setSpecfoods(List<ElemeFoodSpec> specfoods) {
		this.specfoods = specfoods;
	}

	@Override
	public String toString() {
		return "ElemeFood [itemId=" + itemId + ", name=" + name + ", rating="
				+ rating + ", ratingCount=" + ratingCount + ", description="
				+ description + ", specfoods=" + specfoods + "]";
	}

}

import java.math.BigDecimal;
import java.util.List;

import com.alibaba.fastjson.annotation.JSONField;

/**
 * @author Zln
 * @version 2018-12-28 14:25
 * 
 * 菜品规格
 */

public class ElemeFoodSpec {
	
	private String name;

	@JSONField(name = "food_id")
	private String foodId;

	@JSONField(name = "item_id")
	private String itemId;

	@JSONField(name = "original_price")
	private BigDecimal originalPrice;

	@JSONField(name = "packing_fee")
	private BigDecimal packingFee;

	private BigDecimal price;

	@JSONField(name = "sku_id")
	private String skuId;
	
	@JSONField(name = "restaurant_id")
	private String restaurantId;

	private List<Spec> specs; // 规格名称

	public String getFoodId() {
		return foodId;
	}

	public void setFoodId(String foodId) {
		this.foodId = foodId;
	}

	public String getItemId() {
		return itemId;
	}

	public void setItemId(String itemId) {
		this.itemId = itemId;
	}

	public BigDecimal getOriginalPrice() {
		return originalPrice;
	}

	public void setOriginalPrice(BigDecimal originalPrice) {
		this.originalPrice = originalPrice;
	}

	public BigDecimal getPackingFee() {
		return packingFee;
	}

	public void setPackingFee(BigDecimal packingFee) {
		this.packingFee = packingFee;
	}

	public BigDecimal getPrice() {
		return price;
	}

	public void setPrice(BigDecimal price) {
		this.price = price;
	}

	public String getSkuId() {
		return skuId;
	}

	public void setSkuId(String skuId) {
		this.skuId = skuId;
	}
	
	public String getName() {
		return name;
	}

	public void setName(String name) {
		this.name = name;
	}
	
	public String getRestaurantId() {
		return restaurantId;
	}

	public void setRestaurantId(String restaurantId) {
		this.restaurantId = restaurantId;
	}

	public List<Spec> getSpecs() {
		return specs;
	}

	public void setSpecs(List<Spec> specs) {
		this.specs = specs;
	}

	public String getSpceName() {
		if(this.specs != null && this.specs.size() > 0) {
			return this.specs.get(0).getValue();
		}
		else {
			return null;
		}
	}

	@Override
	public String toString() {
		return "ElemeFoodSpec [name=" + name + ", foodId=" + foodId
				+ ", itemId=" + itemId + ", originalPrice=" + originalPrice
				+ ", packingFee=" + packingFee + ", price=" + price + ", skuId="
				+ skuId + ", restaurantId=" + restaurantId + ", specs=" + specs
				+ "]";
	}

	/**
	 * 规格
	 * 
	 * @author Zhouluning
	 *
	 */
	public static class Spec {
		private String value; // 规格名称

		public String getValue() {
			return value;
		}

		public void setValue(String value) {
			this.value = value;
		}

		@Override
		public String toString() {
			return "Spec [value=" + value + "]";
		}
	}
}

创建Pipeline,在WebMagic文档中有结合spring的Pipeline的用法,不过没有太理解,网上也有查找直接拿Pipeline当Service用的,感觉也不是太正确的用法,只能按照自己的理解写了一下。每次ElemeShopProcessor.process执行完成后,将解析出的数据通过page.putField保存,然后在pipeline中通过ResultItems.get进行获取。下面代码只有保存店铺的部分,保存菜品的相似


import java.util.Date;
import java.util.List;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.BeanUtils;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;

import com.zln.spider.eleme.ElemeShopProcessor.ProcessType;
import com.zln.spider.entity.DcShopInfo;
import com.zln.spider.entity.DcTask;
import com.zln.spider.mapper.DcShopInfoMapper;
import com.zln.spider.pojo.ElemeFood;
import com.zln.spider.pojo.ElemeShop;

import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;

/**
* @author Zln
* @version 2018-12-28 13:28
* 
* 饿了么门店信息pipeline
*/

@Component("ElemeShopPipeline")
public class ElemeShopPipeline implements Pipeline{
	
	Logger logger = LoggerFactory.getLogger(getClass());
	
	@Autowired
	DcShopInfoMapper dcShopInfoMapper;
	

	@SuppressWarnings("unchecked")
	@Override
	public void process(ResultItems rs, Task task) {
		if(rs.get("type") != null) {
			
			ProcessType type = (ProcessType)rs.get("type");
			
			switch (type) {
			case SHOP:
				// 店铺信息处理
				List<ElemeShop> listShops = (List<ElemeShop>)rs.get("listShops");
				for (ElemeShop elemeShop : listShops) {
					logger.info("save!" + elemeShop);
					saveShopInfo(elemeShop);
				}
				break;
			case FOOD:
				// 菜品信息处理
				List<ElemeFood> listFoods = (List<ElemeFood>)rs.get("listFoods");
				for (ElemeFood elemeFood : listFoods) {
					logger.info("save!" + elemeFood);
				}
				break;

			default:
				logger.error("未知处理类型!");
				break;
			}
		}
		else {
			// 未抓取到信息
			logger.error("未抓取到需要处理的信息!");
		}
	}
	
	private void saveShopInfo(ElemeShop elemeShop) {
		
		DcShopInfo objDcShopInfo = new DcShopInfo();
		BeanUtils.copyProperties(elemeShop, objDcShopInfo);
		objDcShopInfo.setChannelShopId(elemeShop.getId());
		objDcShopInfo.setCreateTime(new Date());
		
		dcShopInfoMapper.insert(objDcShopInfo);
	}


}

使用mybatis generator生成对应的entity和mapper。springboot集成mybatis可以参考 spring boot(六):如何优雅的使用mybatis


import java.util.Date;

public class DcShopInfo {
    /**
     * 
     * DC_SHOP_INFO.ID
     *
     * @mbg.generated
     */
    private Long id;

    /**
     * 任务ID
     * DC_SHOP_INFO.TASK_ID
     *
     * @mbg.generated
     */
    private Long taskId;

    /**
     * 数据渠道
            1:饿了么
            2:美团
     * DC_SHOP_INFO.CHANNEL
     *
     * @mbg.generated
     */
    private Integer channel;

    /**
     * 店铺名称
     * DC_SHOP_INFO.NAME
     *
     * @mbg.generated
     */
    private String name;

    /**
     * 店铺地址
     * DC_SHOP_INFO.ADDRESS
     *
     * @mbg.generated
     */
    private String address;

    /**
     * 店铺坐标-纬度
     * DC_SHOP_INFO.LATITUDE
     *
     * @mbg.generated
     */
    private String latitude;

    /**
     * 店铺坐标-经度
     * DC_SHOP_INFO.LONGITUDE
     *
     * @mbg.generated
     */
    private String longitude;

    /**
     * 联系电话
     * DC_SHOP_INFO.PHONE
     *
     * @mbg.generated
     */
    private String phone;

    /**
     * 评分
     * DC_SHOP_INFO.RATING
     *
     * @mbg.generated
     */
    private Double rating;

    /**
     * 评价数
     * DC_SHOP_INFO.RATING_COUNT
     *
     * @mbg.generated
     */
    private Integer ratingCount;

    /**
     * 渠道店铺ID
     * DC_SHOP_INFO.CHANNEL_SHOP_ID
     *
     * @mbg.generated
     */
    private String channelShopId;

    /**
     * 创建日期
     * DC_SHOP_INFO.CREATE_TIME
     *
     * @mbg.generated
     */
    private Date createTime;

    /**
     *
     * @mbg.generated
     */
    public Long getId() {
        return id;
    }

    /**
     *
     * @mbg.generated
     */
    public void setId(Long id) {
        this.id = id;
    }

    /**
     *
     * @mbg.generated
     */
    public Long getTaskId() {
        return taskId;
    }

    /**
     *
     * @mbg.generated
     */
    public void setTaskId(Long taskId) {
        this.taskId = taskId;
    }

    /**
     *
     * @mbg.generated
     */
    public Integer getChannel() {
        return channel;
    }

    /**
     *
     * @mbg.generated
     */
    public void setChannel(Integer channel) {
        this.channel = channel;
    }

    /**
     *
     * @mbg.generated
     */
    public String getName() {
        return name;
    }

    /**
     *
     * @mbg.generated
     */
    public void setName(String name) {
        this.name = name;
    }

    /**
     *
     * @mbg.generated
     */
    public String getAddress() {
        return address;
    }

    /**
     *
     * @mbg.generated
     */
    public void setAddress(String address) {
        this.address = address;
    }

    /**
     *
     * @mbg.generated
     */
    public String getLatitude() {
        return latitude;
    }

    /**
     *
     * @mbg.generated
     */
    public void setLatitude(String latitude) {
        this.latitude = latitude;
    }

    /**
     *
     * @mbg.generated
     */
    public String getLongitude() {
        return longitude;
    }

    /**
     *
     * @mbg.generated
     */
    public void setLongitude(String longitude) {
        this.longitude = longitude;
    }

    /**
     *
     * @mbg.generated
     */
    public String getPhone() {
        return phone;
    }

    /**
     *
     * @mbg.generated
     */
    public void setPhone(String phone) {
        this.phone = phone;
    }

    /**
     *
     * @mbg.generated
     */
    public Double getRating() {
        return rating;
    }

    /**
     *
     * @mbg.generated
     */
    public void setRating(Double rating) {
        this.rating = rating;
    }

    /**
     *
     * @mbg.generated
     */
    public Integer getRatingCount() {
        return ratingCount;
    }

    /**
     *
     * @mbg.generated
     */
    public void setRatingCount(Integer ratingCount) {
        this.ratingCount = ratingCount;
    }

    /**
     *
     * @mbg.generated
     */
    public String getChannelShopId() {
        return channelShopId;
    }

    /**
     *
     * @mbg.generated
     */
    public void setChannelShopId(String channelShopId) {
        this.channelShopId = channelShopId;
    }

    /**
     *
     * @mbg.generated
     */
    public Date getCreateTime() {
        return createTime;
    }

    /**
     *
     * @mbg.generated
     */
    public void setCreateTime(Date createTime) {
        this.createTime = createTime;
    }
}

import java.util.List;

import org.apache.ibatis.annotations.Delete;
import org.apache.ibatis.annotations.Insert;
import org.apache.ibatis.annotations.Result;
import org.apache.ibatis.annotations.Results;
import org.apache.ibatis.annotations.Select;
import org.apache.ibatis.annotations.Update;
import org.apache.ibatis.type.JdbcType;

import com.zln.spider.entity.DcShopInfo;

public interface DcShopInfoMapper {
    /**
     * This method was generated by MyBatis Generator.
     * This method corresponds to the database table dc_shop_info
     *
     * @mbg.generated
     */
    @Delete({
        "delete from dc_shop_info",
        "where ID = #{id,jdbcType=BIGINT}"
    })
    int deleteByPrimaryKey(Long id);

    /**
     * This method was generated by MyBatis Generator.
     * This method corresponds to the database table dc_shop_info
     *
     * @mbg.generated
     */
    @Insert({
        "insert into dc_shop_info (ID, TASK_ID, ",
        "CHANNEL, NAME, ADDRESS, ",
        "LATITUDE, LONGITUDE, ",
        "PHONE, RATING, RATING_COUNT, ",
        "CHANNEL_SHOP_ID, CREATE_TIME)",
        "values (#{id,jdbcType=BIGINT}, #{taskId,jdbcType=BIGINT}, ",
        "#{channel,jdbcType=INTEGER}, #{name,jdbcType=VARCHAR}, #{address,jdbcType=VARCHAR}, ",
        "#{latitude,jdbcType=VARCHAR}, #{longitude,jdbcType=VARCHAR}, ",
        "#{phone,jdbcType=VARCHAR}, #{rating,jdbcType=DOUBLE}, #{ratingCount,jdbcType=INTEGER}, ",
        "#{channelShopId,jdbcType=VARCHAR}, #{createTime,jdbcType=TIMESTAMP})"
    })
    int insert(DcShopInfo record);

    /**
     * This method was generated by MyBatis Generator.
     * This method corresponds to the database table dc_shop_info
     *
     * @mbg.generated
     */
    @Select({
        "select",
        "ID, TASK_ID, CHANNEL, NAME, ADDRESS, LATITUDE, LONGITUDE, PHONE, RATING, RATING_COUNT, ",
        "CHANNEL_SHOP_ID, CREATE_TIME",
        "from dc_shop_info",
        "where ID = #{id,jdbcType=BIGINT}"
    })
    @Results({
        @Result(column="ID", property="id", jdbcType=JdbcType.BIGINT, id=true),
        @Result(column="TASK_ID", property="taskId", jdbcType=JdbcType.BIGINT),
        @Result(column="CHANNEL", property="channel", jdbcType=JdbcType.INTEGER),
        @Result(column="NAME", property="name", jdbcType=JdbcType.VARCHAR),
        @Result(column="ADDRESS", property="address", jdbcType=JdbcType.VARCHAR),
        @Result(column="LATITUDE", property="latitude", jdbcType=JdbcType.VARCHAR),
        @Result(column="LONGITUDE", property="longitude", jdbcType=JdbcType.VARCHAR),
        @Result(column="PHONE", property="phone", jdbcType=JdbcType.VARCHAR),
        @Result(column="RATING", property="rating", jdbcType=JdbcType.DOUBLE),
        @Result(column="RATING_COUNT", property="ratingCount", jdbcType=JdbcType.INTEGER),
        @Result(column="CHANNEL_SHOP_ID", property="channelShopId", jdbcType=JdbcType.VARCHAR),
        @Result(column="CREATE_TIME", property="createTime", jdbcType=JdbcType.TIMESTAMP)
    })
    DcShopInfo selectByPrimaryKey(Long id);

    /**
     * This method was generated by MyBatis Generator.
     * This method corresponds to the database table dc_shop_info
     *
     * @mbg.generated
     */
    @Select({
        "select",
        "ID, TASK_ID, CHANNEL, NAME, ADDRESS, LATITUDE, LONGITUDE, PHONE, RATING, RATING_COUNT, ",
        "CHANNEL_SHOP_ID, CREATE_TIME",
        "from dc_shop_info"
    })
    @Results({
        @Result(column="ID", property="id", jdbcType=JdbcType.BIGINT, id=true),
        @Result(column="TASK_ID", property="taskId", jdbcType=JdbcType.BIGINT),
        @Result(column="CHANNEL", property="channel", jdbcType=JdbcType.INTEGER),
        @Result(column="NAME", property="name", jdbcType=JdbcType.VARCHAR),
        @Result(column="ADDRESS", property="address", jdbcType=JdbcType.VARCHAR),
        @Result(column="LATITUDE", property="latitude", jdbcType=JdbcType.VARCHAR),
        @Result(column="LONGITUDE", property="longitude", jdbcType=JdbcType.VARCHAR),
        @Result(column="PHONE", property="phone", jdbcType=JdbcType.VARCHAR),
        @Result(column="RATING", property="rating", jdbcType=JdbcType.DOUBLE),
        @Result(column="RATING_COUNT", property="ratingCount", jdbcType=JdbcType.INTEGER),
        @Result(column="CHANNEL_SHOP_ID", property="channelShopId", jdbcType=JdbcType.VARCHAR),
        @Result(column="CREATE_TIME", property="createTime", jdbcType=JdbcType.TIMESTAMP)
    })
    List<DcShopInfo> selectAll();

    /**
     * This method was generated by MyBatis Generator.
     * This method corresponds to the database table dc_shop_info
     *
     * @mbg.generated
     */
    @Update({
        "update dc_shop_info",
        "set TASK_ID = #{taskId,jdbcType=BIGINT},",
          "CHANNEL = #{channel,jdbcType=INTEGER},",
          "NAME = #{name,jdbcType=VARCHAR},",
          "ADDRESS = #{address,jdbcType=VARCHAR},",
          "LATITUDE = #{latitude,jdbcType=VARCHAR},",
          "LONGITUDE = #{longitude,jdbcType=VARCHAR},",
          "PHONE = #{phone,jdbcType=VARCHAR},",
          "RATING = #{rating,jdbcType=DOUBLE},",
          "RATING_COUNT = #{ratingCount,jdbcType=INTEGER},",
          "CHANNEL_SHOP_ID = #{channelShopId,jdbcType=VARCHAR},",
          "CREATE_TIME = #{createTime,jdbcType=TIMESTAMP}",
        "where ID = #{id,jdbcType=BIGINT}"
    })
    int updateByPrimaryKey(DcShopInfo record);
}

创建TestCast或者通过添加定时的方式执行写好的爬虫

import org.junit.Test;
import org.junit.runner.RunWith;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.junit4.SpringRunner;

import com.zln.spider.eleme.ElemeShopPipeline;
import com.zln.spider.eleme.ElemeShopProcessor;

import us.codecraft.webmagic.Spider;

@RunWith(SpringRunner.class)
@SpringBootTest
public class SpiderApplicationTests {

	@Qualifier("ElemeShopPipeline")
	@Autowired
	ElemeShopPipeline elemeShopPipeline;

	@Test
	public void testElemeShopSpider() {
		String latitude = "31.032697";
		String longitude = "121.216669";
		ElemeShopProcessor processor = new ElemeShopProcessor(longitude,latitude, false);
		Spider.create(processor).addPipeline(elemeShopPipeline)
				.addUrl(processor.getUrl())
				// 开启1个线程抓取
				.thread(1)
				// 启动爬虫
				.run();
	}

}

执行后在数据库中可以看到抓取到的店铺信息

参考资料

WegMagic官网

  • 0
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值