抓取网站数据入库详解,附图文
一. 分析需求
1.1 需求分析
- 刚好有这样一个需求,去抓取下方网站的页面全部数据,并存入MySQL数据库。
- 这个页面为: 爬取页面
- 年月日选择
- 出生于几点,性别: 男或者女 选择:
- 选择年月日小时,性别后,跳转的页面(目标就是爬取此页面):
- 年月日选择
1.2 分析实现可行性
- 经过对各个年份、月份、天、小时、男或女的点击后进入的页面发现如下特点:
- 页面数据是静态数据,并非从后端读取得到 (可考虑有哪些技术可以实现)
- 页面数据有固定的key:value属性,比如 生肖: 牛,星座:双鱼座,且每个页面的key,value是固定的,简单来说,每个页面的key都是一样的,只是具体的value是根据年月日小时,性别会相应变动 (可考虑入库的时候,对数据库字段的定义)
- 页面的路径是有规律的。比如1950年1月1日0时,性别为女的,它的路径为:http://www.8gua.cn/huashengsuanming/1950/w-1950-1-1-0.html,所以分析出路径如下特点:
- 路径为: url/年/(男为m,女为w)性别-年-月-日-时.html组成;
- 可选出生的小时为:
- 页面路径与可选的小时有着一一对应的关系;
二. 分析技术
- 解析静态页面,我们可以使用Jsoup来进行解析,它可以将页面中的元素内容加载为Document文档,我们可以操作指定;
- Jsoup是什么? jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。
- 通过Jsoup,我们可以访问指定的页面,抓取其中的内容,解析为文本(抓取数据文本)
- 提取关键词key和value,我们可以使用正则表达式,将符合规则的数据截取出来(对数据文本的提取)
- 期间还需要用到:
- 获取指定年、月有多少天的方法(对日期的处理)
- 获取指定范围内的集合(如1950年~2019年的集合,1到12月的集合等)
去实现一个功能时,先逐个分析用到哪些东西可以实现。由点到面,这样一个大的功能就可以落地了。
三. 业务流程梳理
- 流程图:
- 具体实现总结:我们先进行年月日时的遍历,然后选择男或女,这样能够获得1950年~2019年每一天中固定的那几个小时范围的男或女的数据;遍历最深层中写逻辑代码,在最里面写:
- 通过年月日时,男或女,以及分页等条件去拼接url
- 通过Jsoup获取指定url内的数据,将其主体内容转为文本,并过滤掉不需要的内容;
- 通过正则表达式,将里面的数据进行提取,变为key:value形式;
- 将结果封装到对象中,然后存入数据库;
四. 实战代码
4.1 公共方法以及依赖的引入
- 实战代码示例,文中代码是用Kotlin编写,与Java相差不大。
- 引入Jsoup依赖:
compile group: 'org.jsoup', name: 'jsoup', version: '1.13.1'
Java版本依赖可直接搜索jsoup ,去寻找Maven依赖即可
- 获取指定年、月下的最大天数:
/** * 根据年 月 获取对应的月份 天数 */ fun getDaysByYearMonth(year: Int, month: Int): Int { val a = Calendar.getInstance() a[Calendar.YEAR] = year a[Calendar.MONTH] = month - 1 a[Calendar.DATE] = 1 a.roll(Calendar.DATE, -1) return a[Calendar.DATE] }
- 获取指定字符串范围内的数据(包含范围数据):
/** * 获取从pre 开始 ,从post结束的字符串数据 */ fun parseTextAll(content: String, pre: String, post: String): String { // 查找的字符串 //正则表达式 val pattern = "$pre(.*?)$post"; //Java正则表达式以括号分组,第一个括号表示以"(乙方):"开头,第三个括号表示以" "(空格)结尾,中间括号为目标值, // 创建 Pattern 对象 val r = Pattern.compile(pattern); // 创建 matcher 对象 val m = r.matcher(content); while (m.find()) { // 自动遍历打印所有结果 group方法打印捕获的组内容,以正则的括号角标从1开始计算,我们这里要第2个括号里的 // 值, 所以取 m.group(2), m.group(0)取整个表达式的值,如果越界取m.group(4),则抛出异常 return m.group(0) } return "" }
- 获取指定字符串范围内的数据(不包含范围数据):
/** * 获取从pre 开始 ,从post结束的字符串数据,排除pre/post */ fun parseText(content: String, pre: String, post: String): String { // 查找的字符串 val pattern = "$pre(.*?)$post" // 创建 Pattern 对象 val r = Pattern.compile(pattern); // 创建 matcher 对象 val m = r.matcher(content); while (m.find()) { // 自动遍历打印所有结果 group方法打印捕获的组内容,以正则的括号角标从1开始计算,我们这里要第2个括号里的 // 值, 所以取 m.group(2), m.group(0)取整个表达式的值,如果越界取m.group(4),则抛出异常 return m.group(0).replace(pre, "").replace(post, "") } return "" }
- Jsoup根据指定url路径分析页面的文本内容:
fun parseContent(urls: List<String>): String { val builder = StringBuilder() urls.forEach { try { val document: Document = Jsoup.parse(URL(it), 3 * 1000) builder.append(document.getElementsByTag("p").text()) } catch (e: Exception) { } } return builder.toString() }
此处代码 document.getElementsByTag().text() 为获取指定标签名的文本数据。这里则为获取
<p>
标签内的全部内容; - 根据指定范围的值获取范围内的集合:
fun getListByRange(startInt: Int, endInt: Int): List<Int>{ val rangeList = mutableListOf<Int>() var startCount= startInt while (startCount <= endInt){ rangeList.add(startCount++) } return rangeList }
- 引入Jsoup依赖:
4.2 数据库及存储相关的设计
- 数据库设计如下:
此处content_text为存储的全部内容的文本信息,因为数据量较大,且可能此字段使用率不高,博主暂时将其放弃,不为此字段赋值;
- Java中的配置:
- 我们使用的持久层框架为:
Mybatis-plus
- 数据库此表名称为:
professional_letter
- 创建的Mapper:
interface ProfessionalLetterMapper : BaseMapper<ProfessionalLetterEntity>
- 创建的Entity:
@TableName("professional_letter") class ProfessionalLetterEntity{ @ApiModelProperty("主键id") var id: Long? = 0L @ApiModelProperty("所属年月日-时分秒,开始时间") var startTime: Date?=null @ApiModelProperty("所属年月日-时分秒,结束时间") var endTime: Date?=null @ApiModelProperty("性别") var sex: Short?=null @ApiModelProperty("标题") var title: String?=null @ApiModelProperty("阳历") var solarCalendar: String?=null @ApiModelProperty("农历") var lunarCalendar: String?=null @ApiModelProperty("节气") var solarTerms: String?=null @ApiModelProperty("星座") var constellation: String?= null @ApiModelProperty("十二生肖") var chineseZodiac: String?=null @ApiModelProperty("二十八星宿") var twentyEightNights: String?=null @ApiModelProperty("命主福元") var fortune: String?=null @ApiModelProperty("文本版内容") var contentText: String?=null @ApiModelProperty("json版本内容") var contentJson: String?=null @ApiModelProperty("胎元") var foetus: String?=null @ApiModelProperty("命宫") var mingGong: String?=null @ApiModelProperty("起大运周岁") var qiDaYun: String?=null }
- 创建的ContentJson(用于保存解析后的全部字段数据)
import com.sino.hardware.common.JsonSerializable import io.swagger.annotations.ApiModelProperty class ContentJson : JsonSerializable() { @ApiModelProperty("阳历") var solarCalendar: String? = null @ApiModelProperty("农历") var lunarCalendar: String? = null @ApiModelProperty("节气") var solarTerms: String? = null @ApiModelProperty("起大运周岁") var qiDaYun: String? = null @ApiModelProperty("星座") var constellation: String? = null @ApiModelProperty("十二生肖") var chineseZodiac: String? = null @ApiModelProperty("二十八星宿") var twentyEightNights: String? = null @ApiModelProperty("命主福元") var fortune: String? = null @ApiModelProperty("八字纳音") var baZiNaYin: String? = null @ApiModelProperty("排大运") var paiDaYun: String? = null @ApiModelProperty("排流年") var paiLiuNian: String? = null @ApiModelProperty("胎元") var foetus: String? = null @ApiModelProperty("命宫") var mingGong: String? = null @ApiModelProperty("终身卦") var zhongShenGua: String? = null @ApiModelProperty("吉神凶煞") var jiShenXiongSha: String? = null @ApiModelProperty("吉神凶煞提示") var jiShenXiongShaTiShi: String? = null @ApiModelProperty("命局生克制化") var mingJuShengKeZhiHua: String? = null @ApiModelProperty("日主综得分") var riZhuZhongDeiFen: String? = null @ApiModelProperty("日主综得分提示") var riZHuZhongDeiFenTiShi: String? = null @ApiModelProperty("三命通会论断") var sanMingTongHuiLunDuan: String? = null @ApiModelProperty("穷通宝鉴-调候用神参考") var qiongTongBaoJian: String? = null @ApiModelProperty("十神定位论断") var shiShenDingWeiLunDuan: String? = null @ApiModelProperty("八字重量") var baZiZhongLiang: String? = null @ApiModelProperty("八字重量提示") var baZiZhongLiangTiShi: String? = null @ApiModelProperty("命宫寓意") var mingGongYuYi: String? = null @ApiModelProperty("性格特征") var xingGeTeZheng: String? = null @ApiModelProperty("性格特征提示") var xingGeTeZhengTiShi: String? = null @ApiModelProperty("职业财运") var zhiYeCaiYun: String? = null @ApiModelProperty("功名官运") var gongMingGuanYun: String? = null @ApiModelProperty("婚姻择偶") var hunYingZeOu: String? = null @ApiModelProperty("配偶方向") var peiOuFangXiang: String? = null @ApiModelProperty("配偶方向提示:") var peiOuFangXiangTiShi: String? = null @ApiModelProperty("祖业遗产") var zuYeYiChan: String? = null @ApiModelProperty("体质健康") var tiZhiJianKang: String? = null @ApiModelProperty("体质健康提示") var tiZhiJianKangTiShi: String? = null @ApiModelProperty("有利选择") var youLiXuanZe: String? = null @ApiModelProperty("流年") var liuNianMap: Map<String,String>? = null @ApiModelProperty("起大运运势") var qiDaYunMap: Map<String,String>? = null }
- 我们使用的持久层框架为:
4.3 核心代码
- 引入Mapper:
@Autowired private lateinit var professionalLetterMapper: ProfessionalLetterMapper
- 创建Main函数方法,并包含条件范围,并调用步骤3、步骤4:
@Async fun insertDB() { val yearList = getListByRange(1950, 2019) val monthList = getListByRange(1, 12) val hourList = listOf(0, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23) // 遍历每一年 yearList.forEach { year -> run { // 遍历每一个月 monthList.forEach { month -> run { // 根据年月,得出此年此月中共有多少天 val dayMaxCount = getDaysByYearMonth(year, month) // 将天数封装为一个集合 val dayList = getListByRange(1, dayMaxCount) // 去查询数据 dayList.forEach { day -> run { hourList.forEach { hour -> run { insertDB(year, month, day, hour) } } } } } } } } // 导入完成,输出提示信息 var entity = ProfessionalLetterEntity() entity.title="导出完成" professionalLetterMapper.insert(entity) }
- 存入数据库方法:
// 存入数据库 private fun insertDB(year: Int, month: Int, day: Int, hour: Int) { val entityM = parseUrl2Content(year, month, day, hour, "m") professionalLetterMapper.insert(entityM) val entityW = parseUrl2Content(year, month, day, hour, "w") professionalLetterMapper.insert(entityW) }
- 创建根据Url获取数据,通过Jsoup转为文本,过滤不需要的数据,并封装为存入数据库的Entity对象的方法:
private fun parseUrl2Content(year: Int, month: Int, day: Int, hour: Int, sex: String): ProfessionalLetterEntity { val urls = setUrls(year, month, day, hour, sex) var entity = ProfessionalLetterEntity() // 内容 var content = parseContent(urls) .replace("华盛算命", "") .replace("www.8gua.cn", "") .replace("- ", "") .replace("Copyright © 2014-2017华盛算命 www.8gua.cn", "") .replace("Copyright©2014-2017华盛算命 www.8gua.cn", "") .replace("【", "\n 【") .replace("Copyright©2014-2017", "") .replace("Copyright © 2014-2017", "") .replace("联系QQ:139238028", "") .replace("Copyright?2014-2017", "") + "\n" // 出生公历 val birthDay = parseText(content, "出生公历:", "。") val birthDayNong = parseText(content, "出生农历:", "。") val jieqi = parseText(content, "节气:", "。") val qidayun = parseText(content, "起大运周岁:", "。") val xingzuo = parseText(content, "星座:", "。") val shengxiao = parseText(content, "生肖:", "。") val ershibaxiu = parseText(content, "二十八宿:", "。") val minzhufuyuan = parseText(content, "命主福元:", "。") val bazi = parseText(content, "。", "节气:") val paidayun = parseText(content, "排大运:", "排流年:") val pailiunian = parseText(content, "排流年:", "※胎元:") val taiyuan = parseText(content, "※胎元:", "命宫:") val minggong = parseText(content, "命宫:", "终身卦:") val zhongshengua = parseText(content, "终身卦:", "吉神凶煞:") val jishenxiongsha = parseText(content, "吉神凶煞:", "☆星座:") val tishijishenxiongsha = parseTextAll(content, "※提示:神煞", "性质的作用。") val minjushengkezhihua = parseText(content, "命局生克制化:", "※日主综合得分:") val rizongzhudeifen = parseText(content, "※日主综合得分:", "※提示:这一步给出了整个八字最有价值的信息,") val tishirizongzhudeifen = parseTextAll(content, "※提示:这一步给出了整个八字最有价值的信息,", "利用这些数据,切记!") val sanmingtonghuilunduan = parseText(content, "三命通会论断:", "穷通宝鉴-调侯用神参考: "); val qiongtongbaojian_diaotongyongshencankao = parseText(content, "穷通宝鉴-调侯用神参考:", "※提示:这里对命局生克关系有较好的论述,") val shishendingweilunduan = parseText(content, "十神定位论断:", "※提示:分宫论断,") val bazizhongliang = parseText(content, "\n", "※提示:这是一种神奇的断命法,") val tishi_bazizhongliang = parseTextAll(content, "※提示:这是一种神奇的断命法,", "命运轮廓,可参考。") val minggongyuyi = parseText(content, "※ 命宫寓意:", "※ 性格特征:") val xinggetezheng = parseText(content, "※ 性格特征:", "※提示:个人性格除禀受天赋外,") val tishi_xinggetezheng = parseTextAll(content, "※提示:个人性格除禀受天赋外,", "时代背景等。") val zhiyecaiyun = parseText(content, "职业财运:", "※提示:职业和财运密切像关") val gongmingguanyun = parseText(content, "功名官运:", "※提示:") val hunyingzeou = parseText(content, "※ 婚姻择偶:", "\n") val peioufangxiang = parseText(content, "【配偶方向】", "※提示:看婚姻除看个人八字外,") val tishi_hunyingzeou = parseTextAll(content, "※提示:看婚姻除看个人八字外,", "不喜克害。") val zuyeyichan = parseText(content, "※ 祖业遗产:", "※ 家庭子女:") val tizhijiankang = parseText(content, "※ 体质健康:", "※提示:八字阴阳五行平衡,") val tishi_tizhijiankang = parseTextAll(content, "※提示:八字阴阳五行平衡,", "※ 有利选择:") val youlixuanze = parseText(content, "※ 有利选择:", "※ 未交大运前的运势:") var liunianCount = 1 val liunianyunshiList = mutableListOf<Int>() while (liunianCount <= 82) { liunianyunshiList.add(liunianCount++) } var qidayunCount = 0 val qidayunCountList = listOf("一", "二", "三", "四", "五", "六", "七", "八") var contentNew = content val qiDaYunMap = mutableMapOf<String, String>() while (qidayunCount < 8) { val index = qidayunCountList[qidayunCount++] val qidayunContent = parseText(contentNew, "第${index}步大运:", "\n") contentNew = contentNew.replace("${index}${qidayunContent}", "") qiDaYunMap.put(index, qidayunContent) } val liuNianMap = mutableMapOf<String, String>() liunianyunshiList.forEach { val liuNianContent = parseTextAll(contentNew, "【${it}岁流年:", "\n") liuNianMap.put("${it}", liuNianContent) } var map = mapOf( "0" to "00:00|00:59", "1" to "01:00|02:59", "3" to "03:00|04:59", "5" to "05:00|06:59", "7" to "07:00|08:59", "9" to "09:00|10:59", "11" to "11:00|12:59", "13" to "13:00|14:59", "15" to "15:00|16:59", "17" to "17:00|18:59", "19" to "19:00|20:59", "21" to "21:00|22:59", "23" to "23:00|23:59") // 将结果封装到数据json中 val contentJson = ContentJson() contentJson.solarCalendar = birthDay contentJson.lunarCalendar = birthDayNong contentJson.solarTerms = jieqi contentJson.qiDaYun = qidayun contentJson.constellation = xingzuo contentJson.chineseZodiac = shengxiao contentJson.twentyEightNights = ershibaxiu contentJson.fortune = minzhufuyuan contentJson.baZiNaYin = bazi contentJson.paiDaYun = paidayun contentJson.paiLiuNian = pailiunian contentJson.foetus = taiyuan contentJson.mingGong = minggong contentJson.zhongShenGua = zhongshengua contentJson.jiShenXiongSha = jishenxiongsha contentJson.jiShenXiongShaTiShi = tishijishenxiongsha contentJson.mingJuShengKeZhiHua = minjushengkezhihua contentJson.riZhuZhongDeiFen = rizongzhudeifen contentJson.riZHuZhongDeiFenTiShi = tishirizongzhudeifen contentJson.sanMingTongHuiLunDuan = sanmingtonghuilunduan contentJson.qiongTongBaoJian = qiongtongbaojian_diaotongyongshencankao contentJson.shiShenDingWeiLunDuan = shishendingweilunduan contentJson.baZiZhongLiang = bazizhongliang contentJson.baZiZhongLiangTiShi = tishi_bazizhongliang contentJson.mingGongYuYi = minggongyuyi contentJson.xingGeTeZheng = xinggetezheng contentJson.xingGeTeZhengTiShi = tishi_xinggetezheng contentJson.zhiYeCaiYun = zhiyecaiyun contentJson.gongMingGuanYun = gongmingguanyun contentJson.hunYingZeOu = hunyingzeou contentJson.peiOuFangXiang = peioufangxiang contentJson.peiOuFangXiangTiShi = tishi_hunyingzeou contentJson.zuYeYiChan = zuyeyichan contentJson.tiZhiJianKang = tizhijiankang contentJson.tiZhiJianKangTiShi = tishi_tizhijiankang contentJson.youLiXuanZe = youlixuanze contentJson.liuNianMap = liuNianMap contentJson.qiDaYunMap = qiDaYunMap // 将结果放到entity中 val hourList = map.get("$hour")!!.split("|") val months = if (month < 10) "0$month" else "$month" val days = if (day < 10) "0$day" else "$day" entity.startTime = DateUtil.parseStrToDate("$year-${months}-${days} ${hourList[0]}:00", "yyyy-MM-dd HH:mm:ss") entity.endTime = DateUtil.parseStrToDate("$year-${months}-${days} ${hourList[1]}:59", "yyyy-MM-dd HH:mm:ss") entity.sex = if (sex == "m") 0 else 1 entity.title = "八字详批-${if (sex == "m") "女" else "男"}命公历${year}年${months}月${days}日生于${hourList[0]}时~${hourList[1]}时的人" entity.solarCalendar = birthDay entity.lunarCalendar = birthDayNong entity.solarTerms = jieqi entity.constellation = xingzuo entity.chineseZodiac = shengxiao entity.twentyEightNights = ershibaxiu entity.fortune = minzhufuyuan entity.contentText = "" // 暂时不添加 entity.contentJson = contentJson.toJSON() entity.mingGong = minggong entity.foetus = taiyuan entity.qiDaYun = qidayun return entity }
五. 启动后的注意点
- 所有的都准备好了,我们点击启动,就可以自动去抓取,并且入库了:
- 数据已成功陆续插入:
- 最后感言:
- 因为数据量较大,我们可以放在服务器里进行执行;
- 我们也可以做优化,比如分库分表之类,后期可根据实际需求来
- 我们可以做多线程等,同时执行;(这里的同时执行可以每个阶段一个线程来存入,充分利用现代CPU的多核性能)
- 解决需求的时候,可根据实际需求来选择技术方案,没有哪种技术方案可以适用于所有需求。