目录
一、引言
不知不觉毕业季又及即将来临,对于快毕业的大学生来说,又到了论文及毕设的抉择及操作头疼的时候了,俺也一样。免不了需要花心思在这上面了;不过好在做着在翻阅开题的时候看到一个新奇且颇感兴趣的题目(Web爬虫系统),所以便有了这系列的文章,此系列主要是结合了爬虫,机器学习-模型预测,Web开发等技术。尽可量的做到完整-------------提示!!!可借鉴技术,切勿搬运作品用于论文设计,由于公开于互联网,保不齐有很多兄弟们也照搬。
二、正文讲解
1、网站分析
(1)这里作者选取的是链家网的数据,因为相比较价格及热度来说,链家网还算是比较前列的,不过目前的房售市场基本就是贝壳,德佑;唯一缺点就是并不是全国的地级市都有,不过大部分常见的也都存在了,上上下下也有170左右,由于作者开题制作的这个系统理念是为了让大学生们找工作时更能清楚的当地房价,及未来走向,因此这种数据量其实也满足预测系统了,话不多说,开始分析。
(2)爬虫最重要的就是对网站结构的理解,便于解析等等;我们需要把整个链家网的数据都爬取下来,就需要对每个地区都进行便利爬取,这里需要先获取各个地方的地址URL信息,打开控制台面板,我们能发现,一个字母代表一个省,用一个li标签包裹,然后下面的地级市则是在当前li标签下面的某个div标签里面的ul标签里面 ,具体的某个地级市则是需要遍历ul标签里面的li标签,有思路了,开造代码
###导入相关爬虫与解析库
import requests
from lxml import etree
import os
url='https://www.lianjia.com/city/'
headers={'你的headers信息,不懂怎么填写的,我其他的相关文章都有填写案例'}
res=requests.get(url=url,headers=headers).text
et=etree.HTML(res)
'''设置入手点主元素定位'''
ls=et.xpath("//div[@class='city_list']/div")
ct_js={}##用于最后存储的字典变量
(3)这里是请求每个城市的url的网站地址,先把包含各个省级的爬取下来,最后解析成json文件,这样方便后续对相应的地方进行url解析爬取
for i in ls:
##地址(省)名称
province=i.xpath('div/text()')[0]
# province=i.xpath('div')
##市级地
small_city_list=i.xpath('ul/li')
##判断是否为行政地
if len(small_city_list)==1:
href=i.xpath('ul/li/a/@href')[0]
ct_js.update({province:href})
else:
ct_dt={}
for k in i.xpath('ul/li'):
small_city=k.xpath('a/text()')[0]
href =k.xpath('a/@href')[0]
ct_dt.update({small_city:href})
ct_js.update({province:ct_dt})
'''方便存储为json格式,替换单双引号'''
new_ct_js=str(ct_js).replace("'",'"')
#有则追加,无则创建
if os.path.exists('city_code.json'):
with open('city_code.json','a',encoding='utf-8') as f:
f.write(str(new_ct_js))
else:
with open('city_code.json','w',encoding='utf-8') as f:
f.write(str(new_ct_js))
print('导入完成')
(4)爬取完成后的大致结构长这个样
2、详情页爬取
1、这里以杭州为例,我们需要解析三个部分:1、地市的url,由于我们原先爬到的url是首页的url,url目录有zufang,loupan等,因此我们需要在url后加上/zufang/rs,(不知道究竟为什么需要这样的后续可以评论,或者自己去官网查看目录);2、当页详情页的个数,及进一步解析,因为外面这一页展示的只是大概的一个讲解,进一步的我们还需要再次获取url标签,最后从里面获取详情数据;3、网页多页面的翻页解析爬取,这里我是直接用的xpath定位到尾部是多少页就直接按照目录规则批量添加多少页,个人感觉方便快捷,也不会出错。
2、详情页进一步的数据解析,这里我们可以分为两部分来解析,一部分为房源数据的大大致信息,其次便是房子的设施环境(其实作者原本想着再爬取周围的交通环境,例如距离哪个地铁站进,有多少个地铁等,但是由于js通过token,cookie等信息反爬[主要地图交通信息是通过第三方,百度API掉用的,所以逆起来的话工作量还是很大的],虽然交通也是房价影响的重要元素,但是逆向成本太高了就先放弃,后续看看能不能通过其他办法直接获取)。
3、这里作者全程用的是lxml里面的etree库,主流xpath语法,个人感觉方便快捷
4、我们分析完大致结构之后,就可以开始编写代码了,导入库的部分文章最后再填写上来
def start():
# 两手准备,直接创建或者通过读取
code_city = {
"安庆": "https://aq.lianjia.com/",
"滁州": "https://cz.lianjia.com/",
"阜阳": "https://fy.lianjia.com/",
"合肥": "https://hf.lianjia.com/",
"马鞍山": "https://mas.lianjia.com/",
"芜湖": "https://wuhu.lianjia.com/",
"北京": "https://bj.lianjia.com/",
"重庆": "https://cq.lianjia.com/",
"福州": "https://fz.lianjia.com/",
"泉州": "https://quanzhou.lianjia.com/",
"厦门": "https://xm.lianjia.com/",
"漳州": "https://zhangzhou.lianjia.com/",
"东莞": "https://dg.lianjia.com/",
"佛山": "https://fs.lianjia.com/",
"广州": "https://gz.lianjia.com/",
"惠州": "https://hui.lianjia.com/",
"江门": "https://jiangmen.lianjia.com/",
"清远": "https://qy.lianjia.com/",
"深圳": "https://sz.lianjia.com/",
"珠海": "https://zh.lianjia.com/",
"湛江": "https://zhanjiang.lianjia.com/",
"中山": "https://zs.lianjia.com/",
"北海": "https://bh.lianjia.com/",
"防城港": "https://fcg.lianjia.com/",
"桂林": "https://gl.lianjia.com/",
"柳州": "https://liuzhou.lianjia.com/",
"南宁": "https://nn.lianjia.com/",
"贵阳": "https://gy.lianjia.com/",
"黔西南": "https://qxn.fang.lianjia.com/",
"兰州": "https://lz.lianjia.com/",
"天水": "https://tianshui.lianjia.com/",
"保定": "https://bd.lianjia.com/",
"承德": "https://chengde.lianjia.com/",
"邯郸": "https://hd.lianjia.com/",
"廊坊": "https://lf.lianjia.com/",
"秦皇岛": "https://qhd.fang.lianjia.com/",
"石家庄": "https://sjz.lianjia.com/",
"唐山": "https://ts.lianjia.com/",
"张家口": "https://zjk.lianjia.com/",
"鄂州": "https://ez.lianjia.com/",
"黄石": "https://huangshi.lianjia.com/",
"黄冈": "https://hg.lianjia.com/",
"武汉": "https://wh.lianjia.com/",
"襄阳": "https://xy.lianjia.com/",
"宜昌": "https://yichang.lianjia.com/",
"保亭": "https://bt.fang.lianjia.com/",
"澄迈": "https://cm.lianjia.com/",
"儋州": "https://dz.fang.lianjia.com/",
"海口": "https://hk.lianjia.com/",
"临高": "https://lg.fang.lianjia.com/",
"乐东": "https://ld.fang.lianjia.com/",
"陵水": "https://ls.lianjia.com/",
"琼海": "http://you.lianjia.com/qh",
"三亚": "https://san.lianjia.com/",
"五指山": "https://wzs.fang.lianjia.com/",
"文昌": "http://you.lianjia.com/wc",
"万宁": "https://wn.fang.lianjia.com/",
"长沙": "https://cs.lianjia.com/",
"常德": "https://changde.lianjia.com/",
"衡阳": "https://hy.lianjia.com/",
"湘西": "https://xx.lianjia.com/",
"岳阳": "https://yy.lianjia.com/",
"株洲": "https://zhuzhou.lianjia.com/",
"济源": "https://jiyuan.fang.lianjia.com/",
"开封": "https://kf.lianjia.com/",
"洛阳": "https://luoyang.lianjia.com/",
"平顶山": "https://pds.lianjia.com/",
"濮阳": "https://py.lianjia.com/",
"三门峡": "https://smx.fang.lianjia.com/",
"新乡": "https://xinxiang.lianjia.com/",
"许昌": "https://xc.lianjia.com/",
"郑州": "https://zz.lianjia.com/",
"周口": "https://zk.lianjia.com/",
"驻马店": "https://zmd.lianjia.com/",
"哈尔滨": "https://hrb.lianjia.com/",
"赣州": "https://ganzhou.lianjia.com/",
"九江": "https://jiujiang.lianjia.com/",
"吉安": "https://jian.lianjia.com/",
"南昌": "https://nc.lianjia.com/",
"上饶": "https://sr.lianjia.com/",
"常州": "https://changzhou.lianjia.com/",
"常熟": "https://changshu.lianjia.com/",
"丹阳": "https://danyang.lianjia.com/",
"海门": "https://haimen.lianjia.com/",
"淮安": "https://ha.lianjia.com/",
"江阴": "https://jy.lianjia.com/",
"句容": "https://jr.lianjia.com/",
"昆山": "https://ks.lianjia.com/",
"南京": "https://nj.lianjia.com/",
"南通": "https://nt.lianjia.com/",
"苏州": "https://su.lianjia.com/",
"太仓": "https://taicang.lianjia.com/",
"无锡": "https://wx.lianjia.com/",
"徐州": "https://xz.lianjia.com/",
"盐城": "https://yc.lianjia.com/",
"镇江": "https://zj.lianjia.com/",
"长春": "https://cc.lianjia.com/",
"吉林": "https://jl.lianjia.com/",
"大连": "https://dl.lianjia.com/",
"丹东": "https://dd.lianjia.com/",
"抚顺": "https://fushun.lianjia.com/",
"沈阳": "https://sy.lianjia.com/",
"包头": "https://baotou.lianjia.com/",
"巴彦淖尔": "https://byne.fang.lianjia.com/",
"赤峰": "https://cf.lianjia.com/",
"呼和浩特": "https://hhht.lianjia.com/",
"通辽": "https://tongliao.lianjia.com/",
"银川": "https://yinchuan.lianjia.com/",
"菏泽": "https://heze.lianjia.com/",
"济南": "https://jn.lianjia.com/",
"济宁": "https://jining.lianjia.com/",
"临沂": "https://linyi.lianjia.com/",
"青岛": "https://qd.lianjia.com/",
"泰安": "https://ta.lianjia.com/",
"潍坊": "https://wf.lianjia.com/",
"威海": "https://weihai.lianjia.com/",
"烟台": "https://yt.lianjia.com/",
"淄博": "https://zb.lianjia.com/",
"成都": "https://cd.lianjia.com/",
"德阳": "https://dy.lianjia.com/",
"达州": "https://dazhou.lianjia.com/",
"广元": "https://guangyuan.lianjia.com/",
"乐山": "https://leshan.fang.lianjia.com/",
"凉山": "https://liangshan.lianjia.com/",
"绵阳": "https://mianyang.lianjia.com/",
"眉山": "https://ms.fang.lianjia.com/",
"南充": "https://nanchong.lianjia.com/",
"攀枝花": "https://pzh.lianjia.com/",
"遂宁": "https://sn.lianjia.com/",
"宜宾": "https://yibin.lianjia.com/",
"雅安": "https://yaan.lianjia.com/",
"资阳": "https://ziyang.lianjia.com/",
"宝鸡": "https://baoji.lianjia.com/",
"汉中": "https://hanzhong.lianjia.com/",
"西安": "https://xa.lianjia.com/",
"咸阳": "https://xianyang.lianjia.com/",
"晋中": "https://jz.lianjia.com/",
"太原": "https://ty.lianjia.com/",
"运城": "https://yuncheng.lianjia.com/",
"上海": "https://sh.lianjia.com/",
"天津": "https://tj.lianjia.com/",
"乌鲁木齐": "https://wlmq.lianjia.com/",
"大理": "https://dali.lianjia.com/",
"昆明": "https://km.lianjia.com/",
"西双版纳": "https://xsbn.lianjia.com/",
"杭州": "https://hz.lianjia.com/",
"湖州": "https://huzhou.lianjia.com/",
"嘉兴": "https://jx.lianjia.com/",
"金华": "https://jh.lianjia.com/",
"宁波": "https://nb.lianjia.com/",
"衢州": "https://quzhou.lianjia.com/",
"绍兴": "https://sx.lianjia.com/",
"台州": "https://taizhou.lianjia.com/",
"温州": "https://wz.lianjia.com/",
"义乌": "https://yw.lianjia.com/"
}
# with open('city_code.json',encoding='utf-8') as f:
# code_city=f.read()
city_code = eval(str(code_city))
choise = input('请选择你要爬取的范围(指定:1)|(全部:2)')
# try:
if int(choise) == 2:
for k, i in enumerate(city_code):
try:
pretreatment(url=code_city[i])
except:
print('出错了,下一个,url是:' + str(code_city[i]))
continue
print("第{}篇爬取完成".format(str(k)))
elif int(choise) == 1:
choise_1 = input("请输入你要爬取的城市(地级市):")
url = city_code[choise_1]
pretreatment(url=url)
4、这里作者并没有直接调用刚才的城市url,而是将其变成了一个个键值对,因为考虑到是所有数据的爬取,直接弄成字典格式更加快捷
erro_ls = []
def pretreatment(url):
##由于这个列表是全局的,需要加上global比爱哦是能够对此变量进行增删改查操作
global erro_ls
n = 0
url_1 = url + 'zufang/rs/'
url_ls = []
headers = {'Referer': 'https://lianjia.com/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}
res = requests.get(url=url_1, headers=headers, verify=False)
et = etree.HTML(res.text)
####详情页URL
parent_element = et.xpath("//div[@class='content__list--item--main']/p[1]/a/@href")
jvti_address= et.xpath("//div[@class='content__list--item--main']/p[2]/a/text()")
for i in parent_element:
if 'apartment' in i:
pass
else:
url_ls.append(url + 'zufang' + i)
###对页数url进行批量获取
page_ls = et.xpath("//div[@class='content w1150']/div[1]/ul[2]/li/a/@href")
for page in page_ls:
url_2 = url + 'zufang' + page + '#contentList'
res = requests.get(url=url_2, headers=headers)
ht = etree.HTML(res.text)
parent_element = ht.xpath("//div[@class='content__list--item--main']/p[1]/a/@href")
url_ls1 = []
for Subelement in parent_element:
###包含apartment元素的都是空白报错网页,或者无元素,略过
if 'apartment' in Subelement:
continue
else:
url_ls1.append(url + Subelement)
if n == 0:
url_ls1.extend(url_ls)
n += 1
for url_x in url_ls1:
send_data(url_x)
(5)详情页的内部解析
def send_data(url_ls,shi,jvti_address):
cont_ls = []
headers = {
"Cookie": "lianjia_uuid=edd184e5-c667-4a89-8956-f1ec3b9b4e78; _smt_uid=65ace5d4.5777da85; _ga=GA1.2.809441178.1705829848; _jzqy=1.1705829844.1706003576.1.jzqsr=baidu|jzqct=%E9%93%BE%E5%AE%B6%E7%BD%91.-; _jzqckmp=1; _gid=GA1.2.761306772.1706003578; Hm_lvt_9152f8221cb6243a53c83b956842be8a=1706003616; _ga_654P0WDKYN=GS1.2.1706003829.1.1.1706003882.0.0.0; _qzja=1.872586688.1706003904907.1706003904907.1706003904907.1706003904907.1706003904907.0.0.0.1.1; _ga_WLZSQZX7DE=GS1.2.1706003832.1.1.1706003970.0.0.0; _ga_TJZVFLS7KV=GS1.2.1706003832.1.1.1706003970.0.0.0; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2218d2b61c5ac236-05b09b3cf7d65b-26001951-2073600-18d2b61c5ad1195%22%2C%22%24device_id%22%3A%2218d2b61c5ac236-05b09b3cf7d65b-26001951-2073600-18d2b61c5ad1195%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_utm_source%22%3A%22baidu%22%2C%22%24latest_utm_medium%22%3A%22pinzhuan%22%2C%22%24latest_utm_campaign%22%3A%22wybeijing%22%2C%22%24latest_utm_content%22%3A%22biaotimiaoshu%22%2C%22%24latest_utm_term%22%3A%22biaoti%22%7D%7D; _jzqx=1.1706013975.1706013975.1.jzqsr=lianjia%2Ecom|jzqct=/.-; lianjia_ssid=7be47081-f926-4d65-955f-73745c500aed; _jzqc=1; _jzqa=1.3665143527547180500.1705829844.1706057571.1706060807.7; _ga_1W6P4PWXJV=GS1.2.1706060811.5.1.1706061178.0.0.0; _ga_W9S66SNGYB=GS1.2.1706060811.5.1.1706061178.0.0.0; beikeBaseData=%7B%22parentSceneId%22:%22135746916306195969%22%7D; select_city=520100; srcid=eyJ0Ijoie1wiZGF0YVwiOlwiZjc5Y2ZiNDVkYjg1MGYyYWQxNDdhMDA0MmQ3N2VkM2EzMTRhMmZiMTkzZjJiNDE0MmZhMjNhNmZmZWM0MjYwMjdhYjg5NTRjZDM4ZDBmZGE0MzdiN2RlZGFjOWQzMGM1M2I4Mzc1NjJlYThiZjFlNjVmNmIwNmVhZDQyYjdhZTI1MzE0OGJlMDVmNTM3YTQ3OTJjYjQwNGZiNzFmYjU4ZjgwOGI2MDU3NjU3NGE3MjRlZDZiZTkwOTVjZWI2OThiZDkwN2NiNDc2NmRkMjMwOTgzMzBlNmQ2MjhiYWM0MjhhODA2MTg0Yjc3NTJlOTNhMzA2MzM4YjE0YWNjYjIzYzMzMDQ5OTA2Mzc3Y2NlNzdmYTAwODRhZTAyYjk2NWE1YWQ5ZGRmNGE0NjAyN2I5MTI0MGIxYTU4YmExOGJjN2VcIixcImtleV9pZFwiOlwiMVwiLFwic2lnblwiOlwiZGM3MzkxNjNcIn0iLCJyIjoiaHR0cHM6Ly9neS5saWFuamlhLmNvbS8venVmYW5nL0dZMTg1OTkyNTE4NzI2NTI5ODQzMi5odG1sIiwib3MiOiJ3ZWIiLCJ2IjoiMC4xIn0=",
'Host': 'gy.lianjia.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}
for url in url_ls:
res = requests.get(url=url, headers=headers)
time.sleep(1)
data_content = etree.HTML(res.text)
time.sleep(1)
##标题
title_untreated = data_content.xpath("//p[@class='content__title']/text()")
print('title_untreated长度为:' + str(len(title_untreated)))
##价格
price_untreated = data_content.xpath("//div[@class='content__aside--title']/span/text()")
print("price_untreated长度为:" + str(len(price_untreated)))
##价格单位
price_unit_untreated = data_content.xpath("//div[@class='content__aside--title']/text()[2]")
print('price_unit_untreated长度为:' + str(len(price_unit_untreated)))
##基本信息
basic_info = data_content.xpath("//div[@class='content__article__info']/ul[1]/li/text()")
if len(title_untreated) == 0 or len(price_untreated) == 0 or len(price_unit_untreated) == 0:
print(url)
# print("标题:"+title[0]+'-------'+"价格:"+price[0]+'-------------'+"单位:"+unit[0]+'-------------------------')
else:
title = title_untreated[0].strip().split(' ')
price = price_untreated[0].replace('\n', '').replace(' ', '')
price_unit = price_unit_untreated[0].replace('\n', '').replace(' ', '')
Security_deposit = data_content.xpath("//div[@class='table_wrapper ']/div/ul/li[3]/text()")
Service_charge = data_content.xpath("//div[@class='table_wrapper ']/div/ul/li[4]/text()")
Agency_fee = data_content.xpath("//div[@class='table_wrapper ']/div/ul/li[5]/text()")
area = basic_info[1].replace("面积:", '')
toward = basic_info[2].replace("朝向:", '')
Maintenance_time = basic_info[4].replace("维护:", '')
Check_in = basic_info[5].replace("入住:", '')
floor = basic_info[7].replace("楼层:", '')
lift = basic_info[8].replace("电梯:", '')
Parking_space = basic_info[10].replace("车位:", '')
water = basic_info[11].replace("用水:", '')
gas = basic_info[14].replace("燃气:", '')
electricity = basic_info[13].replace("用电:", '')
heating_method = basic_info[16].replace("采暖:", '')
print(("-".join(title), price, price_unit, "".join(Security_deposit), "".join(Service_charge),
"".join(Agency_fee), area, toward, Maintenance_time, Check_in, floor, lift, Parking_space, water,
gas, electricity, heating_method))
print()
cont_ls.append(("-".join(title), price, price_unit, "".join(Security_deposit), "".join(Service_charge),
"".join(Agency_fee), area, toward, Maintenance_time, Check_in, floor, lift, Parking_space,
water, gas, electricity, heating_method))
3、数据存储
(1)将数据提取好之后就需要放入数据库或者CSV当中了,我这里写的是存储到数据库当中,因为后期考虑到建模学习可能时间很长,先展示分析的数据;后续也可以从数据库中直接导出CSV文件
def data_save(data):
print('数据爬取完成+1正在保存中,..........')
my = mysql.connector.connect(host='127.0.0.1', user='xxxx', passwd='xxxx', database='lianjia_data',
auth_plugin='mysql_native_password')
con = my.cursor()
# while True:
# save_as = input("请输入您的保存方式:m(mysql)/e(excel[csv]):")
# if save_as == "m":
# break
# elif save_as == 'e':
# break
# else:
# print("你的输入有误,请重新输入")
# continue
sql = 'insert into lianjia_house(title,price,price_unit,Security_deposit,Service_charge,Agency_fee,area,toward,Maintenance_time,Check_in,floor,lift,Parking_space,water,gas,electricity,heating_method) values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s);'
con.executemany(sql, data)
my.commit()
print('插入成功')
三、完整代码
考虑到数据量很大,因此我们需要开去多线程,在开取多线程之前我们需要根据自己的性能经行调试,开多线程的时候我们需要保证数据库设置连接超时,最大连接数,线程大小等设置,linux的设置为/etc/my.cnf【上完整的代码】
import time
import requests
import json
import mysql.connector
from lxml import etree
import threading
# 设置线程数
num_threads = 30
erro_ls = []
def pretreatment(url):
global erro_ls
n = 0
url_1 = url + 'zufang/rs/'
url_ls = []
headers = {'Referer': 'https://lianjia.com/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}
res = requests.get(url=url_1, headers=headers, verify=False)
et = etree.HTML(res.text)
parent_element = et.xpath("//div[@class='content__list--item--main']/p[1]/a/@href")
jvti_address= et.xpath("//div[@class='content__list--item--main']/p[2]/a/text()")
for i in parent_element:
if 'apartment' in i:
pass
else:
url_ls.append(url + 'zufang' + i)
page_ls = et.xpath("//div[@class='content w1150']/div[1]/ul[2]/li/a/@href")
for page in page_ls:
url_2 = url + 'zufang' + page + '#contentList'
res = requests.get(url=url_2, headers=headers)
ht = etree.HTML(res.text)
parent_element = ht.xpath("//div[@class='content__list--item--main']/p[1]/a/@href")
url_ls1 = []
for Subelement in parent_element:
if 'apartment' in Subelement:
continue
else:
url_ls1.append(url + Subelement)
if n == 0:
url_ls1.extend(url_ls)
n += 1
# 分配任务到多线程处理
url_chunks = chunkify(url_ls1, num_threads)
threads = []
for i in range(num_threads):
t = threading.Thread(target=send_data, args=(url_chunks[i],))
threads.append(t)
t.start()
# 等待所有线程完成
for t in threads:
t.join()
def chunkify(lst, n):
return [lst[i::n] for i in range(n)]
# 其余部分保持不变
def send_data(url_ls,shi,jvti_address):
cont_ls = []
headers = {
"Cookie": "lianjia_uuid=edd184e5-c667-4a89-8956-f1ec3b9b4e78; _smt_uid=65ace5d4.5777da85; _ga=GA1.2.809441178.1705829848; _jzqy=1.1705829844.1706003576.1.jzqsr=baidu|jzqct=%E9%93%BE%E5%AE%B6%E7%BD%91.-; _jzqckmp=1; _gid=GA1.2.761306772.1706003578; Hm_lvt_9152f8221cb6243a53c83b956842be8a=1706003616; _ga_654P0WDKYN=GS1.2.1706003829.1.1.1706003882.0.0.0; _qzja=1.872586688.1706003904907.1706003904907.1706003904907.1706003904907.1706003904907.0.0.0.1.1; _ga_WLZSQZX7DE=GS1.2.1706003832.1.1.1706003970.0.0.0; _ga_TJZVFLS7KV=GS1.2.1706003832.1.1.1706003970.0.0.0; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2218d2b61c5ac236-05b09b3cf7d65b-26001951-2073600-18d2b61c5ad1195%22%2C%22%24device_id%22%3A%2218d2b61c5ac236-05b09b3cf7d65b-26001951-2073600-18d2b61c5ad1195%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_utm_source%22%3A%22baidu%22%2C%22%24latest_utm_medium%22%3A%22pinzhuan%22%2C%22%24latest_utm_campaign%22%3A%22wybeijing%22%2C%22%24latest_utm_content%22%3A%22biaotimiaoshu%22%2C%22%24latest_utm_term%22%3A%22biaoti%22%7D%7D; _jzqx=1.1706013975.1706013975.1.jzqsr=lianjia%2Ecom|jzqct=/.-; lianjia_ssid=7be47081-f926-4d65-955f-73745c500aed; _jzqc=1; _jzqa=1.3665143527547180500.1705829844.1706057571.1706060807.7; _ga_1W6P4PWXJV=GS1.2.1706060811.5.1.1706061178.0.0.0; _ga_W9S66SNGYB=GS1.2.1706060811.5.1.1706061178.0.0.0; beikeBaseData=%7B%22parentSceneId%22:%22135746916306195969%22%7D; select_city=520100; srcid=eyJ0Ijoie1wiZGF0YVwiOlwiZjc5Y2ZiNDVkYjg1MGYyYWQxNDdhMDA0MmQ3N2VkM2EzMTRhMmZiMTkzZjJiNDE0MmZhMjNhNmZmZWM0MjYwMjdhYjg5NTRjZDM4ZDBmZGE0MzdiN2RlZGFjOWQzMGM1M2I4Mzc1NjJlYThiZjFlNjVmNmIwNmVhZDQyYjdhZTI1MzE0OGJlMDVmNTM3YTQ3OTJjYjQwNGZiNzFmYjU4ZjgwOGI2MDU3NjU3NGE3MjRlZDZiZTkwOTVjZWI2OThiZDkwN2NiNDc2NmRkMjMwOTgzMzBlNmQ2MjhiYWM0MjhhODA2MTg0Yjc3NTJlOTNhMzA2MzM4YjE0YWNjYjIzYzMzMDQ5OTA2Mzc3Y2NlNzdmYTAwODRhZTAyYjk2NWE1YWQ5ZGRmNGE0NjAyN2I5MTI0MGIxYTU4YmExOGJjN2VcIixcImtleV9pZFwiOlwiMVwiLFwic2lnblwiOlwiZGM3MzkxNjNcIn0iLCJyIjoiaHR0cHM6Ly9neS5saWFuamlhLmNvbS8venVmYW5nL0dZMTg1OTkyNTE4NzI2NTI5ODQzMi5odG1sIiwib3MiOiJ3ZWIiLCJ2IjoiMC4xIn0=",
'Host': 'gy.lianjia.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}
for url in url_ls:
res = requests.get(url=url, headers=headers)
time.sleep(1)
data_content = etree.HTML(res.text)
time.sleep(1)
##标题
title_untreated = data_content.xpath("//p[@class='content__title']/text()")
print('title_untreated长度为:' + str(len(title_untreated)))
##价格
price_untreated = data_content.xpath("//div[@class='content__aside--title']/span/text()")
print("price_untreated长度为:" + str(len(price_untreated)))
##价格单位
price_unit_untreated = data_content.xpath("//div[@class='content__aside--title']/text()[2]")
print('price_unit_untreated长度为:' + str(len(price_unit_untreated)))
##基本信息
basic_info = data_content.xpath("//div[@class='content__article__info']/ul[1]/li/text()")
if len(title_untreated) == 0 or len(price_untreated) == 0 or len(price_unit_untreated) == 0:
print(url)
# print("标题:"+title[0]+'-------'+"价格:"+price[0]+'-------------'+"单位:"+unit[0]+'-------------------------')
else:
title = title_untreated[0].strip().split(' ')
price = price_untreated[0].replace('\n', '').replace(' ', '')
price_unit = price_unit_untreated[0].replace('\n', '').replace(' ', '')
Security_deposit = data_content.xpath("//div[@class='table_wrapper ']/div/ul/li[3]/text()")
Service_charge = data_content.xpath("//div[@class='table_wrapper ']/div/ul/li[4]/text()")
Agency_fee = data_content.xpath("//div[@class='table_wrapper ']/div/ul/li[5]/text()")
area = basic_info[1].replace("面积:", '')
toward = basic_info[2].replace("朝向:", '')
Maintenance_time = basic_info[4].replace("维护:", '')
Check_in = basic_info[5].replace("入住:", '')
floor = basic_info[7].replace("楼层:", '')
lift = basic_info[8].replace("电梯:", '')
Parking_space = basic_info[10].replace("车位:", '')
water = basic_info[11].replace("用水:", '')
gas = basic_info[14].replace("燃气:", '')
electricity = basic_info[13].replace("用电:", '')
heating_method = basic_info[16].replace("采暖:", '')
print(("-".join(title), price, price_unit, "".join(Security_deposit), "".join(Service_charge),
"".join(Agency_fee), area, toward, Maintenance_time, Check_in, floor, lift, Parking_space, water,
gas, electricity, heating_method))
print()
cont_ls.append(("-".join(title), price, price_unit, "".join(Security_deposit), "".join(Service_charge),
"".join(Agency_fee), area, toward, Maintenance_time, Check_in, floor, lift, Parking_space,
water, gas, electricity, heating_method))
def data_save(data):
print('数据爬取完成+1正在保存中,..........')
my = mysql.connector.connect(host='127.0.0.1', user='xxxx', passwd='xxxx', database='lianjia_data',
auth_plugin='mysql_native_password')
con = my.cursor()
# while True:
# save_as = input("请输入您的保存方式:m(mysql)/e(excel[csv]):")
# if save_as == "m":
# break
# elif save_as == 'e':
# break
# else:
# print("你的输入有误,请重新输入")
# continue
sql = 'insert into lianjia_house(title,price,price_unit,Security_deposit,Service_charge,Agency_fee,area,toward,Maintenance_time,Check_in,floor,lift,Parking_space,water,gas,electricity,heating_method) values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s);'
con.executemany(sql, data)
my.commit()
print('插入成功')
def start():
# 两手准备,直接创建或者通过读取
code_city = {
"安庆": "https://aq.lianjia.com/",
"滁州": "https://cz.lianjia.com/",
"阜阳": "https://fy.lianjia.com/",
"合肥": "https://hf.lianjia.com/",
"马鞍山": "https://mas.lianjia.com/",
"芜湖": "https://wuhu.lianjia.com/",
"北京": "https://bj.lianjia.com/",
"重庆": "https://cq.lianjia.com/",
"福州": "https://fz.lianjia.com/",
"泉州": "https://quanzhou.lianjia.com/",
"厦门": "https://xm.lianjia.com/",
"漳州": "https://zhangzhou.lianjia.com/",
"东莞": "https://dg.lianjia.com/",
"佛山": "https://fs.lianjia.com/",
"广州": "https://gz.lianjia.com/",
"惠州": "https://hui.lianjia.com/",
"江门": "https://jiangmen.lianjia.com/",
"清远": "https://qy.lianjia.com/",
"深圳": "https://sz.lianjia.com/",
"珠海": "https://zh.lianjia.com/",
"湛江": "https://zhanjiang.lianjia.com/",
"中山": "https://zs.lianjia.com/",
"北海": "https://bh.lianjia.com/",
"防城港": "https://fcg.lianjia.com/",
"桂林": "https://gl.lianjia.com/",
"柳州": "https://liuzhou.lianjia.com/",
"南宁": "https://nn.lianjia.com/",
"贵阳": "https://gy.lianjia.com/",
"黔西南": "https://qxn.fang.lianjia.com/",
"兰州": "https://lz.lianjia.com/",
"天水": "https://tianshui.lianjia.com/",
"保定": "https://bd.lianjia.com/",
"承德": "https://chengde.lianjia.com/",
"邯郸": "https://hd.lianjia.com/",
"廊坊": "https://lf.lianjia.com/",
"秦皇岛": "https://qhd.fang.lianjia.com/",
"石家庄": "https://sjz.lianjia.com/",
"唐山": "https://ts.lianjia.com/",
"张家口": "https://zjk.lianjia.com/",
"鄂州": "https://ez.lianjia.com/",
"黄石": "https://huangshi.lianjia.com/",
"黄冈": "https://hg.lianjia.com/",
"武汉": "https://wh.lianjia.com/",
"襄阳": "https://xy.lianjia.com/",
"宜昌": "https://yichang.lianjia.com/",
"保亭": "https://bt.fang.lianjia.com/",
"澄迈": "https://cm.lianjia.com/",
"儋州": "https://dz.fang.lianjia.com/",
"海口": "https://hk.lianjia.com/",
"临高": "https://lg.fang.lianjia.com/",
"乐东": "https://ld.fang.lianjia.com/",
"陵水": "https://ls.lianjia.com/",
"琼海": "http://you.lianjia.com/qh",
"三亚": "https://san.lianjia.com/",
"五指山": "https://wzs.fang.lianjia.com/",
"文昌": "http://you.lianjia.com/wc",
"万宁": "https://wn.fang.lianjia.com/",
"长沙": "https://cs.lianjia.com/",
"常德": "https://changde.lianjia.com/",
"衡阳": "https://hy.lianjia.com/",
"湘西": "https://xx.lianjia.com/",
"岳阳": "https://yy.lianjia.com/",
"株洲": "https://zhuzhou.lianjia.com/",
"济源": "https://jiyuan.fang.lianjia.com/",
"开封": "https://kf.lianjia.com/",
"洛阳": "https://luoyang.lianjia.com/",
"平顶山": "https://pds.lianjia.com/",
"濮阳": "https://py.lianjia.com/",
"三门峡": "https://smx.fang.lianjia.com/",
"新乡": "https://xinxiang.lianjia.com/",
"许昌": "https://xc.lianjia.com/",
"郑州": "https://zz.lianjia.com/",
"周口": "https://zk.lianjia.com/",
"驻马店": "https://zmd.lianjia.com/",
"哈尔滨": "https://hrb.lianjia.com/",
"赣州": "https://ganzhou.lianjia.com/",
"九江": "https://jiujiang.lianjia.com/",
"吉安": "https://jian.lianjia.com/",
"南昌": "https://nc.lianjia.com/",
"上饶": "https://sr.lianjia.com/",
"常州": "https://changzhou.lianjia.com/",
"常熟": "https://changshu.lianjia.com/",
"丹阳": "https://danyang.lianjia.com/",
"海门": "https://haimen.lianjia.com/",
"淮安": "https://ha.lianjia.com/",
"江阴": "https://jy.lianjia.com/",
"句容": "https://jr.lianjia.com/",
"昆山": "https://ks.lianjia.com/",
"南京": "https://nj.lianjia.com/",
"南通": "https://nt.lianjia.com/",
"苏州": "https://su.lianjia.com/",
"太仓": "https://taicang.lianjia.com/",
"无锡": "https://wx.lianjia.com/",
"徐州": "https://xz.lianjia.com/",
"盐城": "https://yc.lianjia.com/",
"镇江": "https://zj.lianjia.com/",
"长春": "https://cc.lianjia.com/",
"吉林": "https://jl.lianjia.com/",
"大连": "https://dl.lianjia.com/",
"丹东": "https://dd.lianjia.com/",
"抚顺": "https://fushun.lianjia.com/",
"沈阳": "https://sy.lianjia.com/",
"包头": "https://baotou.lianjia.com/",
"巴彦淖尔": "https://byne.fang.lianjia.com/",
"赤峰": "https://cf.lianjia.com/",
"呼和浩特": "https://hhht.lianjia.com/",
"通辽": "https://tongliao.lianjia.com/",
"银川": "https://yinchuan.lianjia.com/",
"菏泽": "https://heze.lianjia.com/",
"济南": "https://jn.lianjia.com/",
"济宁": "https://jining.lianjia.com/",
"临沂": "https://linyi.lianjia.com/",
"青岛": "https://qd.lianjia.com/",
"泰安": "https://ta.lianjia.com/",
"潍坊": "https://wf.lianjia.com/",
"威海": "https://weihai.lianjia.com/",
"烟台": "https://yt.lianjia.com/",
"淄博": "https://zb.lianjia.com/",
"成都": "https://cd.lianjia.com/",
"德阳": "https://dy.lianjia.com/",
"达州": "https://dazhou.lianjia.com/",
"广元": "https://guangyuan.lianjia.com/",
"乐山": "https://leshan.fang.lianjia.com/",
"凉山": "https://liangshan.lianjia.com/",
"绵阳": "https://mianyang.lianjia.com/",
"眉山": "https://ms.fang.lianjia.com/",
"南充": "https://nanchong.lianjia.com/",
"攀枝花": "https://pzh.lianjia.com/",
"遂宁": "https://sn.lianjia.com/",
"宜宾": "https://yibin.lianjia.com/",
"雅安": "https://yaan.lianjia.com/",
"资阳": "https://ziyang.lianjia.com/",
"宝鸡": "https://baoji.lianjia.com/",
"汉中": "https://hanzhong.lianjia.com/",
"西安": "https://xa.lianjia.com/",
"咸阳": "https://xianyang.lianjia.com/",
"晋中": "https://jz.lianjia.com/",
"太原": "https://ty.lianjia.com/",
"运城": "https://yuncheng.lianjia.com/",
"上海": "https://sh.lianjia.com/",
"天津": "https://tj.lianjia.com/",
"乌鲁木齐": "https://wlmq.lianjia.com/",
"大理": "https://dali.lianjia.com/",
"昆明": "https://km.lianjia.com/",
"西双版纳": "https://xsbn.lianjia.com/",
"杭州": "https://hz.lianjia.com/",
"湖州": "https://huzhou.lianjia.com/",
"嘉兴": "https://jx.lianjia.com/",
"金华": "https://jh.lianjia.com/",
"宁波": "https://nb.lianjia.com/",
"衢州": "https://quzhou.lianjia.com/",
"绍兴": "https://sx.lianjia.com/",
"台州": "https://taizhou.lianjia.com/",
"温州": "https://wz.lianjia.com/",
"义乌": "https://yw.lianjia.com/"
}
# with open('city_code.json',encoding='utf-8') as f:
# code_city=f.read()
city_code = eval(str(code_city))
choise = input('请选择你要爬取的范围(指定:1)|(全部:2)')
# try:
if int(choise) == 2:
for k, i in enumerate(city_code):
try:
pretreatment(url=code_city[i])
except:
print('出错了,下一个,url是:' + str(code_city[i]))
continue
print("第{}篇爬取完成".format(str(k)))
elif int(choise) == 1:
choise_1 = input("请输入你要爬取的城市(地级市):")
url = city_code[choise_1]
pretreatment(url=url)
if __name__ == '__main__':
start()
四、常见错误
1、Error: 1159(08S01): Got timeout reading communication packets(并发数量太大了,数据库还没反应过来)【解决方法】最直接的就是将代码线程数调小一点
相关文章:设置最大连接数
相关文章:设置最大并发数
相关文章:设置会话超时
2、raise errors,ProgrammingError(mysql.connector.errors.ProgramingError: Failed processing format-parameters; Python ' elementunicoderesult" cannot be converted to a MysoL type(插入的类型与数据库里面的类型不一致,也有可能是数据传送太快了,数据库没来的及解析就出错了)【解决方案:检查是否数据全部为字符类型,因为最终传到数据库是跟着数据库里的类型走的|或者直接重试】
3、10060、10061,都是数据库没来的及相应,不用太在意这个问题,设置好数据库连接等待时长后,他会一直重连的,数据不会丢失