概述
- 基于bs4解析的python爬虫,没有用别的框架,用了requests库抓取,实现了链家新房数据爬取并保存到mysql数据库中,图片url也存入数据库,图片本体保存在本地,还码了一个gui可视化饼图。
效果图
-
控制台
-
饼图
-
数据库
-
图片保存到本地
具体实现
- 确定链接。我的链接是这个:链家新房-广州
- 然后是插件。这里我装了bs4,pymysql,matplotlib。需要的自行安装一下。接下来是分区代码
- 配置header和项目初始化:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36',
'Accept': 'image/webp,image/*,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Referer': 'http://www.baidu.com/link?url=_andhfsjjjKRgEWkj7i9cFmYYGsisrnm2A-TN3XZDQXxvGsM9k9ZZSnikW2Yds4s&wd=&eqid=c3435a7d00006bd600000003582bfd1f',
'Connection': 'keep-alive'}
page = ('pg')
hlist = []
- 对链接信息的处理:
def listinfo(listhtml):
areasoup = BeautifulSoup(listhtml, 'html.parser')
ljhouse = areasoup.find_all('div', attrs={
'class': 'resblock-desc-wrapper'})
loupanimg = areasoup.find_all("img", attrs={
"class": "lj-lazy"})
i=0
for house in ljhouse:
loupantitle = house.find("div", attrs={
"class": "resblock-name"})
loupanname = loupantitle.a.get_text()
loupantag = loupantitle.find_all("span")
wuye = loupantag[0].get_text()
xiaoshouzhuangtai = loupantag[1].get_text()
location = house.find("div", attrs={
"class": "resblock-location"}).get_text()
jishi = house.find("a", attrs={
"class": "resblock-room"}).get_text()
area = house.find("div", attrs={
"class": "resblock-area"})
sarea = area.find("span").get_text()
r_area = '暂无'
if sarea != '':
r_area = house.find("div", attrs={
"class": "resblock-area"}).get_text().split()[1]
tag = house.find("div", attrs={
"class": "resblock-tag"}).get_text()
jiage = house.find("div", attrs={
"class": "resblock-price"})
price = jiage.find("div", attrs={
"class": "main-price"}).get_text().split()[0] # 截取数字
if price.replace('\n','').find('-') != -1:
price = price.split('-')[1]
total