前言
万维网上有着无数的网页,包含着海量的信息,无孔不入、森罗万象。但很多时候,无论出于数据分析或产品需求,我们需要从某些网站,提取出我们感兴趣、有价值的内容,但是纵然是进化到21世纪的人类,依然只有两只手,一双眼,不可能去每一个网页去点去看,然后再复制粘贴。所以我们需要一种能自动获取网页内容并可以按照指定规则提取相应内容的程序,这就是爬虫。
一、爬虫技术是什么?
简单来讲,爬虫就是一个探测机器,它的基本操作就是模拟人的行为去各个网站溜达,从这个链接跳到那个链接,查查数据,或者把看到的信息传输回去。就像一只蜘蛛在互联网这张大网上不知疲倦的爬来爬去。
二、使用步骤
1.引入库
import requests
from pyquery import PyQuery as pq
from fake_useragent import UserAgent
import time
import random
import pandas as pd
import pymysql
2.爬取数据
1.请求头
UA = UserAgent()
headers = {
}
2.使用requests.get()请求
requests.get()是使用GET方法获取指定URL。
3.具体代码
import requests
from pyquery import PyQuery as pq
from fake_useragent import UserAgent
import time
import random
import pandas as pd
import pymysql
UA = UserAgent()
headers = {
}
num_page = 2
class Lianjia_Crawer:
def __init__(self,txt_path):
super(Lianjia_Crawer,self).__init__()
self.file = str(txt_path)
self.df = pd.DataFrame(columns = ['title','community','citydirct','houseinfo','dateinfo','taglist','totalprice','unitprice'])
def run(self):
'''启动脚本'''
for i in range(100):
url = "https://nn.lianjia.com/ershoufang/pg{}/".format(str(i))
self.parse_url(url)
time.sleep(2)
self.df.to_csv(self.file, encoding='utf-8')
print('正在爬取的 url 为 {}'.format(url))
print('爬取完毕!!!!!!!!!!!!!!')
def parse_url(self,url):
headers['User-Agent'] = UA.chrome
res = requests.get(url, headers=headers)
doc = pq(res.text)
for i in doc('.clear.LOGCLICKDATA .info.clear'):
try:
pq_i = pq(i)
title = pq_i('.title').text().replace('必看好房', '')
Community = pq_i('.flood .positionInfo a').text()
HouseInfo = pq_i('.address .houseInfo').text()
DateInfo = pq_i('.followInfo').text()
TagList = pq_i('.tag').text()
TotalPrice = pq_i('.priceInfo .totalPrice').text().replace('万', '')
UnitPrice = pq_i('.priceInfo .unitPrice').text().replace('元/平', '')
CityDirct = str(Community).split(' ')[-1]
Community = str(Community).split(' ')[0]
data_dict ={
'title':title,
'community':Community,
'citydirct':CityDirct,
'houseinfo':HouseInfo,
'dateinfo':DateInfo,
'taglist':TagList,
'totalprice':TotalPrice,
'unitprice':UnitPrice
}
print(Community,CityDirct)
self.df = self.df.append(data_dict,ignore_index=True)
#self.file.write(','.join([title, Community, CityDirct, HouseInfo, DateInfo, TagList, TotalPrice, UnitPrice]))
print([title, Community, CityDirct, HouseInfo, DateInfo, TagList, TotalPrice, UnitPrice])
self.df.to_csv("E:/pythonProject/LianJia/aaaa/ershoufang_lianjia.csv", encoding='utf-8')
except Exception as e:
print(e)
print("索引提取失败,请重试!!!!!!!!!!!!!")
if __name__ =="__main__":
txt_path = "E:/pythonProject/LianJia/aaaa/ershoufang_lianjiaa.csv"
Crawer = Lianjia_Crawer(txt_path)
Crawer.run() # 启动爬虫脚本
总结
没有写在进数据库而是在一个csv的文件里面