import requests,csv,time
from lxml import etree
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
'Host': 'www.gzfdcyw.com',
'Cookie': 'PHPSESSID=3oma2c8os1415s0356p0a2i494; UM_distinctid=1732eb7b5ba606-0bc64b4e5a5fec-4353760-1fa400-1732eb7b5bb600; CNZZDATA1262416730=975856371-1594216659-%7C1594216659',
'Connection': 'keep-alive'
}
def save_data(row):
f=open('赣州成交二手房.csv','a',encoding='GBK',newline='',error='ignore')
csv_writer = csv.writer(f)
csv_writer.writerow(row)
f.close()
def parse_url(url):
response = requests.get(url,headers=headers).text
html = etree.HTML(response)
for i in range(2,32):
date = html.xpath('//tr[{}]//text()'.format(i))[1]
deal_num = html.xpath('//tr[{}]//text()'.format(i))[3]
deal_area = html.xpath('//tr[{}]//text()'.format(i))[5]
deal_type = html.xpath('//tr[{}]//text()'.format(i))[7]
row = [date,deal_num,deal_area,deal_type]
print(row)
save_data(row)
def main():
for i in range(1,42):
url = 'http://www.gzfdcyw.com/Deal/index/p/{}.html'.format(i)
parse_url(url)
print('打印完成{}页'.format(i))
time.sleep(1)
if __name__ == '__ main __' :
main ()```
赣州房管局和东莞房管局不同的是,赣州房管局嵌入的页面是` ` TBODY ` `这个元素,这个元素无法被xpath的直接读取,也是挺麻烦的。所以结合了` ` //文本()` `的方法。但是也不知道// text()和/ text()有什么区别。希望大家指正!
赣州房管局成交爬虫
最新推荐文章于 2021-11-07 01:43:27 发布