Python爬虫基础—使用函数封装的形式写爬虫

最新推荐文章于 2023-07-24 11:06:29 发布

㤅uu

最新推荐文章于 2023-07-24 11:06:29 发布

阅读量519

点赞数

分类专栏：爬虫基础文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/m0_65050363/article/details/129877205

版权

爬虫基础专栏收录该内容

6 篇文章 0 订阅

订阅专栏

一、导入所需要的模块（包）

import request
import csv
from bs4 import BeautifulSoup
from tqdm import tqdm

二、负责发送请求，得到响应结果，并返回网页源代码的函数

def get_response(link: str) -> str:
	Headers = {
	'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'
	}
	response = requests.get(url=link, headers=Headers)
	return response.text if response.status_code == 200 else ''

三、负责提取页面信息的函数

def get_data(html_tree):
	# 1.先获取每页所有二手房所在的li标签
	lilist = html_tree.select('html > body > div.content ul.sellListContent > li')
	data = []
	# 2.提取每一条二手房信息
	for i in lilist；
	 	# 二手房标题
     	houseTitle = i.select_one('li > div.info.clear > div.title > a').text
     	# 二手房单价
     	priceInfo = i.select('li > div.info.clear > div.priceInfo span')
   	    # 二手房总价
     	total_price, unit_price = priceInfo[0].text + '万元', priceInfo[1].text
     	data.append([houseTitle, unit_price, total_price])
	return data

四、程序主函数并负责将收集信息写入CSV文件

def main():
	# 打开文件
	file = open('./链家二手房.csv', 'w', encoding='utf-8', newline='')
	# 利用CSV模块中的writer类写入列名
	csv.writer(file).writerow(['标题', '单价', '总价'])
	for page in tqdm(range(1, 101), desc='链家二手房爬虫'):
        URL = f'https://cd.lianjia.com/ershoufang/pg{page}/'
        # 1.先请求链接，拿到网页源代码
        htmlStr = get_response(URL)
        # 2.解析网页源代码，转换为树结构
        soup = BeautifulSoup(htmlStr, 'html.parser')
        # 3. 提取信息
        result = get_data(soup)
        # 同时写入这一页的多条数据
        csv.writer(file).writerows(result)

	# 关闭文件
	file.close()

# 在此处调用函数
# 函数被定义时不会执行，被调用时才会执行
main()