自学爬虫项目(一)

最新推荐文章于 2024-04-01 18:02:56 发布

壹乐

最新推荐文章于 2024-04-01 18:02:56 发布

阅读量5k

点赞数 30

分类专栏： python爬虫学习文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_51894643/article/details/113245044

版权

python爬虫学习专栏收录该内容

2 篇文章 1 订阅

订阅专栏

引言

本人是只有python语言基础的小白，进入大学前从未接触过编程知识，学习的专业也与编程无关。机缘巧合之下，有幸接触到编程，对其产生浓厚的兴趣，并开始学习。

此文旨在记录生活，总结心得，若有不足之处，欢迎批评指正。

文章目录

一、明确目标

用多协程爬取安客居前十页的二手房源的名称，价格，几房几厅，大小，建造年份，联系人，地址。

二、分析过程

1. 首先打开安客居的网站

安客居的url:https://beijing.anjuke.com/sale/

进入网站后按F12进入开发者模式，在network中的Doc中的第0个文件可以看到我们想要的信息都在网页的html中，所以我们可以直接点击开发者框左上角的一个小箭头，来查找我们需要的信息的标签。在这里插入图片描述
我们需要的每一个信息的标签都可以通过以下方式获取

2. 下面接着分析网址

url：https://beijing.anjuke.com/sale/

不难看出，beijing是代表二手房源所在城市，但是我们还需要获取前十页的网页数据，接下来就点开下一页网址看看有无规律可循

第二页：https://beijing.anjuke.com/sale/p2/
第三页：https://beijing.anjuke.com/sale/p3/
第四页：https://beijing.anjuke.com/sale/p4/

通过观察前四页，发先在原本的网址上多了一个p2,p3,p4，可想而知p就是pag的意思，数字则代表相应页码

若想获取前十页数据则需要在加入一个循环便能做到，接下来便是代码实现

三、代码实现

首先第一步将我们上面需要用到的模块导入

from gevent import monkey
monkey.patch_all()
#让程序变成异步模式。
import random,csv,gevent,time,requests

from bs4 import BeautifulSoup

from fake_useragent import UserAgent

from gevent.queue import Queue

根据前面分析网址规律，用for循环构造出前十页的网址，并将这些网址放进队列

work = Queue()
# 创建队列对象，并赋值给work。
for pag in range(10):
    # 利用循环将前十页的网址获取到
    url_list = 'https://fuzhoushi.anjuke.com/sale/p'+str(pag)

    work.put_nowait(url_list)
   # 把构造好的网址用put_nowait方法添加进列表里

定义一个爬取网页的函数，用开发者选项里的小箭头依次找到二手房源的名称，价格，几房几厅，大小，建造年份，联系人，地址的标签，代码实现如下：

def House_Spider():
    # 定义House_Spider函数
    house_items = []
    # 创建一个空列表，到时候用来装各个信息
    pagnum = 0
    # 页码
    while not work.empty():
    # 当列表不是空的时候，执行下面的程序
        pagnum += 1

        print('正在爬取第{}页数据'.format(pagnum))
        # 显示正在爬取的页码
        url = work.get_nowait()

        headers= {'User-Agent':str(UserAgent().random)}
        # 随机获取请求头
        res = requests.get(url,headers=headers)
        # 获取网页源代码
        bs =BeautifulSoup(res.text,'html.parser')
        # 解析网页源码数据
        items = bs.find_all('div',tongji_tag="fcpc_ersflist_gzcount")
        # 提取每一个房源的全部信息
        for item in items:
        # 遍历循环items得到每个房源信息
            title = item.find('h3')['title']
            # 房源标题
            price = item.find('p',class_="property-price-total").text
            # 二手房价格
            room_num = item.find('p',class_="property-content-info-text property-content-info-attribute").text
            # 房子型号，即几室几厅
            area = item.find('div',class_="property-content-info").text.strip()[40:50]
            # 房子面积
            house_age = item.find('div',class_="property-content-info").text.strip()[-7:-1]
            #房子建造年份
            if house_age[-2:-1] != '年':
                house_age = '无'
            else:
                pass

            call_name = item.find('span',class_="property-extra-text").text
            # 联系人
            house_addr = item.find('div',class_="property-content-info property-content-info-comm").text
            # 房源地址
            item1 = [title,room_num,area,price,house_age,call_name,house_addr]
            # 将各个信息放入item1列表中
            house_items.append(item1)
            # 将每个列表添加到house_items列表中，用于下步存储
            
    return house_items
    #返回house_items的值

所有数据已经拿到了，接下来就是要将我们拿到的数据存到本地，用csv模块便能实现，打开的文件最后一定要记得关闭。

def House_File(house_items):
# 定义一个存储数据的文件，并传入我们要存储的数据
    house_file = open('house_price.csv','w',newline='',encoding='gbk')
    # 打开一个名为house_price.csv的文件(没有则会创建)，w为写入模式，newline=''区分换行符，encoding='gbk'表示编码格式
    line = ['名称','房型','面积','价格','建造年份','联系人','地址']
    
    w = csv.writer(house_file)
    # 用csv.writer()函数创建一个w对象
    w.writerow(line)
    # 在第一行写入line列表的信息
    for house_item in house_items:
    # 用遍历house_items得到每个房源的信息
        w.writerow(house_item)
        # 逐行写入
    house_file.close()
    # 关闭文件

数据和存储的代码都搞定，接下来就是启动整个项目啦！

task_list = []
# 创建一个任务列表
for x in range(5):
# 创建5只爬虫来为我们服务
    task = gevent.spawn(House_Spider,work)
    # 创建一个任务task
    task_list.append(task)
    # 将任务全部导入任务列表
    gevent.joinall(task_list)
    # 启动任务列表内的任务
start = time.time()
# 记录项目开始时间
print('任务开始'.center(20,'-'))
# 程序开始
house_items = House_Spider()
# 提取数据，解析数据，筛选拿到想要的数据
House_File(house_items)
# 将数据存储到本地
end = time.time()
# 记录结束时间
print('任务完成'.center(20,'-'))
# 程序结束
print('共耗时:{:.2f}秒'.format(end-start).center(20,'-'))
# 打印共消耗的时间

最后运行的结果为：
在这里插入图片描述

四、代码整合

#获取前十页的二手房源的 名称 价格 几房几厅 大小 建造年份 联系人 地址 标签 
# url ： https://fuzhoushi.anjuke.com/sale/
from gevent import monkey
monkey.patch_all()
 
import random,csv,gevent,time

import requests

from bs4 import BeautifulSoup

from fake_useragent import UserAgent

from gevent.queue import Queue
  
work = Queue()
# 创建队列对象，并赋值给work。
for pag in range(10):
    # 利用循环将前十页的网址获取到
    url_list = 'https://fuzhoushi.anjuke.com/sale/p'+str(pag)

    work.put_nowait(url_list)
   # 把构造好的网址用put_nowait方法添加进列表里
def House_Spider():
    # 定义House_Spider函数
    house_items = []
    # 创建一个空列表，到时候用来装各个信息
    pagnum = 0
    # 页码
    while not work.empty():
    # 当列表不是空的时候，执行下面的程序
        pagnum += 1

        print('正在爬取第{}页数据'.format(pagnum))
        # 显示正在爬取的页码
        url = work.get_nowait()

        headers= {'User-Agent':str(UserAgent().random)}
        # 随机获取请求头
        res = requests.get(url,headers=headers)
        # 获取网页源代码
        bs =BeautifulSoup(res.text,'html.parser')
        # 解析网页源码数据
        items = bs.find_all('div',tongji_tag="fcpc_ersflist_gzcount")
        # 提取每一个房源的全部信息
        for item in items:
        # 遍历循环items得到每个房源信息
            title = item.find('h3')['title']
            # 房源标题
            price = item.find('p',class_="property-price-total").text
            # 二手房价格
            room_num = item.find('p',class_="property-content-info-text property-content-info-attribute").text
            # 房子型号，即几室几厅
            area = item.find('div',class_="property-content-info").text.strip()[40:50]
            # 房子面积
            house_age = item.find('div',class_="property-content-info").text.strip()[-7:-1]
            #房子建造年份
            if house_age[-2:-1] != '年':
                house_age = '无'
            else:
                pass

            call_name = item.find('span',class_="property-extra-text").text
            # 联系人
            house_addr = item.find('div',class_="property-content-info property-content-info-comm").text
            # 房源地址
            item1 = [title,room_num,area,price,house_age,call_name,house_addr]
            # 将各个信息放入item1列表中
            house_items.append(item1)
            # 将每个列表添加到house_items列表中，用于下步存储
            
    return house_items

def House_File(house_items):
# 定义一个存储数据的文件，并传入我们要存储的数据
    house_file = open('house_price.csv','w',newline='',encoding='gbk')
    # 打开一个名为house_price.csv的文件(没有则会创建)，w为写入模式，newline=''区分换行符，encoding='gbk'表示编码格式
    line = ['名称','房型','面积','价格','建造年份','联系人','地址']
    
    w = csv.writer(house_file)
    # 用csv.writer()函数创建一个w对象
    w.writerow(line)
    # 在第一行写入line列表的信息
    for house_item in house_items:
    # 用遍历house_items得到每个房源的信息
        w.writerow(house_item)
        # 逐行写入
    house_file.close()
    # 关闭文件
task_list = []
# 创建一个任务列表
for x in range(5):
# 创建5只爬虫来为我们服务
    task = gevent.spawn(House_Spider,work)
    # 创建一个任务task
    task_list.append(task)
    # 将任务全部导入任务列表
    gevent.joinall(task_list)
    # 启动任务列表内的任务
start = time.time()
# 记录项目开始时间
print('任务开始'.center(20,'-'))
# 程序开始
house_items = House_Spider()
# 提取数据，解析数据，筛选拿到想要的数据
House_File(house_items)
# 将数据存储到本地
end = time.time()
# 记录结束时间
print('任务完成'.center(20,'-'))
# 程序结束
print('共耗时:{:.2f}秒'.format(end-start).center(20,'-'))
# 打印共消耗的时间

五、更多

还有很多地方可以改进，比如可以用一个input来替代网页中的beijing，可以搜索自己想搜索的城市等等。

虽然东西不多，但刚开始学习，以后也会慢慢的多写一点，希望能和大家一起进步~
　　我也只是个菜鸟，文中错误的地方，欢迎拍砖~

壹乐

关注

30
点赞
踩
68

收藏

觉得还不错? 一键收藏
打赏
22
评论
自学爬虫项目(一)

引言本人是只有python语言基础的小白，进入大学前从未接触过编程知识，学习的专业也与编程无关。机缘巧合之下，有幸接触到编程，对其产生浓厚的兴趣，并开始学习。此文旨在记录生活，总结心得，若有不足之处，欢迎批评指正。文章目录引言一、明确目标二、分析过程三、代码实现一、明确目标用多协程爬取安客居前十页的二手房源的名称，价格，几房几厅，大小，建造年份，联系人，地址。二、分析过程1. 首先打开安客居的网站安客居的url:https://beijing.anjuke.com/sale/进入网站后按F
复制链接

扫一扫