2022/3/18爬虫总结

最新推荐文章于 2023-12-27 07:47:20 发布

Huijin2597

最新推荐文章于 2023-12-27 07:47:20 发布

阅读量548

点赞数

文章标签： python 爬虫

本文链接：https://blog.csdn.net/Huijin2597/article/details/123582618

版权

爬虫

第一步：考虑用啥方法？

找接口
requests
selenium

第二步：分析网页结构和需要爬取的数据

1.普通网页内容

2.需要拉到最下面才能拉完一页的（像京东网页）

# 需要用到selenium页面滚动
height=800
for _ in range(13):
    b.execute_script(f'window.scrollTo(0,{height})')
    height+=800
    time.sleep(1)

3.获取到主页信息，需要点击每项进去获取每项内容的

# 需要用到切换标签
# 点击每个项内容进入，获取详细数据
for a in search_result:
    a.click()
    time.sleep(2)
    # 切换窗口,获取内容
    b.switch_to.window(b.window_handles[-1])
    soup=BeautifulSoup(b.page_source,'lxml')
    names=soup.select_one('.wx-tit> h1').text
    print(names)
    b.close()
    #切换回原来窗口
    b.switch_to.window(b.window_handles[0])

第三步：（3种方法的分别使用）

1.找接口（单页）

# 导入所需模块
import requests
import json
import csv

# 以百度小游戏为例，获取接口网址并转成json格式
res=requests.get('https://lewan.baidu.com/lewanapi?action=aladdin_rank_games&gameSource=standalone')
reslut = res.json()

# 转换后是个字典，取所需值即可
list1=[]
for item in reslut['result']['data']['annual']:
    # print(item)
    names=item['gameName']
    score=item['gameScore']
    types=item['gameTypes']
    index1=item['gameQueryIndex']
    list1.append([names,score,types,index1])
    
# 将爬取内容导入csv文档中
writer1=csv.writer(open('files/games.csv','wt',encoding='utf-8',newline=''))
writer1.writerow(['names','score','types','index1'])
writer1.writerows(list1)

2.requests（单页）

import requests
from bs4 import BeautifulSoup
from re import *

# 获取网址
headers={
    'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
}

# 解析内容
soup1=BeautifulSoup(b.page_source,'lxml')
divs=soup1.select('.mod_figure.mod_figure_v_default.mod_figure_list_box>div')
list1=[]
for div in divs:
    names=div.select_one('.list_item>a').attrs['title']
    titles=div.select_one('.list_item>div>div').attrs['title']
    pic=div.select_one('.list_item img').attrs['src']
    list1.append([names,titles,pic])

# 写入内容
writer1=csv.writer(open('files/tencent_variety.csv','w',encoding='utf-8',newline=''))
writer1.writerow(['综艺名','主题','封面'])
writer1.writerows(list1)

3.selenium（单页）

# 导入模块
from selenium.webdriver import ChromeOptions,Chrome
from re import *
import csv
from bs4 import BeautifulSoup

# 定义浏览器对象(打开时取消测试环境和加载图片)
options= ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2})
b=Chrome(options=options)
b.get('输入网址')

# 2.获取搜索框位置并输入搜索内容
search_box=b.find_element_by_id('搜索框id值')
search_box.send_keys('搜索内容')
search_box.send_keys(Keys.ENTER)
time.sleep(2)

# 解析内容
soup1=BeautifulSoup(b.page_source,'lxml')
divs=soup1.select('.mod_figure.mod_figure_v_default.mod_figure_list_box>div')
list1=[]
for div in divs:
    names=div.select_one('.list_item>a').attrs['title']
    titles=div.select_one('.list_item>div>div').attrs['title']
    pic=div.select_one('.list_item img').attrs['src']
    list1.append([names,titles,pic])

# 导入csv文本
writer1=csv.writer(open('files/tencent_variety.csv','w',encoding='utf-8',newline=''))
writer1.writerow(['综艺名','主题','封面'])
writer1.writerows(list1)

第四步（获取多页）

# 每页结尾都有下一页，点击之后获取新页面内容，重复这个步骤
第二步:点击下一页，来到第二页，继续向下滚动，获取第二页内容...直到最后一页(这是重复操作)
# 封装函数
def fun2():
    # 点击下一页
    next_btn=b.find_element_by_class_name('pn-next')
    next_btn.click()
    height = 800
   
    # 向下滚动（这一步不一定有，主要看页面构造）
    for _ in range(12):
        b.execute_script(f'window.scrollTo(0,{height})')
        height += 800
        time.sleep(2)
    
    # 检索内容 
    soup1 = BeautifulSoup(b.page_source, 'lxml')
    lis = soup1.select('.gl-warp.clearfix>li')
    list1 = []
    for li in lis:
        price = li.select_one('div.p-price > strong > i').text
        try:
            shop_name=li.select_one('a.curr-shop.hd-shopname').text
        except:
            shop_name=''
        list1.append([price, shop_name])
    
    # 因为是第二页开始，所以不需要写入标题行，直接叠上去
    writer1 = csv.writer(open('files/京东手机.csv', 'a', encoding='utf-8', newline=''))
    writer1.writerows(list1)
    time.sleep(2)

Huijin2597

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
2022/3/18爬虫总结

爬虫第一步：考虑用啥方法？找接口requestsselenium第二步：分析网页结构和需要爬取的数据1.普通网页内容2.需要拉到最下面才能拉完一页的（像京东网页）# 需要用到selenium页面滚动height=800for _ in range(13): b.execute_script(f'window.scrollTo(0,{height})') height+=800 time.sleep(1)3.获取到主页信息，需要点击每项进去获取每项内容的#
复制链接

扫一扫