day1requests

总结

01认识爬虫

  1. 爬虫基本步骤
    爬虫就是通过代码取获取别人网站上的数据
    1)获取网页数据 - requests和自动化测试工具(Selenium)
    2)解析数据 - 正则表达式、css选择器、xpath
    3)存储数据 - 表格文件(csv、openpyxl)和数据库文件

  2. 获取网页数据

    1. 认识网站
      一个网页由三个部分组成:html、css、javascript(js)
      html - 决定网页中有什么
      css - 决定网页中内容的样式(长什么样)
      js - 让网页内容动态变化
  1. requests的使用
    requests是基于HTTP(s)协议的网络请求的第三方库

02requests

import requests
from re import findall

# 1.获取网页内容
# requests.get(url)  -   获取指定url对应的网页内容,返回一个响应
# url - 同一资源定位符(网址)
response = requests.get('https://www.sohu.com/')
print(type(response), response)    # <class 'requests.models.Response'> <Response [200]>

# 2. 获取网页内容
# 1) 获取网页内容原数据(类型是二进制) - 主要针对图片、视频、音频
print(response.content)

# 2)获取网页内容文本格式数据    -   针对网页
print(response.text)

# 3)将获取的内容进行json转换      -   针对返回数据是json数据的接口
print(response.json())


# 3.添加客户端信息 - 伪装成浏览器
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
}
response = requests.get('https://book.douban.com/', headers=headers)
print(response)
print(response.text)


headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
    'cookie': '_zap=39beb782-a6cf-4483-8719-8e2d846da00d; _xsrf=zMuRHL0ug0AKhKYfxromKD0K2pb49WzV; d_c0="AMDePoka3xKPToqc24ryAdbkUVa-BnMNCLc=|1616989878"; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1616989879; SESSIONID=JoqxQeb0aYuI0CciYGx0er5NN2MwIbEn2MoqPCwApGO; __snaker__id=tuq6EBkbDlGyqzju; JOID=W1wcAU9aRxN4b24mS1EHjk7pf_lZHSVpBjMGSXA2E2s6WSBIL-IPfxNvbiBGCZfpl_fNO_UEOytACg879iVoT8g=; osd=VVoXBU1UQRh8bWAgQFUFgEjie_tXGy5tBD0AQnQ0HW0xXSJGKekLfR1pZSREB5Hik_XDPf4AOSVGAQs5-CNjS8o=; gdxidpyhxdE=35inWkvfQNd786Mr70NuCWk%5CVniAea8PACBGQcr6zC%2BvSsRJ0nRrS8iXJjegcBuudCAmkm%2BERI1pyK63DqoXAWOgu%2BzOIWaEdC5PZb0W3lzECO6%2BgnnMZ0c0zvtyeeWWcaDaokRnBSIoN3R0ap3Q9vcyfNmSe7rP2jARUZw7E4MMVDLM%3A1616990780534; _9755xjdesxxd_=32; YD00517437729195%3AWM_NI=xi4L0%2BZP%2BFywy2%2BsWsWxSY%2B3aOsYtX6ShFhFIen%2FfMfx%2B50CU2xm4C4JHgVXxDGHVHhPa%2BlIkbw0MWZDrgbwD22mIJ89HQ0GHRfdMaV5CPyaQtfY2aHyYeklCzX96bDLT2k%3D; YD00517437729195%3AWM_NIKE=9ca17ae2e6ffcda170e2e6eea6d479ab899d88c75988b08fb3c54f978a8b85f46895e9ffb7b77a969988d8c52af0fea7c3b92abcb7fcadb3219a90e5a7fc7dbbeee594d67995a89dbaaa39a391bdbaf754f68da8a7eb4e8bbba2d0e547a6a8e5d6b34997eeb88eb749f8968888d554bc9c89a9d544ac90afccef43a9beb9d1d05d9a9cbe88e17d88eefe8ecc42fce788b2b34d8c8eaa85d06b96eb8db1fc6b93acaabab16b8ca79786ea6881a698dab56a88949cd3ea37e2a3; YD00517437729195%3AWM_TID=Lftr4M6kyApFUEUBFFcv0DqgQ5uBSC%2FF; captcha_session_v2="2|1:0|10:1616989902|18:captcha_session_v2|88:aEM2RE5hdllTZU42MVlzWGJnYm9TNytFOExDTjdULy8ySEhmdDJHbW5pdUR6aWdHdTM0RE5wa1JwTnM0VTZBYQ==|dbeafd5f8991f9c969035f081369a2a7df86787e910261ed42a34360ce1c69f0"; z_c0="2|1:0|10:1616990313|4:z_c0|92:Mi4xaW5CWUdRQUFBQUFBd040LWlScmZFaWNBQUFDRUFsVk5hTjJJWUFDN0NBVmJOZXZhMHotdG9QU1R1cXFnU0NNS1J3|a1c1166e239c385fea71a694d583b66f40dfcd23e553192253270520e98d6cfc"; tst=h; tshl=; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1617003964; KLBRSID=53650870f91603bc3193342a80cf198c|1617003965|1617003881'
}
response = requests.get('https://www.zhihu.com/', headers=headers)
print(response.headers)
print(response.text)


headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
}
response = requests.get('https://movie.douban.com/top250', headers=headers)
print(response.text)
result = findall(r'<img width="100" alt="(.+?)"', response.text)
print(result)

03requestsJson

import requests
from re import findall
import time

time.sleep(1)


# response = requests.get('https://www.toutiao.com/')
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
}
url = 'https://www.toutiao.com/api/pc/feed/?min_behot_time=0&category=__all__&utm_source=toutiao&widen=1&tadrequire=true&_signature=_02B4Z6wo00d01OMZcbgAAIDC74SAL2Quu5jjPXUAAFitQXZUPpEz0t6WQp2bx.YSJ7Kwks6umgdaCFTnlJLqG5xt7XbxK6OW.Ll4t2ofjFmJ9Pelj-WlovpDwU7lKu8Ep-PIU43aA2S2R.UY9f'
response = requests.get(url, headers=headers)
# print(response.json())
# 解决中文显示编码值的问题:response.text.encode().decode('unicode_escape')
print(findall(r'"title":\s*".+?"', response.text.encode().decode('unicode_escape')))

作业:

作业1:Top250的电影名称、排名、评分等数据保存在csv文件中
作业2:自己通过requests去爬取感兴趣的网站,记录爬取策略
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值