在使用爬虫时,如果爬取简单的网页信息时是比较简单的
例如:
import requests
from bs4 import BeautifulSoup
user = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"
headers = {"User-Agent": user}
reponse = requests.get("https://www.baidu.com/",headers=headers)
print(reponse.status_code)
reponse.encoding = reponse.apparent_encoding
soup = BeautifulSoup(reponse.text,'html.parser')
print(soup)
通过这样的方式就可以获得一个结构化的网页
有一些网站稍微复杂一些,需要通过用户的cookie信息才能访问,这个时候就需要在,找到用户的cookie信息了,在你需要爬取的网页上登录账号,按F12打开开发者模式,刷新页面信息,先点击network 然后点击左边的第一个数据,这个时候在边就会出现,headers等信息,在headers中就可以找到账号的cookie信息了
如:
代码:
import requests
from bs4 import BeautifulSoup
user = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"
cookie = "uuid_tt_dd=10_6647180090-1565705226664-828296; dc_session_id=10_1565705226664.653162; smidV2=201909021440594a9f393f93f293496a0b3490a2ecb61500d17bcb038fb0ee0; UserName=weixin_43654083; UserInfo=0beba30009e74a71b81d1eca59e9d1d6; UserToken=0beba30009e74a71b81d1eca59e9d1d6; UserNick=%E5%86%85%E5%B8%88%E5%A4%A7%E6%A0%91%E8%8E%93%E5%B0%8F%E9%98%9F; AU=391; UN=weixin_43654083; BT=1570070714264; p_uid=U000000; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_6647180090-1565705226664-828296!1788*1*PC_VC!5744*1*weixin_43654083; __gads=Test; firstDie=1; Hm_lvt_eb5e3324020df43e5f9be265a8beb7fd=1574508727; Hm_ct_eb5e3324020df43e5f9be265a8beb7fd=5744*1*weixin_43654083!6525*1*10_6647180090-1565705226664-828296; announcement=%257B%2522isLogin%2522%253Atrue%252C%2522announcementUrl%2522%253A%2522https%253A%252F%252Fblogdev.blog.csdn.net%252Farticle%252Fdetails%252F103053996%2522%252C%2522announcementCount%2522%253A0%252C%2522announcementExpire%2522%253A3600000%257D; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1574559007,1574559191,1574559204,1574559745; Hm_lpvt_6bcd52f51e9b3dce32bec415ac=1574561316; dc_tos=q1gbac"
headers={"User-Agent":user,"Cookie":cookie}
reponse = requests.get("网址",headers=headers)
print(reponse.status_code)
reponse.encoding = reponse.apparent_encoding
soup = BeautifulSoup(reponse.text,'html.parser')
print(soup)
这样就可以获得一个需要登录网站的结构化页面了