爬虫学习

最新推荐文章于 2022-07-29 21:17:01 发布

canaryW

最新推荐文章于 2022-07-29 21:17:01 发布

阅读量545

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/cobracanary/article/details/109312445

版权

python 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

昨天被一个素不相识的学长突然安排了一份工作，是让我人肉收集数据，让我非常恼火，所以从今天起好好学习一下爬虫，以后的事情就可以自动化。

爬一个网站上的中国大学排名，并且将排名写进excel文件中：

import bs4
import pandas as pd
import requests
from bs4 import BeautifulSoup

ans=[]
html=requests.get('https://www.eol.cn/e_html/gk/dxpm/index.shtml')
if html.status_code==200:
    html.encoding=html.apparent_encoding
    content=BeautifulSoup(html.text,'html.parser')
    rank=content.find('div',{'class':"university-list rk-con on"})
    trs:bs4.element.ResultSet=rank.find_all('tr',{'style':"color: #333;"})
    for tr in trs:
        tds=tr.find_all('td')
        temp=[]
        for td in tds:
            temp.append(td.text)
        ans.append(temp)
else:
    print('获取失败，HTTP状态码：',html.status_code)


data=pd.DataFrame(ans,columns=['排名','大学','分数'])

print(data)
data.to_excel('dataset/data.xlsx',index=False)

知乎反爬虫，通过检测http请求的header来识别是不是爬虫，所以我们要在爬虫中加入header信息模拟浏览器请求：

import requests
from bs4 import BeautifulSoup

ans=[]
head={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'}
html=requests.get('https://www.zhihu.com/question/375537442',headers=head)
if html.status_code==200:
    html.encoding=html.apparent_encoding
    content=html.text
    soup=BeautifulSoup(content,'html.parser')
    authors=soup.find_all('a',{'class':"UserLink-link"})
    print(authors)
else:
    print('失败，HTTP状态码是:',html.status_code)
    print(html.headers)

另外，python判断元素类型用isinstance(object,类型)

使用cookie登陆：
在爬取很多网站的时候需要登陆，可以使用cookie来登陆，先去查看在该网站下的登陆信息，然后复制cookie，在python传参的时候传入
比如说登陆csdn：

import requests

cookie="uuid_tt_dd=10_36884947900-1570263579649-276619; dc_session_id=10_1570263579649.130695; UN=cobracanary; BT=1594003729607; __gads=ID=ac09c5a469de205f-2258ed2f7bc200aa:T=1594691090:S=ALNI_MYr8CovfEXZzFot-ZSGeLURdVEBFg; UserName=cobracanary; UserNick=canaryW; AU=A99; p_uid=U000000; log_Id_click=148; announcement=%257B%2522isLogin%2522%253Atrue%252C%2522announcementUrl%2522%253A%2522https%253A%252F%252Flive.csdn.net%252Froom%252Fyzkskaka%252F5n5O4pRs%253Futm_source%253D1598583200%2522%252C%2522announcementCount%2522%253A0%257D; dc_sid=5eeb7317c65d9fce5e2b7a8e4e1b8652; c_first_ref=cn.bing.com; c_first_page=https%3A//blog.csdn.net/weixin_37719937/article/details/97417842; c_segment=11; c_ref=https%3A//cn.bing.com/; csrfToken=0pz4jueKc55gugrLbxiNDhjS; c_page_id=default; dc_tos=qiwgjt; log_Id_pv=392; log_Id_view=899; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1603865249,1603865363,1603867978,1603868017; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1603868109; Hm_up_6bcd52f51e9b3dce32bec4a3997715ac=%7B%22uid_%22%3A%7B%22value%22%3A%22cobracanary%22%2C%22scope%22%3A1%7D%2C%22islogin%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isonline%22%3A%7B%22value%22%3A%221%22%2C%22scope%22%3A1%7D%2C%22isvip%22%3A%7B%22value%22%3A%220%22%2C%22scope%22%3A1%7D%7D; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_36884947900-1570263579649-276619!5744*1*cobracanary"
html=requests.get('https://csdn.net',headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:81.0) Gecko/20100101 Firefox/81.0'},cookies={'cookie':cookie})
print(html.text)

在爬虫中使用正则表达式：

items=soup.find_all('a',{'class':"J_ClickStat",'id':re.compile(r'J_Itemlist_TLink_[0-9]*')})

爬取网页中的AJAX动态数据
今天想爬取股票数据
在这里插入图片描述
结果发现获取的html界面根本没有任何值，后来我发现这个界面的股票数据是使用AJAX动态生成的

使用selenium来获取完整的html数据
selenium可以使用代码操纵浏览器，进而获取到AJAX动态数据

from bs4 import BeautifulSoup
from selenium import webdriver

driver=webdriver.Chrome(executable_path='D:/softwares/chromedriver/chromedriver.exe')
driver.maximize_window()
driver.get('http://quotes.money.163.com/old/#query=EQA&DataType=HS_RANK&sort=PERCENT&order=desc&count=10&page=0')
soup=BeautifulSoup(driver.page_source,'html.parser')
stocks=soup.find_all('a',{'target':'_blank'})