python-爬虫

爬取网页标题

 

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "zh-CN,zh;q=0.9",
            "Connection": "keep-alive"
           }
url = "https://movie.douban.com/top250?start=101"
res = requests.get(url=url, headers=headers)   #res获得请求到的结果
res.encoding = 'UTF-8'

soup = BeautifulSoup(res.text,"html.parser")  #使用指定解析器解析获得res文本

head = soup.select("#content > div > div.article > ol > li:nth-child(1) > div > div.info > div.hd > a > span:nth-child(1)")
print(head)
for i in head:
    print(i.get_text())

select里面的参数,通过下面的方式获取:

获取网页标题的列表

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "zh-CN,zh;q=0.9",
            "Connection": "keep-alive"
           }
url = "https://movie.douban.com/top250?start=101"
res = requests.get(url=url, headers=headers)   #res获得请求到的结果
res.encoding = 'UTF-8'

soup = BeautifulSoup(res.text,"html.parser")  #使用指定解析器解析获得res文本

head = soup.select("#content > div > div.article > ol > li > div > div.info > div.hd > a > span:nth-child(1)")

for i in head:
    print(i.get_text())

和上面的那行代码相比,这里将上一级目录的li:nth-child(1)来改为了li,也就是删除了冒号后面的特定选择的内容,这样获得内容就是所有的标题了。

爬取豆瓣Top250的电影

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "zh-CN,zh;q=0.9",
            "Connection": "keep-alive",
            "cookie": "p_h5_u=F0E2CDDF-339E-4DB7-A832-3EB9CF0574C7; selectedStreamLevel=LD; Hm_lvt_be772be263fd7bf6cdd60433bd714862=1581386367,1581515417,1581674685,1581675681; token=9QCm_jwuffebIWTReplyU; Hm_lpvt_be772be263fd7bf6cdd60433bd714862=1581699602"
           }

for i in range(10):
    url = "https://movie.douban.com/top250?start={}".format(i*25)
    res = requests.get(url=url, headers=headers)   #res获得请求到的结果
    res.encoding = 'UTF-8'
    soup = BeautifulSoup(res.text,"html.parser")  #使用指定解析器解析获得res文本
    selecter ="#content > div > div.article > ol > li > div > div.info > div.hd > a > span:nth-child(1)"
    head = soup.select(selecter)

    for j in head:
        print(j.get_text())

读取伯禹平台的课程列表

import requests
from bs4 import BeautifulSoup
filename = "E:\Documents\Desktop\伯禹学习平台.html"
f = open(filename, "r", encoding='UTF-8')
data = f.read()
print(data)
f.close()


soup = BeautifulSoup(data)
selecter = "#root > div > div.page-course > div > div.sidebar > div > div.ui-card-body > div > ul > li > div > div.title-wrapper > div > a"

head = soup.select(selecter)

for j in head:
    print(j.get_text())

这个其实是采用的撒狗血的方式,先将网页来手动保存下来,然后在调用beautifulSoup进行解析。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值