网路爬虫基础知识

最新推荐文章于 2024-07-30 17:23:09 发布

我当时害怕极了

最新推荐文章于 2024-07-30 17:23:09 发布

阅读量646

点赞数

文章标签： python 网络

本文链接：https://blog.csdn.net/m0_46163065/article/details/105878766

版权

网络爬虫基础知识

1.介绍几个基础概念

网络爬虫：爬虫是一个模拟让人类请求网站行为的程序，可以自动请求网页，并把数据爬取下来，然后使用一定的规则提取有价值的数据。

HTTP协议:即超文本传输协议，是一种发布和接收HTML页面的方法，服务器的端口号为80端口。

url详解:统一资源定位符

结构如下所示：
scheme://host:port/path/?query-string=xxx#anchor
scheme:代表的是访问的协议，一般为http/https/ftp等。
host:主机名，域名，比如www.baidu.com
port：端口号，当你访问一个网站时，浏览器默认使用80端口。
path:查找路径。
query-string:问好后面的内容?就是查询字符串。
anchor：锚点，后台一般不容，前台用来做定位。

在浏览器中请求一个url，浏览器会对这个url进行一个编码，除英文字母、数字和部分字符(鬼知道哪些字符)之外，其他的全部使用百分号+十六进制码值进行编码。

常见的请求方法：在HTTP协议中定义了8种方法。这里只说get和post请求
1.get请求:一般情况下，，只从服务器获取数据下来，并不会对服务器资源产生任何影响时会使用get请求.
2.post请求:向服务器发送数据（登录)、上传数据时,会对服务器资源产生影响会使用post请求.

注意:有的网站和服务器为了做反爬虫机制，也经常不安套路出牌，有可能一个本应该使用get方法的请求就一定要使用post请求，这个要视情况而定。

请求头常见参数:
1.User-Agent：浏览器名称。请求一个网页时，服务器通过这个参数就可以知道这个请求是由哪种浏览器发送的。如果我们通过爬虫发送请求，那么我们的USER-AGENT就是python，对于有反爬虫机制的网站，可以轻易的判断出这是个爬虫。所以我们需要设置这个值来伪装我们的爬虫。
2.Referer:表明当前这个请求是从哪个url过来的。
3.Cookie:http协议时无状态的，也就是同一个人发送两次请求，服务器没能力知道这两个请求是否来自一个人。因此需要用cookie来做标识。一般对于需要先要登录才能访问的网站，就需要发送cookie信息了。

作为一个莫得感情的新手，此处只学习了一下urllib库、requests库、beautifulsoup库，然后使用xpath、re正则、beautifulsoup来解析爬取页面内容。

2.此处介绍一个基础的小项目

2.1 爬取”电影天堂“的页面内容

import requests
from lxml import etree

url ="https://dytt8.net/html/gndy/dyzz/list_23_1.html"
BASE_DOMAIN = "https://dytt8.net"
headers= {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"}
# response = requests.get(url,headers=headers)
# #print(response.content.decode("gbk"))
# response.decode = response.apparent_encoding
# text = response.content.decode("gbk")
# html = etree.HTML(text)
# deatil_urls = html.xpath("//table[@class='tbspan']//a/@href")
# for detail_url in  deatil_urls:
#     print(BASE_DOMAIN+detail_url)

def det_datail_urls(url):
    response = requests.get(url,headers=headers)
    response.encoding=response.apparent_encoding
    text = response.content.decode("gbk",errors="ignore")
    html =etree.HTML(text)
    detail_urls = html.xpath("//table[@class='tbspan']//a/@href")
    detail_urls = map(lambda url:BASE_DOMAIN+url,detail_urls)
    return detail_urls

def parse_detail_page(url):
    movie = {}
    response = requests.get(url,headers=headers)
    text = response.content.decode("gbk",errors="ignore")
    html = etree.HTML(text)
    title = html.xpath('//h1//font[@color="#07519a"]/text()')[0] # 因为为列表，取0
    #print(title)
    movie["title"]=title
    # for x in title:           # 因为多索引了，所以此处检查一下，添加约束条件
    #     print(etree.tostring(x,encoding='utf-8').decode('utf-8'))
    zoomE=html.xpath('//div[@id="Zoom"]')[0]
    img = zoomE.xpath('.//img/@src')
    movie["img"]=img
    infos = zoomE.xpath('.//text()')
    # 下方代码都相同，可以定义一个函数
    def parse_info(info,rule):
        return info.replace(rule,"").strip()
    #for info in infos:
    for index,info in enumerate(infos):    # 为了后面的提取多个演员而使用，因为要用索引获取位置
        if info.startswith("◎年　　代"):
            info = info.replace("◎年　　代","").strip() # 不要年代两个字,并且删除空格
            movie["year"]=info
        elif info.startswith("◎产　　地"):
            info = info.replace("◎产　　地","").strip()
            movie["chandi"]=info
        elif info.startswith("◎类　　别"):
            info = info.replace("◎类　　别","").strip()
            #info = parse_info(info,"◎类　　别")
            movie["cat"]=info
        elif info.startswith("◎豆瓣评分"):
            info = parse_info(info,"◎豆瓣评分")
            movie["score"] = info
        # 获取主演要注意，不同行
        elif  info.startswith("◎主　　演"):
            info= parse_info(info,"◎主　　演")
            actors = [info] # 创建一i个列表存储演员相关信息
            for x in range(index+1,len(infos)):
                actor = infos[x].strip()
                if actor.startswith("◎"):
                    break
                actors.append(actor)
            movie["actor"]=actors
        elif info.startswith("◎简　　介"):
            info=info.replace("◎简　　介","").strip()
            for x in range(index+1,len(infos)):
                profile = infos[x].strip()
                if profile.startswith('【下载地址】'):
                    break
                movie["profile"] = profile
    download_url = html.xpath('//td[@bgcolor="#fdfddf"]/a/@href')[0]
    movie["download_url"] = download_url
    return movie

def spider():
    base_url = "https://dytt8.net/html/gndy/dyzz/list_23_{}.html"
    movies = []
    for x in range(1,8):
        url = base_url.format(x)
        detail_urls = det_datail_urls(url)
        for detail_url in detail_urls:
            movie = parse_detail_page(detail_url)
            movies.append(movie)

    print(movies)

if __name__ == "__main__":
    spider()

2.2 利用xpath爬取豆瓣电影内容

import requests

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
           "Referer":"https://movie.douban.com/"}
url ="https://movie.douban.com/cinema/nowplaying/changsha/"
response = requests.get(url,headers=headers)
#print(response.text)
# print(response.content.decode("utf-8")
text = response.text
import lxml
from lxml import etree
html = etree.HTML(text)
ul = html.xpath("//ul[@class='lists']")[0]
lis = ul.xpath("./li")
movies = []
for li in lis :
    title = li.xpath("@data-title")[0]
    score = li.xpath("@data-director")[0]
    region = li.xpath("@data-region")[0]
    actors = li.xpath("@data-actors")[0]
    thumbnail = li.xpath(".//img/@src")[0]
    movie = {
        "title":title,
        "score":score,
        "region":region,
        "actors":actors,
        "thumbnail":thumbnail
    }
    movies.append(movie)
print(movies)

2.3 利用re正则爬取"古诗文网站"内容

import requests
import re  
url = "https://www.gushiwen.org/default_1.aspx"     # 通过上下翻页查看url的变化特点

def parse_page(url):
    headers= {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"}
    response = requests.get(url,headers)
    #print(response.text)
    text = response.text
    titles = re.findall(r'<div class="cont">.*?<b>(.*?)</b>',text,re.DOTALL) # 由于.默认不匹配换行符，所以加入参数re.DOTALL
    dynasties = re.findall(r'<p class="source">.*?<a.*?>(.*?)</a>',text,re.DOTALL) 
    authors = re.findall(r'<p class="source">.*?<a.*?>.*?<a.*?>(.*?)</a>',text,re.DOTALL)
    contents_tags = re.findall(r'<div class="contson".*?>(.*?)</div>',text,re.DOTALL) 
    contents = []
    for i in contents_tags :
        x= re.sub(r'<.*?>',"",i)      # 包含一些奇怪字符，用sub进行替换
        #print(x.strip())                 # 去除空白换行字符 
        contents.append(x.strip())
    poems = []
    for value in zip(titles,dynasties,authors,contents):
        title,dynastie,author,content = value 
        poem = {
            "title":title,
            "dynastie":dynastie,
            "author":author,
            "content":content
        }
        poems.append(poem)
    for poem in poems :
        print(poem)
        print("*"*40)
    
def main():
#     url = "https://www.gushiwen.org/default_1.aspx"        # 首先设置一个页面的解析，后面直接使用遍历的方式得到其余页面的信息
#     parse_page(url)
    
    url = "https://www.gushiwen.org/default_{}.aspx"
    for i in range(1,12):
        url = url.format(i)
        parse_page(url)
    
if __name__ == "__main__":
    main()

上述简单了三个入门级小项目实战，个人觉得xpath和re正则是真的好用，还有很多的内容没有介绍，可能后面有时间再写鸭，小白还得继续努力鸭。此处简单的爬虫纯粹是为了兴趣学学，并没有涉及太高深的知识。欢迎大家一起交流。

我当时害怕极了

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
网路爬虫基础知识

网络爬虫基础知识1.介绍几个基础概念网络爬虫：爬虫是一个模拟让人类请求网站行为的程序，可以自动请求网页，并把数据爬取下来，然后使用一定的规则提取有价值的数据。HTTP协议:即超文本传输协议，是一种发布和接收HTML页面的方法，服务器的端口号为80端口。url详解:统一资源定位符结构如下所示：scheme://host:port/path/?query-string=xxx#anchor...
复制链接

扫一扫