python爬虫、第一个爬虫（基本知识，requests库，BeautifulSoup库，正则表达式re库）

最新推荐文章于 2021-01-12 10:37:05 发布

轻随风去

最新推荐文章于 2021-01-12 10:37:05 发布

阅读量532

点赞数

分类专栏： python爬虫文章标签： python 爬虫新手

本文链接：https://blog.csdn.net/weixin_41706016/article/details/90175657

版权

python爬虫专栏收录该内容

4 篇文章 1 订阅

订阅专栏

python爬虫学习开始
1、这是我第一次写博客，表示不太会用csdn的这个编辑器，排版神马的就别说了，有什么值得改进的或者不对的地方，欢迎留言，谢谢。
2、作为学生党，正在自学python爬虫，基于《从零开始学python网络爬虫》，想写博客保存一下知识同时与小伙伴们一起学习，fighting!!!

python基本知识

基本知识包括了：变量、字符串的使用及四种方法、函数、控制语句、四种数据结构、文件操作。可以看这里基本知识

变量：python的变量无须提前定义，不需设定类型，一般字符串用的多
字符串：由单引号或者双引号及其引号中的字符组成
- 可以使用加法和乘法，加法和c++中的字符串相同，乘法则以倍数增加
- 切片及索引：a[0]表示字符串第一个字符，a[1:5]表示第二个至第五个字符
- 四种方法（函数）：
  - split()方法：s.split('字符串‘），将s以给定的分隔符(没有就默认为空格、制表、换行），将一个字符串分割为列表
  - replace()方法：s.replace('字符串1’，‘字符串2’，count），将字符串s中所有的字符串1，从头到尾的count个替换为字符串2，若没有count，则表示替换所有的
  - strip()方法：s.strip(’字符串‘)，去除s两侧的’字符串‘，若没有参数，则默认去除空格。
  - format()方法：若s=‘ab{}cd’，则s.format(‘12’)表示‘ab12cd’。显然format表示{}可以随意添加替换，但字符串中只能有一个{}
函数的使用方法
- 函数的定义：def function(参数1，参数2，…):
  一个制表符表示分级（在函数内）
判断语句和循环语句
- 判断语句if-elif-else和C++类似
- 循环语句：for循环和while循环语句
  - for循环：
    for item in iterable:
    do(记得缩进）
    for i in range(1,11):
    do(记得缩进）
  - while循环
    while condition:
    do(记得缩进）
数据结构
- 列表：列表
  - 列表中的每个元素都分配一个数字 - 它的位置，或索引，第一个索引是0。
  - 创建一个列表，只要把逗号分隔的不同的数据项使用方括号括起来即可。
- 元组：元组
- 字典：字典
- 集合：与数学中的集合类似，元素无序、不重复。因此可以将重复的数据去除。集合是以大括号构建的。
文件操作：文件操作
- 打开文件：f = open('地址’，‘模式’）
- 关闭文件：f.close()
- 读取文件内容：content = f.read()
- 写入文件内容：f.write(字符或者数字）

第一个爬虫

一、requests库
参考博文1：一
参考博文2：二

import requests
res = requests.get('http://bj.xiaozhu.com/')#网站为小猪短租网北京地区网址
#pycharm中返回<Response [200]>,说明请求网址成功；若为404，400则请求不成功
print(res)
#打印了网页源代码
print(res.text)

#加入请求头，将爬虫伪装成为浏览器，以便于更好的抓取数据
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}

二、BeautifulSoup库
参考1：1
参考2：2
参考3：3
参考4：爬虫之soup.select()用法浅析

爬虫一：爬取酷狗音乐排行榜数据

利用BeautifulSoup库的.select()方法进行筛选数据

import requests
from bs4 import BeautifulSoup
import time

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}

def get_info(url):
    wb_data = requests.get(url,headers=headers)
    soup = BeautifulSoup(wb_data.text,'lxml')
    ranks = soup.select('#rankWrap > div.pc_temp_songlist > ul > li > span.pc_temp_num')
    titles = soup.select('#rankWrap > div.pc_temp_songlist > ul > li > a')
    times = soup.select('#rankWrap > div.pc_temp_songlist > ul > li > span.pc_temp_tips_r > span')
    for rank,title,time in zip(ranks,titles,times):
        data = {
            '歌曲排名': rank.get_text().strip(),
            '歌手名字': title.get("title").split('-')[0].strip(),
            '歌曲名称': title.get("title").split('-')[1].strip(),
            '歌曲时长': time.get_text().strip(),
            '歌曲链接': title.get("href")
        }
        print(data)
        print(data['歌手名字']+"  ",end=" ")
        print(data["歌曲名称"])
        print("---------------------------------------------------------------------------")

if __name__ == '__main__':
    urls = ['https://www.kugou.com/yy/rank/home/{}-8888.html'.format(str(i)) for i in range(1,25)]
    a = 1;
    for url in urls:
        print('第'+str(a)+'页：')
        a=a+1
        get_info(url)
        time.sleep(0.5)
      
#+"  ",end=" "
#rankWrap > div.pc_temp_songlist > ul > li:nth-child(4) > span.pc_temp_num

#rankWrap > div.pc_temp_songlist > ul > li:nth-child(1)
#rankWrap > div.pc_temp_songlist > ul > li:nth-child(1) > a

#rankWrap > div.pc_temp_songlist > ul > li:nth-child(1) > span.pc_temp_tips_r > span

1、根据观察其页面，“https://www.kugou.com/yy/rank/home/1-8888.html”，依次将其中的1替换为2、3、4直到23。
2、根据Copy selector选择得到 #rankWrap > div.pc_temp_songlist > ul > li:nth-child(1) > span.pc_temp_num，这只能得到第一个的排名，需要将li:nth-child(1)替换为li以此得到该页面所有的排名。
3、get_text()方法与get()方法

爬虫二、爬取斗破苍穹小说

import requests
from bs4 import BeautifulSoup
import re
import time

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}

f = open('C:/Users/16579/Desktop/2.txt','w+')
f1 = open('C:/Users/16579/Desktop/1.txt','w+')
#采用BeautifulSoup库的.select()方法进行筛选数据
def get_chapter(url):
    res = requests.get(url,headers=headers)
    res.encoding = "utf-8"
    soup = BeautifulSoup(res.text, 'lxml')
    titles = soup.select('body > div.main > div.entry-tit > h1')
    bodys = soup.select('body > div.main > div.entry-text > div.m-post > p')
    #contents = re.findall('<p>(.*?)</p>',res.content.decode('utf-8'),re.S)
    print(str(titles[0]).strip("</h1>"))
    f.write(str(titles[0]).strip("</h1>")+'\n')
    for body in bodys:
        #body = str(body).strip("<p>")
        f.write('   '+str(body).strip("</p>")+'\n')
    f.write('\n')

#采用正则表达式筛选数据
def get_chapter1(url):
    res = requests.get(url,headers=headers)
    res.encoding = "utf-8"
    titles = re.findall('<h1>第(.*?)</h1>',res.content.decode('utf-8'),re.S)
    # findall 返回的是一个列表
    contents = re.findall('<p>(.*?)</p>',res.content.decode('utf-8'),re.S)
    print('第'+str(titles[0]))
    f1.write('第'+str(titles[0])+'\n')
    del contents[0]
    contents.pop()
    for content in contents:
        f1.write("  "+str(content)+'\n')
        
#从目录页获取到所有章节的地址
def get_all_chapter_links(url):
    res = requests.get(url,headers=headers)
    res.encoding = 'utf-8'
    soup = BeautifulSoup(res.text, 'lxml')
    links = soup.select('body > div.main > div.xsbox.clearfix > ul > li > a')
    count=1
    for link in links:
        get_chapter1('http://www.doupoxs.com'+link.get("href"))
        print('第' + str(count) + '章完成')
        count = count + 1
        time.sleep(1)
        
#主函数入口
if __name__ == '__main__':
    urls = 'http://www.doupoxs.com/doupocangqiong/'
    get_all_chapter_links(urls)
    f.close()
    f1.close()

有时候无法完成所有的章节的爬取，爬了一会就因为无法连接而卡死了，中断点不一，有时候几十章，有时一千多章。（暂时搞不懂怎么解决）

爬虫三、爬取糗事百科段子

利用正则表达式筛选数据

import requests
import re

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
}
info_lists = []
def judgment_sex(class_name):
  if class_name == 'womenIcon':
      return '女'
  else:
      return  '男'

def get_info(url):
    res = requests.get(url)
    ids = re.findall('<h2>(.*?)</h2>',res.text,re.S)
    levels = re.findall('<div class="articleGender \D+Icon">(.*?)</div>',res.text,re.S)
    sexs = re.findall('<div class="articleGender (.*?)">',res.text,re.S)
    contents = re.findall('<div class="content">.*?<span>(.*?)</span>',res.text,re.S)
    laughs = re.findall('<span class="stats-vote"><i class="number">(\d+)</i>',res.text,re.S)
    comments = re.findall('<i class="number">(\d+)</i> 评论',res.text,re.S)
    for id,level,sex,content,laugh,comment in zip(ids,levels,sexs,contents,laughs,comments):
        info = {
            'id':id,			#ID名字
            'level':level,		#年龄
            'sex':judgment_sex(sex),	#性别
            'content':content,		#段子内容
            'laugh':laugh,		#“好笑”数
            'comment':comment		#评论数
        }
        info_lists.append(info)

if __name__ == '__main__':
    urls = ['http://www.qiushibaike.com/text/page/{}/'.format(str(i)) for i in range(1,36)]
    for url in urls:
        get_info(url)
    for info_list in info_lists:
        f = open('C:/Users/16579/Desktop/三寸人间.txt', 'a+')
        try:
            f.write(info_list['id']+'\n')
            f.write(info_list['level'] + '\n')
            f.write(info_list['sex'] + '\n')
            f.write(info_list['content'] + '\n')
            f.write(info_list['laugh'] + '\n')
            f.write(info_list['comment'] + '\n\n')
            f.close()
        except UnicodeEncodeError:
            pass
        print(info_list)

轻随风去

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
python爬虫、第一个爬虫（基本知识，requests库，BeautifulSoup库，正则表达式re库）

python爬虫学习开始这是我第一次写博客，表示不太会用csdn的这个编辑器，排版神马的就别说了，有什么值得改进的或者不对的地方，欢迎指导欢1111你好！这是你第一次使用 Markdown编辑器所展示的欢迎页。如果你想学习如何使用Markdown编辑器, 可以仔细阅读这篇文章，了解一下Markdown的基本语法知识。新的改变我们对Markdown编辑器进行了一些功能拓展与语法支持，除...
复制链接

扫一扫