python学习Day1

已于 2022-02-11 17:00:55 修改

阅读量102

点赞数

分类专栏： python 文章标签： python

于 2021-11-27 10:22:50 首次发布

本文链接：https://blog.csdn.net/weixin_46649870/article/details/121559042

版权

python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

知乎热搜排行榜

成品展示

在这里插入图片描述

0、准备工作

0.1 Pycharm安装及配置
0.2 Python安装及环境配置
0.3 pip安装与使用 (选择性安装：Python 2.7.9 + 或 Python 3.4+ 以上版本都自带 pip 工具。)

1、Python语法

之前有过其他编程语法经验的铁子大可不必从零开始，Python的语法要简单很多，建议直接看完整的demo，遇到难点再针对性查看，要省时不少

Python大佬的完整demo
Python基础教程|菜鸟教程

2、摸索实践

2.1 观察页面

打开知乎热搜 ->用F12打开控制台 ->鼠标点击红色框体 ->鼠标移动到黑色框体处 ->观察控制台蓝色框体的结构

在这里插入图片描述

2.2 分析结构

经过简单观察，可以发现热搜内容储存在 id=‘TopstoryContent’的div标签下的a标签里，所以只要取到a标签里的内容即可

在这里插入图片描述

2.3 代码编写

根据上一步观察出来的结构，开始大致的代码编写

import requests
from bs4 import BeautifulSoup
url = 'http://www.zhihu.com/hot'  # 知乎首页网址
resp = requests.get(url)  # 获取此网址的所有内容
html = resp.content  # 获取网址的有效内容
html = str(html, 'utf-8')  # 转utf8,解决汉字乱码
# 拿到id=TopstoryContent的div标签下的所有a标签
bf = BeautifulSoup(html, 'html.parser')
hop = bf.find('div', id="TopstoryContent").find_all('a')
# 打印hop内数据
print(hop)

2.4 绕过反爬及登录

报错：'NoneType' object has no attribute 'find_all'
在这里插入图片描述
打断点发现获取网址内容返回403，说明知乎有反爬和登录设置，查阅资料，有多种解决方法，这里就先简单只取给请求加header和cookie解决（header和cookie从Network方法里Header的RequestHeader内复制）

import requests
from bs4 import BeautifulSoup
cookie = '''xxx'''
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                       "Chrome/80.0.3987.132 Safari/537.36","Cookie": cookie}
url = 'http://www.zhihu.com/hot'  # 知乎首页网址
resp = requests.get(url, headers=headers)  # 获取此网址的所有内容
html = resp.content  # 获取网址的有效内容
html = str(html, 'utf-8')  # 转utf8,解决汉字乱码
# 拿到id=TopstoryContent的div标签下的所有a标签
bf = BeautifulSoup(html, 'html.parser')
hop = bf.find('div', id="TopstoryContent").find_all('a')
print(hop)

再次运行，没有出现问题

3、核心功能实现

3.1创建txt文件

#循环所有a标签，拿到a标签中的内容（热搜名称）
for src in hop:
    try:
        #获取热搜标题
        title = src.get('title')
        
        # 跳转进热搜内链接，获取热搜内容，与获取标题同理
        goto = src.get('href')
        rep = requests.get(goto, headers=headers)
        contents = str(rep.content, 'utf-8')
        contentBF = BeautifulSoup(contents, 'html.parser')
        text = contentBF.find('span', itemprop='text')
        content = text.contents[0]
        
        
        #将热搜名保存为文件名
        chapter = save_path + "/" + str(title) + ".txt"
        #将热搜内容循环填入txt文件中
        with codecs.open(chapter, 'a', encoding='utf-8') as f:
            f.write(str(text))
        print(content)
    except Exception as e:
        print(e)

运行测试，内容取到了，但没有完全取到...
在这里插入图片描述

3.2待解决问题：

3.2.1 知乎内容不会一次请求全部显示，需要点击“显示全部”之后，完整内容才会显示，只用python无法完成爬取时点击按钮功能，需要用Selenium辅助实现，但是我不会

在这里插入图片描述

3.2.2 经过仔细研究知乎的网页结构，发现在热搜首页是有完整的内容显示的！！比最初设想的方法更简单！获取知乎热搜顺利解决！但是点击全部获取内容方式的问题还没有解决
在这里插入图片描述

3.3完整代码

import requests
from bs4 import BeautifulSoup


import requests
from bs4 import BeautifulSoup
import codecs

# 定义存储位置
global save_path
#热搜下载储存地址
save_path = 'E:/PythonTest/eg1/zhihuTop'

cookie = '''自己的cookie复制在这里'''
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                       "Chrome/80.0.3987.132 Safari/537.36","Cookie": cookie}
url = 'https://www.zhihu.com/hot'
resp = requests.get(url, headers=headers)
html = resp.content
html = str(html, 'utf-8')
bf = BeautifulSoup(html, 'html.parser')
hop = bf.find('div', id="TopstoryContent").find_all('a')
for src in hop:
       try:
        title = src.get('title')
        if(src.contents[1]):
         hopContent = src.contents[1].contents
         chapter = save_path + "/" + str(title) + ".txt"
         with codecs.open(chapter, 'a', encoding='utf-8') as f:
             f.write(str(hopContent))
    except Exception as e:
        print(e)

最后

因为新学python，许多方法还是不熟，代码格式可能欠规范，有错误希望评论提出，共同进步
这里只是拿知乎例子进行举例，所有网站的热搜或者内容均可同理解决

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python学习Day1

1成果展示0、准备工作1、Python语法2、摸索实践2.1 观察2.2 结构2.3 代码编写2.4 绕过反爬及登录成果展示0、准备工作0.1 Pycharm安装及配置0.2 Python安装及环境配置0.3 pip安装与使用 (选择性安装：Python 2.7.9 + 或 Python 3.4+ 以上版本都自带 pip 工具。)1、Python语法个人建议，之前有过其他编程语法经验的铁子大可不必从零开始，Python的语法要简单很多，建议直接看完整的demo，遇到难点再针对性查看，要省时不
复制链接

扫一扫

专栏目录