基础爬虫(原理)

最新推荐文章于 2024-08-27 20:04:30 发布

aiyuan7045

最新推荐文章于 2024-08-27 20:04:30 发布

阅读量151

点赞数

文章标签：爬虫 python

原文链接：http://www.cnblogs.com/flawlessm/p/10537977.html

版权

网络爬虫：模拟浏览器自动的浏览网页即：一段程序（一个脚本）

作用：自动的批量采集需要的资源

环境：python3

模块：requests (第三方库)

安装---->cmd pip install requests

eg:

import requests

import re

#下载一个网页

url = 'www.jingcaiyuedu.com'

#模拟浏览器发送http请求

response = requests.get(url)

#编码方式

response.encoding = 'utf-8'

#目标小说主页的网页源码

html = response.text

#小说的名字

title = re.findall(r'<mete prooerty="og:title" content="(.*?)"/>,html)[0]

#获取每一章的信息（章节，url)

dl = re.findall(r'<dl id="list">.*?</dl>,html,re.S)[0]

chapter_info_list = re.findall(r'href="(.*?)">(.*?)<',dl)

print(chapter_info_list)

开发爬虫步骤：

-目标数据：网站页面

-分析数据加载流程：分析目标数据所对应的url

-下载数据

-清洗处理数据

-数据持久化

转载于:https://www.cnblogs.com/flawlessm/p/10537977.html

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

关注关注