第一个网络爬虫

最新推荐文章于 2024-07-24 10:38:58 发布

SZTU_青衫酒

最新推荐文章于 2024-07-24 10:38:58 发布

阅读量142

点赞数

分类专栏： Python网络爬虫文章标签：爬虫

本文链接：https://blog.csdn.net/weixin_62135607/article/details/127655247

版权

Python网络爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

第一步获取页面

#!/usr/bin/python
# coding: utf-8

import requests #引入包requests
link = "http://www.santostang.com/"#定义link为目标网页地址
# 定义请求头的浏览器代理，伪装成浏览器
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} 

r = requests.get(link, headers= headers) #请求网页
print (r.text)  #r.text是获取的网页内容代码

结果：

第二步提取需要的数据

#!/usr/bin/python
# coding: utf-8

import requests
from bs4 import BeautifulSoup     #从bs4这个库中导入BeautifulSoup

link = "http://www.santostang.com/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} 
r = requests.get(link, headers= headers)

soup = BeautifulSoup(r.text, "html.parser")  #使用BeautifulSoup解析这段代码
#找到第一篇文章标题，定位到class是"post-title"的h1元素，提取a，提取里面的字符串，strip()去除左右空格
title = soup.find("h1", class_="post-title").a.text.strip()
print (title)

结果：

第三步储存数据

import requests
from bs4 import BeautifulSoup   #从bs4这个库中导入BeautifulSoup

link = "http://www.santostang.com/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'} 
r = requests.get(link, headers= headers)

soup = BeautifulSoup(r.text, "html.parser")   #使用BeautifulSoup解析这段代码
title = soup.find("h1", class_="post-title").a.text.strip()
print (title)

# 打开一个空白的txt，然后使用f.write写入刚刚的字符串title
with open('title_test.txt', "a+") as f:
    f.write(title)

结果：