萌新的Python学习日记 - 爬虫无影 - 使用BeautifulSoup + css selector 抓取动态网页内容：Knewone

最新推荐文章于 2023-07-19 13:27:10 发布

xiaofeng1qaz

最新推荐文章于 2023-07-19 13:27:10 发布

阅读量1k

点赞数

分类专栏： Python学习者爬虫学习者数据分析文章标签： Python 爬虫

本文链接：https://blog.csdn.net/xiaofeng1qaz/article/details/79571490

版权

爬虫学习者同时被 3 个专栏收录

9 篇文章 0 订阅

订阅专栏

Python学习者

8 篇文章 0 订阅

订阅专栏

数据分析

7 篇文章 0 订阅

订阅专栏

博客第二天

测试页面：Knewone：https://knewone.com/discover?page=，

目的：爬取第一区类的title，img，各对象href，

工程内容：Python3.5，jupyter notebook

工具包：requests, BeautifulSoup，time，pandas

代码（可翻页）：

import requests
from bs4 import BeautifulSoup as bs
import time
import pandas as pd
url = 'https://knewone.com/discover?page='
info = [] #为存储信息做准备
def getpage(url,data = None): #定义抓取页面函数
web = requests.get(url) #获取页面
soup = bs(web.text,'lxml') #解析页面
#print(soup) #打印以观察获取以及解析结果
imgs = soup.select('a.cover-inner > img') #定位图片链接位置
titles = soup.select('section.content > h4 > a') #定位titles位置
links = soup.select('section.content > h4 > a') #定位对象超链位置，此处非实际链接，需加上前缀内容
if data==None: ···
for img,title,link in zip(imgs,titles,links):
data = {
'img' : img.get('src'),
'title' : title.get('title'), #利用data存储信息
'link' : link.get('href')
}
info.append(data) ···
print(data)
def getmorpages(start,end): #定义翻页函数
for one in range(start,end):
getpage(url+str(one))
time.sleep(1)
getmorpages(1,10) #调用函数

df = pd.DataFrame(info) #存储为本地文件
df.to_csv('1-3练习-爬取动态网页.csv')