python爬虫-循序渐进1

最新推荐文章于 2024-08-10 23:14:03 发布

柠檬大叔

最新推荐文章于 2024-08-10 23:14:03 发布

阅读量179

点赞数

文章标签： python selenium chrome

本文链接：https://blog.csdn.net/weixin_41670197/article/details/105691434

版权

爬虫分为网页爬虫和接口爬虫，网页爬虫就是像我们正常浏览网页一样去获取数据，接口爬虫是通过目标服务器的接口直接访问拿到想要的数

这里仅做个示例，恶意爬虫是犯法的，同学们注意

今天来简单玩一下python爬虫,就爬一下网易的裤子数据

使用工具

python ：语言
selenium ：一个用于Web应用程序测试的工具。Selenium测试直接运行在浏览器中，就像真正的用户在操作一样. 附上下载链接.
chromedriver: chromedriver是配合selenium（自动化测试工具，驱动浏览器执行特定的动作，如点击、下拉等）库使用，因为只有安装ChromeDriver，才能驱动Chrome浏览器完成相应的操作
Tip1:这里是chromedirver下载地址,要下载对应版本的
Tip2:windows 安装参考这里
Tip3:mac 下载后直接丢到 /usr/local/bin/ 中

思路：

打开网易严选 --> 登录一下 --> 在搜索框里搜索‘裤子’ --> 把整个页面的标题、图片、价格保存到excel到本地

开始操作：

新建一个项目
新建index.py


from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException

import xlwt

# 找到该元素
def find(driver, element):
   try:
       view = driver.find_elements_by_css_selector(element)
   except NoSuchElementException as e:
       return False
   else:
       return view


# 通过 executable_path 指定 chrome 驱动文件所在路径
driver = webdriver.Chrome(executable_path="chromedriver")
driver.set_window_size(1366, 768)

url = "https://you.163.com" # 目标URl


driver.get(url)
print('正在打开网页')
# wait = WebDriverWait(driver, 10, 0.5)
driver.implicitly_wait(10) # 隐式等待，如果webDervier 没有在 DOM 中找到元素，将继续等待，超出设定时间后则抛出找不到元素的异常



# 找到搜索框并
inputView =  find(driver, '.yx-cp-searchInput')
# 写入书籍
inputView[0].send_keys('裤子')
# 点击搜索
find(driver, '.yx-cp-searchButton')[0].click()


# 创建一个excel

work_book = xlwt.Workbook(encoding='utf-8')
sheet = work_book.add_sheet('网易严选-裤子')
sheet.write(0, 0, '名称')
sheet.write(0, 1, '价格')
sheet.write(0, 2, '图片')

# 找到本页面所有的产品数据
mainEl = '.m-searchResult div:nth-of-type(2)  > .resultList > ul ' # 公用头部
dataList = find(driver, mainEl + '> li')
index = 1
for item in dataList:
   title = find(driver, mainEl + '> li:nth-child(' + str(index) + ') .bd > .name' )[0].text
   price = find(driver, mainEl + '> li:nth-child(' + str(index) + ') .bd > .price > span:nth-child(2)' )[0].text
   img = find(driver, mainEl + '> li:nth-child(' + str(index) + ') .hd > a  > img' )[0].get_attribute('src')
   sheet.write(index, 0, title)
   sheet.write(index, 1, price)
   sheet.write(index, 2, img)
   print('第'+ str(index)+'个')
   index += 1

path = '/Users/hanyu/Desktop/网易数据'
work_book.save(path + '/' + '网易严选-裤子' + '.xls')
print('保存成功')

柠檬大叔

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫-循序渐进1

爬虫分为网页爬虫和接口爬虫，网页爬虫就是像我们正常浏览网页一样去获取数据，接口爬虫是通过目标服务器的接口直接访问拿到想要的数这里仅做个示例，恶意爬虫是犯法的，同学们注意今天来简单玩一下python爬虫,就爬一下网易的裤子数据使用工具python ：语言selenium ：一个用于Web应用程序测试的工具。Selenium测试直接运行在浏览器中，就像真正的用户在操作一样. 附上 ...
复制链接

扫一扫