自己初步了解python数据爬虫

最新推荐文章于 2024-07-19 22:03:07 发布

cmdtth

最新推荐文章于 2024-07-19 22:03:07 发布

阅读量339

点赞数

分类专栏： python学习

python学习专栏收录该内容

9 篇文章 0 订阅

订阅专栏

学习的资料：

https://www.zhihu.com/question/47883186

工具的话，把python2.7和spyder，anaconda结合在一起进行使用。工具包的安装直接在 https://anaconda.org/meloncholy/ 搜索相应的工具包，直接在桌面空白处，shift+鼠标右键打开命令行运行。

进行chromedriver相应版本匹配的时候，需要近似就可以了。相应的chromedriver版本可以参考 http://blog.csdn.net/huilan_same/article/details/51896672?locationNum=11&fps=1 ，然后进行下载，需要不同版本的chrome，可以留言，但是我的也不一定全。

自己爬取网页数据的源代码：

# -*- coding: utf-8 -*-
"""
Created on Wed Aug 09 23:09:29 2017

@author: Administrator
"""

from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from bs4 import BeautifulSoup
import csv,time
import pandas as pd
driver = webdriver.Chrome()
first_url = 'http://www.yidianzixun.com/channel/c6'
driver.get(first_url)
time.sleep(5)

driver.find_element_by_class_name('icon-refresh').click()
for i in range(1,90):
driver.find_element_by_class_name('icon-refresh').send_keys(Keys.DOWN)
time.sleep(3)

soup = BeautifulSoup(driver.page_source, 'lxml')
articles = []
for article in soup.find_all(class_='item doc style-small-image style-content-middle'):
title = article.find(class_='doc-title').get_text()
source = article.find(class_='source').get_text()
comment = article.find(class_='comment-count').get_text()
link = 'http://www.yidianzixun.com' + article.get('href')
articles.append([title, source, comment, link])
driver.quit()

#data= pd.to_datetime(articles)

with open('yidian.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(['文章标题', '作者', '评论数', '文章地址'])
for row in articles:
writer.writerow(row)