python入门爬虫案例_[Python入门学习]-爬虫项目案例讲解

最新推荐文章于 2021-06-03 12:58:24 发布

weixin_39938935

最新推荐文章于 2021-06-03 12:58:24 发布

阅读量516

点赞数

文章标签： python入门爬虫案例

一.速成HTML

html：超文本标记语言。

文档的第一行就表明这是一个html文档。根标签是html，然后下面有head和body，head里面是一些头信息，body就是我们想把页面渲染成什么样。

声明字符编码是UTF-8的。

前端技术语言体系：

html

css：层叠样式表

js：javaScript

树形关系：先辈、父、子、兄弟、后代

二.xpath

/：从根节点来进行选择元素

//：从匹配选择的当前节点来对文档中的节点进行选择

.：选取当前节点

..：选择当前节点的父节点

@：选择属性

实例：

/html：选取根元素html

body/div：选取属于body的子元素中的所有div元素

//div：选取所有div标签的子元素，而不管他们在html文档中的位置

@lang：选取名称为lang的所有属性

通配符

*：匹配任何元素节点

@*：匹配任何属性节点

实例：

//*：选取文档当中的所有元素

//title[@*]：选取所有带有属性的title元素

|：在路径表达式中，|代表的是和的关系，如//body/div | //body/li表示选取body元素的所有div元素和li元素

//div | //li：选取文档中所有的div和li元素

三.BuautifulSoup的介绍

什么是beautifulSoup？

是一个可以从html或者是xml文件中提取数据的一个python库

安装命令：pip install beautifulsoup4

在PyCharm的Terminal窗口输入上面的安装命令即可以安装。

我这里是从同花顺随机打开一支个股，找到公司资料->高管介绍，通过F12的方式找到对应的html文件，然后将其另存为到本地名为000004.html

然后编写解析代码：

'''什么是beautifulSoup？

是一个可以从html或者是xml文件中提取数据的一个python库

pip install beautifulsoup4'''

from bs4 importBeautifulSoup

html_doc= "E:/Python/PythonStudy/000004.html"html_file= open(html_doc,"r", encoding="gbk")

html_handle=html_file.read()

soup= BeautifulSoup(html_handle, 'html.parser')print(soup)

运行效果：

特别说明，由于这里下载下来的文档的格式是GBK编码，我们如果强制指定UTF-8编码的话，就会报错。

四.如何使用BuautifulSoup中的选择器

'''什么是beautifulSoup？

是一个可以从html或者是xml文件中提取数据的一个python库

pip install beautifulsoup4'''

from bs4 importBeautifulSoupimportre

html_doc= "E:/Python/PythonStudy/000004.html"html_file= open(html_doc,"r",encoding="gbk")

html_handle=html_file.read()

soup= BeautifulSoup(html_handle, 'html.parser')#print(soup)

#获取html文档头#print(soup.head)

#获取文档中的一个节点

print(soup.p)#获取节点中的属性

print(soup.p.attrs)#获取所有的相应的节点

ps = soup.find_all("p")#print(ps)

#用ID来进行定位

result = soup.find_all(id="quotedata")print(result)#按照CSS来搜索

jobs = soup.find_all("td", class_="jobs")print(jobs)

names= soup.find_all("a", class_="turnto")print(names)

r= re.findall(">(.{2,5})", str(names))print(r)

五.Scrapy基础环境

在PyCharm中输入pip install scrapy安装scrapy。但报error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools"错误，网上查了有如下两种方式解决，一种是根据报错信息去官网下载 Microsoft Visual C++ 14.0，另一种方式就是去https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted下载twisted对应版本的whl文件Twisted‑18.9.0‑cp37‑cp37m‑win_amd64.whl，cp后面是python版本，amd64代表64位，运行pip install E:\Python\Twisted-18.9.0-cp37-cp37m-win_amd64.whl(路径是你存放Twisted下载存放的目录)

将手动下载的Twisted安装后，再运行pip install scrapy命令安装scrapy。

importscrapy

html= scrapy.Request("http://stockpage.10jqka.com.cn/600004/company/#detail")print(html)

如上输出表示scrapy安装成功了。

scrapy和beautiful soup的区别？beautiful soup只能说是一个库，一个python的三方库，scrapy是一个框架。框架和库的不同之处在于：库拿过来然后到你的工程里边直接写，框架它会自动的帮你做很多事，你到框架里边去填充你这些东西。

这时，在part6下创建了stock_spider

这时，正确的用法是，在PyCharm打开这个项目。

六.Scrpay使用逻辑介绍

进入stock_spider目录下，创建爬虫命令：scrapy genspider tonghuashun http://stockpage.10jqka.com.cn/600004/company/#detail

但发现一直提示srcapy不是内部或外部命令，不得已我又在我的Python的安装目录下重新安装scrapy。

执行创建爬虫的命令：scrapy genspider tonghuashun http://stockpage.10jqka.com.cn/600004/company/#detail

打开tonghuashun.py

进入http://stockpage.10jqka.com.cn/600004/company/#detail页面，按F12，定位元素后，右键->Copy->Copy XPath，获得预解析的文档路径信息。

tonghuashun.py内容如下：

#-*- coding: utf-8 -*-

importscrapyclassTonghuashunSpider(scrapy.Spider):

name= 'tonghuashun'allowed_domains= ['stockpage.10jqka.com.cn']

start_urls= ['http://stockpage.10jqka.com.cn/600004/company/#detail/']defparse(self, response):#//*[@id="ml_001"]/table/tbody/tr[1]/td[1]/a

res_selector = response.xpath("//*[@id=\"ml_001\"]/table/tbody/tr[1]/td[1]/a")print(res_selector)pass

为了测试，在stock_spider下新建main.py内容如下，用来测试调试验证

运行发现报“ModuleNotFoundError: No module named 'win32api'”错误，于是进入python安装目录下，执行pip install pywin32命令。

但执行main.py后，没有输入任何的内容。

于是F12分析，找到真正的URL是http://basic.10jqka.com.cn/600004/company.html，XPath是正确无误的。

动态页面：我的页面是从数据库或其他地方得到，然后渲染的页面

静态页面：所见即所得

修改后tonghuashun.py内容如下所示：

#-*- coding: utf-8 -*-

importscrapyclassTonghuashunSpider(scrapy.Spider):

name= 'tonghuashun'allowed_domains= ['stockpage.10jqka.com.cn']#start_urls = ['http://stockpage.10jqka.com.cn/600004/company/#detail/']

start_urls = ['http://basic.10jqka.com.cn/600004/company.html']defparse(self, response):#//*[@id="ml_001"]/table/tbody/tr[1]/td[1]/a

res_selector = response.xpath("//*[@id=\"ml_001\"]/table/tbody/tr[1]/td[1]/a")print(res_selector)pass

'''动态页面：我的页面是从数据库或其他地方得到，然后渲染的页面

静态页面：所见即所得'''

运行效果如下：

当然，我们希望是获取a标签中的文本值，怎么获取呢？其实就在选择器后面加上/text()即可获得。

然后可以通过res_selector.extract()拿到文本值，如下所示：

七.定位

但上面的那种定位调试太慢，其实在我们创建这个爬虫之前有一个scrapy shell命令，它可以很直观的反馈元素是否定位到，即可以用scrapy shell命令调试xpath定位。

输入命令：scrapy shell http://basic.10jqka.com.cn/600004/company.html后，可输入response.xpath("//*[@id=\"ml_001\"]/table/tbody/tr[1]/td[1]/a/text()").extract()看是否可以定位到。

下面分析定位所有董事：

然后，把这代码放到工程中实现：

进一步看另一个实例：

八.爬虫

因为程序爬同花顺网站，可能会因为速度过快被同花顺把我们本地的IP给封了，而我们是来学习的，所以下面将用http://pycs.greedyai.com/来进行练习。

下面先来创建一个虫。

修改stock.py内容如下：

#-*- coding: utf-8 -*-

importscrapyfrom urllib importparseclassStockSpider(scrapy.Spider):

name= 'stock'allowed_domains= ['pycs.greedyai.com/']

start_urls= ['http://pycs.greedyai.com/']defparse(self, response):

post_urls= response.xpath("//a/@href").extract()for post_url inpost_urls:yield scrapy.Request(url=parse.urljoin(response.url, post_url), callback=self.parse_detail, dont_filter=True)pass

defparse_detail(self, response):print("回调函数被调用")pass

url=parse.urljoin(response.url, post_url)：url域名拼接，如果有域名就不加域名，如果没有域名就加上域名。

callback=self.parse_detail：定义一个函数，来对响应进行解析

dont_filter=True：是否不要启动scrapy过滤器过滤非正规URL

yield：在这里是把它交给scrapy进行处理，和return差不太多

并修改main.py，运行

九.定位页面元素

十.处理抓取信息

pipelines就是处理我们数据的，要想让程序能进入pipelines，必须先在items.py中定义

窗口分隔完后，可以方便定义变量

定义变量后，然后在stock.py中进行处理

以上都处理完后，还要在settings.py中进行设置。

这时在pipelines.py中打断定，Debug运行main.py，可以看到数据都已获取到。

此时，相关的代码如下：

stock.py

#-*- coding: utf-8 -*-

importscrapyimportrefrom urllib importparsefrom stock_spider.items importStockItemclassStockSpider(scrapy.Spider):

name= 'stock'allowed_domains= ['pycs.greedyai.com/']

start_urls= ['http://pycs.greedyai.com/']defparse(self, response):

post_urls= response.xpath("//a/@href").extract()for post_url inpost_urls:yield scrapy.Request(url=parse.urljoin(response.url, post_url), callback=self.parse_detail, dont_filter=True)pass

defparse_detail(self, response):

stock_item=StockItem()#董事会成员姓名

stock_item["names"] =self.get_tc(response)#抓取性别信息

stock_item["sexes"] =self.get_sex(response)#抓取年龄信息

stock_item["ages"] =self.get_age(response)#股票代码

stock_item["codes"] =self.get_code(response)#职位信息

stock_item["leaders"] = self.get_leader(response, len(stock_item["names"]))#可以这里在写文件存储逻辑，当然，scrapy框架是让我们写到pipelines中去，但要能在pipelines中处理，就要用到items，在items.py中定义属性

yieldstock_itemdefget_tc(self, response):

tc_names= response.xpath("//*[@id=\"ml_001\"]/table/tbody/tr[1]/td[1]/a/text()").extract()returntc_namesdefget_sex(self, response):#//*[@id="ml_001"]/table/tbody/tr[1]/td[1]/div/table/thead/tr[2]/td[1]

infos = response.xpath("//*[@class=\"intro\"]/text()").extract()

sex_list=[]for info ininfos:try:

sex= re.findall("[男|女]", info)[0]

sex_list.append(sex)except(IndexError):continue

returnsex_listdefget_age(self, response):

infos= response.xpath("//*[@class=\"intro\"]/text()").extract()

age_list=[]for info ininfos:try:

age= re.findall("\d+", info)[0]

age_list.append(age)except(IndexError):continue

returnage_listdefget_code(self, response):

infos= response.xpath('/html/body/div[3]/div[1]/div[2]/div[1]/h1/a/@title').extract()

code_list=[]for info ininfos:

code= re.findall("\d+", info)[0]

code_list.append(code)returncode_listdefget_leader(self, response, length):

tc_leaders= response.xpath("//*[@class=\"tl\"]/text()").extract()

tc_leaders=tc_leaders[0:length]return tc_leaders

items.py

#-*- coding: utf-8 -*-

#Define here the models for your scraped items#

#See documentation in:#https://doc.scrapy.org/en/latest/topics/items.html

importscrapyclassStockSpiderItem(scrapy.Item):#define the fields for your item here like:

#name = scrapy.Field()

pass

classStockItem(scrapy.Item):

names=scrapy.Field()

sexes=scrapy.Field()

ages=scrapy.Field()

codes=scrapy.Field()

leaders= scrapy.Field()

pipelines.py

#-*- coding: utf-8 -*-

#Define your item pipelines here#

#Don't forget to add your pipeline to the ITEM_PIPELINES setting#See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

classStockSpiderPipeline(object):defprocess_item(self, item, spider):returnitemclassStockPipeline(object):defprocess_item(self, item, spider):print(item)return item

十一.数据处理

数据处理的相关代码pipelines.py，就是获得数据，按格式写入到文件中。

pipelines.py

#-*- coding: utf-8 -*-

#Define your item pipelines here#

#Don't forget to add your pipeline to the ITEM_PIPELINES setting#See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

importosclassStockSpiderPipeline(object):defprocess_item(self, item, spider):returnitemclassStockPipeline(object):def __init__(self):#类被加载时要创建一个文件

self.file = open("executive_prep.csv", "a+")defprocess_item(self, item, spider):#判断文件是否为空，为空写：高管姓名,性别,年龄,股票代码,职位，不为空那么就追加写文件

if os.path.getsize("executive_prep.csv"):#开始写文件

self.write_content(item)else:

self.file.write("高管姓名,性别,年龄,股票代码,职位\n")

self.file.flush()returnitemdefwrite_content(self, item):

names= item["names"]

sexes= item["sexes"]

ages= item["ages"]

codes= item["codes"]

leaders= item["leaders"]

result= ""

for i inrange(len(names)):

result= names[i] + "," + sexes[i] + "," + ages[i] + "," + codes[i] + "," + leaders[i] + "\n"self.file.write(result)

运行main.py，生成executive_prep.csv内容如下：

到目前为止，学习了基本的爬虫，整个爬虫用scrapy框架，scrapy框架每个模块是怎么工作的，从数据的抓取，然后数据处理，包括数据持久化(有写到文件中，也有写到数据库中)，在这里作为初学者先写到文件里，整个流程串起来了。

其实可以把数据保存到数据库中去，如Neo4j数据库，但是格式需要按Neo4j数据库所要求的.csv格式才能导入。

neo4j数据库下载地址：https://neo4j.com/download-center/#panel2-3，下载解压后，启动服务：bin/neo4j start，初始用户名/密码neo4j/neo4j，按照提示修改密码。

假设我们先爬取到了关联的数据并放到CSV文件中了，且通过数据转换成neo4j数据库所要求的csv格式，可以通过如下命令将所有的数据导入到Neo4j中：

bin/neo4j-admin import --nodes executive.csv --nodes stock.csv --nodes concept.csv --nodes industry.csv --relationships executive_stock.csv --relationships stock_industry.csv --relationships stock_concept.csv

数据默认存放在 graph.db 文件夹里。如果graph.db文件夹之前已经有数据存在，则可以选择先删除再执行命令。

把Neo4j服务重启之后，就可以通过 localhost:7474 观察到知识图谱了。

weixin_39938935

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python入门爬虫案例_[Python入门学习]-爬虫项目案例讲解

一.速成HTMLhtml：超文本标记语言。文档的第一行就表明这是一个html文档。根标签是html，然后下面有head和body，head里面是一些头信息，body就是我们想把页面渲染成什么样。声明字符编码是UTF-8的。前端技术语言体系：htmlcss：层叠样式表js：javaScript树形关系：先辈、父、子、兄弟、后代二.xpath/：从根节点来进行选择元素//：从匹配选择的当前节点来对文档...
复制链接

扫一扫