爬虫

爬取:
1.安装scrapy,anaconda之后,打开终端:输入scrapy

2.在爬取列表类网站时候,找共性共同点,把不一样的地方改成一样的可以将整个网站爬取。
问题:
①分别爬取得数据如何对应上,即爬取的姓名和数据如何对的上。
②数据存储:存在数据库或者是csv
③难的是xpath

3.json格式输出
Excel的编码格式

4.yield类似于递归,当next没有时就不再爬取。
5.难点:
阿加斯请求,
表单请求

6.爬虫找工作:会反爬

1.C:\Users\>scrapy

2.C:\Users\石悦政>scrapy shell http://quotes.toscrape.com/
这是我们要爬取的网站

3
这时我们可以看到弹出来一个与之前网页一样的页面。这是爬取出来的网页,地址为硬盘地址。

.In [1]: view(response)
Out[1]: True
`In [10]: response.xpath('/html/body/div/div[2]/div[1]/div[1]/span[1]').extract_first()

//注意:括号里面加引号将地址引起来。
//extract_first()与前面要加上点.
//不加/text()是全部提取。加上之后可以将标签里的文本提取出来。


Out[10]: '<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>'

In [11]: response.xpath('/html/body/div[1]/div[2]/div[1]/div[1]/span[1]/text()').extract_first()


Out[11]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

//我们爬取第二部分:

In [14]: response.xpath('/html/body/div/div[2]/div[1]/div[2]/span[1]').extract_first()

Out[14]: '<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>'

//发现最后一个div[]括号内的内容不一样。
//所以改成*,也不用加——first()
**In [16]: response.xpath('/html/body/div/div[2]/div[1]/div[*]/span[1]/text()').extract()**

Out[18]:
['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']

//爬取作者:
In [19]: response.xpath('/html/body/div/div[2]/div[1]/div[1]/span[2]/small/text()').extract()
Out[19]: ['Albert Einstein']

In [20]: response.xpath('/html/body/div/div[2]/div[1]/div[2]/span[2]/small/text()').extract()
Out[20]: ['J.K. Rowling']


//我们找规律:
In [21]: response.xpath('/html/body/div/div[2]/div[1]/div[*]/span[2]/small/text()').extract()
Out[21]:
['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']

//我们爬取标签:
In [22]: response.xpath('/html/body/div/div[2]/div[1]/div[1]/div/text()').extract()
Out[22]:
['\n            Tags:\n            ',
 ' \n            \n            ',
 '\n            \n            ',
 '\n            \n            ',
 '\n            \n            ',
 '\n            \n        ']

//我们需要把标签后面的text()改成
//      '/meta/@content'
In [23]: response.xpath('/html/body/div[1]/div[2]/div[1]/div[1]/div/meta/@content').extract()
Out[23]: ['change,deep-thoughts,thinking,world']

In [25]: response.xpath('/html/body/div/div[2]/div[1]/div[2]/div/meta/@content').extract()
Out[25]: ['abilities,choices']

//找规律

In [26]: response.xpath('/html/body/div/div[2]/div[1]/div[*]/div/meta/@content').extract()
Out[26]:
['change,deep-thoughts,thinking,world',
 'abilities,choices',
 'inspirational,life,live,miracle,miracles',
 'aliteracy,books,classic,humor',
 'be-yourself,inspirational',
 'adulthood,success,value',
 'life,love',
 'edison,failure,inspirational,paraphrased',
 'misattributed-eleanor-roosevelt',
 'humor,obvious,simile']

//为解决一一对应问题:我们整块爬取:
In [27]: quote=response.xpath('/html/body/div/div[2]/div[1]/div[1]')

In [28]: quote
Out[28]: [<Selector xpath='/html/body/div/div[2]/div[1]/div[1]' data='<div class="quote" itemscope itemtype="h'>]

//爬取第一块的文本:
In [30]: quote.xpath('./span[1]/text()').extract_first()
Out[30]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

//爬取第一段的标签:
In [31]: quote.xpath('./div/meta/@content').extract_first()
Out[31]: 'change,deep-thoughts,thinking,world'


//第一段的作者:
In [32]: quote.xpath('./span[2]/small/text()').extract_first()
Out[32]: 'Albert Einstein'


//我们让他们自动爬取,设置一个循环:
In [33]: quotes=response.xpath('/html/body/div/div[2]/div[1]/div[*]')
 for q in quotes:^M
    ...:     print(q.xpath('./span[1]/text()').extract_first())^M
    ...:     print(q.xpath('./span[2]/small/text()').extract_first())^M
    ...:     print(q.xpath('./div[1]/meta/@content').extract_first())^M
![在这里插入图片描述](https![在这里插入图片描述](https://img-blog.csdnimg.cn/20200930145357637.jpg?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80NDc3Mzk4OQ==,size_16,color_FFFFFF,t_70#pic_center)
://img-blog.csdnimg.cn/20200930145150353.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80NDc3Mzk4OQ==,size_16,color_FFFFFF,t_70#pic_center)

`

注意for循环不加括号:
for  i in range(8):



```cpp
我们新建一个爬虫项目:


C:\Users\石悦政>f:

F:\>~~cd F:\学习\1730089873\FileRecv\2020大三上学期\泰迪杯\爬虫项目~~ 

F:\学习\1730089873\FileRecv\2020大三上学期\泰迪杯\爬虫项目>**scrapy startproject quotes**

F:\学习\1730089873\FileRecv\2020大三上学期\泰迪杯\爬虫项目>cd quotes

F:\学习\1730089873\FileRecv\2020大三上学期\泰迪杯\爬虫项目\quotes>scrapy jdspider quotes quotes.toscrape.com/


循环
scrapy建名字
scrapy  genspider jdscriper quotes.toscrape.com/
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值