爬虫

最新推荐文章于 2024-07-16 20:32:10 发布

奥利给的二黑

最新推荐文章于 2024-07-16 20:32:10 发布

阅读量192

点赞数

文章标签： spark

本文链接：https://blog.csdn.net/weixin_44773989/article/details/108874908

版权

爬取：
1.安装scrapy,anaconda之后，打开终端：输入scrapy

2.在爬取列表类网站时候，找共性共同点，把不一样的地方改成一样的可以将整个网站爬取。
问题：
①分别爬取得数据如何对应上，即爬取的姓名和数据如何对的上。
②数据存储：存在数据库或者是csv
③难的是xpath

3.json格式输出
Excel的编码格式

4.yield类似于递归，当next没有时就不再爬取。
5.难点：
阿加斯请求，
表单请求

6.爬虫找工作：会反爬

1.C:\Users\>scrapy

2.C:\Users\石悦政>scrapy shell http://quotes.toscrape.com/
这是我们要爬取的网站

3
这时我们可以看到弹出来一个与之前网页一样的页面。这是爬取出来的网页，地址为硬盘地址。

.In [1]: view(response)
Out[1]: True

`In [10]: response.xpath('/html/body/div/div[2]/div[1]/div[1]/span[1]').extract_first()

//注意：括号里面加引号将地址引起来。
//extract_first()与前面要加上点.
//不加/text()是全部提取。加上之后可以将标签里的文本提取出来。


Out[10]: '<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>'

In [11]: response.xpath('/html/body/div[1]/div[2]/div[1]/div[1]/span[1]/text()').extract_first()


Out[11]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

//我们爬取第二部分：

In [14]: response.xpath('/html/body/div/div[2]/div[1]/div[2]/span[1]').extract_first()

Out[14]: '<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>'

//发现最后一个div[]括号内的内容不一样。
//所以改成*，也不用加——first（）
**In [16]: response.xpath('/html/body/div/div[2]/div[1]/div[*]/span[1]/text()').extract()**

Out[18]:
['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']

//爬取作者：
In [19]: response.xpath('/html/body/div/div[2]/div[1]/div[1]/span[2]/small/text()').extract()
Out[19]: ['Albert Einstein']

In [20]: response.xpath('/html/body/div/div[2]/div[1]/div[2]/span[2]/small/text()').extract()
Out[20]: ['J.K. Rowling']


//我们找规律：
In [21]: response.xpath('/html/body/div/div[2]/div[1]/div[*]/span[2]/small/text()').extract()
Out[21]:
['Albert Einstein',
 'J.K. Rowling',
 'Albert Einstein',
 'Jane Austen',
 'Marilyn Monroe',
 'Albert Einstein',
 'André Gide',
 'Thomas A. Edison',
 'Eleanor Roosevelt',
 'Steve Martin']

//我们爬取标签：
In [22]: response.xpath('/html/body/div/div[2]/div[1]/div[1]/div/text()').extract()
Out[22]:
['\n            Tags:\n            ',
 ' \n            \n            ',
 '\n            \n            ',
 '\n            \n            ',
 '\n            \n            ',
 '\n            \n        ']

//我们需要把标签后面的text()改成
//      '/meta/@content'
In [23]: response.xpath('/html/body/div[1]/div[2]/div[1]/div[1]/div/meta/@content').extract()
Out[23]: ['change,deep-thoughts,thinking,world']

In [25]: response.xpath('/html/body/div/div[2]/div[1]/div[2]/div/meta/@content').extract()
Out[25]: ['abilities,choices']

//找规律

In [26]: response.xpath('/html/body/div/div[2]/div[1]/div[*]/div/meta/@content').extract()
Out[26]:
['change,deep-thoughts,thinking,world',
 'abilities,choices',
 'inspirational,life,live,miracle,miracles',
 'aliteracy,books,classic,humor',
 'be-yourself,inspirational',
 'adulthood,success,value',
 'life,love',
 'edison,failure,inspirational,paraphrased',
 'misattributed-eleanor-roosevelt',
 'humor,obvious,simile']

//为解决一一对应问题：我们整块爬取：
In [27]: quote=response.xpath('/html/body/div/div[2]/div[1]/div[1]')

In [28]: quote
Out[28]: [<Selector xpath='/html/body/div/div[2]/div[1]/div[1]' data='<div class="quote" itemscope itemtype="h'>]

//爬取第一块的文本：
In [30]: quote.xpath('./span[1]/text()').extract_first()
Out[30]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

//爬取第一段的标签：
In [31]: quote.xpath('./div/meta/@content').extract_first()
Out[31]: 'change,deep-thoughts,thinking,world'


//第一段的作者：
In [32]: quote.xpath('./span[2]/small/text()').extract_first()
Out[32]: 'Albert Einstein'


//我们让他们自动爬取,设置一个循环：
In [33]: quotes=response.xpath('/html/body/div/div[2]/div[1]/div[*]')
 for q in quotes:^M
    ...:     print(q.xpath('./span[1]/text()').extract_first())^M
    ...:     print(q.xpath('./span[2]/small/text()').extract_first())^M
    ...:     print(q.xpath('./div[1]/meta/@content').extract_first())^M
![在这里插入图片描述](https![在这里插入图片描述](https://img-blog.csdnimg.cn/20200930145357637.jpg?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80NDc3Mzk4OQ==,size_16,color_FFFFFF,t_70#pic_center)
://img-blog.csdnimg.cn/20200930145150353.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80NDc3Mzk4OQ==,size_16,color_FFFFFF,t_70#pic_center)

`

注意for循环不加括号：
for  i in range(8):


```cpp
我们新建一个爬虫项目：


C:\Users\石悦政>f:

F:\>~~cd F:\学习\1730089873\FileRecv\2020大三上学期\泰迪杯\爬虫项目~~ 

F:\学习\1730089873\FileRecv\2020大三上学期\泰迪杯\爬虫项目>**scrapy startproject quotes**

F:\学习\1730089873\FileRecv\2020大三上学期\泰迪杯\爬虫项目>cd quotes

F:\学习\1730089873\FileRecv\2020大三上学期\泰迪杯\爬虫项目\quotes>scrapy jdspider quotes quotes.toscrape.com/

循环
scrapy建名字
scrapy  genspider jdscriper quotes.toscrape.com/

奥利给的二黑

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
爬虫

爬取：安装scrapy,anaconda。在爬取列表类网站时候，找共性共同点，把不一样的地方改成一样的可以将整个网站爬取。问题：分别爬取得数据如何对应上，即存在数据库或者是csv男的是xpathyijson格式输出Excel的编码格式yield类似于递归，当next没有时就不再爬取。阿加斯请求，表单请求爬虫找工作：会反爬...
复制链接

扫一扫