Python爬虫初识

最新推荐文章于 2024-09-21 17:56:48 发布

Richard_Jason

最新推荐文章于 2024-09-21 17:56:48 发布

阅读量646

点赞数 1

文章标签： python 爬虫

Python爬虫初识

因为学的是python2.7x。。。。
所以用的urllib
原来可以在submile 中运行Python代码。。。。。。。ctrl+b 就会在下面显示了
dir（urllib）就会显示这个模块的方法
help（urllib.open）就会显示这个方法的参数啊什么的

urlopen有3个参数，第一个url，第二个data，第三个代理

这里写图片描述

可以对一个对象进行dir，然后就能看到这个对象的可以用的方法了

原来写baidu.com 进入到了www.baidu.com是301重定向啊！！

403禁止访问
30x重定向
50x是服务器问题

python的urllib有一个直接把网页下载下来的！！！！方法
urllib.urlretrieve(url,’保存路径’)
urlretrieve

直接就下载了

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib

url = 'http://www.163.com'
html = urllib.urlopen(url)
# print html.read()
print html.getcode()
urllib.urlretrieve(url,'c:/aaa.html')

html.close()
 
 1
2
3
4
5
6
7
8
9
10
11
 
 1
2
3
4
5
6
7
8
9
10
11

一定要记得关闭啊！！！！close

另一种下载网页的方法

html = html.read()
with open('c:a1.txt','wb') as f:
    f.write(html)
 
 1
2
3
 
 1
2
3

python 有个方法连用

html = urllib.urlopen(url).read()

print html
 
 1
2
3
 
 1
2
3

要不然就要不写read（），然后在html = html.read()

如果只是进行逐个操作的话可以这么写

经常会出现编码错误，乱码，就要去看看网页是什么类型的编码，就是看源代码，然后看看是什么如 utf-8、GBK等等
然后在read()后就decode（）把他解析了，然后再点encode（）编码成自己想要的
可以在decode里面写（‘gbk’,’ignore’）ignore是忽略一些错误的编码，有可能一个网页中有好几种编码。

OS模块的getcwd()获取当前绝对路径，
chdir(‘另一个路径’) 切换到另一个路径

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib
def callback(a,b,c):
    down_pro = 100.0*a*b/c
    if down_pro >100:
        down_pro=100
    print '%.2f%%' %down_pro
url = 'http://www.iplaypython.com/'
local ='c:/123.txt'
urllib.urlretrieve(url,local,callback)
 
 1
2
3
4
5
6
7
8
9
10
11
 
 1
2
3
4
5
6
7
8
9
10
11

主要是那个函数。。。。。。
那个函数的3个函数，一个数据块的数量，数据块的大小，字节，数据文件大小

还有如果要输出百分号就要！！！！！！！写两个百分号

第三课，获取网页编码，获取的是网页头部的返回的

第三方模块，自动判断网页编码的库
下载chardet模块，安装，开始字符集检测，封装函数

import urllib
url = 'http://www.163.com'
info = urllib.urlopen(url).info()

print info.getparam('charset')
 
 1
2
3
4
5
 
 1
2
3
4
5

那个info中的一个函数，获取编码类型

对网页的编码判断！！！
的第三方模块 chardet （字符集检测）

就是导入，然后获取urlopen返回的在调用read返回的对象
用
chardet.detect(就是上面返回的对象)
就会返回是什么编码的概率

这里写图片描述

import urllib
import chardet
url = 'http://www.iplaypython.com'
content = urllib.urlopen(url).read()
print chardet.detect(content)
 
 1
2
3
4
5
 
 1
2
3
4
5

写了个函数

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib
import chardet
url = 'http://www.iplaypython.com'

def automatic(url):
    con =   urllib.urlopen(url).read()
    res = chardet.detect(con)
    end = res['encoding']
    return end

print automatic(url)
urls = ['http://www.baidu.com',
     'http://www.163.com',
     'http://www.jd.com'
]
for x in urls:
    print automatic(x)


 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

第四课

urllib2模块！！！！！！！！

一定要注意！国外和国内的编码，谷歌和百度，
GBK是中文！！！

编码的重要性！！！！！

破解不让抓取。。。

。。。。。CSDN就不行，会返回403

import urllib2
url = 'http://blog.csdn.net/qq_28295425'

my_heaeders = {
     'GET':url,
    'Host':'blog.csdn.net',
'Referer':'http://blog.csdn.net/experts.html',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36'


}


req = urllib2.Request(url,headers=my_heaeders)
# req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36')
# req.add_header('GET',url)
# req.add_header('Host','blog.csdn.net')
# req.add_header('Referer','http://blog.csdn.net/experts.html')

asd = urllib2.urlopen(req)

print asd.read()


my_heaeders =[
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36',
'Mozilla/5.5 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36'
]

def get_content(url,headers):
    random_header =random.choice(headers)
    req = urllib2.Request(url)
    req.add_header('User-Agent',random_header)
    req.add_header('Host','blog.csdn.net')
    req.add_header('Referer','http://blog.csdn.net/')
    req.add_header('GET',url)
    content = urllib2.urlopen(req).read()
    return content

print get_content(url,my_heaeders)

 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

注意那个随机选择的是【】

先把url给request。然后添加头部信息，然后把返回的req放入urlopen中

第五课图片下载爬虫

例如去百度贴吧的图片

一定要写编码方式
那个-*-是直接就-就是不用加shift的

-- conding:utf-8 --

主要是学会了正则的一个方法findall（表达式，内容）会返回所匹配的内容

还有别傻逼似的什么都有正则写，，，，，，，，会变的用正则写，不变的就直接复制了
最好写成函数

我自己写的没有函数的提取快代理的代理ip和端口

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 先爬取代理，设置header，做一个函数，以后写用
import re
import urllib2
url1 ='http://www.kuaidaili.com/'#<td data-title="IP">123.182.216.241</td>
html1 = urllib2.urlopen(url1)
html = html1.read()
html1.close()
regexip = r'data-title="IP">(\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})'
regexport = r'data-title="PORT">(\d{1,4})'
poxyip = re.findall(regexip,html)
poxyport = re.findall(regexport,html)
for x in range(10):
    print poxyip[x]+' : '+poxyport[x]
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

第六课用的第三方mokuai BeautifulSoup from bs4

很方便的就是不用写正则表达式，用的是beautifulsoup的，会根据html的标签来寻找，然后传入一个字符串也就是返回的网页，然后返回的是所有包含这个标签的，这个是第一个参数，第二个参数是class=‘’那个class要写成class_防止和python的类混，然后就会返回名字叫做这个的了，然后也是返回一堆，！！！返回一个对象，那就有方法，这个对象是一个类似字典的那种，那么查找src的那么就应该返回value了！于是返回的对象的[‘src’]那么就返回图片的网址了。当然可以写别的key来查找。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import urllib
def get_content(url):
    html = urllib.urlopen(url)
    content = html.read()
    html.close()
    return content

def get_image(info):
    soup = BeautifulSoup(info)
    all_img = soup.find_all('img',class_="BDE_Image")
    for img in all_img:
        print img['src']
url = 'http://tieba.baidu.com/p/4656488748'
info = get_content(url)
print get_image(info)


 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

判断文件是否存在

os.path.exists(filename)
 
 1
 
 1

输出当前时间：格式化的

SyntaxError: invalid syntax
>>> ISOTIMEFORMAT='%Y-%m-%d %X'
>>> time.strftime( ISOTIMEFORMAT, time.localtime() )    
'2016-08-17 16:31:28'
 
 1
2
3
4
 
 1
2
3
4

直接time.localtime返回的是tuple！！

我写的获取迅雷会员的代码

# -*- coding: utf-8 -*-
import urllib
import re
import os
url1 = 'http://xlfans.com/'
regex = r'迅雷会员号共享(.+?)密码(.*)<'
regex1 = r'class="item"><a href="(.+?)">'
ml = 'c:/xunlei.txt'
def get_html(url):
    html1 = urllib.urlopen(url)
    html = html1.read()
    html1.close()
    return html
def get_re(html):
    xunlei = re.findall(regex,html)
    for a in xunlei:
        with open(ml,'a') as f:
            b = a[0]+' '+a[1]
            f.write(b+'\n')
def get_new(html):
    new = re.findall(regex1,html)
    return new[0]
# f = open(ml,'wb')
# f.write('1')
# f.close()
html= get_html(url1)
url = get_new(html)
new_html = get_html(url)
if os.path.exists(ml):
    os.remove(ml)
get_re(new_html)
print 'please look c:/xunlei.txt thankyou!'
print 'newurl= '+url

 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

python编写目录扫描工具

按行读取

readlines（）
os.path.splitext(s)
s = ‘c:/1.txt’
 
 1
2
3
 
 1
2
3

那个splitext是把文件拆开，一个是路径+文件名，另一个是后缀名

！！！！！！！！！！！！！！！！！！！！！！！！！

。empty（）是否为空

。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。
我肯了一个模拟表单提交的，抓包，看提交的data，直接创建字典data，可以自己设置参数修改等，然后创建heasers信息，也是抓包获取的，然后那个data要用urllib的

urllib.urlencode(data)   把他编码成url编码格式的
然后因为是data和headers，所以用的urllib
request = urllib2.Request(url,data,headers)
response = urllibe.urlopen(request)
result = response.read().decode(‘gbk’)
先用data和headers，url创建一个request，然后用urlopen打开
 
 1
2
3
4
5
6
 
 1
2
3
4
5
6

666的没边，我明白那个如果链接上了断线检测，设置时间每隔多久去ping下百度啊。。。
然后实现断线重现连接

#判断当前是否可以联网
    def canConnect(self):
        fnull = open(os.devnull, 'w')
        result = subprocess.call('ping www.baidu.com', shell = True, stdout = fnull, stderr = fnull)
        fnull.close()
        if result:
            return False
        else:
            return True

 
 1
2
3
4
5
6
7
8
9
10
 
 1
2
3
4
5
6
7
8
9
10

http://cuiqingcai.com/2083.html

另一篇里讲的，关于正则，.*就是可以匹配任意无限个字符，加个？就是非贪婪模式
re.S代表在匹配时点任意匹配模式。点可以代表换行符

一般写正则都是先写compile，即
pattern = re.compile(r’想写的正则’)
然后
result = pattern.findall(要匹配的)

他设置了，输入，如果输入回车就索引加，然后就可以读下一个段子了
一直循环。。。。

一定要写异常！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！

• re.I(全拼：IGNORECASE): 忽略大小写（括号内是完整写法，下同）
• re.M(全拼：MULTILINE): 多行模式，改变’^’和’$’的行为（参见上图）
• re.S(全拼：DOTALL): 点任意匹配模式，改变’.’的行为
• re.L(全拼：LOCALE): 使预定字符类 \w \W \b \B \s \S 取决于当前区域设定
• re.U(全拼：UNICODE): 使预定字符类 \w \W \b \B \s \S \d \D 取决于unicode定义的字符属性
• re.X(全拼：VERBOSE): 详细模式。这个模式下正则表达式可以是多行，忽略空白字符，并可以加入注释。

那个Request的三个参数是url，post数据，headers

转自：http://blog.csdn.net/qq_28295425/article/details/53729758