这里面的很多代码都是自己平时犯的错误,然后慢慢积累起来的爬虫经验,仅仅是面向我自己的而已,慢慢来吧,语言这方面的东西,就要靠自己去慢慢的积累,不积小流难以成江海,不积跬步无以至千里。不管别人说啥,都要坚持自己的方向坚持下去,时间不知不觉,我们后知后觉。反正无论什么东西吧,坚持下去就会有结果。这个代码就是有时候不能访问一些浏览器,因为禁止程序访问,所以就要稍微修改一点东西
<span style="font-size:18px;">import sys,urllib2
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6)Gecko/20091201 Firefox/3.5.6'}
req = urllib2.Request("http://blog.csdn.net/lotluck",headers=headers)
content = urllib2.urlopen(req).read() ## UTF-8
type = sys.getfilesystemencoding() # local encode format
print content.decode('UTF-8').encode(type) # convert encode format </span>
这个黄哥那个视频上面讲的,我也就自己练习了一下,用httpwatch自己抓取东西,反正做的时候遇到的不少问题,在慢慢来吧,里面遇到正则表达式,果断时间在学习吧。这个代码就是黄哥讲的python爬虫之采集搜素引擎联想词
#coding:utf-8
import urllib
import urllib2
import re
import sys
gic=urllib.quote("科技")
print gic
url = "http://sug.so.360.cn/suggest/word?callback=suggest_so&encodein=utf-8&encodeout=utf-8&word="+gic
headers = {
"GET":url,
"Host":"sug.so.360.cn",
"Referer":"http://www.so.com/",
"User-Agent":"sMozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.56 Safari/537.17",
}
req = urllib2.Request(url)
for key in headers:
req.add_header(key,headers[key])
html=urllib2.urlopen(req).read()
ss = re.findall("\"(.*?)\"",html)
for item in ss:
print item
#输出1-100之间的素数
from math import sqrt
result = []
for num in range(2,100):
f = True
for sn in range(2,int(sqrt(num)+1)):
if num%sn==0:
f = False
break
if f:
result.append(num)
print result
实现switch语句的功能,用字典实现的
from __future__ import division
x = 1
y = 2
operator = "/"
result = {
"+" : x + y,
"-" : x - y,
"*" : x * y,
"/" : x / y
}
print result.get(operator)
import urllib2
response = urllib2.urlopen('http://www.xiyoumobile.com/')
html = response.read()
print html
#获取网页所有链接
import urllib
import time
url = ['']*20
con = urllib.urlopen('http://blog.sina.com.cn/twocold').read()
a = con.find(r'<a href=')
href = con.find(r'href=',a)
html = con.find(r'.html',href)
i=0
while a!=-1 and href!=-1 and html!=-1 and i<2:
url[i] = con[href+6:html+5]
print url[i]
a = con.find(r'<a href=',html)
href = con.find(r'href=',a)
html = con.find(r'.html',href)
i=i+1
else:
print 'find end'
j=0
while j<2:
content = urllib.urlopen(url[j]).read()
open(r'lal/'+url[j][-31:],'w+').write(content)
print 'downloding',url[j]
j=j+1
time.sleep(15)
else:
print 'download article finished'
#抓取韩寒博客的东西
import urllib
import time
con=urllib.urlopen('http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html').read()
href=con.find(r'<a href=')
print href
html=con.find(r'.html',href)
print html
url=con[href+9:html+5]
print url