Python的爬虫的笔记

最新推荐文章于 2023-10-17 00:05:02 发布

lotluck

最新推荐文章于 2023-10-17 00:05:02 发布

阅读量930

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/lotluck/article/details/45998369

版权

Python 专栏收录该内容

29 篇文章 0 订阅

订阅专栏

这里面的很多代码都是自己平时犯的错误，然后慢慢积累起来的爬虫经验，仅仅是面向我自己的而已，慢慢来吧，语言这方面的东西，就要靠自己去慢慢的积累，不积小流难以成江海，不积跬步无以至千里。不管别人说啥，都要坚持自己的方向坚持下去，时间不知不觉，我们后知后觉。反正无论什么东西吧，坚持下去就会有结果。这个代码就是有时候不能访问一些浏览器，因为禁止程序访问，所以就要稍微修改一点东西

<span style="font-size:18px;">import sys,urllib2


headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6)Gecko/20091201 Firefox/3.5.6'}  


req = urllib2.Request("http://blog.csdn.net/lotluck",headers=headers)


content = urllib2.urlopen(req).read()    ## UTF-8  

type = sys.getfilesystemencoding()    # local encode format  

print content.decode('UTF-8').encode(type)    # convert encode format  </span>

这个黄哥那个视频上面讲的，我也就自己练习了一下，用httpwatch自己抓取东西，反正做的时候遇到的不少问题，在慢慢来吧，里面遇到正则表达式，果断时间在学习吧。这个代码就是黄哥讲的python爬虫之采集搜素引擎联想词

#coding:utf-8

import urllib
import urllib2
import re
import sys

gic=urllib.quote("科技")



print gic

url = "http://sug.so.360.cn/suggest/word?callback=suggest_so&encodein=utf-8&encodeout=utf-8&word="+gic
headers = {
     "GET":url,
     "Host":"sug.so.360.cn",
     "Referer":"http://www.so.com/",
     "User-Agent":"sMozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.56 Safari/537.17",
     }

req = urllib2.Request(url)


for key in headers:
     req.add_header(key,headers[key])

html=urllib2.urlopen(req).read()

ss = re.findall("\"(.*?)\"",html)

for item in ss:
     print item

#输出1-100之间的素数

from math import sqrt

result = []

for num in range(2,100):
    f = True
    for sn in range(2,int(sqrt(num)+1)):
        if num%sn==0:
            f = False
        break
    if f:
        result.append(num)

print result

实现switch语句的功能，用字典实现的

from __future__ import division

x = 1

y = 2

operator = "/"

result = {

    "+" : x + y,
    "-" : x - y,
    "*" : x * y,
    "/" : x / y
    }

print result.get(operator)

import urllib2
response = urllib2.urlopen('http://www.xiyoumobile.com/')
html = response.read()
print html

#获取网页所有链接

import urllib
import time

url = ['']*20
con = urllib.urlopen('http://blog.sina.com.cn/twocold').read()
a = con.find(r'<a  href=')
href = con.find(r'href=',a)
html = con.find(r'.html',href)

i=0
while a!=-1 and href!=-1 and html!=-1 and i<2:
     url[i] = con[href+6:html+5]
     print url[i]
     a = con.find(r'<a  href=',html)
     href = con.find(r'href=',a)
     html = con.find(r'.html',href)
     i=i+1
else:
    print 'find end'

j=0
while j<2:
    content = urllib.urlopen(url[j]).read()
    open(r'lal/'+url[j][-31:],'w+').write(content)
    print 'downloding',url[j]
    j=j+1
    time.sleep(15)
else:
    print 'download article finished'

#抓取韩寒博客的东西

import urllib
import time

con=urllib.urlopen('http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html').read()

href=con.find(r'<a  href=')
print href
html=con.find(r'.html',href)
print html
url=con[href+9:html+5]
print url

lotluck

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python的爬虫的笔记

这里面的很多代码都是自己平时犯的错误，然后慢慢积累起来的爬虫经验，仅仅是面向我自己的而已，慢慢来吧，语言这方面的东西，就要靠自己去慢慢的积累，不积小流难以成江海，不积跬步无以至千里。不管别人说啥，都要坚持自己的方向坚持下去，时间不知不觉，我们后知后觉。反正无论什么东西吧，坚持下去就会有结果。这个代码就是有时候不能访问一些浏览器，因为禁止程序访问，所以就要稍微修改一点东西
复制链接

扫一扫