Python:使用代理代理爬虫

urllib模块使用代理

urllib / urllib2使用代理比较ProxyHandler麻烦,需要先构建一个的类,随后将该类用于构建网页打开的opener的类,再在请求中安装该开启者。

代理格式是"http://112.25.41.136:80",如果要账号密码是"http://user:password@112.25.41.136:80"

<span style="color:#276904"><span style="color:#ffffff"><code><span style="color:#ffffff">proxy</span><span style="color:#f92672">=</span><span style="color:#e6db74">"http://112.25.41.136:80"</span>
<span style="color:#75715e"># Build ProxyHandler object by given proxy</span>
<span style="color:#ffffff">proxy_support</span><span style="color:#f92672">=</span><span style="color:#ffffff">urllib</span><span style="color:#f92672">.</span><span style="color:#ffffff">request</span><span style="color:#f92672">.</span><span style="color:#ffffff">ProxyHandler</span>({<span style="color:#e6db74">'http'</span>:<span style="color:#ffffff">proxy</span>})
<span style="color:#75715e"># Build opener with ProxyHandler object</span>
<span style="color:#ffffff">opener</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">urllib</span><span style="color:#f92672">.</span><span style="color:#ffffff">request</span><span style="color:#f92672">.</span><span style="color:#ffffff">build_opener</span>(<span style="color:#ffffff">proxy_support</span>)
<span style="color:#75715e"># Install opener to request</span>
<span style="color:#ffffff">urllib</span><span style="color:#f92672">.</span><span style="color:#ffffff">request</span><span style="color:#f92672">.</span><span style="color:#ffffff">install_opener</span>(<span style="color:#ffffff">opener</span>)
<span style="color:#75715e"># Open url</span>
<span style="color:#ffffff">r</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">urllib</span><span style="color:#f92672">.</span><span style="color:#ffffff">request</span><span style="color:#f92672">.</span><span style="color:#ffffff">urlopen</span>(<span style="color:#e6db74">'http://icanhazip.com'</span>,<span style="color:#ffffff">timeout</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">1000</span>)
</code></span></span>

请求模块使用代理

以单次代理为例。多次的话可以用会议一类构建。

如果需要使用代理,你可以通过为任意请求方法提供代理参数来配置单个请求:

<span style="color:#276904"><span style="color:#ffffff"><code><span style="color:#f92672">import</span> <span style="color:#ffffff">requests</span>

<span style="color:#ffffff">proxies</span> <span style="color:#f92672">=</span> {
  <span style="color:#e6db74">"http"</span>: <span style="color:#e6db74">"http://10.10.1.10:3128"</span>,
  <span style="color:#e6db74">"https"</span>: <span style="color:#e6db74">"http://10.10.1.10:1080"</span>,
}

<span style="color:#ffffff">r</span><span style="color:#f92672">=</span><span style="color:#ffffff">requests</span><span style="color:#f92672">.</span><span style="color:#ffffff">get</span>(<span style="color:#e6db74">"http://icanhazip.com"</span>, <span style="color:#ffffff">proxies</span><span style="color:#f92672">=</span><span style="color:#ffffff">proxies</span>)
<span style="color:#f92672">print</span> <span style="color:#ffffff">r</span><span style="color:#f92672">.</span><span style="color:#ffffff">text</span>
</code></span></span>

也。你可以通过环境变量HTTP_PROXY状语从句:HTTPS_PROXY来配置代理。

<span style="color:#276904"><span style="color:#ffffff"><code><span style="color:#f6aa11">export </span><span style="color:#ffffff">HTTP_PROXY</span><span style="color:#f92672">=</span><span style="color:#e6db74">"http://10.10.1.10:3128"</span>
<span style="color:#f6aa11">export </span><span style="color:#ffffff">HTTPS_PROXY</span><span style="color:#f92672">=</span><span style="color:#e6db74">"http://10.10.1.10:1080"</span>
python
<span style="color:#555555">>>> </span>import requests
<span style="color:#555555">>>> </span><span style="color:#ffffff">r</span><span style="color:#f92672">=</span>requests.get<span style="color:#f92672">(</span><span style="color:#e6db74">"http://icanhazip.com"</span><span style="color:#f92672">)</span>
<span style="color:#555555">>>> </span>print r.text
</code></span></span>

若你的代理需要使用HTTP Basic Auth,可以使用http://user:password@host/语法:

<span style="color:#276904"><span style="color:#ffffff"><code><span style="color:#ffffff">proxies</span> <span style="color:#f92672">=</span> {
    <span style="color:#e6db74">"http"</span>: <span style="color:#e6db74">"http://user:pass@10.10.1.10:3128/"</span>,
}
</code></span></span>

示例脚本

这里以gatherproxy的高匿代理为例构建一个代理池的类。如别的西刺代理同理构建。

<span style="color:#276904"><span style="color:#ffffff"><code><span style="color:#75715e">#! /usr/bin/env python</span>
<span style="color:#75715e"># -*- coding: utf-8 -*-</span>

<span style="color:#ffffff">__author__</span><span style="color:#f92672">=</span><span style="color:#e6db74">"Platinhom"</span>
<span style="color:#ffffff">__date__</span><span style="color:#f92672">=</span><span style="color:#e6db74">"2016.1.29 23:30"</span>

<span style="color:#f92672">import</span> <span style="color:#ffffff">re</span><span style="color:#f92672">,</span><span style="color:#ffffff">requests</span><span style="color:#f92672">,</span><span style="color:#ffffff">random</span>

<span style="color:#ffffff">header</span><span style="color:#f92672">=</span>{<span style="color:#e6db74">'headers'</span>:<span style="color:#e6db74">'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'</span>}

<span style="color:#f92672">class</span> <span style="color:#a6e22e">GatherProxy</span>(<span style="color:#f6aa11">object</span>):
	<span style="color:#e6db74">'''To get proxy from http://gatherproxy.com/'''</span>
	<span style="color:#ffffff">url</span><span style="color:#f92672">=</span><span style="color:#e6db74">'http://gatherproxy.com/proxylist'</span>
	<span style="color:#ffffff">pre1</span><span style="color:#f92672">=</span><span style="color:#ffffff">re</span><span style="color:#f92672">.</span><span style="color:#f6aa11">compile</span>(<span style="color:#e6db74">r'<tr.*?>(?:.|</span><span style="color:#960050">\</span><span style="color:#e6db74">n)*?</tr>'</span>)
	<span style="color:#ffffff">pre2</span><span style="color:#f92672">=</span><span style="color:#ffffff">re</span><span style="color:#f92672">.</span><span style="color:#f6aa11">compile</span>(<span style="color:#e6db74">r"(?<=</span><span style="color:#960050">\</span><span style="color:#e6db74">(</span><span style="color:#960050">\</span><span style="color:#e6db74">').+?(?=</span><span style="color:#960050">\</span><span style="color:#e6db74">'</span><span style="color:#960050">\</span><span style="color:#e6db74">))"</span>)

	<span style="color:#f92672">def</span> <span style="color:#a6e22e">getelite</span>(<span style="color:#ffffff">self</span>,<span style="color:#ffffff">pages</span><span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>,<span style="color:#ffffff">uptime</span><span style="color:#f92672">=</span><span style="color:#ae81ff">70</span>,<span style="color:#ffffff">fast</span><span style="color:#f92672">=</span><span style="color:#ffffff">True</span>):
		<span style="color:#e6db74">'''Get Elite Anomy proxy
		Pages define how many pages to get
		Uptime define the uptime(L/D)
		fast define only use fast proxy with short reponse time'''</span>

		<span style="color:#ffffff">proxies</span><span style="color:#f92672">=</span><span style="color:#f6aa11">set</span>()
		<span style="color:#f92672">for</span> <span style="color:#ffffff">i</span> <span style="color:#f92672">in</span> <span style="color:#f6aa11">range</span>(<span style="color:#ae81ff">1</span>,<span style="color:#ffffff">pages</span><span style="color:#f92672">+</span><span style="color:#ae81ff">1</span>):
			<span style="color:#ffffff">params</span><span style="color:#f92672">=</span>{<span style="color:#e6db74">"Type"</span>:<span style="color:#e6db74">"elite"</span>,<span style="color:#e6db74">"PageIdx"</span>:<span style="color:#f6aa11">str</span>(<span style="color:#ffffff">i</span>),<span style="color:#e6db74">"Uptime"</span>:<span style="color:#f6aa11">str</span>(<span style="color:#ffffff">uptime</span>)}
			<span style="color:#ffffff">r</span><span style="color:#f92672">=</span><span style="color:#ffffff">requests</span><span style="color:#f92672">.</span><span style="color:#ffffff">post</span>(<span style="color:#ffffff">self</span><span style="color:#f92672">.</span><span style="color:#ffffff">url</span><span style="color:#f92672">+</span><span style="color:#e6db74">"/anonymity/t=Elite"</span>,<span style="color:#ffffff">params</span><span style="color:#f92672">=</span><span style="color:#ffffff">params</span>,<span style="color:#ffffff">headers</span><span style="color:#f92672">=</span><span style="color:#ffffff">header</span>)
			<span style="color:#f92672">for</span> <span style="color:#ffffff">td</span> <span style="color:#f92672">in</span> <span style="color:#ffffff">self</span><span style="color:#f92672">.</span><span style="color:#ffffff">pre1</span><span style="color:#f92672">.</span><span style="color:#ffffff">findall</span>(<span style="color:#ffffff">r</span><span style="color:#f92672">.</span><span style="color:#ffffff">text</span>):
				<span style="color:#f92672">if</span> <span style="color:#ffffff">fast</span> <span style="color:#f92672">and</span> <span style="color:#e6db74">'center fast'</span> <span style="color:#f92672">not</span> <span style="color:#f92672">in</span> <span style="color:#ffffff">td</span>:
					<span style="color:#f92672">continue</span> 
				<span style="color:#f92672">try</span>:
					<span style="color:#ffffff">tmp</span><span style="color:#f92672">=</span> <span style="color:#ffffff">self</span><span style="color:#f92672">.</span><span style="color:#ffffff">pre2</span><span style="color:#f92672">.</span><span style="color:#ffffff">findall</span>(<span style="color:#f6aa11">str</span>(<span style="color:#ffffff">td</span>))
					<span style="color:#f92672">if</span>(<span style="color:#f6aa11">len</span>(<span style="color:#ffffff">tmp</span>)<span style="color:#f92672">==</span><span style="color:#ae81ff">2</span>):
						<span style="color:#ffffff">proxies</span><span style="color:#f92672">.</span><span style="color:#ffffff">add</span>(<span style="color:#ffffff">tmp</span>[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">+</span><span style="color:#e6db74">":"</span><span style="color:#f92672">+</span><span style="color:#f6aa11">str</span>(<span style="color:#f6aa11">int</span>(<span style="color:#e6db74">'0x'</span><span style="color:#f92672">+</span><span style="color:#ffffff">tmp</span>[<span style="color:#ae81ff">1</span>],<span style="color:#ae81ff">16</span>)))
				<span style="color:#f92672">except</span>:
					<span style="color:#f92672">pass</span>
		<span style="color:#f92672">return</span> <span style="color:#ffffff">proxies</span>

<span style="color:#f92672">class</span> <span style="color:#a6e22e">ProxyPool</span>(<span style="color:#f6aa11">object</span>):
	<span style="color:#e6db74">'''A proxypool class to obtain proxy'''</span>

	<span style="color:#ffffff">gatherproxy</span><span style="color:#f92672">=</span><span style="color:#ffffff">GatherProxy</span>()

	<span style="color:#f92672">def</span> <span style="color:#a6e22e">__init__</span>(<span style="color:#ffffff">self</span>):
		<span style="color:#ffffff">self</span><span style="color:#f92672">.</span><span style="color:#ffffff">pool</span><span style="color:#f92672">=</span><span style="color:#f6aa11">set</span>()

	<span style="color:#f92672">def</span> <span style="color:#a6e22e">updateGatherProxy</span>(<span style="color:#ffffff">self</span>,<span style="color:#ffffff">pages</span><span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>,<span style="color:#ffffff">uptime</span><span style="color:#f92672">=</span><span style="color:#ae81ff">70</span>,<span style="color:#ffffff">fast</span><span style="color:#f92672">=</span><span style="color:#ffffff">True</span>):
		<span style="color:#e6db74">'''Use GatherProxy to update proxy pool'''</span>
		<span style="color:#ffffff">self</span><span style="color:#f92672">.</span><span style="color:#ffffff">pool</span><span style="color:#f92672">.</span><span style="color:#ffffff">update</span>(<span style="color:#ffffff">self</span><span style="color:#f92672">.</span><span style="color:#ffffff">gatherproxy</span><span style="color:#f92672">.</span><span style="color:#ffffff">getelite</span>(<span style="color:#ffffff">pages</span><span style="color:#f92672">=</span><span style="color:#ffffff">pages</span>,<span style="color:#ffffff">uptime</span><span style="color:#f92672">=</span><span style="color:#ffffff">uptime</span>,<span style="color:#ffffff">fast</span><span style="color:#f92672">=</span><span style="color:#ffffff">fast</span>))

	<span style="color:#f92672">def</span> <span style="color:#a6e22e">removeproxy</span>(<span style="color:#ffffff">self</span>,<span style="color:#ffffff">proxy</span>):
		<span style="color:#e6db74">'''Remove a proxy from pool'''</span>
		<span style="color:#f92672">if</span> (<span style="color:#ffffff">proxy</span> <span style="color:#f92672">in</span> <span style="color:#ffffff">self</span><span style="color:#f92672">.</span><span style="color:#ffffff">pool</span>):
			<span style="color:#ffffff">self</span><span style="color:#f92672">.</span><span style="color:#ffffff">pool</span><span style="color:#f92672">.</span><span style="color:#ffffff">remove</span>(<span style="color:#ffffff">proxy</span>)

	<span style="color:#f92672">def</span> <span style="color:#a6e22e">randomchoose</span>(<span style="color:#ffffff">self</span>):
		<span style="color:#e6db74">'''Random Get a proxy from pool'''</span>
		<span style="color:#f92672">if</span> (<span style="color:#ffffff">self</span><span style="color:#f92672">.</span><span style="color:#ffffff">pool</span>):
			<span style="color:#f92672">return</span> <span style="color:#ffffff">random</span><span style="color:#f92672">.</span><span style="color:#ffffff">sample</span>(<span style="color:#ffffff">self</span><span style="color:#f92672">.</span><span style="color:#ffffff">pool</span>,<span style="color:#ae81ff">1</span>)[<span style="color:#ae81ff">0</span>]
		<span style="color:#f92672">else</span>:
			<span style="color:#ffffff">self</span><span style="color:#f92672">.</span><span style="color:#ffffff">updateGatherProxy</span>()
			<span style="color:#f92672">return</span> <span style="color:#ffffff">random</span><span style="color:#f92672">.</span><span style="color:#ffffff">sample</span>(<span style="color:#ffffff">self</span><span style="color:#f92672">.</span><span style="color:#ffffff">pool</span>,<span style="color:#ae81ff">1</span>)[<span style="color:#ae81ff">0</span>]

	<span style="color:#f92672">def</span> <span style="color:#a6e22e">getproxy</span>(<span style="color:#ffffff">self</span>):
		<span style="color:#e6db74">'''Get a dict format proxy randomly'''</span>
		<span style="color:#ffffff">proxy</span><span style="color:#f92672">=</span><span style="color:#ffffff">self</span><span style="color:#f92672">.</span><span style="color:#ffffff">randomchoose</span>()
		<span style="color:#ffffff">proxies</span><span style="color:#f92672">=</span>{<span style="color:#e6db74">'http'</span>:<span style="color:#e6db74">'http://'</span><span style="color:#f92672">+</span><span style="color:#ffffff">proxy</span>,<span style="color:#e6db74">'https'</span>:<span style="color:#e6db74">'https://'</span><span style="color:#f92672">+</span><span style="color:#ffffff">proxy</span>}
		<span style="color:#75715e">#r=requests.get('http://icanhazip.com',proxies=proxies,timeout=1)</span>
		<span style="color:#f92672">try</span>:
			<span style="color:#ffffff">r</span><span style="color:#f92672">=</span><span style="color:#ffffff">requests</span><span style="color:#f92672">.</span><span style="color:#ffffff">get</span>(<span style="color:#e6db74">'http://dx.doi.org'</span>,<span style="color:#ffffff">proxies</span><span style="color:#f92672">=</span><span style="color:#ffffff">proxies</span>,<span style="color:#ffffff">timeout</span><span style="color:#f92672">=</span><span style="color:#ae81ff">1</span>)
			<span style="color:#f92672">if</span> (<span style="color:#ffffff">r</span><span style="color:#f92672">.</span><span style="color:#ffffff">status_code</span> <span style="color:#f92672">==</span> <span style="color:#ae81ff">200</span> ):
				<span style="color:#f92672">return</span> <span style="color:#ffffff">proxies</span>
			<span style="color:#f92672">else</span>:
				<span style="color:#ffffff">self</span><span style="color:#f92672">.</span><span style="color:#ffffff">removeproxy</span>(<span style="color:#ffffff">proxy</span>)
				<span style="color:#f92672">return</span> <span style="color:#ffffff">self</span><span style="color:#f92672">.</span><span style="color:#ffffff">getproxy</span>()
		<span style="color:#f92672">except</span>:
			<span style="color:#ffffff">self</span><span style="color:#f92672">.</span><span style="color:#ffffff">removeproxy</span>(<span style="color:#ffffff">proxy</span>)
			<span style="color:#f92672">return</span> <span style="color:#ffffff">self</span><span style="color:#f92672">.</span><span style="color:#ffffff">getproxy</span>()
</code></span></span>

实例2:urllib代理刷CSDN博客(转载)

转载自FadeTrack的Python爬虫入门“下”。的使用的英文西刺代理作为代理的源。

<span style="color:#276904"><span style="color:#ffffff"><code><span style="color:#75715e"># 刷 CSDN 博客访问量</span>
<span style="color:#f92672">import</span> <span style="color:#ffffff">urllib.request</span>
<span style="color:#f92672">import</span> <span style="color:#ffffff">re</span><span style="color:#f92672">,</span><span style="color:#ffffff">random</span>
<span style="color:#f92672">from</span> <span style="color:#ffffff">multiprocessing.dummy</span> <span style="color:#f92672">import</span> <span style="color:#ffffff">Pool</span> <span style="color:#f92672">as</span> <span style="color:#ffffff">ThreadPool</span> 
<span style="color:#ffffff">time_out</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">3</span> <span style="color:#75715e"># 全局变量 10 秒超时时间</span>
<span style="color:#ffffff">count</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>
<span style="color:#ffffff">proxies</span> <span style="color:#f92672">=</span> [<span style="color:#ffffff">None</span>]
<span style="color:#ffffff">headers</span> <span style="color:#f92672">=</span> {<span style="color:#e6db74">'User-Agent'</span>:<span style="color:#e6db74">'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36'</span>}
<span style="color:#f92672">def</span> <span style="color:#a6e22e">get_proxy</span>():
    <span style="color:#75715e"># 使用全局变量,修改之</span>
    <span style="color:#f92672">global</span> <span style="color:#ffffff">proxies</span>
    <span style="color:#f92672">try</span>:
        <span style="color:#ffffff">req</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">urllib</span><span style="color:#f92672">.</span><span style="color:#ffffff">request</span><span style="color:#f92672">.</span><span style="color:#ffffff">Request</span>(<span style="color:#e6db74">'http://www.xicidaili.com/'</span>,<span style="color:#ffffff">None</span>,<span style="color:#ffffff">headers</span>)
    <span style="color:#f92672">except</span>:
        <span style="color:#f92672">print</span>(<span style="color:#e6db74">'无法获取代理信息!'</span>)
        <span style="color:#f92672">return</span>
    <span style="color:#ffffff">response</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">urllib</span><span style="color:#f92672">.</span><span style="color:#ffffff">request</span><span style="color:#f92672">.</span><span style="color:#ffffff">urlopen</span>(<span style="color:#ffffff">req</span>)
    <span style="color:#ffffff">html</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">response</span><span style="color:#f92672">.</span><span style="color:#ffffff">read</span>()<span style="color:#f92672">.</span><span style="color:#ffffff">decode</span>(<span style="color:#e6db74">'utf-8'</span>)
    <span style="color:#ffffff">p</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">re</span><span style="color:#f92672">.</span><span style="color:#f6aa11">compile</span>(<span style="color:#e6db74">r'''<tr</span><span style="color:#960050">\</span><span style="color:#e6db74">sclass[^>]*></span><span style="color:#960050">\</span><span style="color:#e6db74">s+
                                    <td>.+</td></span><span style="color:#960050">\</span><span style="color:#e6db74">s+
                                    <td>(.*)?</td></span><span style="color:#960050">\</span><span style="color:#e6db74">s+
                                    <td>(.*)?</td></span><span style="color:#960050">\</span><span style="color:#e6db74">s+
                                    <td>(.*)?</td></span><span style="color:#960050">\</span><span style="color:#e6db74">s+
                                    <td>(.*)?</td></span><span style="color:#960050">\</span><span style="color:#e6db74">s+
                                    <td>(.*)?</td></span><span style="color:#960050">\</span><span style="color:#e6db74">s+
                                    <td>(.*)?</td></span><span style="color:#960050">\</span><span style="color:#e6db74">s+
                                </tr>'''</span>,<span style="color:#ffffff">re</span><span style="color:#f92672">.</span><span style="color:#ffffff">VERBOSE</span>)
    <span style="color:#ffffff">proxy_list</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">p</span><span style="color:#f92672">.</span><span style="color:#ffffff">findall</span>(<span style="color:#ffffff">html</span>)
    <span style="color:#f92672">for</span> <span style="color:#ffffff">each_proxy</span> <span style="color:#f92672">in</span> <span style="color:#ffffff">proxy_list</span>[<span style="color:#ae81ff">1</span>:]:
        <span style="color:#f92672">if</span> <span style="color:#ffffff">each_proxy</span>[<span style="color:#ae81ff">4</span>] <span style="color:#f92672">==</span> <span style="color:#e6db74">'HTTP'</span>:
            <span style="color:#ffffff">proxies</span><span style="color:#f92672">.</span><span style="color:#ffffff">append</span>(<span style="color:#ffffff">each_proxy</span>[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">+</span><span style="color:#e6db74">':'</span><span style="color:#f92672">+</span><span style="color:#ffffff">each_proxy</span>[<span style="color:#ae81ff">1</span>])
<span style="color:#f92672">def</span> <span style="color:#a6e22e">change_proxy</span>():
    <span style="color:#75715e"># 随机从序列中取出一个元素</span>
    <span style="color:#ffffff">proxy</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">random</span><span style="color:#f92672">.</span><span style="color:#ffffff">choice</span>(<span style="color:#ffffff">proxies</span>)
    <span style="color:#75715e"># 判断元素是否合理</span>
    <span style="color:#f92672">if</span> <span style="color:#ffffff">proxy</span> <span style="color:#f92672">==</span> <span style="color:#ffffff">None</span>:
        <span style="color:#ffffff">proxy_support</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">urllib</span><span style="color:#f92672">.</span><span style="color:#ffffff">request</span><span style="color:#f92672">.</span><span style="color:#ffffff">ProxyHandler</span>({})
    <span style="color:#f92672">else</span>:
        <span style="color:#ffffff">proxy_support</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">urllib</span><span style="color:#f92672">.</span><span style="color:#ffffff">request</span><span style="color:#f92672">.</span><span style="color:#ffffff">ProxyHandler</span>({<span style="color:#e6db74">'http'</span>:<span style="color:#ffffff">proxy</span>})
    <span style="color:#ffffff">opener</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">urllib</span><span style="color:#f92672">.</span><span style="color:#ffffff">request</span><span style="color:#f92672">.</span><span style="color:#ffffff">build_opener</span>(<span style="color:#ffffff">proxy_support</span>)
    <span style="color:#ffffff">opener</span><span style="color:#f92672">.</span><span style="color:#ffffff">addheaders</span> <span style="color:#f92672">=</span> [(<span style="color:#e6db74">'User-Agent'</span>,<span style="color:#ffffff">headers</span>[<span style="color:#e6db74">'User-Agent'</span>])]
    <span style="color:#ffffff">urllib</span><span style="color:#f92672">.</span><span style="color:#ffffff">request</span><span style="color:#f92672">.</span><span style="color:#ffffff">install_opener</span>(<span style="color:#ffffff">opener</span>)
    <span style="color:#f92672">print</span>(<span style="color:#e6db74">'智能切换代理:</span><span style="color:#e6db74">%</span><span style="color:#e6db74">s'</span> <span style="color:#f92672">%</span> (<span style="color:#e6db74">'本机'</span> <span style="color:#f92672">if</span> <span style="color:#ffffff">proxy</span><span style="color:#f92672">==</span><span style="color:#ffffff">None</span> <span style="color:#f92672">else</span> <span style="color:#ffffff">proxy</span>))
<span style="color:#f92672">def</span> <span style="color:#a6e22e">get_req</span>(<span style="color:#ffffff">url</span>):
    <span style="color:#75715e"># 先伪造一下头部吧,使用字典</span>
    <span style="color:#ffffff">blog_eader</span> <span style="color:#f92672">=</span> {
                <span style="color:#e6db74">'User-Agent'</span>:<span style="color:#e6db74">'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36'</span>,
                <span style="color:#e6db74">'Host'</span>:<span style="color:#e6db74">'blog.csdn.net'</span>,
                <span style="color:#e6db74">'Referer'</span>:<span style="color:#e6db74">'http://blog.csdn.net/'</span>,
                <span style="color:#e6db74">'GET'</span>:<span style="color:#ffffff">url</span>
                } 
    <span style="color:#ffffff">req</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">urllib</span><span style="color:#f92672">.</span><span style="color:#ffffff">request</span><span style="color:#f92672">.</span><span style="color:#ffffff">Request</span>(<span style="color:#ffffff">url</span>,<span style="color:#ffffff">headers</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">blog_eader</span>)
    <span style="color:#f92672">return</span> <span style="color:#ffffff">req</span>
<span style="color:#75715e"># 访问 博客</span>
<span style="color:#f92672">def</span> <span style="color:#a6e22e">look_blog</span>(<span style="color:#ffffff">url</span>):
    <span style="color:#75715e"># 切换一下IP</span>
    <span style="color:#ffffff">change_proxy</span>()
    <span style="color:#ffffff">req</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">get_req</span>(<span style="color:#ffffff">url</span>)
    <span style="color:#f92672">try</span>:
        <span style="color:#ffffff">urllib</span><span style="color:#f92672">.</span><span style="color:#ffffff">request</span><span style="color:#f92672">.</span><span style="color:#ffffff">urlopen</span>(<span style="color:#ffffff">req</span>,<span style="color:#ffffff">timeout</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">time_out</span>)
    <span style="color:#f92672">except</span>:
        <span style="color:#f92672">return</span>
    <span style="color:#f92672">else</span>:
        <span style="color:#f92672">print</span>(<span style="color:#e6db74">'访问成功!'</span>)
<span style="color:#75715e"># 迭代访问</span>
<span style="color:#f92672">def</span> <span style="color:#a6e22e">click_blog</span>(<span style="color:#ffffff">url</span>):
    <span style="color:#f92672">for</span> <span style="color:#ffffff">i</span> <span style="color:#f92672">in</span> <span style="color:#f6aa11">range</span>(<span style="color:#ae81ff">0</span>,<span style="color:#ffffff">count</span>):
        <span style="color:#f92672">if</span>(<span style="color:#ffffff">i</span> <span style="color:#f92672">==</span> <span style="color:#ffffff">count</span>):
            <span style="color:#f92672">break</span>
        <span style="color:#f92672">print</span>(<span style="color:#e6db74">'当前访问 Blog </span><span style="color:#e6db74">%</span><span style="color:#e6db74">s 第 </span><span style="color:#e6db74">%</span><span style="color:#e6db74">d 次'</span> <span style="color:#f92672">%</span> (<span style="color:#ffffff">url</span>,<span style="color:#ffffff">i</span>))
        <span style="color:#ffffff">look_blog</span>(<span style="color:#ffffff">url</span>)
<span style="color:#75715e"># 获取博客的文章链表</span>
<span style="color:#f92672">def</span> <span style="color:#a6e22e">get_blog_list</span>(<span style="color:#ffffff">url</span>):
    <span style="color:#ffffff">req</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">get_req</span>(<span style="color:#ffffff">url</span>)
    <span style="color:#f92672">try</span>:
        <span style="color:#ffffff">response</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">urllib</span><span style="color:#f92672">.</span><span style="color:#ffffff">request</span><span style="color:#f92672">.</span><span style="color:#ffffff">urlopen</span>(<span style="color:#ffffff">req</span>,<span style="color:#ffffff">timeout</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">time_out</span>)
    <span style="color:#f92672">except</span>:
        <span style="color:#f92672">print</span>(<span style="color:#e6db74">'无法挽回的错误'</span>)
        <span style="color:#f92672">return</span> <span style="color:#ffffff">None</span>
    <span style="color:#75715e"># 由于 Csdn 是 utf-8 所以不需要转码</span>
    <span style="color:#ffffff">html</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">response</span><span style="color:#f92672">.</span><span style="color:#ffffff">read</span>()
    <span style="color:#75715e"># 存储一个正则表达式 规则</span>
    <span style="color:#ffffff">regx</span> <span style="color:#f92672">=</span> <span style="color:#e6db74">'<span class="link_title"><a href="(.+?)">'</span>
    <span style="color:#ffffff">pat</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">re</span><span style="color:#f92672">.</span><span style="color:#f6aa11">compile</span>(<span style="color:#ffffff">regx</span>)
    <span style="color:#75715e"># 其实这里 写作 list1 = re.findall('<span class="link_title"><a href="(.+?)">',str(html)) 也是一样的结果</span>
    <span style="color:#ffffff">blog_list</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">re</span><span style="color:#f92672">.</span><span style="color:#ffffff">findall</span>(<span style="color:#ffffff">pat</span>,<span style="color:#f6aa11">str</span>(<span style="color:#ffffff">html</span>))
    <span style="color:#f92672">return</span> <span style="color:#ffffff">blog_list</span>
<span style="color:#f92672">if</span> <span style="color:#ffffff">__name__</span> <span style="color:#f92672">==</span> <span style="color:#e6db74">'__main__'</span>:
    <span style="color:#f92672">global</span> <span style="color:#ffffff">count</span>
    <span style="color:#75715e"># 基本参数初始化</span>
    <span style="color:#75715e"># 获取代理</span>
    <span style="color:#ffffff">get_proxy</span>()
    <span style="color:#f92672">print</span>(<span style="color:#e6db74">'有效代理个数为 : </span><span style="color:#e6db74">%</span><span style="color:#e6db74">d'</span> <span style="color:#f92672">%</span> <span style="color:#f6aa11">len</span>(<span style="color:#ffffff">proxies</span>))
    <span style="color:#ffffff">blogurl</span> <span style="color:#f92672">=</span> <span style="color:#f6aa11">input</span>(<span style="color:#e6db74">'输入blog链接:'</span>)
    <span style="color:#75715e"># 这个地方原本是我的默认输入偷懒用的</span>
    <span style="color:#f92672">if</span> <span style="color:#f6aa11">len</span>(<span style="color:#ffffff">blogurl</span>) <span style="color:#f92672">==</span> <span style="color:#ae81ff">0</span>:
        <span style="color:#ffffff">blogurl</span> <span style="color:#f92672">=</span> <span style="color:#e6db74">'http://blog.csdn.net/bkxiaoc/'</span>
    <span style="color:#f92672">print</span>(<span style="color:#e6db74">'博客地址是:</span><span style="color:#e6db74">%</span><span style="color:#e6db74">s'</span> <span style="color:#f92672">%</span> <span style="color:#ffffff">blogurl</span>)
    <span style="color:#f92672">try</span>:
        <span style="color:#ffffff">count</span> <span style="color:#f92672">=</span> <span style="color:#f6aa11">int</span>(<span style="color:#f6aa11">input</span>(<span style="color:#e6db74">'输入次数:'</span>))
    <span style="color:#f92672">except</span> <span style="color:#f6aa11">ValueError</span>:
        <span style="color:#f92672">print</span>(<span style="color:#e6db74">'参数错误'</span>)
        <span style="color:#ffffff">quit</span>() 
    <span style="color:#f92672">if</span> <span style="color:#ffffff">count</span> <span style="color:#f92672">==</span> <span style="color:#ae81ff">0</span> <span style="color:#f92672">or</span> <span style="color:#ffffff">count</span> <span style="color:#f92672">></span> <span style="color:#ae81ff">999</span>:
        <span style="color:#f92672">print</span>(<span style="color:#e6db74">'次数过大或过小'</span>)
        <span style="color:#ffffff">quit</span>()
    <span style="color:#f92672">print</span>(<span style="color:#e6db74">'次数确认为 </span><span style="color:#e6db74">%</span><span style="color:#e6db74">d'</span> <span style="color:#f92672">%</span> <span style="color:#ffffff">count</span>)
    <span style="color:#75715e"># 获取 博文 列表,由于测试时我的博文只有一页所以 只能获得一页的列表</span>
    <span style="color:#ffffff">blog_list</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">get_blog_list</span>(<span style="color:#ffffff">blogurl</span> <span style="color:#f92672">+</span> <span style="color:#e6db74">'?viewmode=contents'</span>)
    <span style="color:#f92672">if</span> <span style="color:#f6aa11">len</span>(<span style="color:#ffffff">blog_list</span>) <span style="color:#f92672">==</span> <span style="color:#ae81ff">0</span>:
        <span style="color:#f92672">print</span>(<span style="color:#e6db74">'未找到Blog列表'</span>)
        <span style="color:#ffffff">quit</span>()
    <span style="color:#f92672">print</span>(<span style="color:#e6db74">'启动!!!!!!!!!!!!!!!!!!!!'</span>)
    <span style="color:#75715e"># 迭代一下 使用多线程</span>
    <span style="color:#ffffff">index</span> <span style="color:#f92672">=</span> <span style="color:#ae81ff">0</span>
    <span style="color:#f92672">for</span> <span style="color:#ffffff">each_link</span> <span style="color:#f92672">in</span> <span style="color:#ffffff">blog_list</span>:
        <span style="color:#75715e"># 补全头部</span>
        <span style="color:#ffffff">each_link</span> <span style="color:#f92672">=</span> <span style="color:#e6db74">'http://blog.csdn.net'</span> <span style="color:#f92672">+</span> <span style="color:#ffffff">each_link</span>
        <span style="color:#ffffff">blog_list</span>[<span style="color:#ffffff">index</span>] <span style="color:#f92672">=</span> <span style="color:#ffffff">each_link</span>
        <span style="color:#ffffff">index</span> <span style="color:#f92672">+=</span> <span style="color:#ae81ff">1</span>
    <span style="color:#75715e"># 有多少个帖子就开多少个线程的一半 let's go</span>
    <span style="color:#ffffff">pool</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">ThreadPool</span>(<span style="color:#f6aa11">int</span>(<span style="color:#f6aa11">len</span>(<span style="color:#ffffff">blog_list</span>) <span style="color:#f92672">/</span> <span style="color:#ae81ff">2</span>))
    <span style="color:#ffffff">results</span> <span style="color:#f92672">=</span> <span style="color:#ffffff">pool</span><span style="color:#f92672">.</span><span style="color:#f6aa11">map</span>(<span style="color:#ffffff">click_blog</span>, <span style="color:#ffffff">blog_list</span>)
    <span style="color:#ffffff">pool</span><span style="color:#f92672">.</span><span style="color:#ffffff">close</span>()
    <span style="color:#ffffff">pool</span><span style="color:#f92672">.</span><span style="color:#ffffff">join</span>()
    <span style="color:#f92672">print</span>(<span style="color:#e6db74">'完成任务!!!!!!!!!!!!!!!!!!!!'</span>)</code></span></span>
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值