(个人经验,仅供参考,错误之处,敬请谅解)
基础
模块:
requests,re
使用示例:
这是闯关的简单案例
#Encoding=utf-8
import requests
import re
url_start = "http://www.heibanke.com/lesson/crawler_ex00";
r = requests.get(url_start);
number = re.findall('<h3>.*?(\d{5}).*',r.text)
while number:
r = requests.get(url_start+'/'+number[0])
number = re.findall('<h3>.*?(\d{5}).*',r.text)
print(number)
拓展辅助模块:
hackhttp,beartifulsoup
使用示例:
#!/usr/bin/env python
#encoding=utf-8
import hackhttp
from bs4 import BeautifulSoup
url = 'https://.......'
http = hackhttp.hackhttp()
code,head,html,redirect_url,log = http.http(url)
soup = BeautifulSoup(html,'lxml')
soup.title //直接使用标签名
content = soup.find_all(name='x',attrs={'class':'',''})
多线程使用
多线程模版:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import sys
from queue import Queue
import threading
from bs4 import BeautifulSoup as bs
import re
headers = {.....}
class MySpider(threading.Thread):
def __init__(self, que):
threading.Thread.__init__(self)
self._que = que
def run(self):
while not self._que.empty():
url = self._que.get()
try:
self.spider(url)
except Exception as e:
print(e)
pass
def spider(self,url):
r = requests.get(url=url,headers=headers)
soup = bs(r.content,'lxml')
...... //对内容进行处理
def main():
que = Queue()
for i in range(...):
que.put('https://......)
threads = []
thread_count = 4 //线程数
for i in range(thread_count):
threads.append(MySpider(que))
for t in threads:
t.start()
for t in threads:
t.join()
if __name__ == '__main__':
main()
总结:
为了提高效率,往往爬虫需要结合多线程进行使用,所以总结了以上的各类模版,便于使用。
7387

被折叠的 条评论
为什么被折叠?



