爬虫学习笔记(1)

3 篇文章 0 订阅
2 篇文章 0 订阅

python爬虫

爬虫学习规则

  1. 信息一定是公开的,非公开信息不爬。
  2. 爬虫时模拟普通人,绝不高频。
  3. 爬虫绝不盈利,只短时间学习使用。(或盈利途径有利于大众且对各方没有利益损失,包括数据提供方)赚普通数据分析的钱。

注:本笔记使用jupyter编写。

第一章 回顾python编程

使用os中的fork进行多进程。

import os
if __name__ == '__main__':
    print('current Progress is (%s) start ...'%(os.getpid()))
    pid = os.fork()
    if pid < 0:
        print('error in fork')
    elif pid == 0:
        print('I am child progress({0}) and my father progress is ({1})'.format(os.getpid(),os.getppid()))
    else:
        print('I({0}) created a child progress ({1}).'.format(os.getpid(),pid))


current Progress is (84833) start ...
I(84833) created a child progress (85025).
I am child progress(85025) and my father progress is (84833)

使用multiprocessing模块创建多进程

%%writefile temp/test1.py
import os
from multiprocessing import Process
#子进程执行代码
def run_proc(name):
    print('child process {0} ({1}) is running'.format(name,os.getpid()))
if __name__ == '__main__':
    print('Parent procee is {0}'.format(os.getpid()))
    p_list =[]
    for i in range(5):
        p = Process(target = run_proc,args = (str(i),))
        print("Process will start.")
        p.start()
        p_list.append(p)
    for p in p_list:
        p.join()
    print("Process end.")
Writing temp/test1.py
! python3 ./temp/test1.py
Parent procee is 97948
Process will start.
Process will start.
Process will start.
Process will start.
Process will start.
child process 1 (97952) is running
child process 0 (97951) is running
child process 2 (97953) is running
child process 3 (97954) is running
child process 4 (97955) is running
Process end.

结果和书上不符合,主要是run_proc函数没有运行。
解决方法,首先利用

%%writefile temp1.py

写入文件,再利用

! python3 temp1.py

就可以执行,原理,jupyter就是一个程序,他是不能够多进程的,但是可以把文件存到本地,在调用命令行进行就可以啦。

%%writefile temp/test2.py
from multiprocessing import Pool
import os,time,random

def run_task(name):
    print("task{0} Process{1} is running...".format(name,os.getpid()))
    time.sleep(0.001)
    print("task{0} is end.".format(name))
if __name__ == '__main__':
    print('Current process{0}'.format(os.getpid()))
    p=Pool(processes = 3)
    for i in range(5):
        p.apply_async(run_task,args=(i,))
    print('Waiting for all subprocesses done')
    p.close()
    p.join()
    print('All subprocesses done')
    
Writing temp/test2.py
! python3 temp/test2.py
Current process97984
Waiting for all subprocesses done
task0 Process97986 is running...
task0 is end.
task1 Process97986 is running...
task1 is end.
task2 Process97986 is running...
task3 Process97987 is running...
task2 is end.
task4 Process97986 is running...
task3 is end.
task4 is end.
All subprocesses done

进程间的通讯

%%writefile temp/test3.py
from multiprocessing import Process,Queue
import os,time,random
#写数据执行的代码
def proc_writer(q,urls):
    print('Process{0} is writing'.format(os.getpid()))
    for url in urls:
        q.put(url)
        print('Put {0} to queue ...'.format(url))
        time.sleep(random.random())
#读数据执行的代码
def proc_read(q):
    print('Process{0} is reading'.format(os.getpid()))
    while True:
        url = q.get(True)
        print('get {0} from queue.'.format(url))
if __name__ == '__main__':
    #父进程创建Queue,并传递给子进程。
    q = Queue()
    proc_writer1 = Process(target=proc_writer,args=(q,['url1','url2','url3']))
    proc_writer2 = Process(target =proc_writer,args = (q,['url4','url5','url6']))
    proc_reader = Process(target = proc_read,args=(q,))
    #启动子进程,写入
    proc_writer1.start()
    proc_writer2.start()
    #启动子进程,读取
    proc_reader.start()
    #等待写入结束
    proc_writer1.join()
    proc_writer2.join()
    #强行终止读取进程
    proc_reader.terminate()
    
    
Writing temp/test3.py
! python3 temp/test3.py
Process98055 is writing
Put url1 to queue ...
Process98056 is writing
Put url4 to queue ...
Process98057 is reading
get url1 from queue.
get url4 from queue.
Put url2 to queue ...
get url2 from queue.
Put url5 to queue ...
get url5 from queue.
Put url3 to queue ...
get url3 from queue.
Put url6 to queue ...
get url6 from queue.

pipe和multiprocess的区别:
pipe用于两个进程间的通讯,multiprocess可以用于多个进程。
pipe就像一个管道,一段进一段出。

%%writefile temp/test1.4_1.py

import multiprocessing
import os, time,random,sys

def proc_send(pipe,urls):
    for url in urls:
        print("Process {0} send {1}.".format(os.getpid(),url))
        pipe.send(url)
        time.sleep(random.random())
def proc_recv(pipe,num):
    while num:
        print('Process {0} revcieve{1}'.format(os.getpid(),pipe.recv()))
        time.sleep(random.random())
        num = num - 1
        
if __name__ == '__main__':
    pipe = multiprocessing.Pipe()
    p1 = multiprocessing.Process(target=proc_send,args=(pipe[0],['url_'+str(i) for i in range(10)]))
    p2 = multiprocessing.Process(target=proc_recv,args=(pipe[1],10))
    p1.start()
    p2.start()
    p1.join()
    p2.join()
    
Overwriting temp/test1.4_1.py
! python3 temp/test1.4_1.py
Process 669 send url_0.
Process 670 revcieveurl_0
Process 669 send url_1.
Process 670 revcieveurl_1
Process 669 send url_2.
Process 670 revcieveurl_2
Process 669 send url_3.
Process 669 send url_4.
Process 670 revcieveurl_3
Process 670 revcieveurl_4
Process 669 send url_5.
Process 670 revcieveurl_5
Process 669 send url_6.
Process 670 revcieveurl_6
Process 669 send url_7.
Process 670 revcieveurl_7
Process 669 send url_8.
Process 670 revcieveurl_8
Process 669 send url_9.
Process 670 revcieveurl_9

创建多线程

%%writefile temp/test1.4_2.py
import random,time,threading
def thread_run(urls):
    print('Current {0} is running...'.format(threading.current_thread().name))
    for url in urls:
        print("{0} ---->>> {1}".format(threading.current_thread().name,url))
        time.sleep(random.random())
if __name__ == '__main__':
    print('{0} is running...'.format(threading.current_thread().name))
    t1 = threading.Thread(target=thread_run,name='Thread_1',args=(['url_'+str(i) for i in range(3)],))
    t2 = threading.Thread(target=thread_run,name='Thread_2',args=(['url_'+str(i) for i in range(4,7)],))
    t1.start()
    t2.start()
    t1.join()
    t2.join()
    print('{0} is done'.format(threading.current_thread().name))

Overwriting temp/test1.4_2.py
! python3 temp/test1.4_2.py
MainThread is running...
Current Thread_1 is running...
Thread_1 ---->>> url_0
Current Thread_2 is running...
Thread_2 ---->>> url_4
Thread_1 ---->>> url_1
Thread_2 ---->>> url_5
Thread_1 ---->>> url_2
Thread_2 ---->>> url_6
MainThread is done

上面是直接利用了thread类,其实还可以继承该类。

%%writefile temp/test1.4_3.py
import threading,random,time
class myTread(threading.Thread):
    def __init__(self,name,urls):
        threading.Thread.__init__(self,name=name)
        self.urls = urls
    def run(self):
        print('{0} is runing...'.format(threading.current_thread().name))
        for url in self.urls:
            print('{0} --->>> {1}'.format(threading.current_thread().name,url))
            time.sleep(random.random())
if __name__ == '__main__':
    t1 = myTread('Thread_1',['url_'+str(i) for i in range(3)])
    t2 = myTread('Thread_2',['url_'+str(i) for i in range(4,7)])
    t1.start()
    t2.start()
    t1.join()
    t2.join()
    
Overwriting temp/test1.4_3.py
! python3 temp/test1.4_3.py
Thread_1 is runing...
Thread_1 --->>> url_0
Thread_2 is runing...
Thread_2 --->>> url_4
Thread_2 --->>> url_5
Thread_1 --->>> url_1
Thread_1 --->>> url_2
Thread_2 --->>> url_6

线程同步

%%writefile temp/test1.4_4.py
import threading
mylock = threading.RLock()
num = 0
class myThread(threading.Thread):
    def __init__(self,name):
        threading.Thread.__init__(self,name=name)
    def run(self):
        global num
        while True:
            mylock.acquire()
            print('{0} locked,Number:{1}'.format(threading.current_thread().name,num))
            if num >=4:
                mylock.release()
                print('{0} released ,Number:{1}'.format(threading.current_thread().name,num))
                break
            num +=1
            print('{0} released ,Number:{1}'.format(threading.current_thread().name,num))
            mylock.release()
if __name__ == '__main__':
    t1 = myThread('Thread_1')
    t2 = myThread('Thread_2')
    t1.start()
    t2.start()
    
            
        
Overwriting temp/test1.4_4.py
! python3 temp/test1.4_4.py
Thread_1 locked,Number:0
Thread_1 released ,Number:1
Thread_1 locked,Number:1
Thread_1 released ,Number:2
Thread_1 locked,Number:2
Thread_1 released ,Number:3
Thread_1 locked,Number:3
Thread_1 released ,Number:4
Thread_1 locked,Number:4
Thread_1 released ,Number:4
Thread_2 locked,Number:4
Thread_2 released ,Number:4

全局解释器锁GIL

在进行多线程操作时,不能调用多个CPU核。所以:

  • 在进行多CPU操作时,不推荐使用多线程,推荐使用多进程。
  • 多线程适合于密集IO操作。

协程

  • 协程对于CPU来说就是单线程。
  • 使用greenlet库进行协程:进行访问的IO阻塞,greenlet切换另一段未阻塞的代码执行,直到原来代码阻塞解除在切换回来执行,是一种合理安排的串行方式。
  • 在网络中最经常遇到的就是IO操作阻塞。
import gevent
import urllib

def run_task(url):
    print('Visit -->> {0}'.format(url))
    try:
        response = urllib.request.urlopen(url)
        data = response.read()
        print("{0} bytes received from {1}".format(len(data),url))
    except Exception as e:
        pritn(e)
if __name__ =="__main__":
    urls = ['https://github.com','https://www.python.org/','https://www.cnblogs.com']
    greenlets = [gevent.spawn(run_task,url) for url in urls]
    gevent.joinall(greenlets)
Visit -->> https://github.com
276650 bytes received from https://github.com
Visit -->> https://www.python.org/
49821 bytes received from https://www.python.org/
Visit -->> https://www.cnblogs.com
74555 bytes received from https://www.cnblogs.com

  • 上面urllib2包换成了urllib.request.但是执行貌似是串行执行的。

分布式进程

  • 将进程分布在多台机器上执行。通过网络进行通讯和调度。
%%writefile temp/test1.4_5.py
# -*- coding: utf-8 -*-
import random,time
import queue
from multiprocessing.managers import BaseManager
from multiprocessing import freeze_support

# 第一步:建立task_queue和result_queue,用来存放任务和结果。
task_queue = queue.Queue()
result_queue = queue.Queue()
class Queuemanager(BaseManager):
    pass
# 第二步:把创建的两个队列注册到网络上,利用register方法,callable参数关联了Queue对象。
# 将Queue对象在网络中暴露。
# Queuemanager.register('get_task_queue',callable=lambda:task_queue)
# Queuemanager.register('get_result_queue',callable=lambda:result_queue)
def return_task_queue():
    global task_queue
    return task_queue
def return_result_queue():
    global result_queue
    return result_queue

def macos_run():
    Queuemanager.register('get_task_queue',callable=return_task_queue)
    Queuemanager.register('get_result_queue',callable=return_result_queue)

    #第三步:绑定端口8001,设置验证口令‘qiye’。相当于初始化。
    manager = Queuemanager(address=("127.0.0.1",8001),authkey=bytes('qiye',encoding='utf8'))

    # 第四步:启动管理,监听信息通道。
    manager.start()
    try:
        # 第五步:获得网络访问的Queue的对象。
        task = manager.get_task_queue()
        result = manager.get_result_queue()

        # 第六步:添加任务
        for url in ['ImageUrl_'+str(i) for i in range(10)]:
            print('put taks {0}'.format(url))
            task.put(url)
        # 获取返回结果
        print('try get result ...')
        for i in range(10):
            print('result is {0}'.format(result.get(timeout=10)))
    except:
        print('manager error')
    finally:
        # 关闭管理
        manager.shutdown()
    
if __name__ == '__main__':
    freeze_support()
    macos_run()
Overwriting temp/test1.4_5.py
! python3 temp/test1.4_5.py 
put taks ImageUrl_0
put taks ImageUrl_1
put taks ImageUrl_2
put taks ImageUrl_3
put taks ImageUrl_4
put taks ImageUrl_5
put taks ImageUrl_6
put taks ImageUrl_7
put taks ImageUrl_8
put taks ImageUrl_9
try get result ...
result is ImageUrl_0 --->>> success
result is ImageUrl_1 --->>> success
result is ImageUrl_2 --->>> success
result is ImageUrl_3 --->>> success
result is ImageUrl_4 --->>> success
result is ImageUrl_5 --->>> success
result is ImageUrl_6 --->>> success
result is ImageUrl_7 --->>> success
result is ImageUrl_8 --->>> success
result is ImageUrl_9 --->>> success
%%writefile temp/test1.4_6.py
# coding=utf-8
import time
from multiprocessing.managers import BaseManager
#创建类
class QueueManager(BaseManager):
    pass
QueueManager.register('get_task_queue')
QueueManager.register('get_result_queue')

server_addr = '127.0.0.1'
print('Connect to server {0}'.format(server_addr))
m = QueueManager(address = (server_addr,8001),authkey=bytes('qiye',encoding='utf8'))
#网络联接
m.connect()
# 获取对象
task = m.get_task_queue()
result = m.get_result_queue()
#从task队列获取任务并写入result
while(not task.empty()):
    image_url = task.get(True,timeout=5)
    print('run task download{0}'.format(image_url))
    time.sleep(1)
    result.put('{0} --->>> success'.format(image_url))
print('work done')

Overwriting temp/test1.4_6.py

这个task需要和上面一起运行。

书上程序继续无法执行,显示

string argument without an encoding

解决方法:
原来的代码:

manager = Queuemanager(address=("",8001),authkey='qiye')

改为

manager = Queuemanager(address=("",8001),authkey=bytes('qiye',encoding='utf8'))

猜测原因:还是python2 and python3 的兼容性问题。2的字符串是bytes格式的。


结果还是无法执行

pickle.PicklingError: Can't pickle <function <lambda> at 0x7fc29001e160>: attribute lookup <lambda> on __main__ failed

解决方法:
原来的代码:

Queuemanager.register('get_task_queue',callable=lambda:task_queue)
Queuemanager.register('get_result_queue',callable=lambda:result_queue)

因为lambda函数只能在linux下这么用,要改成上面代码的函数调用形式。
此外还要使用freeze_support(),原因还不知道。

网络编程

TPC编程

  • 服务端
%%writefile temp/1.5_0.py
#coding:utf-8
import socket
import threading
import time
def dealClient(sock,addr):
    #接受发来的数据并给对方服务器发送数据
    print('Accept new connect from {0}'.format(addr))
    sock.send(b'Hello,I am serve!')
    while True:
        data = sock.recv(1024)
        time.sleep(1)
        if not data or data.decode('utf-8') == 'exit':
            break
        print('-->>{0}'.format(data.decode('utf-8')))
        sock.send(('Loop_Msg:{0}!'.format(data.decode('utf-8')).encode('utf-8')))
    # 关闭Socket
    sock.close()
    print('Connection from {0}:{1} closed'.format(addr,addr))
if __name__ == '__main__':
    s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
    s.bind(('127.0.0.1',9999)) # ‘127.0.0.1’为本机IP
    s.listen(5)
    print('Waiting for connection ...')
    while True:
        sock,addr = s.accept()
        t = threading.Thread(target=dealClient,args=(sock,addr))
        t.start()
        
Overwriting temp/1.5_0.py
! python3 temp/1.5_0.py
Waiting for connection ...
Accept new connect from ('127.0.0.1', 54652)
-->>Hello,I am a client
Connection from ('127.0.0.1', 54652):('127.0.0.1', 54652) closed
^C
Traceback (most recent call last):
  File "temp/1.5_0.py", line 25, in <module>
    sock,addr = s.accept()
  File "/Users/xiaohanli/opt/anaconda3/lib/python3.8/socket.py", line 292, in accept
    fd, addr = self._accept()
KeyboardInterrupt
  • 客户端
%%writefile temp/1.5_1.py
# coding=utf-8
import socket
s = socket.socket(socket.AF_INET,socket.SOCK_STREAM)
s.connect(('127.0.0.1',9999))
print('-->>'+s.recv(1024).decode('utf-8'))
s.send(b'Hello,I am a client')
print('-->>'+s.recv(1024).decode('utf-8'))
s.send(b'exit')
s.close()

Writing temp/1.5_1.py

同样客户端和服务端要在不同命令行运行。

UDP编程

  • 服务端
%%writefile temp/1.5_2.py
# coding:utf-8
import socket
s = socket.socket(socket.AF_INET,socket.SOCK_DGRAM)
s.bind(('127.0.0.1',9999))
print('Blind UDP on 9999 ...')
while True:
    data,addr = s.recvfrom(1024)
    print('Received from %s:%s'%addr)
    s.sendto(b'Hello,%s'%data,addr)
    
Overwriting temp/1.5_2.py
! python3 temp/1.5_2.py
Blind UDP on 9999 ...
^C
Traceback (most recent call last):
  File "temp/1.5_2.py", line 7, in <module>
    data,addr = s.recvfrom(1024)
KeyboardInterrupt
%%writefile temp/1.5_3.py
# coding:utf-8
import socket
s= socket.socket(socket.AF_INET,socket.SOCK_DGRAM)
for data in [b'Hello',b'World']:
    s.sendto(data,('127.0.0.1',9999))
    print(s.recv(1024).decode('utf-8'))
s.close()
Overwriting temp/1.5_3.py

上面程序还是有问题,没有找到原因。UDP编程

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值