进程线程深入理解GIL&一些小实验

Generalzy

已于 2023-09-27 21:59:47 修改

阅读量438

点赞数

分类专栏： python 文章标签： java python 开发语言

于 2022-09-10 23:10:40 首次发布

本文链接：https://blog.csdn.net/General_zy/article/details/126794199

版权

python 专栏收录该内容

59 篇文章 10 订阅

订阅专栏

threading in os

在这里插入图片描述
所有与该进程有关的资源，都被记录在进程控制块PCB中。以表示该进程拥有这些资源或正在使用它们。另外，进程也是抢占处理机的调度单位，它拥有一个完整的虚拟地址空间。当进程发生调度时，不同的进程拥有不同的虚拟地址空间，而同一进程内的不同线程共享同一地址空间。
在这里插入图片描述

对比维度	多进程	多线程	总结
数据共享、同步	数据共享复杂，同步简单	数据共享简单，同步复杂	各有优劣
内存、CPU	占用内存多，切换复杂，CPU利用率低	占用内存少，切换简单，CPU利用率高	线程占优
创建、销毁、切换	复杂，速度慢	简单，速度快	线程占优
编程、调试	编程简单，调试简单	编程复杂，调试复杂	进程占优
可靠性	进程间不会互相影响	一个线程挂掉将导致整个进程挂掉	进程占优
分布式	适用于多核、多机，扩展到多台机器简单	适合于多核	进程占优

进程线程协程。

GIL in python

在这里插入图片描述

在多线程环境中，Python 虚拟机按以下方式执行：

设置GIL
切换到一个线程去运行
运行直至指定数量的字节码指令，或者线程主动让出控制（可以调用sleep(0)）
把线程设置为睡眠状态
解锁GIL
再次重复以上所有步骤

在这里插入图片描述
全局解释器锁（Global Interpreter Lock，GIL）是Python解释器的一个特性，它是为了保证在多线程环境下对Python对象进行正确的访问控制而设计的。

GIL实际上是一个互斥锁，它保护了解释器内部的数据结构，防止多个线程同时执行Python字节码。因此，在任意时刻只有一个线程能够执行Python字节码，其他线程会被阻塞。

GIL的存在限制了Python的多线程并行性能，尤其是在处理CPU密集型任务时。因为在多线程环境下，只有一个线程能够执行Python字节码，其他线程在等待解释器的释放。这就意味着多线程并不会让CPU的多个核心同时执行Python字节码。

然而，值得注意的是，GIL对于IO密集型任务并不会产生明显的性能影响。因为在IO密集型任务中，大部分时间都是在等待外部IO完成，而不是在执行Python字节码。

如果你需要充分利用多核心处理器进行计算密集型任务，你可以考虑使用多进程（multiprocessing 模块）来代替多线程。每个进程都有独立的Python解释器和GIL，因此可以在不受GIL限制的情况下并行执行Python字节码。

multiprocessing

Unix/Linux操作系统提供了一个fork()系统调用，它非常特殊。普通的函数，调用一次，返回一次，但是fork()调用一次，返回两次，因为操作系统自动把当前进程（父进程）复制了一份（子进程），然后，分别在父进程和子进程内返回。子进程永远返回0，而父进程返回子进程的ID。这样做的理由是，一个父进程可以fork出很多子进程，所以，父进程要记下每个子进程的ID，而子进程只需要调用getpid()就可以拿到父进程的ID。

spawn和fork区别

fork：除了必要的启动资源外，其他变量，包，数据等都继承自父进程，并且是copy-on-write的，也就是共享了父进程的一些内存页，因此启动较快，但是由于大部分都用的父进程数据，所以是不安全的进程

spawn：从头构建一个子进程，父进程的数据等拷贝到子进程空间内，拥有自己的Python解释器，所以需要重新加载一遍父进程的包，因此启动较慢，由于数据都是自己的，安全性较高

方法名
spawn	父进程会启动一个全新的 python 解释器进程。子进程将只继承那些运行进程对象的 `run()` 方法所必需的资源。特别地，来自父进程的非必需文件描述符和句柄将不会被继承。使用此方法启动进程相比使用 fork 或 forkserver 要慢上许多。可在Unix和Windows上使用。 Windows上的默认设置。
fork	父进程使用 `os.fork()` 来产生 Python 解释器分叉。子进程在开始时实际上与父进程相同。父进程的所有资源都由子进程继承。请注意，安全分叉多线程进程是棘手的。只存在于Unix。Unix中的默认值。
forkserver	程序启动并选择* forkserver * 启动方法时，将启动服务器进程。从那时起，每当需要一个新进程时，父进程就会连接到服务器并请求它分叉一个新进程。分叉服务器进程是单线程的，因此使用 `os.fork()` 是安全的。没有不必要的资源被继承。可在Unix平台上使用，支持通过Unix管道传递文件描述符。

注意事项

子进程不共享父进程的变量
在这里插入图片描述

from multiprocessing import Process
import time
 
// define global str_list
str_list = ['ppp', 'yyy']
 
 
def add_str1():
    """子进程1"""
    print('In process one: ', str_list)
    for x in 'thon':
        str_list.append(x * 3)
        time.sleep(1)
        print('In process one: ', str_list)
 
 
def add_str2():
    """子进程1"""
    print('In process two: ', str_list)
    for x in 'thon':
        str_list.append(x)
        time.sleep(1)
        print('In process two: ', str_list)
 
 
if __name__ == '__main__':
    p1 = Process(target=add_str1)
    p1.start()
    p2 = Process(target=add_str2)
    p2.start()
    p1.join()
    p2.join()
----------------------------------------------------
In process one:  ['ppp', 'yyy']
In process two:  ['ppp', 'yyy']
In process one:  ['ppp', 'yyy', 'ttt']
In process two:  ['ppp', 'yyy', 't']
In process two: In process one:   ['ppp', 'yyy', 'ttt', 'hhh']['ppp', 'yyy', 't', 'h']

In process one: In process two:   ['ppp', 'yyy', 'ttt', 'hhh', 'ooo']['ppp', 'yyy', 't', 'h', 'o']

In process one:  ['ppp', 'yyy', 'ttt', 'hhh', 'ooo', 'nnn']
In process two:  ['ppp', 'yyy', 't', 'h', 'o', 'n']

使用pool时multiprocessing.lock不能被序列化
在这里插入图片描述

pool方法使用了queue.Queue将task传递给工作进程，所以传递的数据会被序列化然后插入到队列中。而lock是一个对象，并不是str类型，对象无法插入到队列中，所以会报错。

import os
import multiprocessing
import time
 
 
def write_file(lock):
    print(os.getpid(), os.getppid())
    lock.acquire()
    with open('./t.log', 'a') as f:
        f.write("test")
    lock.release()
 
 
if __name__ == '__main__':
    lock = multiprocessing.Lock()
    pool = multiprocessing.Pool(processes=5)
    for i in range(5):
        handler = pool.apply_async(write_file, (lock, ))
        # print(handler.get())
    try:
        while True:
            time.sleep(3600)
            continue
    except KeyboardInterrupt:
        pool.close()
        pool.join()

平台差异：
对于linux和mac主进程执行的代码不会进程拷贝，但是对应windows系统来说，主进程执行的代码也会进行拷贝，对于windows来说，创建子进程的代码如果进程拷贝执行相当于递归无限制进行创建子进程，会报错。
这也就是为什么：在windows中Process()必须放到if name == ‘main’:下

多线程 or 多进程写入同一文件并不会出现格式错乱

在实际使用过程中，不管是使用多线程还是多进程同时写入一个文件，都不会造成文件的格式错乱，似乎所有的写入操作都是原子操作
在Linux下之所以多线程 or 多进程写入同一个文件没有出现异常是因为系统的一些机制
Appending to a File from Multiple Processes
对于多线程 or 多进程同时写文件的操作，最好的方式还是加锁或者使用队列，在用户态对写入操作进行控制，这是万无一失的方式。

全局变量不可共享解决方案

想要进程间共享参数, 可以使用 from multiprocessing import Process,Manager,Manager支持的类型有:list,dict,Namespace,Lock,RLock,Semaphore,BoundedSemaphore,Condition,Event,Queue,Value和Array

管理器是独立运行的子进程，其中存在真实的对象，并以服务器的形式运行，其他进程通过使用代理访问共享对象，这些代理作为客户端运行。Manager()是BaseManager的子类，返回一个启动的SyncManager()实例，可用于创建共享对象并返回访问这些共享对象的代理。

from multiprocessing import Process, Array
import time

# 创建共享内存，存入列表　
shm = Array('i', [1,2,3,4,5])
# shm = Array('i', range(5))
# shm = Array('i', 5)  # 表示开辟５个空间


def fun(shm):
    # shm 是可迭代对象
    for i in shm:
        print(i)
    # 修改共享内存
    print(list(shm))
    shm[3] = 1000


if __name__ == '__main__':
    p = Process(target=fun, args=(shm,))
    p.start()
    p.join()

    print("=================")
    for i in shm:
        print(i)

使用pool方法的initializer参数(pool不建议使用，方法有缺陷)


import os
import multiprocessing
import time
 
 
def write_file():
    print(os.getpid(), os.getppid())
    lock.acquire()
    with open('./t.log', 'a') as f:
        f.write("test")
    lock.release()
 
 
def init_lock(l):
    global lock
    lock = l
 
if __name__ == '__main__':
    l = multiprocessing.Lock()
    pool = multiprocessing.Pool(processes=5, initializer=init_lock, initargs=(l, ))
    for i in range(5):
        handler = pool.apply_async(write_file)
        print(handler.get())
    try:
        while True:
            time.sleep(3600)
            continue
    except KeyboardInterrupt:
        pool.close()
        pool.join()

uwsgi+django和锁

django原生是但进程多线程的web server
uwsgi会启动多个worker，这样一来django就变成了多进程多线程web server

这样该如何加锁呢？

需求&前情提要

linux和windows创建子进程的方式不同，充分证明windows干脆把文件都复制了一遍

from multiprocessing import Process
import os
def run_proc(name):
    print('Run child process %s (%s)...' % (name, os.getpid()))
if __name__ == '__main__':
    print('Parent process %s.' % os.getpid())
    p = Process(target=run_proc, args=('test',))
    print('Child process will start.')
    p.start()
    p.join()
print('Child process end.')
print(1)
---------------------------------------------Linux
Parent process 3268.
Child process will start.
Run child process test (3269)...
Child process end.
1
---------------------------------------------windows
Parent process 10424.
Child process will start.
Child process end.
1
Run child process test (6540)...
Child process end.
1

普通全局变量不共享
进程池中的进程并不是由当前同一个父进程创建的
用进程池模拟多个用户

实现：

main.py
# 程序入口
from g import mutex
from multiprocessing import Process
from time import sleep


def user_operation(username):
    with mutex:
        print("%s 抢到了锁 %s,开始执行" % (username, hex(id(mutex))))
        print("file upload starting")
        sleep(3)
        print("file upload ending")


if __name__ == '__main__':
    users = []
    for i in range(4):
        user = Process(target=user_operation, args=("user-%s" % i,))
        user.start()
        users.append(user)
    for user in users:
        user.join()

g.py
# 全局变量
from multiprocessing import Manager,Lock
mutex = Manager().Lock()
或者
mutex = Lock()

在这里插入图片描述
这段代码在windows环境会报错

An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:

    if __name__ == '__main__':
        freeze_support()
        ...

The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.

在这里插入图片描述
修改为：

from multiprocessing import Process
from time import sleep


def user_operation(username, mu):
    with mu:
        print("%s 抢到了锁 %s,开始执行" % (username, hex(id(mu))))
        print("file upload starting")
        sleep(3)
        print("file upload ending")


if __name__ == '__main__':
    from g import mutex

    users = []
    for i in range(4):
        user = Process(target=user_operation, args=("user-%s" % i, mutex))
        user.start()
        users.append(user)
    for user in users:
        user.join()

----------------------------------------------------------
user-0 抢到了锁 0x156c7a12310,开始执行
file upload starting
file upload ending
user-1 抢到了锁 0x27b5fd72310,开始执行
file upload starting
file upload ending
user-2 抢到了锁 0x27ea9202310,开始执行
file upload starting
file upload ending
user-3 抢到了锁 0x204c3f91310,开始执行
file upload starting
file upload ending

实现了相同的效果，但打印出来的锁的id每次都不同。

go 语言实现的效果与python在linux上的相同，是正确的。

package main

import (
	"fmt"
	"sync"
	"time"
)

var a []string

func init()  {
	a = make([]string,0,5)
}

func userOperation(username string,wg *sync.WaitGroup,mutex *sync.Mutex){
	defer wg.Done()
	mutex.Lock()
	fmt.Printf("user %s 抢到了锁\n", username)
	fmt.Println("file upload starting")
	fmt.Println(mutex)
	a = append(a,username)
	time.Sleep(time.Second*3)
	fmt.Println("file upload ending")
	mutex.Unlock()
}


func main() {
	mutex :=sync.Mutex{}
	wg:=sync.WaitGroup{}
	wg.Add(4)
	for i:=0;i<4;i++{
		go userOperation(fmt.Sprintf("user-%d",i),&wg,&mutex)
	}
	wg.Wait()
	fmt.Println(a)
}
----------------------------------------------------
user user-3 抢到了锁
file upload starting
0xc000006030
file upload ending
user user-0 抢到了锁
file upload starting
0xc000006030
file upload ending
user user-1 抢到了锁
file upload starting
0xc000006030
file upload ending
user user-2 抢到了锁
file upload starting
0xc000006030
file upload ending
[user-3 user-0 user-1 user-2]

一些小实验

单例模式在多线程多进程下的应用

单例在多线程下，可以保证全局唯一，但在多进程下呢？
子进程不共享父进程的变量

所以，每个进程都维护着一个自己的单例。

验证

创建appserver

以flask为例，以包的形式创建一个单例

# view.py--------------------------------------------------------------
from single import indexer
from flask import Blueprint
from flask import jsonify
from flask import request

view_bp = Blueprint("view_bp", __name__)


@view_bp.route("/get")
def get_name():
    print("lock", id(indexer.mu))
    print("indexer", id(indexer))
    return jsonify({"code": 1, "msg": indexer.get()})


@view_bp.route("/set")
def set_name():
    print("lock", id(indexer.mu))
    print("indexer", id(indexer))
    return jsonify({"code": 1, "msg": indexer.set(request.args.get("name"))})


# single.py ----------------------------------------------------------------------
from multiprocessing import Lock


class __Indexer:
    names = set()

    def __init__(self):
        self.mu = Lock()

    def set(self, name):
        self.names.add(name)

    def get(self):
        return self.names.pop()


indexer = __Indexer()


# main.py -----------------------------------------------------------------------
from flask import Flask
from view import view_bp

app = Flask(__name__)
app.register_blueprint(view_bp)

if __name__ == '__main__':
	# 多进程启动
    app.run("0.0.0.0", port=8000, processes=4)

用gunicorn去替换flask的werkzurg(也可以用uwsgi去替换）

# gunicorn.conf

bind = "0.0.0.0:5000"
# 4个worker
workers = 4
backlog = 2048
pidfile = "log/gunicorn.pid"
# accesslog = "log/access.log"
# errorlog = "log/debug.log"
timeout = 600
debug=False
capture_output = True

在这里插入图片描述

假设

如果indexer全局唯一，那么names全局唯一，只要一次set一次get，程序不会报错

实操

set-get一次后程序奔溃，并且可以看到打印出来的内存地址两次不相同

在这里插入图片描述
结论：开启的四个进程，每个进程中都有自己的indexer，并不是唯一。

将`set`替换为`Manger().List()`

from multiprocessing import Lock
from multiprocessing import Manager


class __Indexer:
    # names = set()
    names = Manager().list()

    def __init__(self):
        self.mu = Lock()

    def set(self, name):
        # self.names.add(name)
        self.names.append(name)

    def get(self):
        return self.names.pop()


indexer = __Indexer()

再次验证，依旧会显示从empty中取值

在这里插入图片描述

使用类方法实现单例（不考虑高并发）

from multiprocessing import Lock
from multiprocessing import Manager


class _Indexer:
    names = set()
    # names = Manager().list()
    instance = None

    @classmethod
    def new(cls):
        if cls.instance is None:
            cls.instance = cls()
            return cls.instance
        else:
            return cls.instance

    def __init__(self):
        self.mu = Lock()

    def set(self, name):
        self.names.add(name)
        # self.names.append(name)

    def get(self):
        return self.names.pop()


# indexer = _Indexer()
indexer = _Indexer.new()

在这里插入图片描述

将`indexer`移出去

# from single import indexer
from single import _Indexer
from flask import Blueprint
from flask import jsonify
from flask import request

view_bp = Blueprint("view_bp", __name__)

indexer = _Indexer.new()

依旧报错，相当于将name随机放到了某个进程下的names当其他进程去取时，自然会报错。
在这里插入图片描述

重新打开`Manager()`

在这里插入图片描述
到目前为止，可以确定，每一个进程都有自己的indexer

加普通锁

放到`single.py`

from multiprocessing import Lock
from multiprocessing import Manager

mu = Lock()


class _Indexer:
    names = set()
    # names = Manager().list()
    instance = None

    @classmethod
    def new(cls):
        print("global",mu)
        with mu:
            if cls.instance is None:
                cls.instance = cls()
                return cls.instance
            else:
                return cls.instance

    def __init__(self):
        self.mu = Lock()

    def set(self, name):
        self.names.add(name)
        # self.names.append(name)

    def get(self):
        return self.names.pop()

# indexer = _Indexer()
# indexer = _Indexer.new()

在这里插入图片描述
刚启动，就显示创建了4个锁。。。。

放到`main.py`

from flask import Flask
from view import view_bp
from multiprocessing import Lock

app = Flask(__name__)
app.register_blueprint(view_bp)

if __name__ == '__main__':
    mu = Lock()
    app.run("0.0.0.0", port=8000, processes=4)

在这里插入图片描述
循环导入，崩溃

加`Manager().Lock()`

from multiprocessing import Lock
from multiprocessing import Manager

# mu = Lock()
mu = Manager().Lock()


class _Indexer:
    names = set()
    # names = Manager().list()
    instance = None

    @classmethod
    def new(cls):
        print(mu)
        with mu:
            if cls.instance is None:
                cls.instance = cls()
                return cls.instance
            else:
                return cls.instance

    def __init__(self):
        self.mu = Lock()

    def set(self, name):
        self.names.add(name)
        # self.names.append(name)

    def get(self):
        return self.names.pop()

# indexer = _Indexer()
# indexer = _Indexer.new()

也是创建了四个锁
在这里插入图片描述
依旧不是全局唯一

到最后，我自己也乱了，，，，

总之就是，httpserver进程间数据不共享，单例也不是单例，如果要创建单例，应该用fd是否存在去创建，这样所有的进程都可以"看得到这个fd锁"

import os

def lock():
    while True:
        if os.path.exists("./mutex.lock"):
            return False
        else:
            with open("./mutex.lock", "wb") as fd:
                pass
            return True


def unlock():
    os.remove("./mutex.lock")

如何将CPU占满

众所周知，进程是资源分配的最小单位，线程是CPU调度的最小单位。

将CPU占满也就只需要按照CPU核数开启CPU密集型任务了，对比以下python代码和go代码

from concurrent.futures import ThreadPoolExecutor
from hashlib import sha512
from os import cpu_count


def task():
    # 定义hash任务
    count = 1
    while True:
        sha512_factory = sha512()
        for _ in range(count):
            sha512_factory.update("hello world".encode("utf-8"))
        sha512_factory.hexdigest()
        count += 1


def main():
    print(cpu_count())
    with ThreadPoolExecutor(cpu_count()) as pool:
        for i in range(cpu_count()):
            pool.submit(task)


if __name__ == '__main__':
    main()

package main

import (
	"crypto/sha256"
	"fmt"
	"runtime"
)

func task() {
	count := 1
	for {
		sha256C := sha256.New()
		for i := 0; i < count; i++ {
			sha256C.Write([]byte("hello world"))
		}
		sha256C.Sum([]byte("hello world"))
		count += 1
	}
}

func main() {
	fmt.Println(runtime.NumCPU())

	for i := 0; i < runtime.NumCPU(); i++ {
		go task()
	}

	select {}
}

python代码的CPU占用率只能达到1/cpu核数,而go却可以占到100%，罪魁祸首就是python GIL，全局解释器锁顾名思义锁的是当前进程共享的python解释器，以致于同一时间下开启的多个线程，只有一个线程可以被python解释执行，相当于开启了一个线程在执行运算，所以CPU占用率只能达到1/cpu核数。

Generalzy

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
进程线程深入理解GIL&一些小实验

管理器是独立运行的子进程，其中存在真实的对象，并以服务器的形式运行，其他进程通过使用代理访问共享对象，这些代理作为客户端运行。fork：除了必要的启动资源外，其他变量，包，数据等都继承自父进程，并且是copy-on-write的，也就是共享了父进程的一些内存页，因此启动较快，但是由于大部分都用的父进程数据，所以是不安全的进程。spawn：从头构建一个子进程，父进程的数据等拷贝到子进程空间内，拥有自己的Python解释器，所以需要重新加载一遍父进程的包，因此启动较慢，由于数据都是自己的，安全性较高。
复制链接

扫一扫