I have a multi-threaded application in Python where threads are reading very large (so I cannot copy them to thread-local storage) dicts (read from disk and never modified). Then they process huge amounts of data using the dicts as read-only data:
# single threaded
d1,d2,d3 = read_dictionaries()
while line in stdin:
stdout.write(compute(line,d1,d2,d3)+line)
I am trying to speed this up by using threads, which would then each read its own input and write its own output, but since the dicts are huge, I want the threads to share the storage.
IIUC, every time a thread reads from the dict, it has to lock it, and that imposes a performance cost on the application. This data locking is not necessary because the dicts are read-only.
Does CPython actually lock the data individually or does it just use the GIL?
If, indeed, there is per-dict locking, is there a way to avoid it?
解决方案
Multithreading processing in python is useless. It's better to use multiprocessing module. Because multithreading can give positive effort only in lower number of cases.
Python implementation detail: In CPython, due to the Global
Interpreter Lock, only one thread can execute Python code at once
(even though certain performance-oriented libraries might overcome
this limitation). If you want your application to make better use of
the computational resources of multi-core machines, you are advised to
use multiprocessing. However, threading is still an appropriate model
if you want to run multiple I/O-bound tasks simultaneously.
Official documentation.
Without any code examples from your side I can only recommend to split your big dictionary on several parts and process every part using Pool.map. And merge results in main process.
Unfortunately, it's impossible to share a lot of memory between different python processes effective (we are not talking about shared memory pattern based on mmap). But you can read different parts of your dictionary in different processes. Or just read entire dictionary in main process and give a small chunks to child processes.
Also, I should warn you that you should be very carefully with multiprocessing algorithms. Because every extra megabytes will be multiplied on number of process.
So, based on your pseudocode example I can assume two possible algorithm based on compute function:
# "Stateless"
for line in stdin:
res = compute_1(line) + compute_2(line) + compute_3(line)
print res, line
# "Shared" state
for line in stdin:
res = compute_1(line)
res = compute_2(line, res)
res = compute_3(line, res)
print res, line
In first case, you can create a several workers, read each dictionary in separate worker based on Process class (it's good idea to decrease memory usage for each process), and compute it like a production line.
In second case, you have a shared state. For each next worker you need a result of previous one. It's worst case for multithreading/multiprocessing programming. But you can write algorithm there several workers are using same Queue and pushing result to it without waiting finish of all cycle. And you just share a Queue instance between your processes.