I have a dataset of 370k records stored in a Pandas Dataframe which needs to be integrated. I tried multiprocessing, threading, Cpython and loop unrolling. But I was not successful and the time shown to compute was 22 hrs. The task is as follows:
%matplotlib inline
from numba import jit, autojit
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
with open('data/full_text.txt', encoding = "ISO-8859-1") as f:
strdata=f.readlines()
data=[]
for string in strdata:
data.append(string.split('\t'))
df=pd.DataFrame(data,columns=["uname","date","UT","lat","long","msg"])
df=df.drop('UT',axis=1)
df[['lat','long']] = df[['lat','long']].apply(pd.to_numeric)
from textblob import TextBlob
from tqdm import tqdm
df['polarity']=np.zeros(len(df))
Threading:
from queue import Queue
from threading import Thread
import logging
logging.basicConfig(
level=logging.DEBUG,
format='(%(threadName)-10s) %(message)s',
)
class DownloadWorker(Thread):
def __init__(self, queue):
Thread.__init__(self)
self.queue = queue
def run(self):
while True:
# Get the work from the queue and expand the tuple
lowIndex, highIndex = self.queue.get()
a = range(lowIndex,highIndex-1)
for i in a:
df['polarity'][i]=TextBlob(df['msg'][i]).sentiment.polarity
self.queue.task_done()
def main():
# Create a queue to communicate with the worker threads
queue = Queue()
# Create 8 worker threads
for x in range(8):
worker = DownloadWorker(queue)
worker.daemon = True
worker.start()
# Put the tasks into the queue as a tuple
for i in tqdm(range(0,len(df)-1,62936)):
logging.debug('Queueing')
queue.put((i,i+62936 ))
queue.join()
print('Took {}'.format(time() - ts))
main()
Multiprocessing with loop unrolling:
pool = multiprocessing.Pool(processes=2)
r = pool.map(assign_polarity, df)
pool.close()
def assign_polarity(df):
a=range(0,len(df),5)
for i in tqdm(a):
df['polarity'][i]=TextBlob(df['msg'][i]).sentiment.polarity
df['polarity'][i+1]=TextBlob(df['msg'][i+1]).sentiment.polarity
df['polarity'][i+2]=TextBlob(df['msg'][i+2]).sentiment.polarity
df['polarity'][i+3]=TextBlob(df['msg'][i+3]).sentiment.polarity
df['polarity'][i+4]=TextBlob(df['msg'][i+4]).sentiment.polarity
How to increase the speed of computation? or storing the computation in dataframe in a faster way? My laptop configuration
Ram: 8GB
Physical cores: 2
Logical cores: 8
Windows 10
Implementing Multiprocessing gave me a higher computation time.
Threading was being executed sequentially (I think because of GIL)
Loop Unrolling gave me the same computation speed.
Cpython was giving me errors while importing libraries.
解决方案
ASD -- I noticed that storing something in a df iteratively is VERY slow. I'd try to store your TextBlobs in a list (or another structure) and then converting that list into a column of a df.