python循环速度提高_如何提高循环中的python速度?

I have a dataset of 370k records stored in a Pandas Dataframe which needs to be integrated. I tried multiprocessing, threading, Cpython and loop unrolling. But I was not successful and the time shown to compute was 22 hrs. The task is as follows:

%matplotlib inline

from numba import jit, autojit

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

with open('data/full_text.txt', encoding = "ISO-8859-1") as f:

strdata=f.readlines()

data=[]

for string in strdata:

data.append(string.split('\t'))

df=pd.DataFrame(data,columns=["uname","date","UT","lat","long","msg"])

df=df.drop('UT',axis=1)

df[['lat','long']] = df[['lat','long']].apply(pd.to_numeric)

from textblob import TextBlob

from tqdm import tqdm

df['polarity']=np.zeros(len(df))

Threading:

from queue import Queue

from threading import Thread

import logging

logging.basicConfig(

level=logging.DEBUG,

format='(%(threadName)-10s) %(message)s',

)

class DownloadWorker(Thread):

def __init__(self, queue):

Thread.__init__(self)

self.queue = queue

def run(self):

while True:

# Get the work from the queue and expand the tuple

lowIndex, highIndex = self.queue.get()

a = range(lowIndex,highIndex-1)

for i in a:

df['polarity'][i]=TextBlob(df['msg'][i]).sentiment.polarity

self.queue.task_done()

def main():

# Create a queue to communicate with the worker threads

queue = Queue()

# Create 8 worker threads

for x in range(8):

worker = DownloadWorker(queue)

worker.daemon = True

worker.start()

# Put the tasks into the queue as a tuple

for i in tqdm(range(0,len(df)-1,62936)):

logging.debug('Queueing')

queue.put((i,i+62936 ))

queue.join()

print('Took {}'.format(time() - ts))

main()

Multiprocessing with loop unrolling:

pool = multiprocessing.Pool(processes=2)

r = pool.map(assign_polarity, df)

pool.close()

def assign_polarity(df):

a=range(0,len(df),5)

for i in tqdm(a):

df['polarity'][i]=TextBlob(df['msg'][i]).sentiment.polarity

df['polarity'][i+1]=TextBlob(df['msg'][i+1]).sentiment.polarity

df['polarity'][i+2]=TextBlob(df['msg'][i+2]).sentiment.polarity

df['polarity'][i+3]=TextBlob(df['msg'][i+3]).sentiment.polarity

df['polarity'][i+4]=TextBlob(df['msg'][i+4]).sentiment.polarity

How to increase the speed of computation? or storing the computation in dataframe in a faster way? My laptop configuration

Ram: 8GB

Physical cores: 2

Logical cores: 8

Windows 10

Implementing Multiprocessing gave me a higher computation time.

Threading was being executed sequentially (I think because of GIL)

Loop Unrolling gave me the same computation speed.

Cpython was giving me errors while importing libraries.

解决方案

ASD -- I noticed that storing something in a df iteratively is VERY slow. I'd try to store your TextBlobs in a list (or another structure) and then converting that list into a column of a df.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值