看了好多博客,这个问题都没有讲清本质,看得我一知半解,下面是出错代码,一个简单地单词计数。
#dask 利用hdfs单词计数
from hdfs import Client
from distributed import Client as Cl
from collections import defaultdict
hdfs = Client('http://192.168.175.139:9870') #连接hdfs
print(hdfs.list('/'))
client = Cl('172.26.244.71:8786') #连接到dask分布式
print(client.ncores) #查看dask分布式运算资源情况
filenames = hdfs.list('/test/input1')
print(filenames)
def count_words(fn):
fn = '/test/input1/' + fn
word_counts = defaultdict(int)
with hdfs.read(fn) as f:
for line in f.readlines():
for word in line.split():
word_counts[word] += 1
return word_counts
# counts = count_words(filenames[0])
# print(counts)
future = client.submit(count_words,filenames[0])
counts = future.result()
当程序中出现TypeError: can't pickle _thread.lock objects
错误的时候,在我这里,是我运用dask分布式,通俗的原因就是分布式连接局域网中的HDFS文件系统的时候,我们把HDFS当成了一个全局变量,其实不然,分布式中其他主机并不知道HDFS代表什么,也连接不上远程数据库,所以应当把hdfs连接局域网中的数据库写到函数里面,分布到集群中的任意机器,问题得以解决。这个问题看了两天,原因是对python函数理解不够,谨记!!!
更改后的代码:
#dask 利用hdfs单词计数
from hdfs import Client
from distributed import Client as Cl
from collections import defaultdict
hdfs = Client('http://192.168.175.139:9870') #连接hdfs
print(hdfs.list('/'))
client = Cl('172.26.244.71:8786') #连接到dask分布式
print(client.ncores) #查看dask分布式运算资源情况
filenames = hdfs.list('/test/input1')
print(filenames)
def count_words(fn):
hdfs = Client('http://192.168.175.139:9870') # 连接hdfs
fn = '/test/input1/' + fn
word_counts = defaultdict(int)
with hdfs.read(fn) as f:
for line in f.readlines():
for word in line.split():
word_counts[word] += 1
return word_counts
# counts = count_words(filenames[0])
# print(counts)
future = client.submit(count_words,filenames[0])
counts = future.result()
print(counts)