前言
最近看了一些基于MapReduce的神经网络并行训练方面的论文,老师让我自己去实现一下,更深入的体会其中的原理。
MapReduce是基于java语言的框架,于是一开始想用java写深度学习代码。但是dl4j框架实在太难用了,而且网络上的深度学习教程都是基于python的,所以最终还是决定用python去实现基于MapReduce框架的神经网络。如果行不通的话,后面再考虑用java实现神经网络。
目前大致的学习步骤如下:
1、Ubuntu下安装Python3
参考链接:ubuntu下安装python3
2、Python实现最简单的MapReduce例子,如Wordcount
参考链接1:用Python模拟MapReduce分布式计算
参考链接2:用python写MapReduce函数——以WordCount为例
3、在Hadoop上运行Python
参考链接1:在Hadoop上运行Python脚本
参考链接2:Writing An Hadoop MapReduce Program In Python
4、Pycharm(linux)+Hadoop+Spark(环境搭建)
参考链接1:Pycharm(linux)+Hadoop+Spark(环境搭建)
参考链接2:PyCharm搭建Spark开发环境 + 第一个pyspark程序
参考链接3:Spark下使用python写wordCount
参考链接4:Ubuntu 16.04 + PyCharm + spark 运行环境配置
5、Spark实现BP算法并行化
6、复现论文代码
一、Ubuntu下安装Python3
输入python -V查看版本,发现python命令不可用
root@ubuntu:/# python -V
Command 'python' not found, did you mean:
command 'python3' from deb python3
command 'python' from deb python-is-python3
进入/usr/bin,输入ls -l | grep python,发现python3链接到python3.8
root@ubuntu:/usr/bin# ls -l | grep python
lrwxrwxrwx 1 root root 23 Jun 2 03:49 pdb3.8 -> ../lib/python3.8/pdb.py
lrwxrwxrwx 1 root root 31 Sep 23 2020 py3versions -> ../share/python3/py3versions.py
lrwxrwxrwx 1 root root 9 Sep 23 2020 python3 -> python3.8
-rwxr-xr-x 1 root root 5490352 Jun 2 03:49 python3.8
lrwxrwxrwx 1 root root 33 Jun 2 03:49 python3.8-config -> x86_64-linux-gnu-python3.8-config
lrwxrwxrwx 1 root root 16 Mar 13 2020 python3-config -> python3.8-config
-rwxr-xr-x 1 root root 384 Mar 27 2020 python3-futurize
-rwxr-xr-x 1 root root 388 Mar 27 2020 python3-pasteurize
-rwxr-xr-x 1 root root 3241 Jun 2 03:49 x86_64-linux-gnu-python3.8-config
lrwxrwxrwx 1 root root 33 Mar 13 2020 x86_64-linux-gnu-python3-config -> x86_64-linux-gnu-python3.8-config
输入python3 -V,发现Ubuntu自带python3,并且版本为3.8.10
root@ubuntu:/# python3 -V
Python 3.8.10
输入pip3 -V查看版本
root@ubuntu:/# pip3 -V
pip 20.0.2 from /usr/lib/python3/dist-packages/pip (python 3.8)
输入pip3 list查看包
root@ubuntu:/# pip3 list
Package Version
---------------------- --------------------
apturl 0.5.2
bcrypt 3.1.7
blinker 1.4
Brlapi 0.7.0
certifi 2019.11.28
chardet 3.0.4
Click 7.0
colorama 0.4.3
command-not-found 0.3
cryptography 2.8
至此,说明Ubuntu具备python3环境。
二、实现WordCount
1、本地执行Python的MapReduce任务
1)代码
mapper.py文件
#!/usr/bin/env python3
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print("%s\t%s" % (word, 1))
reducer.py文件
#!/usr/bin/env python3
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
#word, count = line.split()
try:
count = int(count)
except ValueError: #count如果不是数字的话,直接忽略掉
continue
if current_word == word:
current_count += count
else:
if current_word:
print("%s\t%s" % (current_word, current_count))
current_count = count
current_word = word
if word == current_word: #不要忘记最后的输出
print("%s\t%s" % (current_word, current_count))
注意#!/usr/bin/env python3
,因为本机没有配置环境变量,python命令不生效,所以这里必须写python3
2)在本地测试代码
在/opt/PycharmProjects/MapReduce/WordCount文件夹下
为了保险起见可以先赋予运行权限
root@ubuntu:/opt/PycharmProjects/MapReduce/WordCount# chmod +x mapper.py
root@ubuntu:/opt/PycharmProjects/MapReduce/WordCount# chmod +x reducer.py
测试mapper.py程序
root@ubuntu:/opt/PycharmProjects/MapReduce/WordCount# echo "aa bb cc dd aa cc" | python3 mapper.py
aa 1
bb 1
cc 1
dd 1
aa 1
cc 1
测试reducer.py程序
root@ubuntu:/opt/PycharmProjects/MapReduce/WordCount# echo "foo foo quux labs foo bar quux" | python3 mapper.py | sort -k1,1 | python3 reducer.py
bar 1
foo 3
labs 1
quux 2
2、在Hadoop集群上执行Python的MapReduce任务
方法一、Hadoop Streaming
Hadoop Streaming提供了一个便于进行MapReduce编程的工具包,使用它可以基于一些可执行命令、脚本语言或其他编程语言来实现Mapper和 Reducer,从而充分利用Hadoop并行计算框架的优势和能力,来处理大数据。
(1)下载文本文件
root@ubuntu:/home/wuwenjing/Downloads/dataset/guteberg# wget http://www.gutenberg.org/files/5000/5000-8.txt
root@ubuntu:/home/wuwenjing/Downloads/dataset/guteberg# wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt
(2)在HDFS上创建文件夹,把这两本书传到HDFS上
root@ubuntu:/hadoop/hadoop-2.9.2/sbin# hdfs dfs -mkdir /user/MapReduce/input
root@ubuntu:/hadoop/hadoop-2.9.2/sbin# hdfs dfs -put /home/wuwenjing/Downloads/dataset/gutenberg/*.txt /user/MapReduce/input
(3)寻找streaming的jar文件存放地址,找到share文件夹中的hadoop-straming*.jar文件
root@ubuntu:/hadoop# find ./ -name "*streaming*.jar"
./hadoop-2.9.2/share/hadoop/tools/lib/hadoop-streaming-2.9.2.jar
./hadoop-2.9.2/share/hadoop/tools/sources/hadoop-streaming-2.9.2-sources.jar
./hadoop-2.9.2/share/hadoop/tools/sources/hadoop-streaming-2.9.2-test-sources.jar
(4)由于通过streaming接口运行的脚本太长了,因此直接建立一个shell名称为run.sh来运行:
root@ubuntu:/hadoop/hadoop-2.9.2# vim run.sh
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-file /opt/PycharmProjects/MapReduce/WordCount/mapper.py -mapper /opt/PycharmProjects/MapReduce/WordCount/mapper.py \
-file /opt/PycharmProjects/MapReduce/WordCount/reducer.py -reducer /opt/PycharmProjects/MapReduce/WordCount/reducer.py \
-input /user/MapReduce/input/*.txt -output /user/MapReduce/output
root@ubuntu:/hadoop/hadoop-2.9.2# source run.sh
这里报错,报错信息如下:
21/08/30 23:24:34 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
21/08/30 23:24:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
packageJobJar: [/opt/PycharmProjects/MapReduce/WordCount/mapper.py, /opt/PycharmProjects/MapReduce/WordCount/reducer.py] [] /tmp/streamjob4294615261368991400.jar tmpDir=null
21/08/30 23:24:35 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
21/08/30 23:24:35 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
21/08/30 23:24:35 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
21/08/30 23:24:35 INFO mapred.FileInputFormat: Total input files to process : 2
21/08/30 23:24:35 INFO mapreduce.JobSubmitter: number of splits:2
21/08/30 23:24:36 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1574514939_0001
21/08/30 23:24:36 INFO mapred.LocalDistributedCacheManager: Localized file:/opt/PycharmProjects/MapReduce/WordCount/mapper.py as file:/hadoop/hadoop-2.9.2/tmp/mapred/local/1630391076183/mapper.py
21/08/30 23:24:36 INFO mapred.LocalDistributedCacheManager: Localized file:/opt/PycharmProjects/MapReduce/WordCount/reducer.py as file:/hadoop/hadoop-2.9.2/tmp/mapred/local/1630391076184/reducer.py
21/08/30 23:24:36 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
21/08/30 23:24:36 INFO mapreduce.Job: Running job: job_local1574514939_0001
21/08/30 23:24:36 INFO mapred.LocalJobRunner: OutputCommitter set in config null
21/08/30 23:24:36 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
21/08/30 23:24:36 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
21/08/30 23:24:36 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
21/08/30 23:24:36 INFO mapred.LocalJobRunner: Waiting for map tasks
21/08/30 23:24:36 INFO mapred.LocalJobRunner: Starting task: attempt_local1574514939_0001_m_000000_0
21/08/30 23:24:36 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
21/08/30 23:24:36 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
21/08/30 23:24:36 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
21/08/30 23:24:36 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/MapReduce/input/5000-8.txt:0+1428843
21/08/30 23:24:36 INFO mapred.MapTask: numReduceTasks: 1
21/08/30 23:24:36 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
21/08/30 23:24:36 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
21/08/30 23:24:36 INFO mapred.MapTask: soft limit at 83886080
21/08/30 23:24:36 INFO mapred