【神经网络并行训练(上)】基于MapReduce的并行算法的实现

最近看了一些基于MapReduce的神经网络并行训练方面的论文,老师让我自己去实现一下,更深入的体会其中的原理。MapReduce是基于java语言的框架,于是一开始想用java写深度学习代码。但是dl4j框架实在太难用了,而且网络上的深度学习教程都是基于python的,所以最终还是决定用python去实现基于MapReduce框架的神经网络。如果行不通的话,后面再考虑用java实现神经网络。目前大致的学习步骤如下:1、Python实现最简单的MapReduce例子,如Wordcount2、MapR
摘要由CSDN通过智能技术生成

前言

最近看了一些基于MapReduce的神经网络并行训练方面的论文,老师让我自己去实现一下,更深入的体会其中的原理。

MapReduce是基于java语言的框架,于是一开始想用java写深度学习代码。但是dl4j框架实在太难用了,而且网络上的深度学习教程都是基于python的,所以最终还是决定用python去实现基于MapReduce框架的神经网络。如果行不通的话,后面再考虑用java实现神经网络。

目前大致的学习步骤如下:
1、Ubuntu下安装Python3
参考链接:ubuntu下安装python3
2、Python实现最简单的MapReduce例子,如Wordcount
参考链接1:用Python模拟MapReduce分布式计算
参考链接2:用python写MapReduce函数——以WordCount为例
3、在Hadoop上运行Python
参考链接1:在Hadoop上运行Python脚本
参考链接2:Writing An Hadoop MapReduce Program In Python
4、Pycharm(linux)+Hadoop+Spark(环境搭建)
参考链接1:Pycharm(linux)+Hadoop+Spark(环境搭建)
参考链接2:PyCharm搭建Spark开发环境 + 第一个pyspark程序
参考链接3:Spark下使用python写wordCount
参考链接4:Ubuntu 16.04 + PyCharm + spark 运行环境配置
5、Spark实现BP算法并行化
6、复现论文代码

一、Ubuntu下安装Python3

输入python -V查看版本,发现python命令不可用

root@ubuntu:/# python -V

Command 'python' not found, did you mean:

  command 'python3' from deb python3
  command 'python' from deb python-is-python3

进入/usr/bin,输入ls -l | grep python,发现python3链接到python3.8

root@ubuntu:/usr/bin# ls -l | grep python 
lrwxrwxrwx 1 root root          23 Jun  2 03:49 pdb3.8 -> ../lib/python3.8/pdb.py
lrwxrwxrwx 1 root root          31 Sep 23  2020 py3versions -> ../share/python3/py3versions.py
lrwxrwxrwx 1 root root           9 Sep 23  2020 python3 -> python3.8
-rwxr-xr-x 1 root root     5490352 Jun  2 03:49 python3.8
lrwxrwxrwx 1 root root          33 Jun  2 03:49 python3.8-config -> x86_64-linux-gnu-python3.8-config
lrwxrwxrwx 1 root root          16 Mar 13  2020 python3-config -> python3.8-config
-rwxr-xr-x 1 root root         384 Mar 27  2020 python3-futurize
-rwxr-xr-x 1 root root         388 Mar 27  2020 python3-pasteurize
-rwxr-xr-x 1 root root        3241 Jun  2 03:49 x86_64-linux-gnu-python3.8-config
lrwxrwxrwx 1 root root          33 Mar 13  2020 x86_64-linux-gnu-python3-config -> x86_64-linux-gnu-python3.8-config

输入python3 -V,发现Ubuntu自带python3,并且版本为3.8.10

root@ubuntu:/# python3 -V
Python 3.8.10

输入pip3 -V查看版本

root@ubuntu:/# pip3 -V
pip 20.0.2 from /usr/lib/python3/dist-packages/pip (python 3.8)

输入pip3 list查看包

root@ubuntu:/# pip3 list
Package                Version             
---------------------- --------------------
apturl                 0.5.2               
bcrypt                 3.1.7               
blinker                1.4                 
Brlapi                 0.7.0               
certifi                2019.11.28          
chardet                3.0.4               
Click                  7.0                 
colorama               0.4.3               
command-not-found      0.3                 
cryptography           2.8

至此,说明Ubuntu具备python3环境。

二、实现WordCount

1、本地执行Python的MapReduce任务

1)代码
mapper.py文件

#!/usr/bin/env python3
import sys
for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print("%s\t%s" % (word, 1))

reducer.py文件

#!/usr/bin/env python3
from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    #word, count = line.split()
    try:
        count = int(count)
    except ValueError:  #count如果不是数字的话,直接忽略掉
        continue
    if current_word == word:
        current_count += count
    else:
        if current_word:
            print("%s\t%s" % (current_word, current_count))
        current_count = count
        current_word = word

if word == current_word:  #不要忘记最后的输出
    print("%s\t%s" % (current_word, current_count))

注意#!/usr/bin/env python3,因为本机没有配置环境变量,python命令不生效,所以这里必须写python3
2)在本地测试代码
在/opt/PycharmProjects/MapReduce/WordCount文件夹下
为了保险起见可以先赋予运行权限

root@ubuntu:/opt/PycharmProjects/MapReduce/WordCount# chmod +x mapper.py
root@ubuntu:/opt/PycharmProjects/MapReduce/WordCount# chmod +x reducer.py

测试mapper.py程序

root@ubuntu:/opt/PycharmProjects/MapReduce/WordCount# echo "aa bb cc dd aa cc" | python3 mapper.py
aa	1
bb	1
cc	1
dd	1
aa	1
cc	1

测试reducer.py程序

root@ubuntu:/opt/PycharmProjects/MapReduce/WordCount# echo "foo foo quux labs foo bar quux" | python3 mapper.py | sort -k1,1 | python3 reducer.py
bar	1
foo	3
labs	1
quux	2

2、在Hadoop集群上执行Python的MapReduce任务

方法一、Hadoop Streaming

Hadoop Streaming提供了一个便于进行MapReduce编程的工具包,使用它可以基于一些可执行命令、脚本语言或其他编程语言来实现Mapper和 Reducer,从而充分利用Hadoop并行计算框架的优势和能力,来处理大数据。
(1)下载文本文件

root@ubuntu:/home/wuwenjing/Downloads/dataset/guteberg# wget http://www.gutenberg.org/files/5000/5000-8.txt
root@ubuntu:/home/wuwenjing/Downloads/dataset/guteberg# wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt

(2)在HDFS上创建文件夹,把这两本书传到HDFS上

root@ubuntu:/hadoop/hadoop-2.9.2/sbin# hdfs dfs -mkdir /user/MapReduce/input
root@ubuntu:/hadoop/hadoop-2.9.2/sbin# hdfs dfs -put /home/wuwenjing/Downloads/dataset/gutenberg/*.txt /user/MapReduce/input

(3)寻找streaming的jar文件存放地址,找到share文件夹中的hadoop-straming*.jar文件

root@ubuntu:/hadoop# find ./ -name "*streaming*.jar"
./hadoop-2.9.2/share/hadoop/tools/lib/hadoop-streaming-2.9.2.jar
./hadoop-2.9.2/share/hadoop/tools/sources/hadoop-streaming-2.9.2-sources.jar
./hadoop-2.9.2/share/hadoop/tools/sources/hadoop-streaming-2.9.2-test-sources.jar

(4)由于通过streaming接口运行的脚本太长了,因此直接建立一个shell名称为run.sh来运行:

root@ubuntu:/hadoop/hadoop-2.9.2# vim run.sh
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-file /opt/PycharmProjects/MapReduce/WordCount/mapper.py -mapper  /opt/PycharmProjects/MapReduce/WordCount/mapper.py \
-file /opt/PycharmProjects/MapReduce/WordCount/reducer.py -reducer  /opt/PycharmProjects/MapReduce/WordCount/reducer.py \
-input /user/MapReduce/input/*.txt -output /user/MapReduce/output
root@ubuntu:/hadoop/hadoop-2.9.2# source run.sh

这里报错,报错信息如下:

21/08/30 23:24:34 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
21/08/30 23:24:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
packageJobJar: [/opt/PycharmProjects/MapReduce/WordCount/mapper.py, /opt/PycharmProjects/MapReduce/WordCount/reducer.py] [] /tmp/streamjob4294615261368991400.jar tmpDir=null
21/08/30 23:24:35 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
21/08/30 23:24:35 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
21/08/30 23:24:35 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
21/08/30 23:24:35 INFO mapred.FileInputFormat: Total input files to process : 2
21/08/30 23:24:35 INFO mapreduce.JobSubmitter: number of splits:2
21/08/30 23:24:36 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1574514939_0001
21/08/30 23:24:36 INFO mapred.LocalDistributedCacheManager: Localized file:/opt/PycharmProjects/MapReduce/WordCount/mapper.py as file:/hadoop/hadoop-2.9.2/tmp/mapred/local/1630391076183/mapper.py
21/08/30 23:24:36 INFO mapred.LocalDistributedCacheManager: Localized file:/opt/PycharmProjects/MapReduce/WordCount/reducer.py as file:/hadoop/hadoop-2.9.2/tmp/mapred/local/1630391076184/reducer.py
21/08/30 23:24:36 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
21/08/30 23:24:36 INFO mapreduce.Job: Running job: job_local1574514939_0001
21/08/30 23:24:36 INFO mapred.LocalJobRunner: OutputCommitter set in config null
21/08/30 23:24:36 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
21/08/30 23:24:36 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
21/08/30 23:24:36 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
21/08/30 23:24:36 INFO mapred.LocalJobRunner: Waiting for map tasks
21/08/30 23:24:36 INFO mapred.LocalJobRunner: Starting task: attempt_local1574514939_0001_m_000000_0
21/08/30 23:24:36 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
21/08/30 23:24:36 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
21/08/30 23:24:36 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
21/08/30 23:24:36 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/MapReduce/input/5000-8.txt:0+1428843
21/08/30 23:24:36 INFO mapred.MapTask: numReduceTasks: 1
21/08/30 23:24:36 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
21/08/30 23:24:36 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
21/08/30 23:24:36 INFO mapred.MapTask: soft limit at 83886080
21/08/30 23:24:36 INFO mapred
  • 0
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值