利用hadoopstreaming&python导入数据库数据

最新推荐文章于 2021-12-11 14:39:15 发布

chenlongzhen_tech

最新推荐文章于 2021-12-11 14:39:15 发布

阅读量1.2k

点赞数

分类专栏： python hadoop

本文链接：https://blog.csdn.net/sadfasdgaaaasdfa/article/details/47911093

版权

python 同时被 2 个专栏收录

29 篇文章 0 订阅

订阅专栏

hadoop

9 篇文章 0 订阅

订阅专栏

由于需要把4000万行数据导入数据库, R语言太慢,所以决定用hadoop streaming导入数据

例子参考

http://blog.csdn.net/zhaoyl03/article/details/8657031/

运行方式:

${HADOOP_HOME}/bin/hadoop \
jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
-file mapper.py \
-file reducer.py \
-mapper mapper.py  \
-reducer reducer.py \
-input /testdata.txt  \
-output /streaminput3  \
-jobconf mapred.reduce.tasks=2

运行结果

[root@master codes]# /usr/hadoop/bin/hadoop fs -cat /streaminput3/part-00000                                    15/08/23 06:45:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
(The    1
(which  1
2008).  1
Actually,       1
Alchemy 1
Art     1
But     1
Cs▒▒rdi 1
GDB,    1
GitHub. 1
I       7
I▒▒ll   1
Matloff 1
NSP,    1
R▒▒s    2
Software        2
Stanford        1
The     2
a       10
across  1
add     1
adding  2
also    3
and     3
another 1
around  2
at      1
before  1
before, 1
beginning,      1
book    1
but     6
by      2
cleaning        1
code    2
different       1
do      3
does    1
dump.frames().  1
enhancements    1
especially      1
establishes     1
estimation,     1
facilities      1
finally 1
for     7
fun     1
general).       1
give    1
gotten  2
had     3
hadn▒▒t 2
hard,   1
having  1
higher  1
huge    1
idea,   1
in      7
it      3
it.     1
kernel  1
know,   1
latest  1
lot     1
main    1
might   1
more    1
much    1
nice    1
nodes   1
nonparametric   2
not     1
of      7
on      4
parallel        3
place   1
planning        1
portion 1
problem 1
properties      1
ran     1
regression      1
regression,     1
regression.     1
researchers     1
ridge   1
running 1
than    1
that    4
the     10
time    1
to      6
to.     2
tools   2
up      1
various 1
was     2
way     1
which   3
while   1
work    1
wrote   1
you     2
your    1

虚拟机测试

${HADOOP_HOME}/bin/hadoop jar \
${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
-file mapperSQL.py -mapper mapperSQL.py  \
-input /foreastdatatest.csv \
-output /streaminput/  \
-jobconf mapred.reduce.tasks=2

#!/usr/bin/env python

import MySQLdb,sys

#input One Line
def input(value):
    try:

        conn=MySQLdb.connect(host='master' ,user='root',db='test',port=3306)

        cur=conn.cursor()

        #value = ['"1029418"', '"2014-07-31 23:45:00"', '"575.914531958545"']

        cur.execute("insert into DataPowerPrediction values(%s,%s,%s)",value)

        # conn.commit()
        cur.close()
        conn.close()

    except MySQLdb.Error,e:
        pass


#value = ['1029418', '2014-07-31 23:45:00', '575.914531958545']



# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    line = line.replace('"','')
    # split the line into words
    words = line.split(',')
    # increase counters
    print words
    input(words)

在集群中测试

hadoop \
jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.5.2-transwarp.jar \
-file mapper.py -mapper mapper.py  \
-input ./test/forecastdata \
-output ./test/forcastdatatosqloutputfile2 \
-jobconf mapred.map.tasks=8 \
-jobconf mapred.reduce.tasks=2