大数据网站日志离线分析项目

 大数据网站日志离线分析项目

hive和hbase的整合

HBaseIntegration - Apache Hive - Apache Software Foundation

注意事项:

版本信息

Avro Data Stored in HBase Columns

As of Hive 0.9.0 the HBase integration requires at least HBase 0.92, earlier versions of Hive were working with HBase 0.89/0.90

Hive 0.9.0与HBase 0.92兼容。

版本信息

Hive 1.x will remain compatible with HBase 0.98.x and lower versions. Hive 2.x will be compatible with HBase 1.x and higher. (See HIVE-10990 for details.) Consumers wanting to work with HBase 1.x using Hive 1.x will need to compile Hive 1.x stream code themselves.

Hive 1.x仍然和HBase 0.98.x兼容。

HIVE-705提出的原生支持的Hive和HBase的整合。可以使用Hive QL语句访问HBase的表,包括SELECT和INSERT。甚至让hive做Hive表和HBase表的join操作和union操作。

需要jar包(hive自带)

hive-hbase-handler-x.y.z.jar

连接单节点hbase的示例:

$HIVE_SRC/build/dist/bin/hive –auxpath $HIVE_SRC/build/dist/lib/hive-hbase-handler-0.9.0.jar,$HIVE_SRC/build/dist/lib/hbase-0.92.0.jar,$HIVE_SRC/build/dist/lib/zookeeper-3.3.4.jar,$HIVE_SRC/build/dist/lib/guava-r09.jar --hiveconf hbase.master=hbase.yoyodyne.com:60000

其中--hiveconf表示可以将此配置写到hive-site.xml中。

连接到hbase集群的示例:

$HIVE_SRC/build/dist/bin/hive --auxpath $HIVE_SRC/build/dist/lib/hive-hbase-handler-0.9.0.jar,$HIVE_SRC/build/dist/lib/hbase-0.92.0.jar,$HIVE_SRC/build/dist/lib/zookeeper-3.3.4.jar,$HIVE_SRC/build/dist/lib/guava-r09.jar --hiveconf hbase.zookeeper.quorum=zk1.yoyodyne.com,zk2.yoyodyne.com,zk3.yoyodyne.com

其中--hiveconf表示可以将此配置写到hive-site.xml中。

在hive的服务端:

然后正常启动:hive --service metastore

启动客户端CLI:hive

要在hive中操作hbase的表,需要对列进行映射。

CREATE TABLE hbase_table_1(key int, value string)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")

TBLPROPERTIES ("hbase.table.name" = "xyz", "hbase.mapred.output.outputtable" = "xyz");

必须指定hbase.columns.mapping属性。

hbase.table.name属性可选,用于指定hbase中对应的表名,允许在hive表中使用不同的表名。上例中,hive中表名为hbase_table_1,hbase中表名为xyz。如果不指定,hive中的表名与hbase中的表名一致。

hbase.mapred.output.outputtable属性可选,向表中插入数据的时候是必须的。该属性的值传递给了hbase.mapreduce.TableOutputFormat使用。

在hive表定义中的映射hbase.columns.mapping中的cf1:val在创建完表之后,hbase中只显示cf1,并不显示val,因为val是行级别的,cf1才是hbase中表级别的元数据。

具体操作:

hive:

CREATE TABLE hbase_table_1(key int, value string)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")

TBLPROPERTIES ("hbase.table.name" = "xyz", "hbase.mapred.output.outputtable" = "xyz");

hbase:

list

desc 'xyz'

hive操作:

insert into hbase_table_1 values(1,'zhangsan');

hbase操作:

scan 'xyz'

建立外部表要求hbase中必须有表对应

hbase操作:

create 'tb_user', 'info'

hive操作:

create external table hive_tb_user1 (

key int,

name string,

age int,

sex string,

likes array<string>

)

row format

delimited

collection items terminated by '-'

stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

with serdeproperties("hbase.columns.mapping"=":key,info:name,info:age,info:sex,info:likes")

tblproperties("hbase.table.name"="tb_user", "hbase.mapred.output.outputtable"="tb_user");

from hive_tb_user

insert into table hive_tb_user

select 1,'zhangsan',25,'female',array('climbing','reading','shopping') limit 1;

hbase操作:

scan 'tb_user'

put 'tb_user', 1, 'info:likes', 'like1-like2-like3-like4'

hive和hbase

要求在hive的server端中添加配置信息:

hive-site.xml中添加

<property>

  <name>hbase.zookeeper.quorum</name>

  <value>node2,node3,node4</value>

</property>

hive --service metastore

客户端直接启动hive就行了

hive

  1. 创建hive的内部表,要求hbase中不能有对应的表
  2. 创建hive的外部表,要求hbase中一定要有对应的表
  3. 映射关系通过
    1. WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:id,cf:username,cf:age")
  4. stored by指定hive中存储数据的时候,由该类来处理,该类会将数据放到hbase的存储中,同时在hive读取数据的时候,由该类负责处理hbase的数据和hive的对应关系
    1. STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

5、指定hive表和hbase中的哪张表对应,outputtable负责当hive insert数据的时候将数据写到hbase的哪张表。

TBLPROPERTIES ("hbase.table.name" = "my_table", "hbase.mapred.output.outputtable" = "my_table");

创建外部表,要求hbase中有对应的表

CREATE external TABLE hbase_my_table(key int, value string)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:val")

TBLPROPERTIES ("hbase.table.name" = "my_table", "hbase.mapred.output.outputtable" = "my_table");

关于hbase和hive的列对应关系

hbase表

my_table

 cf

rowkey        key

cf:id         myid

cf:username   myname

cf:age      myage

hive表

create external table my_table_hbase (

   key     int,

   myid    int,

   myname  string,

   myage   int

)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:id,cf:username,cf:age")

TBLPROPERTIES ("hbase.table.name" = "my_table", "hbase.mapred.output.outputtable" = "my_table");

创建hive的内部表:要求hbase中不能有对应的表。

CREATE TABLE hbase_table_1(key int, value string)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")

TBLPROPERTIES ("hbase.table.name" = "xyz", "hbase.mapred.output.outputtable" = "xyz");

sqoop介绍+安装+数据导入

Sqoop:将关系数据库(oracle、mysql、postgresql等)数据与hadoop数据进行转换的工具

官网:http://sqoop.apache.org/

版本:(两个版本完全不兼容,sqoop1使用最多)

sqoop1:1.4.x

sqoop2:1.99.x

同类产品

DataX:阿里顶级数据交换工具

sqoop架构非常简单,是hadoop生态系统的架构最简单的框架。

sqoop1由client端直接接入hadoop,任务通过解析生成对应的maprecue执行

sqoop导出

sqoop安装和测试

解压

配置环境变量

SQOOP_HOME

PATH

添加数据库驱动包

配置sqoop-env.sh

注释掉bin/configure-sqoop中的第134-147行以关闭不必要的警告信息。

测试

sqoop version

sqoop list-databases --connect jdbc:mysql://node4:3306/ --username root --password 123456

sqoop help

sqoop help command

直接在命令行执行:

sqoop list-databases --connect jdbc:mysql://node1:3306 --username hive --password hive123

将sqoop的命令放到文件中:

sqoop1.txt

######################

list-databases

--connect

jdbc:mysql://node4:3306

--username

hive

--password

hive123

######################

命令行执行:

sqoop --options-file sqoop1.txt

[root@node4 sqoop-1.4.6]# sqoop help list-databases

usage: sqoop list-databases [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:

   --connect <jdbc-uri>                         Specify JDBC connect

                                                string

   --connection-manager <class-name>            Specify connection manager

                                                class name

   --connection-param-file <properties-file>    Specify connection

                                                parameters file

   --driver <class-name>                        Manually specify JDBC

                                                driver class to use

   --hadoop-home <hdir>                         Override

                                                $HADOOP_MAPRED_HOME_ARG

   --hadoop-mapred-home <dir>                   Override

                                                $HADOOP_MAPRED_HOME_ARG

   --help                                       Print usage instructions

-P                                              Read password from console

   --password <password>                        Set authentication

                                                password

   --password-alias <password-alias>            Credential provider

                                                password alias

   --password-file <password-file>              Set authentication

                                                password file path

   --relaxed-isolation                          Use read-uncommitted

                                                isolation for imports

   --skip-dist-cache                            Skip copying jars to

                                                distributed cache

   --username <username>                        Set authentication

                                                username

   --verbose                                    Print more information

                                                while working

Generic Hadoop command-line arguments:

(must preceed any tool-specific arguments)

Generic options supported are

-conf <configuration file>     specify an application configuration file

-D <property=value>            use value for given property

-fs <local|namenode:port>      specify a namenode

-jt <local|resourcemanager:port>    specify a ResourceManager

-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster

-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.

-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is

bin/hadoop command [genericOptions] [commandOptions]

从hive导出到MySQL,则需要在hive的主机(比如hive的客户端所在的位置)安装sqoop。

$CONDITIONS

[root@server3 ~]# sqoop help import

usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:

   --connect <jdbc-uri>                         Specify JDBC connect

                                                string

   --connection-manager <class-name>            Specify connection manager

                                                class name

   --connection-param-file <properties-file>    Specify connection

                                                parameters file

   --driver <class-name>                        Manually specify JDBC

                                                driver class to use

   --hadoop-home <hdir>                         Override

                                                $HADOOP_MAPRED_HOME_ARG

   --hadoop-mapred-home <dir>                   Override

                                                $HADOOP_MAPRED_HOME_ARG

   --help                                       Print usage instructions

-P                                              Read password from console

   --password <password>                        Set authentication

                                                password

   --password-alias <password-alias>            Credential provider

                                                password alias

   --password-file <password-file>              Set authentication

                                                password file path

   --relaxed-isolation                          Use read-uncommitted

                                                isolation for imports

   --skip-dist-cache                            Skip copying jars to

                                                distributed cache

   --username <username>                        Set authentication

                                                username

   --verbose                                    Print more information

                                                while working

Import control arguments:

   --append                                                   Imports data

                                                              in append

                                                              mode

   --as-avrodatafile                                          Imports data

                                                              to Avro data

                                                              files

   --as-parquetfile                                           Imports data

                                                              to Parquet

                                                              files

   --as-sequencefile                                          Imports data

                                                              to

                                                              SequenceFile

                                                              s

   --as-textfile                                              Imports data

                                                              as plain

                                                              text

                                                              (default)

   --autoreset-to-one-mapper                                  Reset the

                                                              number of

                                                              mappers to

                                                              one mapper

                                                              if no split

                                                              key

                                                              available

   --boundary-query <statement>                               Set boundary

                                                              query for

                                                              retrieving

                                                              max and min

                                                              value of the

                                                              primary key

   --columns <col,col,col...>         指定将数据库表中的哪些列数据导入

                                                              

   --compression-codec <codec>                                Compression

                                                              codec to use

                                                              for import

   --delete-target-dir                 Imports data  in delete mode

   --direct                                                   Use direct

                                                              import fast

                                                              path

   --direct-split-size <n>                                    Split the

                                                              input stream

                                                              every 'n'

                                                              bytes when

                                                              importing in

                                                              direct mode

-e,--query <statement>                                        Import

                                                              results of

                                                              SQL

                                                              'statement'

   --fetch-size <n>                                           Set number

                                                              'n' of rows

                                                              to fetch

                                                              from the

                                                              database

                                                              when more

                                                              rows are

                                                              needed

   --inline-lob-limit <n>                                     Set the

                                                              maximum size

                                                              for an

                                                              inline LOB

-m,--num-mappers <n>                                          Use 'n' map

                                                              tasks to

                                                              import in

                                                              parallel

   --mapreduce-job-name <name>                                Set name for

                                                              generated

                                                              mapreduce

                                                              job

   --merge-key <column>                                       Key column

                                                              to use to

                                                              join results

   --split-by <column-name>                                   Column of

                                                              the table

                                                              used to

                                                              split work

                                                              units

   --table <table-name>                                       Table to

                                                              read

   --target-dir <dir>                                         HDFS plain

                                                              table

                                                              destination

   --validate                                                 Validate the

                                                              copy using

                                                              the

                                                              configured

                                                              validator

   --validation-failurehandler <validation-failurehandler>    Fully

                                                              qualified

                                                              class name

                                                              for

                                                              ValidationFa

                                                              ilureHandler

   --validation-threshold <validation-threshold>              Fully

                                                              qualified

                                                              class name

                                                              for

                                                              ValidationTh

                                                              reshold

   --validator <validator>                                    Fully

                                                              qualified

                                                              class name

                                                              for the

                                                              Validator

   --warehouse-dir <dir>                                      HDFS parent

                                                              for table

                                                              destination

   --where <where clause>                                     WHERE clause

                                                              to use

                                                              during

                                                              import

-z,--compress                                                 Enable

                                                              compression

Incremental import arguments:

   --check-column <column>        Source column to check for incremental

                                  change

   --incremental <import-type>    Define an incremental import of type

                                  'append' or 'lastmodified'

   --last-value <value>           Last imported value in the incremental

                                  check column

Output line formatting arguments:

   --enclosed-by <char>               Sets a required field enclosing

                                      character

   --escaped-by <char>                Sets the escape character

   --fields-terminated-by <char>      Sets the field separator character

   --lines-terminated-by <char>       Sets the end-of-line character

   --mysql-delimiters                 Uses MySQL's default delimiter set:

                                      fields: ,  lines: \n  escaped-by: \

                                      optionally-enclosed-by: '

   --optionally-enclosed-by <char>    Sets a field enclosing character

Input parsing arguments:

   --input-enclosed-by <char>               Sets a required field encloser

   --input-escaped-by <char>                Sets the input escape

                                            character

   --input-fields-terminated-by <char>      Sets the input field separator

   --input-lines-terminated-by <char>       Sets the input end-of-line

                                            char

   --input-optionally-enclosed-by <char>    Sets a field enclosing

                                            character

Code generation arguments:

   --bindir <dir>                        Output directory for compiled

                                         objects

   --class-name <name>                   Sets the generated class name.

                                         This overrides --package-name.

                                         When combined with --jar-file,

                                         sets the input class.

   --input-null-non-string <null-str>    Input null non-string

                                         representation

   --input-null-string <null-str>        Input null string representation

   --jar-file <file>                     Disable code generation; use

                                         specified jar

   --map-column-java <arg>               Override mapping for specific

                                         columns to java types

   --null-non-string <null-str>          Null non-string representation

   --null-string <null-str>              Null string representation

   --outdir <dir>                        Output directory for generated

                                         code

   --package-name <name>                 Put auto-generated classes in

                                         this package

Generic Hadoop command-line arguments:

(must preceed any tool-specific arguments)

Generic options supported are

-conf <configuration file>     specify an application configuration file

-D <property=value>            use value for given property

-fs <local|namenode:port>      specify a namenode

-jt <local|resourcemanager:port>    specify a ResourceManager

-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster

-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.

-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is

bin/hadoop command [genericOptions] [commandOptions]

At minimum, you must specify --connect and --table

Arguments to mysqldump and other subprograms may be supplied

after a '--' on the command line.

命令行导入:

从MySQL导数据到HDFS,导入

sqoop import --connect jdbc:mysql://node4/log_results --username hivehive --password hive --as-textfile --table dimension_browser --columns id,browser_name,browser_version --target-dir /sqoop/test1 --delete-target-dir -m 1

将语句写入文件并运行:

sqoop2.txt

import

--connect

jdbc:mysql://node4/log_results

--username

hivehive

--password

hive

--as-textfile

--table

dimension_browser

--columns

id,browser_name,browser_version

--target-dir

/sqoop/test1

--delete-target-dir

-m

1

命令行:

sqoop --options-file sqoop2.txt

可以指定SQL执行导入:

sqoop3.txt

import

--connect

jdbc:mysql://node4/log_results

--username

hivehive

--password

hive

--as-textfile

#--query is the same as -e

-e

select id, browser_name, browser_version from dimension_browser where $CONDITIONS

--target-dir

/sqoop/test2

--delete-target-dir

-m

1

命令行:

sqoop --options-file sqoop3.txt

指定导出文件的分隔符:

sqoop4.txt

import

--connect

jdbc:mysql://node1/log_results

--username

hive

--password

hive123

--as-textfile

-e

select id, browser_name, browser_version from dimension_browser where $CONDITIONS

--target-dir

/sqoop/test2-1

--delete-target-dir

-m

1

--fields-terminated-by

\t

命令行:

sqoop --options-file sqoop4.txt

导入到HDFS以及在HIVE创建表  默认字段的分隔符就是逗号,可以不指定逗号

sqoop5.txt

import

--connect

jdbc:mysql://node1/log_results

--username

hive

--password

hive123

--as-textfile

#--query is the same as -e

-e

select id, browser_name, browser_version from dimension_browser where $CONDITIONS

--hive-import

--create-hive-table

--hive-table

hive_browser_dim

--target-dir

/my/tmp

-m

1

--fields-terminated-by

,

命令行:

sqoop --options-file sqoop5.txt

导出:

hdfs://mycluster/sqoop/data/mydata.txt

1,zhangsan,hello world

2,lisi,are you ok

3,wangwu,fine thanks

4,zhaoliu,what are you doing

5,qunqi,just say hello

sqoop6.txt

export

--connect

jdbc:mysql://node4/log_results

--username

hivehive

--password

hive

--columns

id,myname,myversion

--export-dir

/user/hive/warehouse/hive_browser_dim/

-m

1

--table

mybrowserinfo

--input-fields-terminated-by

,

sqoop6-1.txt

export

--connect

jdbc:mysql://node4/log_results

--username

hivehive

--password

hive

--columns

id,myname,myversion

--export-dir

/user/hive/warehouse/hive_browser_dim/

-m

1

--table

mybrowserinfo1

命令行:

sqoop --options-file sqoop6-1.txt

令行执行:

sqoop --options-file sqoop6.txt

逗号不需要指定分隔符

默认的hive分隔符需要在sqoop文件中指定分隔符\001:

sqoop11.txt

export

--connect

jdbc:mysql://node1/log_results

--username

hive

--password

hive123

--columns

id,name,msg

--export-dir

/user/hive/warehouse/tb_log2

-m

1

--table

tb_loglog

--input-fields-terminated-by

\001

用户浏览深度SQL分析

四种行转列:

join

union

DECODE(oracle)

case when

需求:

将用户访问的次数进行分组,每组多少人。

站在统计用户的角度

MySQL中的stat_view_depth表

1. 在hive中创建hbase的event_log对应表

CREATE EXTERNAL TABLE event_logs(

key string, pl string, en string, s_time bigint, p_url string, u_ud string, u_sd string

) ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

with serdeproperties('hbase.columns.mapping'=':key,log:pl,log:en,log:s_time,log:p_url,log:u_ud,log:u_sd')

tblproperties('hbase.table.name'='eventlog');

2. 创建mysql在hive中的对应表

hive中的表,执行HQL之后分析的结果保存该表,然后通过sqoop工具导出到mysql

CREATE TABLE `stats_view_depth` (

  `platform_dimension_id` bigint ,

  `data_dimension_id` bigint ,

  `kpi_dimension_id` bigint ,

  `pv1` bigint ,

  `pv2` bigint ,

  `pv3` bigint ,

  `pv4` bigint ,

  `pv5_10` bigint ,

  `pv10_30` bigint ,

  `pv30_60` bigint ,

  `pv60_plus` bigint ,

  `created` string

) row format delimited fields terminated by '\t';

3. hive创建临时表

把hql分析之后的中间结果存放到当前的临时表。

CREATE TABLE `stats_view_depth_tmp`(`pl` string, `date` string, `col` string, `ct` bigint);

pl平台

date日期

col列,值对应于mysql表中的列:pv1,pv2,pv4….

ct对应于每列的值

col对应mysql中的pv前缀列。

4. 编写UDF

(platformdimension & datedimension)两个维度

package com.sxt.transformer.hive;


import com.sxt.common.DateEnum;
import com.sxt.transformer.model.dim.base.DateDimension;
import com.sxt.transformer.model.dim.base.PlatformDimension;
import com.sxt.transformer.service.IDimensionConverter;
import com.sxt.transformer.service.impl.DimensionConverterImpl;
import com.sxt.util.TimeUtil;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

import java.io.IOException;

/**
 * 操作日期dimension 相关的udf
 *
 * @author root
 *
 */
public class PlatformDimensionUDF extends UDF {
    private IDimensionConverter converter = new DimensionConverterImpl();

    /**
     * 根据给定的platform名称返回id
     *
     * @param platform
     * @return
     */
    public IntWritable evaluate(Text platform) {
        PlatformDimension dimension = new PlatformDimension(platform.toString());

        try {
            int id = this.converter.getDimensionIdByValue(dimension);
            return new IntWritable(id);
        } catch (IOException e) {
            throw new RuntimeException("获取id异常");
        }
    }
}

package com.sxt.transformer.hive;


import java.io.IOException;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

import com.sxt.common.DateEnum;
import com.sxt.transformer.model.dim.base.DateDimension;
import com.sxt.transformer.service.IDimensionConverter;
import com.sxt.transformer.service.impl.DimensionConverterImpl;
import com.sxt.util.TimeUtil;

/**
 * 操作日期dimension 相关的udf
 *
 * @author root
 *
 */
public class DateDimensionUDF extends UDF {
    private IDimensionConverter converter = new DimensionConverterImpl();

    /**
     * 根据给定的日期(格式为:yyyy-MM-dd)至返回id
     *
     * @param day
     * @return
     */
    public IntWritable evaluate(Text day) {
        DateDimension dimension = DateDimension.buildDate(TimeUtil.parseString2Long(day.toString()), DateEnum.DAY);
        try {
            int id = this.converter.getDimensionIdByValue(dimension);
            System.out.println(day.toString());
            System.out.println(id);
            return new IntWritable(id);
        } catch (IOException e) {
            throw new RuntimeException("获取id异常" + day.toString());
        }
    }
}

5. 上传

打包

bds3.jar上传到hdfs的/sxt/transformer文件夹中

6. 创建hive的function

#create function platformFunc as 'com.sxt.transformer.hive.PlatformDimensionUDF' using jar 'hdfs://mycluster/sxt/transformer/bds3.jar';  

create function dateFunc as 'com.sxt.transformer.hive.DateDimensionUDF' using jar 'hdfs://sxt/sxt/transformer/bds3.jar';  

7. hql编写(统计用户角度的浏览深度)<注意:时间为外部给定>

from (

  select

    pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd') as day, u_ud,

    (case when count(p_url) = 1 then "pv1"

      when count(p_url) = 2 then "pv2"

      when count(p_url) = 3 then "pv3"

      when count(p_url) = 4 then "pv4"

      when count(p_url) >= 5 and count(p_url) <10 then "pv5_10"

      when count(p_url) >= 10 and count(p_url) <30 then "pv10_30"

      when count(p_url) >=30 and count(p_url) <60 then "pv30_60"  

      else 'pv60_plus' end) as pv

  from event_logs

  where

    en='e_pv'

    and p_url is not null

    and pl is not null

    and s_time >= unix_timestamp('2017-08-23','yyyy-MM-dd')*1000

    and s_time < unix_timestamp('2017-08-24','yyyy-MM-dd')*1000

  group by

    pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd'), u_ud

) as tmp

insert overwrite table stats_view_depth_tmp

  select pl,day,pv,count(distinct u_ud) as ct where u_ud is not null group by pl,day,pv;

如何知道该访客是pv10的?

聚合操作

需要从hbase表中查询数据,对u_ud聚合,计算出多少个pv事件

case when得出该访客属于pv10

89155407 pv3

62439313 pv5_10

41469129 pv10_30

37005838 pv30_60

08257218 pv3

总的得出所有人属于pv10

对所有的pv10聚合,计算u_ud的总数,得出pv10的有多少人

`pl` string, `date` string, `col` string, `ct` bigint

website  2019-11-18  pv10  300

website  2019-11-18  pv10  400

website  2019-11-18  pv10  500

website  2019-11-18  pv10  300

website  2019-11-18  pv5_10  20

website  2019-11-18  pv10_30  40

website  2019-11-18  pv30_60  10

website  2019-11-18  pv60_plus  120

总的得出所有pv?有都少人

pv1人数是多少?

聚合操作

行转列 à 结果

--把临时表的多行数据,转换一行

行转列

std prj score

S1 M 100

S1 E 98

S1 Z 80

S2 M 87

S2 E 88

S2 Z 89

std M E Z

S1 100 98 80

S2 87 88 89

select std, score from my_score where prj='M';

select std, score from my_score where prj='E';

select std, score from my_score where prj='Z';

select std, t1.score, t2.score, t3.score from t1 join t2 on t1.std=t2.std

 join t3 on t1.std=t3.std;

SELECT t1.std, t1.score, t2.score, t3.score

from

(select std, score from my_score where prj='M') t1

 join

(select std, score from my_score where prj='E') t2

 on t1.std=t2.std

 join (select std, score from my_score where prj='Z') t3

 on t1.std=t3.std;

采用union all的方式:

select tmp.std, sum(tmp.M), sum(tmp.E), sum(tmp.Z) from (

select std, score as 'M', 0 as 'E', 0 as 'Z' from tb_score where prj='M' UNION ALL

select std, 0 as 'M', score as 'E', 0 as 'Z' from tb_score where prj='E' UNION ALL

select std, 0 as 'M', 0 as 'E', score as 'Z' from tb_score where prj='Z'

) tmp group by tmp.std;

with tmp as

(

select pl,`date` as date1,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all

select pl,`date` as date1,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all

select pl,`date` as date1,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus' union all

select 'all' as pl,`date` as date1,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all

select 'all' as pl,`date` as date1,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus'

)

from tmp

insert overwrite table stats_view_depth

select 2,3,6,sum(pv1),sum(pv2),sum(pv3),sum(pv4),sum(pv5_10),sum(pv10_30),sum(pv30_60),sum(pv60_plus),'2017-01-10' group by pl,date1;

编写UDF获取2,3,6的值,2,3,6是一个假的数据。

with tmp as

(

select pl,`date` as date1,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all

select pl,`date` as date1,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all

select pl,`date` as date1,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus' union all

select 'all' as pl,`date` as date1,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all

select 'all' as pl,`date` as date1,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus'

)

from tmp

insert overwrite table stats_view_depth

select platformFunc(pl),dateFunc(date1), 6 ,sum(pv1),sum(pv2),sum(pv3),sum(pv4),sum(pv5_10),sum(pv10_30),sum(pv30_60),sum(pv60_plus),date1 group by pl,date1;

8. sqoop脚本编写(统计用户角度)

sqoop --options-file sqoop_1

文件sqoop_1内容:

export

--connect

jdbc:mysql://node4:3306/log_results

--username

hive

--password

hive123

--table

stats_view_depth

--export-dir

/hive/bigdater.db/stats_view_depth/*

--input-fields-terminated-by

"\\t"

--update-mode

allowinsert

--update-key

platform_dimension_id,data_dimension_id,kpi_dimension_id

站在会话角度的浏览深度

9. hql编写(统计会话角度的浏览深度)<注意:时间为外部给定>

from (

select pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd') as day, u_sd,

 (case when count(p_url) = 1 then "pv1"

 when count(p_url) = 2 then "pv2"

 when count(p_url) = 3 then "pv3"

 when count(p_url) = 4 then "pv4"

 when count(p_url) >= 5 and count(p_url) <10 then "pv5_10"

 when count(p_url) >= 10 and count(p_url) <30 then "pv10_30"

 when count(p_url) >=30 and count(p_url) <60 then "pv30_60"  

 else 'pv60_plus' end) as pv

from event_logs

where en='e_pv' and p_url is not null and pl is not null and s_time >= unix_timestamp('2015-12-13','yyyy-MM-dd')*1000 and s_time < unix_timestamp('2015-12-14','yyyy-MM-dd')*1000

group by pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd'), u_sd

) as tmp

insert overwrite table stats_view_depth_tmp

select pl,day,pv,count(distinct u_sd) as ct where u_sd is not null group by pl,day,pv;

with tmp as

(

select pl,date,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all

select pl,date,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all

select pl,date,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all

select pl,date,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all

select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all

select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all

select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all

select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus' union all

select 'all' as pl,date,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all

select 'all' as pl,date,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all

select 'all' as pl,date,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all

select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all

select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all

select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all

select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all

select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus'

)

from tmp

insert overwrite table stats_view_depth select platform_convert(pl),date_convert(date),6,sum(pv1),sum(pv2),sum(pv3),sum(pv4),sum(pv5_10),sum(pv10_30),sum(pv30_60),sum(pv60_plus),'2015-12-13' group by pl,date;

10. sqoop脚本编写(统计会话角度)

sqoop --options-file sqoop_2

文件sqoop_2内容

export

--connect

jdbc:mysql://hh:3306/report

--username

hive

--password

hive

--table

stats_view_depth

--export-dir

/hive/bigdater.db/stats_view_depth/*

--input-fields-terminated-by

"\\01"

--update-mode

allowinsert

--update-key

platform_dimension_id,data_dimension_id,kpi_dimension_id

shell脚本编写

view_depth_run.sh

#!/bin/bash

startDate=''

endDate=''

until [ $# -eq 0 ]

do

if [ $1'x' = '-sdx' ]; then

shift

startDate=$1

elif [ $1'x' = '-edx' ]; then

shift

endDate=$1

fi

shift

done

if [ -n "$startDate" ] && [ -n "$endDate" ]; then

echo "use the arguments of the date"

else

echo "use the default date"

startDate=$(date -d last-day +%Y-%m-%d)

endDate=$(date +%Y-%m-%d)

fi

echo "run of arguments. start date is:$startDate, end date is:$endDate"

echo "start run of view depth job "

## insert overwrite

echo "start insert user data to hive tmp table"

hive  -e "from (select pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd') as day, u_ud, (case when count(p_url) = 1 then 'pv1' when count(p_url) = 2 then 'pv2' when count(p_url) = 3 then 'pv3' when count(p_url) = 4 then 'pv4' when count(p_url) >= 5 and count(p_url) <10 then 'pv5_10' when count(p_url) >= 10 and count(p_url) <30 then 'pv10_30' when count(p_url) >=30 and count(p_url) <60 then 'pv30_60'  else 'pv60_plus' end) as pv from event_logs where en='e_pv' and p_url is not null and pl is not null and s_time >= unix_timestamp('$startDate','yyyy-MM-dd')*1000 and s_time < unix_timestamp('$endDate','yyyy-MM-dd')*1000 group by pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd'), u_ud) as tmp insert overwrite table stats_view_depth_tmp select pl,day,pv,count(distinct u_ud) as ct where u_ud is not null group by pl,day,pv"

echo "start insert user data to hive table"

hive  -e "with tmp as (select pl,date,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all select pl,date,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all select pl,date,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus' union all select 'all' as pl,date,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all select 'all' as pl,date,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all select 'all' as pl,date,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus' ) from tmp insert overwrite table stats_view_depth select platform_convert(pl),date_convert(date),5,sum(pv1),sum(pv2),sum(pv3),sum(pv4),sum(pv5_10),sum(pv10_30),sum(pv30_60),sum(pv60_plus),date group by pl,date"

echo "start insert session date to hive tmp table"

hive  -e "from (select pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd') as day, u_sd, (case when count(p_url) = 1 then 'pv1' when count(p_url) = 2 then 'pv2' when count(p_url) = 3 then 'pv3' when count(p_url) = 4 then 'pv4' when count(p_url) >= 5 and count(p_url) <10 then 'pv5_10' when count(p_url) >= 10 and count(p_url) <30 then 'pv10_30' when count(p_url) >=30 and count(p_url) <60 then 'pv30_60'  else 'pv60_plus' end) as pv from event_logs where en='e_pv' and p_url is not null and pl is not null and s_time >= unix_timestamp('$startDate','yyyy-MM-dd')*1000 and s_time < unix_timestamp('$endDate','yyyy-MM-dd')*1000 group by pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd'), u_sd ) as tmp insert overwrite table stats_view_depth_tmp select pl,day,pv,count(distinct u_sd) as ct where u_sd is not null group by pl,day,pv"

## insert into

echo "start insert session data to hive table"

hive --database bigdater -e "with tmp as (select pl,date,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all select pl,date,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all select pl,date,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus' union all select 'all' as pl,date,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all select 'all' as pl,date,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all select 'all' as pl,date,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus' ) from tmp insert into table stats_view_depth select platform_convert(pl),date_convert(date),6,sum(pv1),sum(pv2),sum(pv3),sum(pv4),sum(pv5_10),sum(pv10_30),sum(pv30_60),sum(pv60_plus),'2015-12-13' group by pl,date"

## sqoop

echo "run the sqoop script,insert hive data to mysql table"

sqoop export --connect jdbc:mysql://hh:3306/report --username hive --password hive --table stats_view_depth --export-dir /hive/bigdater.db/stats_view_depth/* --input-fields-terminated-by "\\01" --update-mode allowinsert --update-key platform_dimension_id,data_dimension_id,kpi_dimension_id

echo "complete run the view depth job"

项目优化

一、调优的目的

充分的利用机器的性能,更快的完成mr程序的计算任务。甚至是在有限的机器条件下,能够支持运行足够多的mr程序。

二、调优的总体概述

从mr程序的内部运行机制,我们可以了解到一个mr程序由mapper和reducer两个阶段组成,其中mapper阶段包括数据的读取、map处理以及写出操作(排序和合并/sort&merge),而reducer阶段包含mapper输出数据的获取、数据合并(sort&merge)、reduce处理以及写出操作。那么在这七个子阶段中,能够进行较大力度的进行调优的就是map输出、reducer数据合并以及reducer个数这三个方面的调优操作。也就是说虽然性能调优包括cpu、内存、磁盘io以及网络这四个大方面,但是从mr程序的执行流程中,我们可以知道主要有调优的是内存、磁盘io以及网络。在mr程序中调优,主要考虑的就是减少网络传输和减少磁盘IO操作,故本次课程的mr调优主要包括服务器调优、代码调优、mapper调优、reducer调优以及runner调优这五个方面。

三、服务器调优

服务器调优主要包括服务器参数调优jvm调优。在本次项目中,由于我们使用hbase作为我们分析数据的原始数据存储表,所以对于hbase我们也需要进行一些调优操作。除了参数调优之外,和其他一般的java程序一样,还需要进行一些jvm调优。

hdfs调优

1. dfs.datanode.failed.volumes.tolerated: 允许发生磁盘错误的磁盘数量,默认为0,表示不允许datanode发生磁盘异常。当挂载多个磁盘的时候,可以修改该值。

2. dfs.replication: 复制因子,默认3

3. dfs.namenode.handler.count: namenode节点并发线程量,默认10

4. dfs.datanode.handler.count:datanode之间的并发线程量,默认10。

5. dfs.datanode.max.transfer.threads:datanode提供的数据流操作的并发线程量,默认4096。

一般将其设置为linux系统的文件句柄数的85%~90%之间,查看文件句柄数语句ulimit -a,修改vim /etc/security/limits.conf, 不能设置太大

文件末尾,添加

* soft nofile 65535

* hard nofile 65535

注意:句柄数不能够太大,可以设置为1000000以下的所有数值,一般不设置为-1。

异常处理:当设置句柄数较大的时候,重新登录可能出现unable load session的提示信息,这个时候采用单用户模式进行修改操作即可。

单用户模式:

启动的时候按'a'键,进入选择界面,然后按'e'键进入kernel修改界面,然后选择第二行'kernel...',按'e'键进行修改,在最后添加空格+single即可,按回车键回到修改界面,最后按'b'键进行单用户模式启动,当启动成功后,还原文件后保存,最后退出(exit)重启系统即可。

6. io.file.buffer.size: 读取/写出数据的buffer大小,默认4096,一般不用设置,推荐设置为4096的整数倍(物理页面的整数倍大小)。

hbase调优

1. 设置regionserver的内存大小,默认为1g,推荐设置为4g。

修改conf/hbase-env.sh中的HBASE_HEAPSIZE=4g

2. hbase.regionserver.handler.count: 修改客户端并发线程数,默认为10。设置规则为,当put和scans操作比较的多的时候,将其设置为比较小的值;当get和delete操作比较多的时候,将其设置为比较大的值。原因是防止频繁GC操作导致内存异常。

3. 自定义hbase的分割和紧缩操作,默认情况下hbase的分割机制是当region大小达到hbase.hregion.max.filesize(10g)的时候进行自动分割,推荐每个regionserver的region个数在20~500个为最佳。hbase的紧缩机制是hbase的一个非常重要的管理机制,hbase的紧缩操作是非常消耗内存和cpu的,所以一般机器压力比较大的话,推荐将其关闭,改为手动控制。

4. hbase.balancer.period: 设置hbase的负载均衡时间,默认为300000(5分钟),在负载比较高的集群上,将其值可以适当的改大。

5. hfile.block.cache.size:修改hflie文件块在内存的占比,默认0.4。在读应用比较多的系统中,可以适当的增大该值,在写应用比较多的系统中,可以适当的减少该值,不过不推荐修改为0。

6. hbase.regionserver.global.memstore.upperLimit:修改memstore的内存占用比率上限,默认0.4,当达到该值的时候,会进行flush操作将内容写的磁盘中。

7. hbase.regionserver.global.memstore.lowerLimit: 修改memstore的内存占用比率下限,默认0.38,进行flush操作后,memstore占用的内存比率必须不大于该值。

8. hbase.hregion.memstore.flush.size: 当memstore的值大于该值的时候,进行flush操作。默认134217728(128M)。

9. hbase.hregion.memstore.block.multiplier: 修改memstore阻塞块大小比率值,默认为4。也就是说在memstore的大小超过4*hbase.hregion.memstore.flush.size的时候就会触发写阻塞操作。最终可能会导致出现oom异常。

mapreduce调优

1. mapreduce.task.io.sort.factor: mr程序进行合并排序的时候,打开的文件数量,默认为10个.

2. mapreduce.task.io.sort.mb: mr程序进行合并排序操作的时候或者mapper写数据的时候,内存大小,默认100M

3. mapreduce.map.sort.spill.percent: mr程序进行flush操作的阀值,默认0.80。

4. mapreduce.reduce.shuffle.parallelcopies:mr程序reducer copy数据的线程数,默认5。

5. mapreduce.reduce.shuffle.input.buffer.percent: reduce复制map数据的时候指定的内存堆大小百分比,默认为0.70,适当的增加该值可以减少map数据的磁盘溢出,能够提高系统性能。

6. mapreduce.reduce.shuffle.merge.percent:reduce进行shuffle的时候,用于启动合并输出和磁盘溢写的过程的阀值,默认为0.66。如果允许,适当增大其比例能够减少磁盘溢写次数,提高系统性能。同mapreduce.reduce.shuffle.input.buffer.percent一起使用。

7. mapreduce.task.timeout:mr程序的task执行情况汇报过期时间,默认600000(10分钟),设置为0表示不进行该值的判断。

四、代码调优

代码调优,主要是mapper和reducer中,针对多次创建的对象,进行代码提出操作。这个和一般的java程序的代码调优一样。

五、mapper调优

mapper调优主要就是就一个目标:减少输出量。我们可以通过增加combine阶段以及对输出进行压缩设置进行mapper调优。

combine介绍:

实现自定义combine要求继承reducer类,特点:

以map的输出key/value键值对作为输入输出键值对,作用是减少网络输出,在map节点上就合并一部分数据。

比较适合,map的输出是数值型的,方便进行统计。

压缩设置:

在提交job的时候分别设置启动压缩和指定压缩方式。

六、reducer调优

reducer调优主要是通过参数调优和设置reducer的个数来完成。

reducer个数调优:

要求:一个reducer和多个reducer的执行结果一致,不能因为多个reducer导致执行结果异常。

规则:一般要求在hadoop集群中的执行mr程序,map执行完成100%后,尽量早的看到reducer执行到33%,可以通过命令hadoop job -status job_id或者web页面来查看。

原因: map的执行process数是通过inputformat返回recordread来定义的;而reducer是有三部分构成的,分别为读取mapper输出数据、合并所有输出数据以及reduce处理,其中第一步要依赖map的执行,所以在数据量比较大的情况下,一个reducer无法满足性能要求的情况下,我们可以通过调高reducer的个数来解决该问题。

优点:充分利用集群的优势

缺点:有些mr程序没法利用多reducer的优点,比如获取top n的mr程序。

七、runner调优

runner调优其实就是在提交job的时候设置job参数,一般都可以通过代码和xml文件两种方式进行设置。

1~8详见ActiveUserRunner(before和configure方法),9详解TransformerBaseRunner(initScans方法)

1. mapred.child.java.opts: 修改childyard进程执行的jvm参数,针对map和reducer均有效,默认:-Xmx200m

2. mapreduce.map.java.opts: 需改map阶段的childyard进程执行jvm参数,默认为空,当为空的时候,使用mapred.child.java.opts。

3. mapreduce.reduce.java.opts:修改reducer阶段的childyard进程执行jvm参数,默认为空,当为空的时候,使用mapred.child.java.opts。

4. mapreduce.job.reduces: 修改reducer的个数,默认为1。可以通过job.setNumReduceTasks方法来进行更改。

5. mapreduce.map.speculative:是否启动map阶段的推测执行,默认为true。其实一般情况设置为false比较好。可通过方法job.setMapSpeculativeExecution来设置。

6. mapreduce.reduce.speculative:是否需要启动reduce阶段的推测执行,默认为true,其实一般情况设置为fase比较好。可通过方法job.setReduceSpeculativeExecution来设置。

7. mapreduce.map.output.compress:设置是否启动map输出的压缩机制,默认为false。在需要减少网络传输的时候,可以设置为true。

8. mapreduce.map.output.compress.codec:设置map输出压缩机制,默认为org.apache.hadoop.io.compress.DefaultCodec,推荐使用SnappyCodec(在之前版本中需要进行安装操作,现在版本不太清楚,安装参数:http://www.cnblogs.com/chengxin1982/p/3862309.html)

9. hbase参数设置

由于hbase默认是一条一条数据拿取的,在mapper节点上执行的时候是每处理一条数据后就从hbase中获取下一条数据,通过设置cache值可以一次获取多条数据,减少网络数据传输。

架构设计与项目总结

项目流程

模块细节

模块的计算方式

问题

如何解决的

总结:

写离线项目简历

第五天 大数据网站日志离线分析项目

hive和hbase的整合

HBaseIntegration - Apache Hive - Apache Software Foundation

注意事项:

版本信息

Avro Data Stored in HBase Columns

As of Hive 0.9.0 the HBase integration requires at least HBase 0.92, earlier versions of Hive were working with HBase 0.89/0.90

Hive 0.9.0与HBase 0.92兼容。

版本信息

Hive 1.x will remain compatible with HBase 0.98.x and lower versions. Hive 2.x will be compatible with HBase 1.x and higher. (See HIVE-10990 for details.) Consumers wanting to work with HBase 1.x using Hive 1.x will need to compile Hive 1.x stream code themselves.

Hive 1.x仍然和HBase 0.98.x兼容。

HIVE-705提出的原生支持的Hive和HBase的整合。可以使用Hive QL语句访问HBase的表,包括SELECT和INSERT。甚至让hive做Hive表和HBase表的join操作和union操作。

需要jar包(hive自带)

hive-hbase-handler-x.y.z.jar

连接单节点hbase的示例:

$HIVE_SRC/build/dist/bin/hive –auxpath $HIVE_SRC/build/dist/lib/hive-hbase-handler-0.9.0.jar,$HIVE_SRC/build/dist/lib/hbase-0.92.0.jar,$HIVE_SRC/build/dist/lib/zookeeper-3.3.4.jar,$HIVE_SRC/build/dist/lib/guava-r09.jar --hiveconf hbase.master=hbase.yoyodyne.com:60000

其中--hiveconf表示可以将此配置写到hive-site.xml中。

连接到hbase集群的示例:

$HIVE_SRC/build/dist/bin/hive --auxpath $HIVE_SRC/build/dist/lib/hive-hbase-handler-0.9.0.jar,$HIVE_SRC/build/dist/lib/hbase-0.92.0.jar,$HIVE_SRC/build/dist/lib/zookeeper-3.3.4.jar,$HIVE_SRC/build/dist/lib/guava-r09.jar --hiveconf hbase.zookeeper.quorum=zk1.yoyodyne.com,zk2.yoyodyne.com,zk3.yoyodyne.com

其中--hiveconf表示可以将此配置写到hive-site.xml中。

在hive的服务端:

然后正常启动:hive --service metastore

启动客户端CLI:hive

要在hive中操作hbase的表,需要对列进行映射。

CREATE TABLE hbase_table_1(key int, value string)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")

TBLPROPERTIES ("hbase.table.name" = "xyz", "hbase.mapred.output.outputtable" = "xyz");

必须指定hbase.columns.mapping属性。

hbase.table.name属性可选,用于指定hbase中对应的表名,允许在hive表中使用不同的表名。上例中,hive中表名为hbase_table_1,hbase中表名为xyz。如果不指定,hive中的表名与hbase中的表名一致。

hbase.mapred.output.outputtable属性可选,向表中插入数据的时候是必须的。该属性的值传递给了hbase.mapreduce.TableOutputFormat使用。

在hive表定义中的映射hbase.columns.mapping中的cf1:val在创建完表之后,hbase中只显示cf1,并不显示val,因为val是行级别的,cf1才是hbase中表级别的元数据。

具体操作:

hive:

CREATE TABLE hbase_table_1(key int, value string)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")

TBLPROPERTIES ("hbase.table.name" = "xyz", "hbase.mapred.output.outputtable" = "xyz");

hbase:

list

desc 'xyz'

hive操作:

insert into hbase_table_1 values(1,'zhangsan');

hbase操作:

scan 'xyz'

建立外部表要求hbase中必须有表对应

hbase操作:

create 'tb_user', 'info'

hive操作:

create external table hive_tb_user1 (

key int,

name string,

age int,

sex string,

likes array<string>

)

row format

delimited

collection items terminated by '-'

stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

with serdeproperties("hbase.columns.mapping"=":key,info:name,info:age,info:sex,info:likes")

tblproperties("hbase.table.name"="tb_user", "hbase.mapred.output.outputtable"="tb_user");

from hive_tb_user

insert into table hive_tb_user

select 1,'zhangsan',25,'female',array('climbing','reading','shopping') limit 1;

hbase操作:

scan 'tb_user'

put 'tb_user', 1, 'info:likes', 'like1-like2-like3-like4'

hive和hbase

要求在hive的server端中添加配置信息:

hive-site.xml中添加

<property>

  <name>hbase.zookeeper.quorum</name>

  <value>node2,node3,node4</value>

</property>

hive --service metastore

客户端直接启动hive就行了

hive

  1. 创建hive的内部表,要求hbase中不能有对应的表
  2. 创建hive的外部表,要求hbase中一定要有对应的表
  3. 映射关系通过
    1. WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:id,cf:username,cf:age")
  4. stored by指定hive中存储数据的时候,由该类来处理,该类会将数据放到hbase的存储中,同时在hive读取数据的时候,由该类负责处理hbase的数据和hive的对应关系
    1. STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

5、指定hive表和hbase中的哪张表对应,outputtable负责当hive insert数据的时候将数据写到hbase的哪张表。

TBLPROPERTIES ("hbase.table.name" = "my_table", "hbase.mapred.output.outputtable" = "my_table");

创建外部表,要求hbase中有对应的表

CREATE external TABLE hbase_my_table(key int, value string)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:val")

TBLPROPERTIES ("hbase.table.name" = "my_table", "hbase.mapred.output.outputtable" = "my_table");

关于hbase和hive的列对应关系

hbase表

my_table

 cf

rowkey        key

cf:id         myid

cf:username   myname

cf:age      myage

hive表

create external table my_table_hbase (

   key     int,

   myid    int,

   myname  string,

   myage   int

)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:id,cf:username,cf:age")

TBLPROPERTIES ("hbase.table.name" = "my_table", "hbase.mapred.output.outputtable" = "my_table");

创建hive的内部表:要求hbase中不能有对应的表。

CREATE TABLE hbase_table_1(key int, value string)

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")

TBLPROPERTIES ("hbase.table.name" = "xyz", "hbase.mapred.output.outputtable" = "xyz");

sqoop介绍+安装+数据导入

Sqoop:将关系数据库(oracle、mysql、postgresql等)数据与hadoop数据进行转换的工具

官网:http://sqoop.apache.org/

版本:(两个版本完全不兼容,sqoop1使用最多)

sqoop1:1.4.x

sqoop2:1.99.x

同类产品

DataX:阿里顶级数据交换工具

sqoop架构非常简单,是hadoop生态系统的架构最简单的框架。

sqoop1由client端直接接入hadoop,任务通过解析生成对应的maprecue执行

sqoop导入:

sqoop导出

sqoop安装和测试

解压

配置环境变量

SQOOP_HOME

PATH

添加数据库驱动包

配置sqoop-env.sh

注释掉bin/configure-sqoop中的第134-147行以关闭不必要的警告信息。

测试

sqoop version

sqoop list-databases --connect jdbc:mysql://node4:3306/ --username root --password 123456

sqoop help

sqoop help command

直接在命令行执行:

sqoop list-databases --connect jdbc:mysql://node1:3306 --username hive --password hive123

将sqoop的命令放到文件中:

sqoop1.txt

######################

list-databases

--connect

jdbc:mysql://node4:3306

--username

hive

--password

hive123

######################

命令行执行:

sqoop --options-file sqoop1.txt

[root@node4 sqoop-1.4.6]# sqoop help list-databases

usage: sqoop list-databases [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:

   --connect <jdbc-uri>                         Specify JDBC connect

                                                string

   --connection-manager <class-name>            Specify connection manager

                                                class name

   --connection-param-file <properties-file>    Specify connection

                                                parameters file

   --driver <class-name>                        Manually specify JDBC

                                                driver class to use

   --hadoop-home <hdir>                         Override

                                                $HADOOP_MAPRED_HOME_ARG

   --hadoop-mapred-home <dir>                   Override

                                                $HADOOP_MAPRED_HOME_ARG

   --help                                       Print usage instructions

-P                                              Read password from console

   --password <password>                        Set authentication

                                                password

   --password-alias <password-alias>            Credential provider

                                                password alias

   --password-file <password-file>              Set authentication

                                                password file path

   --relaxed-isolation                          Use read-uncommitted

                                                isolation for imports

   --skip-dist-cache                            Skip copying jars to

                                                distributed cache

   --username <username>                        Set authentication

                                                username

   --verbose                                    Print more information

                                                while working

Generic Hadoop command-line arguments:

(must preceed any tool-specific arguments)

Generic options supported are

-conf <configuration file>     specify an application configuration file

-D <property=value>            use value for given property

-fs <local|namenode:port>      specify a namenode

-jt <local|resourcemanager:port>    specify a ResourceManager

-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster

-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.

-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is

bin/hadoop command [genericOptions] [commandOptions]

从hive导出到MySQL,则需要在hive的主机(比如hive的客户端所在的位置)安装sqoop。

$CONDITIONS

[root@server3 ~]# sqoop help import

usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]

Common arguments:

   --connect <jdbc-uri>                         Specify JDBC connect

                                                string

   --connection-manager <class-name>            Specify connection manager

                                                class name

   --connection-param-file <properties-file>    Specify connection

                                                parameters file

   --driver <class-name>                        Manually specify JDBC

                                                driver class to use

   --hadoop-home <hdir>                         Override

                                                $HADOOP_MAPRED_HOME_ARG

   --hadoop-mapred-home <dir>                   Override

                                                $HADOOP_MAPRED_HOME_ARG

   --help                                       Print usage instructions

-P                                              Read password from console

   --password <password>                        Set authentication

                                                password

   --password-alias <password-alias>            Credential provider

                                                password alias

   --password-file <password-file>              Set authentication

                                                password file path

   --relaxed-isolation                          Use read-uncommitted

                                                isolation for imports

   --skip-dist-cache                            Skip copying jars to

                                                distributed cache

   --username <username>                        Set authentication

                                                username

   --verbose                                    Print more information

                                                while working

Import control arguments:

   --append                                                   Imports data

                                                              in append

                                                              mode

   --as-avrodatafile                                          Imports data

                                                              to Avro data

                                                              files

   --as-parquetfile                                           Imports data

                                                              to Parquet

                                                              files

   --as-sequencefile                                          Imports data

                                                              to

                                                              SequenceFile

                                                              s

   --as-textfile                                              Imports data

                                                              as plain

                                                              text

                                                              (default)

   --autoreset-to-one-mapper                                  Reset the

                                                              number of

                                                              mappers to

                                                              one mapper

                                                              if no split

                                                              key

                                                              available

   --boundary-query <statement>                               Set boundary

                                                              query for

                                                              retrieving

                                                              max and min

                                                              value of the

                                                              primary key

   --columns <col,col,col...>         指定将数据库表中的哪些列数据导入

                                                              

   --compression-codec <codec>                                Compression

                                                              codec to use

                                                              for import

   --delete-target-dir                 Imports data  in delete mode

   --direct                                                   Use direct

                                                              import fast

                                                              path

   --direct-split-size <n>                                    Split the

                                                              input stream

                                                              every 'n'

                                                              bytes when

                                                              importing in

                                                              direct mode

-e,--query <statement>                                        Import

                                                              results of

                                                              SQL

                                                              'statement'

   --fetch-size <n>                                           Set number

                                                              'n' of rows

                                                              to fetch

                                                              from the

                                                              database

                                                              when more

                                                              rows are

                                                              needed

   --inline-lob-limit <n>                                     Set the

                                                              maximum size

                                                              for an

                                                              inline LOB

-m,--num-mappers <n>                                          Use 'n' map

                                                              tasks to

                                                              import in

                                                              parallel

   --mapreduce-job-name <name>                                Set name for

                                                              generated

                                                              mapreduce

                                                              job

   --merge-key <column>                                       Key column

                                                              to use to

                                                              join results

   --split-by <column-name>                                   Column of

                                                              the table

                                                              used to

                                                              split work

                                                              units

   --table <table-name>                                       Table to

                                                              read

   --target-dir <dir>                                         HDFS plain

                                                              table

                                                              destination

   --validate                                                 Validate the

                                                              copy using

                                                              the

                                                              configured

                                                              validator

   --validation-failurehandler <validation-failurehandler>    Fully

                                                              qualified

                                                              class name

                                                              for

                                                              ValidationFa

                                                              ilureHandler

   --validation-threshold <validation-threshold>              Fully

                                                              qualified

                                                              class name

                                                              for

                                                              ValidationTh

                                                              reshold

   --validator <validator>                                    Fully

                                                              qualified

                                                              class name

                                                              for the

                                                              Validator

   --warehouse-dir <dir>                                      HDFS parent

                                                              for table

                                                              destination

   --where <where clause>                                     WHERE clause

                                                              to use

                                                              during

                                                              import

-z,--compress                                                 Enable

                                                              compression

Incremental import arguments:

   --check-column <column>        Source column to check for incremental

                                  change

   --incremental <import-type>    Define an incremental import of type

                                  'append' or 'lastmodified'

   --last-value <value>           Last imported value in the incremental

                                  check column

Output line formatting arguments:

   --enclosed-by <char>               Sets a required field enclosing

                                      character

   --escaped-by <char>                Sets the escape character

   --fields-terminated-by <char>      Sets the field separator character

   --lines-terminated-by <char>       Sets the end-of-line character

   --mysql-delimiters                 Uses MySQL's default delimiter set:

                                      fields: ,  lines: \n  escaped-by: \

                                      optionally-enclosed-by: '

   --optionally-enclosed-by <char>    Sets a field enclosing character

Input parsing arguments:

   --input-enclosed-by <char>               Sets a required field encloser

   --input-escaped-by <char>                Sets the input escape

                                            character

   --input-fields-terminated-by <char>      Sets the input field separator

   --input-lines-terminated-by <char>       Sets the input end-of-line

                                            char

   --input-optionally-enclosed-by <char>    Sets a field enclosing

                                            character

Code generation arguments:

   --bindir <dir>                        Output directory for compiled

                                         objects

   --class-name <name>                   Sets the generated class name.

                                         This overrides --package-name.

                                         When combined with --jar-file,

                                         sets the input class.

   --input-null-non-string <null-str>    Input null non-string

                                         representation

   --input-null-string <null-str>        Input null string representation

   --jar-file <file>                     Disable code generation; use

                                         specified jar

   --map-column-java <arg>               Override mapping for specific

                                         columns to java types

   --null-non-string <null-str>          Null non-string representation

   --null-string <null-str>              Null string representation

   --outdir <dir>                        Output directory for generated

                                         code

   --package-name <name>                 Put auto-generated classes in

                                         this package

Generic Hadoop command-line arguments:

(must preceed any tool-specific arguments)

Generic options supported are

-conf <configuration file>     specify an application configuration file

-D <property=value>            use value for given property

-fs <local|namenode:port>      specify a namenode

-jt <local|resourcemanager:port>    specify a ResourceManager

-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster

-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.

-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is

bin/hadoop command [genericOptions] [commandOptions]

At minimum, you must specify --connect and --table

Arguments to mysqldump and other subprograms may be supplied

after a '--' on the command line.

命令行导入:

从MySQL导数据到HDFS,导入

sqoop import --connect jdbc:mysql://node4/log_results --username hivehive --password hive --as-textfile --table dimension_browser --columns id,browser_name,browser_version --target-dir /sqoop/test1 --delete-target-dir -m 1

将语句写入文件并运行:

sqoop2.txt

import

--connect

jdbc:mysql://node4/log_results

--username

hivehive

--password

hive

--as-textfile

--table

dimension_browser

--columns

id,browser_name,browser_version

--target-dir

/sqoop/test1

--delete-target-dir

-m

1

命令行:

sqoop --options-file sqoop2.txt

可以指定SQL执行导入:

sqoop3.txt

import

--connect

jdbc:mysql://node4/log_results

--username

hivehive

--password

hive

--as-textfile

#--query is the same as -e

-e

select id, browser_name, browser_version from dimension_browser where $CONDITIONS

--target-dir

/sqoop/test2

--delete-target-dir

-m

1

命令行:

sqoop --options-file sqoop3.txt

指定导出文件的分隔符:

sqoop4.txt

import

--connect

jdbc:mysql://node1/log_results

--username

hive

--password

hive123

--as-textfile

-e

select id, browser_name, browser_version from dimension_browser where $CONDITIONS

--target-dir

/sqoop/test2-1

--delete-target-dir

-m

1

--fields-terminated-by

\t

命令行:

sqoop --options-file sqoop4.txt

导入到HDFS以及在HIVE创建表  默认字段的分隔符就是逗号,可以不指定逗号

sqoop5.txt

import

--connect

jdbc:mysql://node1/log_results

--username

hive

--password

hive123

--as-textfile

#--query is the same as -e

-e

select id, browser_name, browser_version from dimension_browser where $CONDITIONS

--hive-import

--create-hive-table

--hive-table

hive_browser_dim

--target-dir

/my/tmp

-m

1

--fields-terminated-by

,

命令行:

sqoop --options-file sqoop5.txt

导出:

hdfs://mycluster/sqoop/data/mydata.txt

1,zhangsan,hello world

2,lisi,are you ok

3,wangwu,fine thanks

4,zhaoliu,what are you doing

5,qunqi,just say hello

sqoop6.txt

export

--connect

jdbc:mysql://node4/log_results

--username

hivehive

--password

hive

--columns

id,myname,myversion

--export-dir

/user/hive/warehouse/hive_browser_dim/

-m

1

--table

mybrowserinfo

--input-fields-terminated-by

,

sqoop6-1.txt

export

--connect

jdbc:mysql://node4/log_results

--username

hivehive

--password

hive

--columns

id,myname,myversion

--export-dir

/user/hive/warehouse/hive_browser_dim/

-m

1

--table

mybrowserinfo1

命令行:

sqoop --options-file sqoop6-1.txt

令行执行:

sqoop --options-file sqoop6.txt

逗号不需要指定分隔符

默认的hive分隔符需要在sqoop文件中指定分隔符\001:

sqoop11.txt

export

--connect

jdbc:mysql://node1/log_results

--username

hive

--password

hive123

--columns

id,name,msg

--export-dir

/user/hive/warehouse/tb_log2

-m

1

--table

tb_loglog

--input-fields-terminated-by

\001

用户浏览深度SQL分析

四种行转列:

join

union

DECODE(oracle)

case when

需求:

将用户访问的次数进行分组,每组多少人。

站在统计用户的角度

MySQL中的stat_view_depth表

1. 在hive中创建hbase的event_log对应表

CREATE EXTERNAL TABLE event_logs(

key string, pl string, en string, s_time bigint, p_url string, u_ud string, u_sd string

) ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

with serdeproperties('hbase.columns.mapping'=':key,log:pl,log:en,log:s_time,log:p_url,log:u_ud,log:u_sd')

tblproperties('hbase.table.name'='eventlog');

2. 创建mysql在hive中的对应表

hive中的表,执行HQL之后分析的结果保存该表,然后通过sqoop工具导出到mysql

CREATE TABLE `stats_view_depth` (

  `platform_dimension_id` bigint ,

  `data_dimension_id` bigint ,

  `kpi_dimension_id` bigint ,

  `pv1` bigint ,

  `pv2` bigint ,

  `pv3` bigint ,

  `pv4` bigint ,

  `pv5_10` bigint ,

  `pv10_30` bigint ,

  `pv30_60` bigint ,

  `pv60_plus` bigint ,

  `created` string

) row format delimited fields terminated by '\t';

3. hive创建临时表

把hql分析之后的中间结果存放到当前的临时表。

CREATE TABLE `stats_view_depth_tmp`(`pl` string, `date` string, `col` string, `ct` bigint);

pl平台

date日期

col列,值对应于mysql表中的列:pv1,pv2,pv4….

ct对应于每列的值

col对应mysql中的pv前缀列。

4. 编写UDF

(platformdimension & datedimension)两个维度

package com.sxt.transformer.hive;


import com.sxt.common.DateEnum;
import com.sxt.transformer.model.dim.base.DateDimension;
import com.sxt.transformer.model.dim.base.PlatformDimension;
import com.sxt.transformer.service.IDimensionConverter;
import com.sxt.transformer.service.impl.DimensionConverterImpl;
import com.sxt.util.TimeUtil;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

import java.io.IOException;

/**
 * 操作日期dimension 相关的udf
 *
 * @author root
 *
 */
public class PlatformDimensionUDF extends UDF {
    private IDimensionConverter converter = new DimensionConverterImpl();

    /**
     * 根据给定的platform名称返回id
     *
     * @param platform
     * @return
     */
    public IntWritable evaluate(Text platform) {
        PlatformDimension dimension = new PlatformDimension(platform.toString());

        try {
            int id = this.converter.getDimensionIdByValue(dimension);
            return new IntWritable(id);
        } catch (IOException e) {
            throw new RuntimeException("获取id异常");
        }
    }
}

package com.sxt.transformer.hive;


import java.io.IOException;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

import com.sxt.common.DateEnum;
import com.sxt.transformer.model.dim.base.DateDimension;
import com.sxt.transformer.service.IDimensionConverter;
import com.sxt.transformer.service.impl.DimensionConverterImpl;
import com.sxt.util.TimeUtil;

/**
 * 操作日期dimension 相关的udf
 *
 * @author root
 *
 */
public class DateDimensionUDF extends UDF {
    private IDimensionConverter converter = new DimensionConverterImpl();

    /**
     * 根据给定的日期(格式为:yyyy-MM-dd)至返回id
     *
     * @param day
     * @return
     */
    public IntWritable evaluate(Text day) {
        DateDimension dimension = DateDimension.buildDate(TimeUtil.parseString2Long(day.toString()), DateEnum.DAY);
        try {
            int id = this.converter.getDimensionIdByValue(dimension);
            System.out.println(day.toString());
            System.out.println(id);
            return new IntWritable(id);
        } catch (IOException e) {
            throw new RuntimeException("获取id异常" + day.toString());
        }
    }
}

5. 上传

打包

bds3.jar上传到hdfs的/sxt/transformer文件夹中

6. 创建hive的function

#create function platformFunc as 'com.sxt.transformer.hive.PlatformDimensionUDF' using jar 'hdfs://mycluster/sxt/transformer/bds3.jar';  

create function dateFunc as 'com.sxt.transformer.hive.DateDimensionUDF' using jar 'hdfs://sxt/sxt/transformer/bds3.jar';  

7. hql编写(统计用户角度的浏览深度)<注意:时间为外部给定>

from (

  select

    pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd') as day, u_ud,

    (case when count(p_url) = 1 then "pv1"

      when count(p_url) = 2 then "pv2"

      when count(p_url) = 3 then "pv3"

      when count(p_url) = 4 then "pv4"

      when count(p_url) >= 5 and count(p_url) <10 then "pv5_10"

      when count(p_url) >= 10 and count(p_url) <30 then "pv10_30"

      when count(p_url) >=30 and count(p_url) <60 then "pv30_60"  

      else 'pv60_plus' end) as pv

  from event_logs

  where

    en='e_pv'

    and p_url is not null

    and pl is not null

    and s_time >= unix_timestamp('2017-08-23','yyyy-MM-dd')*1000

    and s_time < unix_timestamp('2017-08-24','yyyy-MM-dd')*1000

  group by

    pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd'), u_ud

) as tmp

insert overwrite table stats_view_depth_tmp

  select pl,day,pv,count(distinct u_ud) as ct where u_ud is not null group by pl,day,pv;

如何知道该访客是pv10的?

聚合操作

需要从hbase表中查询数据,对u_ud聚合,计算出多少个pv事件

case when得出该访客属于pv10

89155407 pv3

62439313 pv5_10

41469129 pv10_30

37005838 pv30_60

08257218 pv3

总的得出所有人属于pv10

对所有的pv10聚合,计算u_ud的总数,得出pv10的有多少人

`pl` string, `date` string, `col` string, `ct` bigint

website  2019-11-18  pv10  300

website  2019-11-18  pv10  400

website  2019-11-18  pv10  500

website  2019-11-18  pv10  300

website  2019-11-18  pv5_10  20

website  2019-11-18  pv10_30  40

website  2019-11-18  pv30_60  10

website  2019-11-18  pv60_plus  120

总的得出所有pv?有都少人

pv1人数是多少?

聚合操作

行转列 à 结果

--把临时表的多行数据,转换一行

行转列

std prj score

S1 M 100

S1 E 98

S1 Z 80

S2 M 87

S2 E 88

S2 Z 89

std M E Z

S1 100 98 80

S2 87 88 89

select std, score from my_score where prj='M';

select std, score from my_score where prj='E';

select std, score from my_score where prj='Z';

select std, t1.score, t2.score, t3.score from t1 join t2 on t1.std=t2.std

 join t3 on t1.std=t3.std;

SELECT t1.std, t1.score, t2.score, t3.score

from

(select std, score from my_score where prj='M') t1

 join

(select std, score from my_score where prj='E') t2

 on t1.std=t2.std

 join (select std, score from my_score where prj='Z') t3

 on t1.std=t3.std;

采用union all的方式:

select tmp.std, sum(tmp.M), sum(tmp.E), sum(tmp.Z) from (

select std, score as 'M', 0 as 'E', 0 as 'Z' from tb_score where prj='M' UNION ALL

select std, 0 as 'M', score as 'E', 0 as 'Z' from tb_score where prj='E' UNION ALL

select std, 0 as 'M', 0 as 'E', score as 'Z' from tb_score where prj='Z'

) tmp group by tmp.std;

with tmp as

(

select pl,`date` as date1,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all

select pl,`date` as date1,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all

select pl,`date` as date1,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus' union all

select 'all' as pl,`date` as date1,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all

select 'all' as pl,`date` as date1,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus'

)

from tmp

insert overwrite table stats_view_depth

select 2,3,6,sum(pv1),sum(pv2),sum(pv3),sum(pv4),sum(pv5_10),sum(pv10_30),sum(pv30_60),sum(pv60_plus),'2017-01-10' group by pl,date1;

编写UDF获取2,3,6的值,2,3,6是一个假的数据。

with tmp as

(

select pl,`date` as date1,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all

select pl,`date` as date1,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all

select pl,`date` as date1,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all

select pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus' union all

select 'all' as pl,`date` as date1,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all

select 'all' as pl,`date` as date1,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all

select 'all' as pl,`date` as date1,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus'

)

from tmp

insert overwrite table stats_view_depth

select platformFunc(pl),dateFunc(date1), 6 ,sum(pv1),sum(pv2),sum(pv3),sum(pv4),sum(pv5_10),sum(pv10_30),sum(pv30_60),sum(pv60_plus),date1 group by pl,date1;

8. sqoop脚本编写(统计用户角度)

sqoop --options-file sqoop_1

文件sqoop_1内容:

export

--connect

jdbc:mysql://node4:3306/log_results

--username

hive

--password

hive123

--table

stats_view_depth

--export-dir

/hive/bigdater.db/stats_view_depth/*

--input-fields-terminated-by

"\\t"

--update-mode

allowinsert

--update-key

platform_dimension_id,data_dimension_id,kpi_dimension_id

站在会话角度的浏览深度

9. hql编写(统计会话角度的浏览深度)<注意:时间为外部给定>

from (

select pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd') as day, u_sd,

 (case when count(p_url) = 1 then "pv1"

 when count(p_url) = 2 then "pv2"

 when count(p_url) = 3 then "pv3"

 when count(p_url) = 4 then "pv4"

 when count(p_url) >= 5 and count(p_url) <10 then "pv5_10"

 when count(p_url) >= 10 and count(p_url) <30 then "pv10_30"

 when count(p_url) >=30 and count(p_url) <60 then "pv30_60"  

 else 'pv60_plus' end) as pv

from event_logs

where en='e_pv' and p_url is not null and pl is not null and s_time >= unix_timestamp('2015-12-13','yyyy-MM-dd')*1000 and s_time < unix_timestamp('2015-12-14','yyyy-MM-dd')*1000

group by pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd'), u_sd

) as tmp

insert overwrite table stats_view_depth_tmp

select pl,day,pv,count(distinct u_sd) as ct where u_sd is not null group by pl,day,pv;

with tmp as

(

select pl,date,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all

select pl,date,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all

select pl,date,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all

select pl,date,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all

select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all

select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all

select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all

select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus' union all

select 'all' as pl,date,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all

select 'all' as pl,date,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all

select 'all' as pl,date,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all

select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all

select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all

select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all

select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all

select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus'

)

from tmp

insert overwrite table stats_view_depth select platform_convert(pl),date_convert(date),6,sum(pv1),sum(pv2),sum(pv3),sum(pv4),sum(pv5_10),sum(pv10_30),sum(pv30_60),sum(pv60_plus),'2015-12-13' group by pl,date;

10. sqoop脚本编写(统计会话角度)

sqoop --options-file sqoop_2

文件sqoop_2内容

export

--connect

jdbc:mysql://hh:3306/report

--username

hive

--password

hive

--table

stats_view_depth

--export-dir

/hive/bigdater.db/stats_view_depth/*

--input-fields-terminated-by

"\\01"

--update-mode

allowinsert

--update-key

platform_dimension_id,data_dimension_id,kpi_dimension_id

shell脚本编写

view_depth_run.sh

#!/bin/bash

startDate=''

endDate=''

until [ $# -eq 0 ]

do

if [ $1'x' = '-sdx' ]; then

shift

startDate=$1

elif [ $1'x' = '-edx' ]; then

shift

endDate=$1

fi

shift

done

if [ -n "$startDate" ] && [ -n "$endDate" ]; then

echo "use the arguments of the date"

else

echo "use the default date"

startDate=$(date -d last-day +%Y-%m-%d)

endDate=$(date +%Y-%m-%d)

fi

echo "run of arguments. start date is:$startDate, end date is:$endDate"

echo "start run of view depth job "

## insert overwrite

echo "start insert user data to hive tmp table"

hive  -e "from (select pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd') as day, u_ud, (case when count(p_url) = 1 then 'pv1' when count(p_url) = 2 then 'pv2' when count(p_url) = 3 then 'pv3' when count(p_url) = 4 then 'pv4' when count(p_url) >= 5 and count(p_url) <10 then 'pv5_10' when count(p_url) >= 10 and count(p_url) <30 then 'pv10_30' when count(p_url) >=30 and count(p_url) <60 then 'pv30_60'  else 'pv60_plus' end) as pv from event_logs where en='e_pv' and p_url is not null and pl is not null and s_time >= unix_timestamp('$startDate','yyyy-MM-dd')*1000 and s_time < unix_timestamp('$endDate','yyyy-MM-dd')*1000 group by pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd'), u_ud) as tmp insert overwrite table stats_view_depth_tmp select pl,day,pv,count(distinct u_ud) as ct where u_ud is not null group by pl,day,pv"

echo "start insert user data to hive table"

hive  -e "with tmp as (select pl,date,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all select pl,date,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all select pl,date,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus' union all select 'all' as pl,date,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all select 'all' as pl,date,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all select 'all' as pl,date,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus' ) from tmp insert overwrite table stats_view_depth select platform_convert(pl),date_convert(date),5,sum(pv1),sum(pv2),sum(pv3),sum(pv4),sum(pv5_10),sum(pv10_30),sum(pv30_60),sum(pv60_plus),date group by pl,date"

echo "start insert session date to hive tmp table"

hive  -e "from (select pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd') as day, u_sd, (case when count(p_url) = 1 then 'pv1' when count(p_url) = 2 then 'pv2' when count(p_url) = 3 then 'pv3' when count(p_url) = 4 then 'pv4' when count(p_url) >= 5 and count(p_url) <10 then 'pv5_10' when count(p_url) >= 10 and count(p_url) <30 then 'pv10_30' when count(p_url) >=30 and count(p_url) <60 then 'pv30_60'  else 'pv60_plus' end) as pv from event_logs where en='e_pv' and p_url is not null and pl is not null and s_time >= unix_timestamp('$startDate','yyyy-MM-dd')*1000 and s_time < unix_timestamp('$endDate','yyyy-MM-dd')*1000 group by pl, from_unixtime(cast(s_time/1000 as bigint),'yyyy-MM-dd'), u_sd ) as tmp insert overwrite table stats_view_depth_tmp select pl,day,pv,count(distinct u_sd) as ct where u_sd is not null group by pl,day,pv"

## insert into

echo "start insert session data to hive table"

hive --database bigdater -e "with tmp as (select pl,date,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all select pl,date,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all select pl,date,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all select pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus' union all select 'all' as pl,date,ct as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv1' union all select 'all' as pl,date,0 as pv1,ct as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv2' union all select 'all' as pl,date,0 as pv1,0 as pv2,ct as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv3' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,ct as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv4' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,ct as pv5_10,0 as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv5_10' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,ct as pv10_30,0 as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv10_30' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,ct as pv30_60,0 as pv60_plus from stats_view_depth_tmp where col='pv30_60' union all select 'all' as pl,date,0 as pv1,0 as pv2,0 as pv3,0 as pv4,0 as pv5_10,0 as pv10_30,0 as pv30_60,ct as pv60_plus from stats_view_depth_tmp where col='pv60_plus' ) from tmp insert into table stats_view_depth select platform_convert(pl),date_convert(date),6,sum(pv1),sum(pv2),sum(pv3),sum(pv4),sum(pv5_10),sum(pv10_30),sum(pv30_60),sum(pv60_plus),'2015-12-13' group by pl,date"

## sqoop

echo "run the sqoop script,insert hive data to mysql table"

sqoop export --connect jdbc:mysql://hh:3306/report --username hive --password hive --table stats_view_depth --export-dir /hive/bigdater.db/stats_view_depth/* --input-fields-terminated-by "\\01" --update-mode allowinsert --update-key platform_dimension_id,data_dimension_id,kpi_dimension_id

echo "complete run the view depth job"

项目优化

一、调优的目的

充分的利用机器的性能,更快的完成mr程序的计算任务。甚至是在有限的机器条件下,能够支持运行足够多的mr程序。

二、调优的总体概述

从mr程序的内部运行机制,我们可以了解到一个mr程序由mapper和reducer两个阶段组成,其中mapper阶段包括数据的读取、map处理以及写出操作(排序和合并/sort&merge),而reducer阶段包含mapper输出数据的获取、数据合并(sort&merge)、reduce处理以及写出操作。那么在这七个子阶段中,能够进行较大力度的进行调优的就是map输出、reducer数据合并以及reducer个数这三个方面的调优操作。也就是说虽然性能调优包括cpu、内存、磁盘io以及网络这四个大方面,但是从mr程序的执行流程中,我们可以知道主要有调优的是内存、磁盘io以及网络。在mr程序中调优,主要考虑的就是减少网络传输和减少磁盘IO操作,故本次课程的mr调优主要包括服务器调优、代码调优、mapper调优、reducer调优以及runner调优这五个方面。

三、服务器调优

服务器调优主要包括服务器参数调优jvm调优。在本次项目中,由于我们使用hbase作为我们分析数据的原始数据存储表,所以对于hbase我们也需要进行一些调优操作。除了参数调优之外,和其他一般的java程序一样,还需要进行一些jvm调优。

hdfs调优

1. dfs.datanode.failed.volumes.tolerated: 允许发生磁盘错误的磁盘数量,默认为0,表示不允许datanode发生磁盘异常。当挂载多个磁盘的时候,可以修改该值。

2. dfs.replication: 复制因子,默认3

3. dfs.namenode.handler.count: namenode节点并发线程量,默认10

4. dfs.datanode.handler.count:datanode之间的并发线程量,默认10。

5. dfs.datanode.max.transfer.threads:datanode提供的数据流操作的并发线程量,默认4096。

一般将其设置为linux系统的文件句柄数的85%~90%之间,查看文件句柄数语句ulimit -a,修改vim /etc/security/limits.conf, 不能设置太大

文件末尾,添加

* soft nofile 65535

* hard nofile 65535

注意:句柄数不能够太大,可以设置为1000000以下的所有数值,一般不设置为-1。

异常处理:当设置句柄数较大的时候,重新登录可能出现unable load session的提示信息,这个时候采用单用户模式进行修改操作即可。

单用户模式:

启动的时候按'a'键,进入选择界面,然后按'e'键进入kernel修改界面,然后选择第二行'kernel...',按'e'键进行修改,在最后添加空格+single即可,按回车键回到修改界面,最后按'b'键进行单用户模式启动,当启动成功后,还原文件后保存,最后退出(exit)重启系统即可。

6. io.file.buffer.size: 读取/写出数据的buffer大小,默认4096,一般不用设置,推荐设置为4096的整数倍(物理页面的整数倍大小)。

hbase调优

1. 设置regionserver的内存大小,默认为1g,推荐设置为4g。

修改conf/hbase-env.sh中的HBASE_HEAPSIZE=4g

2. hbase.regionserver.handler.count: 修改客户端并发线程数,默认为10。设置规则为,当put和scans操作比较的多的时候,将其设置为比较小的值;当get和delete操作比较多的时候,将其设置为比较大的值。原因是防止频繁GC操作导致内存异常。

3. 自定义hbase的分割和紧缩操作,默认情况下hbase的分割机制是当region大小达到hbase.hregion.max.filesize(10g)的时候进行自动分割,推荐每个regionserver的region个数在20~500个为最佳。hbase的紧缩机制是hbase的一个非常重要的管理机制,hbase的紧缩操作是非常消耗内存和cpu的,所以一般机器压力比较大的话,推荐将其关闭,改为手动控制。

4. hbase.balancer.period: 设置hbase的负载均衡时间,默认为300000(5分钟),在负载比较高的集群上,将其值可以适当的改大。

5. hfile.block.cache.size:修改hflie文件块在内存的占比,默认0.4。在读应用比较多的系统中,可以适当的增大该值,在写应用比较多的系统中,可以适当的减少该值,不过不推荐修改为0。

6. hbase.regionserver.global.memstore.upperLimit:修改memstore的内存占用比率上限,默认0.4,当达到该值的时候,会进行flush操作将内容写的磁盘中。

7. hbase.regionserver.global.memstore.lowerLimit: 修改memstore的内存占用比率下限,默认0.38,进行flush操作后,memstore占用的内存比率必须不大于该值。

8. hbase.hregion.memstore.flush.size: 当memstore的值大于该值的时候,进行flush操作。默认134217728(128M)。

9. hbase.hregion.memstore.block.multiplier: 修改memstore阻塞块大小比率值,默认为4。也就是说在memstore的大小超过4*hbase.hregion.memstore.flush.size的时候就会触发写阻塞操作。最终可能会导致出现oom异常。

mapreduce调优

1. mapreduce.task.io.sort.factor: mr程序进行合并排序的时候,打开的文件数量,默认为10个.

2. mapreduce.task.io.sort.mb: mr程序进行合并排序操作的时候或者mapper写数据的时候,内存大小,默认100M

3. mapreduce.map.sort.spill.percent: mr程序进行flush操作的阀值,默认0.80。

4. mapreduce.reduce.shuffle.parallelcopies:mr程序reducer copy数据的线程数,默认5。

5. mapreduce.reduce.shuffle.input.buffer.percent: reduce复制map数据的时候指定的内存堆大小百分比,默认为0.70,适当的增加该值可以减少map数据的磁盘溢出,能够提高系统性能。

6. mapreduce.reduce.shuffle.merge.percent:reduce进行shuffle的时候,用于启动合并输出和磁盘溢写的过程的阀值,默认为0.66。如果允许,适当增大其比例能够减少磁盘溢写次数,提高系统性能。同mapreduce.reduce.shuffle.input.buffer.percent一起使用。

7. mapreduce.task.timeout:mr程序的task执行情况汇报过期时间,默认600000(10分钟),设置为0表示不进行该值的判断。

四、代码调优

代码调优,主要是mapper和reducer中,针对多次创建的对象,进行代码提出操作。这个和一般的java程序的代码调优一样。

五、mapper调优

mapper调优主要就是就一个目标:减少输出量。我们可以通过增加combine阶段以及对输出进行压缩设置进行mapper调优。

combine介绍:

实现自定义combine要求继承reducer类,特点:

以map的输出key/value键值对作为输入输出键值对,作用是减少网络输出,在map节点上就合并一部分数据。

比较适合,map的输出是数值型的,方便进行统计。

压缩设置:

在提交job的时候分别设置启动压缩和指定压缩方式。

六、reducer调优

reducer调优主要是通过参数调优和设置reducer的个数来完成。

reducer个数调优:

要求:一个reducer和多个reducer的执行结果一致,不能因为多个reducer导致执行结果异常。

规则:一般要求在hadoop集群中的执行mr程序,map执行完成100%后,尽量早的看到reducer执行到33%,可以通过命令hadoop job -status job_id或者web页面来查看。

原因: map的执行process数是通过inputformat返回recordread来定义的;而reducer是有三部分构成的,分别为读取mapper输出数据、合并所有输出数据以及reduce处理,其中第一步要依赖map的执行,所以在数据量比较大的情况下,一个reducer无法满足性能要求的情况下,我们可以通过调高reducer的个数来解决该问题。

优点:充分利用集群的优势

缺点:有些mr程序没法利用多reducer的优点,比如获取top n的mr程序。

七、runner调优

runner调优其实就是在提交job的时候设置job参数,一般都可以通过代码和xml文件两种方式进行设置。

1~8详见ActiveUserRunner(before和configure方法),9详解TransformerBaseRunner(initScans方法)

1. mapred.child.java.opts: 修改childyard进程执行的jvm参数,针对map和reducer均有效,默认:-Xmx200m

2. mapreduce.map.java.opts: 需改map阶段的childyard进程执行jvm参数,默认为空,当为空的时候,使用mapred.child.java.opts。

3. mapreduce.reduce.java.opts:修改reducer阶段的childyard进程执行jvm参数,默认为空,当为空的时候,使用mapred.child.java.opts。

4. mapreduce.job.reduces: 修改reducer的个数,默认为1。可以通过job.setNumReduceTasks方法来进行更改。

5. mapreduce.map.speculative:是否启动map阶段的推测执行,默认为true。其实一般情况设置为false比较好。可通过方法job.setMapSpeculativeExecution来设置。

6. mapreduce.reduce.speculative:是否需要启动reduce阶段的推测执行,默认为true,其实一般情况设置为fase比较好。可通过方法job.setReduceSpeculativeExecution来设置。

7. mapreduce.map.output.compress:设置是否启动map输出的压缩机制,默认为false。在需要减少网络传输的时候,可以设置为true。

8. mapreduce.map.output.compress.codec:设置map输出压缩机制,默认为org.apache.hadoop.io.compress.DefaultCodec,推荐使用SnappyCodec(在之前版本中需要进行安装操作,现在版本不太清楚,安装参数:http://www.cnblogs.com/chengxin1982/p/3862309.html)

9. hbase参数设置

由于hbase默认是一条一条数据拿取的,在mapper节点上执行的时候是每处理一条数据后就从hbase中获取下一条数据,通过设置cache值可以一次获取多条数据,减少网络数据传输。

架构设计与项目总结

项目流程

模块细节

模块的计算方式

问题

如何解决的

总结:

写离线项目简历

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

一鸣888

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值