Linux、Spark、Hadoop、Hbase、Hive、kafka...常用操作命令
一、linux*
-
查看磁盘信息:
fdisk -l
-
查看内存:
df-h
free -h
-
递归创建文件夹:
mkdir -p
-
查看日期函数:
date --help
例:date -d '2020-9-3' +%s
传入字符串返回时间戳
date -d '2020-9-3' +%F
返回完整日期 -
查看ip:
ip addr
ifconfig -a
(其中“inet addr:”后面一连串数字就是Linux系统的ip地址)
window查看ip:ipconfig
-
关机:
shotdown now
poweroff
-
重启:
reboot
-
查找文件:
find / -name 文件名
注释:
linux:#
sql:--
spark、java、scala://
- 编辑模式查看行号:
:set nu
shell命令:
- 清除缓存脚本: freemem.sh
#!/bin/bash
used=`free -m | awk 'NR==2' | awk '{print $3}'`
free=`free -m | awk 'NR==2' | awk '{print $4}'`
echo "===========================" >> /var/log/mem.log
date >> /var/log/mem.log
echo "Memory usage | [Use:${used}MB][Free:${free}MB]" >> /var/log/mem.log
if [ $free -ge 100 ] ; then
sync && echo 1 > /proc/sys/vm/drop_caches
sync && echo 2 > /proc/sys/vm/drop_caches
sync && echo 3 > /proc/sys/vm/drop_caches
echo "OK" >> /var/log/mem.log
else
echo "Not required" >> /var/log/mem.log
fi
- 给脚本赋权:
chmod u +x freemem.sh
- 执行脚本:
./freemem.sh
二、Spark*
-
启动spark :
spark安装根目录下:sbin/start-all.sh
或者sbin目录下:./start-all.sh
-
启动spark shell, 三种启动方式:
首先前要启动hive元数据管理:hive --service metastore
(1)本地模式启动:spark-shell
(2)spark自己搭建的集群:spark-shell --master spark://hadoop-single:7077
(3)Spark on YARN(需提前启动yarn):spark-shell --master yarn
-
退出spark shell :
quit
-
解决站溢出问题
更改配置:vi /opt/install/spark/conf/spark-defaults.conf
更改:spark.driver.extraJavaOptions="-Xss30M"
三、hadoop*
-
启动hadoop:
start-all.sh
-
通过hdfs查询文件有多少行:
hdfs dfs -cat /app/data/exam/meituan_waimai_meishi.csv | wc -l
-
hdfs:
删除文件夹:/hdfs dfs -rm -r /input /input/data
(data是要删除的文件夹,删除文件夹连同文件夹下的文件一起删除)
删除文件:/hdfs dfs -rm -f /input/filename
(filename 则是要删除的文件名字) -
hadoop退出安全模式:
hadoop dfsadmin -safemode leave
四、zookeeper*
- 启动:
zkServer.sh start
五、python*
-
pyspark
- 如果要环境加入pyspark 就直接执行命令:
pyspark
(登录页面: hadoop-single:8888) - 如果只需要普通python环境 就输入命令:
jupyter notebook --allow-root
(登录页面: hadoop-single:7070)
- 如果要环境加入pyspark 就直接执行命令:
-
退出python命令行:
Ctrl + d
quit()
exit()
-
导入模块:
import numpy as np
import pandas as pd
- pyspark:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.linalg import Vector #导入向量包
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.import.Pipeline
- window 启动 python jupyter:
cmd命令行输出:jupyter NoteBook
六、kafka*
- 启动kafka:
kafka-server-start.sh ./config/server.properties
- 创建topic:
kafka-topics.sh --create --zookeeper 192.168.206.129:2181 --topic testDemo --partitions 3 --replication-factor 1
- 创建生产者,产生数据:
kafka-console-producer.sh --topic testDemo --broker-list 192.168.206.129:9092
- 创建消费者,重头开始取数据:
kafka-console-consumer.sh --bootstrap-server 192.168.206.129:9092 --topic testDemo --from-beginning
- 删除topic:
kafka-topics.sh --zookeeper 127.0.0.1:2181 --delete --topic testDemo
- 查看当前kafka中的所有topic列表 :
kafka-topics.sh --zookeeper 192.168.206.129:2181 --list
- 查看某一个topic详情:
kafka-topics.sh --zookeeper 192.168.206.129:2181 --describe --topic testDemo
- 查看某一个topic消息队列数量:
kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list 192.168.206.129:9092 --topic testDemo -time -1 --offsets 1
- 修改topic保留时间
kafka-topics.sh --zookeeper 192.168.206.129:2181 --alter --topic test --config retention-ms=10000
七、mysql*
- 删除表:
DROP TABLE table_name;
- 清空表:
truncate table student;
-
linux 执行sql语句:
①进入mysql:use mysql -uroot -pok
②执行语句:soucrce /home/a.sql
-
mysql查看允许任意主机登录:
select user,host,password from user;
函数:
- ceil () /ceiling() 向上取整
ceil(1.2) = 2
- floor () 向下取整
floor(1.2) = 1
- round() 四舍五入
八、hbase*
-
启动hbase:
start-hbase.sh
-
进入hbase-shell命令行:
hbase shell
-
查看版本:
version
-
查看集群状态:
status
-
查看文件总行数:
①: count ‘表名’
count '表名',{INTERVAL => 1000000}
默认情况下每1000行显示当前计数,可以自定义计数间隔
②:利用 hbase.RowCounter包 执行 MR 任务
在linux命令行执行:hbase org.apache.hadoop.hbase.mapreduce.RowCounter '表名'
显示结果见(倒数第5行): 《 ROWS=727 》
命令空间:
- 查看命名空间:
list_namespace
- 新建命名空间:
create_namespace 'events_db'
- 查看命名空间中的表:
list_namespace_tables 'events_db'
- 删除命名空间:
drop_namespace 'events_db'
- 修改命名空间: (该命名空间必须为空,否则会报错)
alter_namespace 'events_db', {METHOD => 'set', 'PROPERTY_NAME' => 'PROPERTY_VALUE'}
建表语句:
- 以命令空间建表:
user_friends为表名,uf为列簇
create 'events_db:user_friends','uf'
3个列簇:profile region registration
create 'events_db:users','profile','region','registration'
- 删除表: (先禁用再删除)
disable 'events_db:user_friends'
drop 'events_db:user_friends'
- 删除某一个列簇: ( 删除表exam:analysis 中的 question 列簇)
alter'exam:analysis','delete'=>'question'
- 增加一个列簇:
alter 'user', NAME => 'f2'
例1: 表名: customer,列簇: addr ,order rowkey: jsmith
- 建表:
create 'customer', {NAME=>'addr'}, {NAME=>'order'}
- 列出HBase所有的表:
list
- 查看表的详细信息:
desc 'customer'
- 添加数据(增加列city,数据montreal):
put 'customer', 'jsmith', 'addr:city', 'montreal'
- 获取数据(rowkey:jsmith下的所有数据):
get 'customer', 'jsmith'
例2: 表名: user,列族: baseinfo,privateinfo
- 建表:
create 'user',{NAME=>'baseinfo'},{NAME=>'privateinfo'}
- 查看user 表信息:
desc 'user'
- 查看全部表数据:
scan 'user'
- 向user 表中插入数据:
put 'user','002','baseinfo:name','kitty'
put 'user','002','baseinfo:age',18
put 'user','002','privateinfo:tall','1.8'
put 'user','002','privateinfo:health','good'
put 'user','001','baseinfo:name','hello'
put 'user','001','baseinfo:age',20
put 'user','001','privateinfo:tall','1.75'
put 'user','001','privateinfo:health','good'
put 'user','003','baseinfo:name','dava'
put 'user','003','baseinfo:age',25
put 'user','003','privateinfo:tall','1.8'
put 'user','003','privateinfo:health','bad'
- 根据rowkey 获取数据
get 'user','001'
- 获取指定列簇数据的两种方式
get 'user','001','baseinfo'
get 'user','001',{COLUMNS=>['baseinfo']}
- 获取指定列的数据
get 'user','001',{COLUMNS=>['baseinfo:name']}
- 更新数据
put 'user','001','baseinfo:name','jack'
- 将文档数据导入HBase:(通过hbase shell导入文档数据)
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv \
-Dimporttsv.separator=, \
-Dimporttsv.columns="HBASE_ROW_KEY,order:numb,order:date" \
customer file:///home/vagrant/hbase_import_data.csv
九、hive*
- 启动hive元数据管理:
hive --service metastore
- 后台启动元服务:
nohup hive --service metastore &
-
hive 命令行:
hive
-
启动第三方插件beeline:
需启动hiveserver2 服务:nohup hive --service hiveserver2 &
新建窗口,启动beeline服务:beeline -u jdbc:hive2://localhost:10000
-
hive本地模式:
set hive.exec.mode.local.auto=true;
set hive.exec.mode.local.auto.inputbytes.max=50000000;
set hive.exec.mode.local.auto.tasks.max=10;
- 开启Hive的本地模式:
set hive.exec.mode.local.auto=true;(默认为false)
当一个job满足如下条件才能真正使用本地模式
1.job的输入数据大小必须小于参数:hive.exec.mode.local.auto.inputbytes.max(默认128MB)
2.job的map数必须小于参数:hive.exec.mode.local.auto.tasks.max(默认4)
3.job的reduce数必须为0或者1
- 设置动态分区
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.optimize.sort.dynamic.partition=true;
set hive auto.convert.join=false;
- linux读取hql脚本:(linux命令行)
hive -f map_hbase.hql
- hive如何传入参数:
如果是在SQL文件当中,使用下面的命令:
设置变量:set hivevar:dt='2020-09-04'
使用变量:insert into table xxx partition(dt=${hivevar:dat})
如果想通过脚本传入变量,使用下面的命令:
hive -hivevar dt='2020-09-04' -f xxx.sql
- hive建表语句:
①通过查询建表:
create table ctas_employee as select * from employee;
- 插入数据:
①插入数据,从ctas_employee查询出所有的数据插入到employee:
insert into employee select * from ctas_employee;
②同时向两张表插入数据,employee、employee_external
from ctas_employee
insert into table employee select *
insert into table employee_external select *;
③通过文件加载数据:
有local: 从linux复制数据到Hive的指定目录下
无local: 从hdfs剪切数据到Hive指定目录下
格式:
load data [local] inpath '...' [overwrite] into table xxx
④通过文件插入数据:
from employee
insert overwrite/into [local] directory '/root/employee2.txt' select * ;
⑤加载数据至动态分区表:(year、month是 表employee_partitioned中的字段)
from ctas_partitioned
insert into table employee_partitioned partition(year,month)
select name,work_place,sex_age,skills_score,depart_title,'2020','07'
- 数据导出
EXPORT TABLE employee TO local '/tmp/output3';
- 数据导入
IMPORT TABLE employee FROM '/tmp/output3';
- 添加分区
alter table employee partition add
partition (year=2019,mouth=7)
partition (year=2020,mouth=7)
partition (year=2020,mouth=8);
- 删除操作(强制删除数据库):
drop database dataName cascade;
- hive侧视图查询:
select name,wps,skills_score from employee
lateral view explode(work_place) work_place_single as wps;
多列侧视图:
select name,wps,skill,score from employee
lateral view explode(work_place) work_place_single as wps
lateral view explode(skills_score) sks as skill,score;
- with 用法
SELECT * FROM (SELECT * FROM employee) a;
相当于
with a
as (SELECT * FROM employee)
SELECT * FROM a
- union 和 union all 的区别
union all 不去重:
select * from emp_basic eb where eb.emp_name='Erhart Scambler'
union all
select * from emp_basic eb where eb.emp_name='Erhart Scambler';
union 去重:
select * from emp_basic eb where eb.emp_name='Erhart Scambler'
union
select * from emp_basic eb where eb.emp_name='Erhart Scambler' ;
- hive聚合函数使用方法:
select 字段1,字段2,接聚合函数(字段x)
from ...
gruop by 字段1,字段2 (除了字段x,其他select后接的字段都要放到gruop by 中)
- like 和rlike 区别
like % 匹配一个或多个字符
rlike 可以使用java中的正则表达式
十、sqoop*
- 例1: mysql中建库建表,全量导入到hive中
①mysql中建库建表:
CREATE DATABASE IF NOT EXISTS sales_source DEFAULT CHARSET utf8 COLLATE utf8_general_ci;
USE sales_source;
DROP TABLE IF EXISTS customer;
CREATE TABLE customer
(
customer_number INT(11) NOT NULL AUTO_INCREMENT,
customer_name VARCHAR(128) NOT NULL,
customer_street_address VARCHAR(256) NOT NULL,
customer_zip_code INT(11) NOT NULL,
customer_city VARCHAR(32) NOT NULL,
customer_state VARCHAR(32) NOT NULL,
PRIMARY KEY (customer_number)
);
插入数值:
INSERT INTO customer
( customer_name
, customer_street_address
, customer_zip_code
, customer_city
, customer_state
)
VALUES
('Big Customers', '7500 Louise Dr.', '17050',
'Mechanicsburg', 'PA')
, ( 'Small Stores', '2500 Woodland St.', '17055',
'Pittsburgh', 'PA')
, ('Medium Retailers', '1111 Ritter Rd.', '17055',
'Pittsburgh', 'PA'
)
, ('Good Companies', '9500 Scott St.', '17050',
'Mechanicsburg', 'PA')
, ('Wonderful Shops', '3333 Rossmoyne Rd.', '17050',
'Mechanicsburg', 'PA')
, ('Loyal Clients', '7070 Ritter Rd.', '17055',
'Pittsburgh', 'PA')
;
②hive中建库建表:
create database sales_rds;
USE sales_rds;
DROP TABLE IF EXISTS rds.customer;
CREATE TABLE sales_rds.customer
(
customer_number INT ,
customer_name VARCHAR(128) ,
customer_street_address VARCHAR(256) ,
customer_zip_code INT ,
customer_city VARCHAR(32) ,
customer_state VARCHAR(32)
);
③sqoop全量抽取customer表(hive中customer此处要求事先存在)
sqoop import \
--connect jdbc:mysql://localhost:3306/sales_source \
--username root \
--password ok \
--table customer \
--hive-import \
--hive-table sales_rds.customer \
--hive-overwrite \
--target-dir temp
其中:
–target-dir temp 是指中间临时文件夹
–fields-terminated-by 指定分隔符
④查询:
select * from customer;
- 例2: 增量抽取 表sales_order
步骤与例1一致,
①mysql建表:
CREATE TABLE sales_order
(
order_number INT(11) NOT NULL AUTO_INCREMENT,
customer_number INT(11) NOT NULL,
product_code INT(11) NOT NULL,
order_date DATETIME NOT NULL,
entry_date DATETIME NOT NULL,
order_amount DECIMAL(18,2) NOT NULL,
PRIMARY KEY (order_number)
);
②hive建表:
CREATE TABLE sales_rds.sales_order
(
order_number INT ,
customer_number INT,
product_code INT ,
order_date timestamp ,
entry_date timestamp ,
order_amount DECIMAL(18,2)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
;
③sqoop增量导入
sqoop import \
--connect jdbc:mysql://localhost:3306/sales_source \
--username root \
--password root \
--table sales_order \
--hive-import \
--hive-table sales_rds.sales_order \
--fields-terminated-by '\t' \ #字段分隔符
--lines-terminated-by '\n' \ #行分隔符
--check-column entry_date \ # 检查列, 通过检查某一列的值实现递增
--incremental append \ # 增量的方式, 可以选择append或者是lastmodified
--last-value '1900-1-1' # 上一次检查列最大的值
也可以创建 sqoop job 来实现:
创建 job:
sqoop job \
--create myjob \
-- import \
--connect jdbc:mysql://localhost:3306/sales_source \
--username root \
--password ok \
--table sales_order \
--hive-import \
--hive-table sales_rds.sales_order \
--fields-terminated-by '\t' \
--lines-terminated-by '\n' \
--check-column entry_date \
--incremental append \
--last-value '1900-1-1'
执行job:
sqoop job --exec myjob
删除job:
sqoop job --delete myjob
注: 设置免密执行job
echo -n "ok" > sqoopPWD.pwd
hdfs dfs -mkdir -p /sqoop/pwd
hdfs dfs -put sqoopPWD.pwd /sqoop/pwd/
hdfs dfs -chmod 400 /sqoop/pwd/sqoopPWD.pwd
需在 sqoop/conf/sqoop-site.xml 中放开注释:
<property>
<name>sqoop.metastore.client.record.password</name>
<value>true</value>
<description>If true, allow saved passwords in the metastore.
</description>
</property>
十一、zeppelin*
- 启动:
cd /opt/install/zeppelin081/bin/
./zeppelin-daemon.sh start