hadoop shell 命令:
给用户授权: hdfs
dfs -chmod -R 755 /
修改所有者权限:hdfs d
fs -chown -R larry /
hdfs很多个小文件上传,压缩的好处:namenode中存储了各个文件所在block的位置(该信息常驻内存),多个小文件会对文件查找性能造成影响。
不同的hadoop用户拥有的权限不同,看到目录下的文件结构也不一样,root用户不一定具有所有权限。
hadoop可以操作本地文件
查看jar包包含了那些class文件
jar -ft
spark-examples-1.6.0-hadoop2.6.0.jar
执行hadoop jar包:
hadoop jar mahout-examples-0.11.2-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
默认 hdfs dfs -put 单线程上传本地文件
hdfs dfs -get file 下载文件到本地
DistCp分布式集群到集群的拷贝,可以用来本地多线程上传数据。参考:
https://hadoop.apache.org/docs/r1.0.4/cn/distcp.html
bash$ hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo
spark运行在yarn上,查看运行进度:
>
yarn application -list | grep SPARK
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1468582953415_1281732 Spark shell SPARK recsys root.bdp_jmart_recsys.recsys_szda RUNNING UNDEFINED
10% http://172.18.149.130:36803
application_1468582953415_1275117 sparksql SPARK mart_risk root.bdp_jmart_risk.bdp_jmart_risk_hkh RUNNING UNDEFINED 10% http://172.18.143.152:5396
>
yarn application -status application_1468582953415_1275117 查看一个任务的状态
yarn收集日志:
yarn logs -applicationId <application ID>
可以收集应用程序的运行日志,但是必须应用程序运行完才能查看(运行完yarn才聚合日志),同时必须开启日志聚合功能(默认是不开启的),修改yarn.log-aggregation-enable为true.
使用yarn命令查看执行任务情况:
> yarn application -list
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):2
Application-Id Application-Name Application-Type User Queue State Final-State
Progress
Tracking-URL
application_1470800211073_0030 17778-thriftserver SPARK hadoop root.default RUNNING UNDEFINED
10% http://192.168.177.78:4041
application_1470800211073_0021 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 SPARK hadoop root.default RUNNING UNDEFINED 10%
http://192.168.177.77:4040 (spark任务会有tracking-url)
使用yarn rest api查看任务执行情况:
>curl --compressed -H "Accept: application/json" -X GET "
http://BDS-TEST-002:8088/ws/v1/cluster/apps/application_1470800211073_0030"
{
"app": {
"id": "application_1470800211073_0030",
"user": "hadoop",
"name": "17778-thriftserver",
"queue": "root.default",
"state": "RUNNING",
"finalStatus": "UNDEFINED",
"
progress": 10,
"trackingUI": "ApplicationMaster",
"trackingUrl": "http://BDS-TEST-002:8088/proxy/application_1470800211073_0030/",
"diagnostics": "",
"clusterId": 1470800211073,
"applicationType": "SPARK",
"applicationTags": "",
"startedTime": 1471260225117,
"finishedTime": 0,
"elapsedTime": 64056828,
"amContainerLogs": "http://BDS-TEST-002:8042/node/containerlogs/container_1470800211073_0030_01_000001/hadoop",
"amHostHttpAddress": "BDS-TEST-002:8042",
"allocatedMB": 7168,
"allocatedVCores": 3,
"runningContainers": 3,
"memorySeconds": 459101287,
"vcoreSeconds": 192149,
"preemptedResourceMB": 0,
"preemptedResourceVCores": 0,
"numNonAMContainerPreempted": 0,
"numAMContainerPreempted": 0
}
}
url格式为:
http://{http address of service}/ws/{version}/{resourcepath}
另外,需要去yarn-site.xml中查看resource manager的地址,下面的配置中有两个webapp,是由于使用zk HA。
<property>
<name>
yarn.resourcemanager.webapp.address.rm1</name>
<value>BDS-TEST-001:8088</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>BDS-TEST-002:8088</value>
</property>
hadoop HA