Projects_System Administrator

最新推荐文章于 2024-06-13 08:37:22 发布

weixin_30675247

最新推荐文章于 2024-06-13 08:37:22 发布

阅读量72

点赞数

文章标签： java 大数据数据库

原文链接：http://www.cnblogs.com/touchdown/p/5174292.html

版权

I was working at NYU HPC team as a Linux system administrator since last September. It is an very great experience to work there as a graduate student. The first project after I got there is to deploy the account management mechanism. Since the person who was in charge of account management just left at that time, my colleagues want to know how the details work.

Generally, for the account creation, we assign a user to its group ID. Syn user's NetID password to HPC account using LDAP. Create /scratch and /home directory and set up permission and disk space quota(on HPC, the block-soft limit is 5TB, hard-limit is 6TB) and inode number quota(1 million, 1001000). In login nodes, use rocks commands to create subgroup /users/netID in cpuset, cpu, cpuacct, memory subsystems. Then set on which cpus the process can be scheduled and on which memory nodes it can obtain memory.

cgroups (aka control groups) is a Linux kernel feature to limit, police and account the resource usage of certain processes (actually process groups).

add user command

/usr/sbin/useradd -g 100 -s /usr/local/bin/rbash -u 2670296 -c Anastasiya Kolchynska -m ak5879

This command assign user to a group and creates home directory if it does not exist.

Pig

Pig is one component of Hadoop ecosystem. It is used to analyze large sets of data representing them as data flow. It is generally used with Hadoop. Programmers need to write Pig script using Pig Latin language. Internally, Pig converts these scripts into a series of map/reduce jobs.

Pig has map, bag and tuple type.

WordCount.pig example

File = LOAD 'hdfs://babar.es.its.nyu.edu:8020/user/bl1810/tmp/inputfile';
HackathonMatch = FOREACH File generate 'hackathon' as key, (($0 matches '.*hackathon.*') ? 1 : 0) as value;
DecMatch = FOREACH File generate 'Dec' as key, (($0 matches '.*Dec.*') ? 1 : 0) as value;
ChicagoMatch = FOREACH File generate 'Chicago' as key, (($0 matches '.*Chicago.*') ? 1 : 0) as value;
JavaMatch = FOREACH File generate 'Java' as key, (($0 matches '.*Java.*') ? 1 : 0) as value;
hackathonWords = FOREACH (GROUP HackathonMatch BY key) generate group AS key, SUM(HackathonMatch.value) as value; 
DecWords = FOREACH (GROUP DecMatch BY key) generate group AS key, SUM(DecMatch.value) as value ; 
ChicagoWords = FOREACH (GROUP ChicagoMatch BY key) generate group AS key, SUM(ChicagoMatch.value) as value; 
JavaWords = FOREACH (GROUP JavaMatch BY key) generate group AS key, SUM(JavaMatch.value) as value; 
TotalMatchWords = UNION hackathonWords, DecWords, ChicagoWords, JavaWords;
GroupTotal = Group TotalMatchWords by 1;
FinalWordslist = FOREACH GroupTotal generate FLATTEN(TotalMatchWords); 
store FinalWordslist into 'hdfs://babar.es.its.nyu.edu:8020/user/bl1810/tmp/output3';

Hive

schema-on-read; Does not support updates and deletes;

select * from w1 where year > 1949; (Runs map/reduce jobs underneath)

Oozie

Oozie is a Java web application to schedule hadoop jobs. Oozie combines hadoop jobs sequentially into one logic unit of work. Oozie detects completion of tasks through callback and polling.

MongoDB

MongoDB is document-oriented database. It stores data in "whole documents" like JSON document

Don't need joins. Saves time compare to SQL.

## Run mongoDB default port.

>mongod

## Go to the binary folder and click

>mongo

## import data shows as bank_data

>mongoimport --jasonArray --db test --collection bank_data < Your_json_file_path

## count how many records in back_data

>db.bank_data.count()

## Picks up the first record

>db.bank_data.findOne()

## Retrieves all the data and returns a collection.

> db.bank_data.find()

> db.bank_data.find()[6]

> db.bank_data.find({ last_name : "SMITH" }).count();

> db.bank_data.find({ last_name : "SMITH" })[50]

## Get the projection and return only the fields we want.

> db.bank_data.find({ last_name : "SMITH" }, { first_name : 1, last_name : 1})

HBase

Distributed Column-Orientated database build on top of hadoop. HBase internally uses Hash tables and provides random access, and it stores the data in indexed HDFS files for faster lookups.

HBase stores de-normalized data while RDBMS is normalized.

Denormalization VS Normalization

Denormalization is generally used to either :

Avoid a certain number of queries
Remove some joins

The basic idea of denormalization is that you'll add redundant data, or group some, to be able to get those data more easily -- at a smaller cost ; which is better for performances.

A quick examples ?

Consider a "Posts" and a "Comments" table, for a blog
- For each Post, you'll have several lines in the "Comment" table
- This means that, to display a list of posts with the associated number of comments, you'll have to :
  - Do one query to list the posts
  - Do one query per post to count how many comments it has (Yes, those can be merged into only one, to get the number for all posts at once)
  - Which means several queries.
Now, if you add a "number of comments" field into the Posts table :
- You only need one query to list the posts
- And no need to query the Comments table : the number of comments are already de-normalized to the Posts table.
- And only one query that returns one more field is better than more queries.

Now, there are some costs, yes :

- First, this costs some place on both disk and in memory, as you have some redundant informations :
  - The number of comments are stored in the Posts table
  - And you can also find those number counting on the Comments table
- Second, each time someone adds/removes a comment, you have to :
  - Save/delete the comment, of course
  - But also, update the corresponding number in the Posts table.
  - But, if your blog has a lot more people reading than writing comments, this is probably not so bad.

After I'm done with this small project, I then helped one of HPC consultant to do Hadoop benchmarking, which is to measure how fast is our cluster in terms of distributed I/O by running TestDFSIO. It stresses the cluster to see if it can handle high I/O. It also measures how fast map/reduce is by running TeraSort. The benchmarking also include some use cases, such as PageRank.

Globus is a connected set of services for data management. It can be used for moving data between your local machine and the cluster. It is based on GridFTP

Hadoop testing

TestDFSIO (TestDFSIO.java is saved on web link): Test how fast is your cluster in terms of I/O. It is read and write test. As read does not generate its own files, we need to performe write opeartion first. The most important output are throughput (mb/sec) and average I/O rate (mb/sec).

## terasort
sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-examples.jar teragen 1000 /user/cloudera/terasort-input
sudo -u hdfs hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/hadoop-examples.jar teragen 100000000 /user/cloudera/terasort-input

Output:

----- TestDFSIO ----- : write
           Date & time: Fri Apr 08 2011
       Number of files: 1000
Total MBytes processed: 1000000
     Throughput mb/sec: 4.989
Average IO rate mb/sec: 5.185
 IO rate std deviation: 0.960
    Test exec time sec: 1113.53

----- TestDFSIO ----- : read
           Date & time: Fri Apr 08 2011
       Number of files: 1000
Total MBytes processed: 1000000
     Throughput mb/sec: 11.349
Average IO rate mb/sec: 22.341
 IO rate std deviation: 119.231
    Test exec time sec: 544.842

Throughput is calculated as the following: files handled by all map tasks divided by time used on all.

Average I/O rate: the sum of rate on each map task divided by the number of map tasks.

TeraSort: Basically, the goal of TeraSort is to sort 1TB of data (or any other amount of data you want) as fast as possible. It is a benchmark that combines testing the HDFS and MapReduce layers of an Hadoop cluster. It has three stages: Teragen, TeraSort and TeraValidate

TeraGen generates output data that is byte for byte equivalent to the C version including the newlines and specific keys. It divides the desired number of rows by the desired number of tasks and assigns ranges of rows to each map. The map jumps the random number generator to the correct value for the first row and generates the following rows.

TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N-1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i-1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1. To speed up the partitioning, the partitioner builds a two level trie that quickly indexes into the list of sample keys based on the first two bytes of the key. TeraSort generates the sample keys by sampling the input before the job is submitted and writing the list of keys into HDFS. The input and output format, which are used by all 3 applications, read and write the text files in the right format. The output of the reduce has replication set to 1, instead of the default 3, because the contest does not require the output data be replicated on to multiple nodes.

TeraValidate ensures that the output is globally sorted. It creates one map per a file in the output directory and each map ensures that each key is less than or equal to the previous one. The map also generates records with the first and last keys of the file and the reduce ensures that the first key of file i is greater that the last key of file i-1. Any problems are reported as output of the reduce with the keys that are out of order.

MRBench: MRBench (see src/test/org/apache/hadoop/mapred/MRBench.java) loops a small job a number of times. As such it is a very complimentary benchmark to the “large-scale” TeraSort benchmark suite because MRBench checks whether small job runs are responsive and running efficiently on your cluster. It puts its focus on the MapReduce layer as its impact on the HDFS layer is very limited.

转载于:https://www.cnblogs.com/touchdown/p/5174292.html

weixin_30675247

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Projects_System Administrator

I was working at NYU HPC team as a Linux system administrator since last September. It is an very great experience to work there as a graduate student. The first project after I got there is to deploy...
复制链接

扫一扫