Hama图计算模型

Epiphany_7725

已于 2024-03-13 11:05:56 修改

阅读量827

点赞数 19

文章标签：大数据空间计算 python 分布式 hadoop

于 2023-12-14 00:15:00 首次发布

本文链接：https://blog.csdn.net/m0_54428859/article/details/134874572

版权

开始

默认已经安装了hdfs、apache-zookeeper、并且已经启动完成。

系统环境：jdk8、centos7

Hama单机环境安装配置

下载hama安装文件，从http://hama.apache.org/downloads.html 处下载合适的版本，这里我选择最新版本。

下载文件后，运用命令 sudo tar -zxf ~/下载/hama-dist-0.7.0.tar.gz -C /usr/local 解压至 /usr/local/hama ，再运用命令 sudo mv ./hama-0.7.0/ ./hama 将文件夹名改为hama。

进入hama中的conf文件夹，修改hama-env.sh文件，在其中加入java的home路径，即加入：export JAVA_HOME=/export/server/jdk

修改 hama-site.xml文件，这时hama配置的核心文件，具体内容如下：

  <configuration>
    <property>
      <name>bsp.master.address</name>
      <value>node1</value>
      <description>The address of the bsp master server. Either the
      literal string "local" or a host:port for distributed mode
      </description>
    </property>
 
    <property>
      <name>fs.default.name</name>
      <value>hdfs://node1:9000</value>
      <description>
        The name of the default file system. Either the literal string
        "local" or a host:port for HDFS.
      </description>
    </property>
 
    <property>
      <name>hama.zookeeper.quorum</name>
      <value>localhost</value>
      <description>Comma separated list of servers in the ZooKeeper Quorum.
      For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
      By default this is set to localhost for local and pseudo-distributed modes
      of operation. For a fully-distributed setup, this should be set to a full
      list of ZooKeeper quorum servers. If HAMA_MANAGES_ZK is set in hama-env.sh
      this is the list of servers which we will start/stop zookeeper on.
      </description>
    </property>
  </configuration>

其中，bsp.master.address即bsp中的BSPMaster的地址和端口。这里因为是单机模式，所以ip地址为本机地址node1。fs.default.name这个值要特别注意，是hadoop中nameNode的地址和端口，因为hama要用到hadoop的hdfs分布式文件系统。剩下的俩个是zookeeper的相关配置。因为是单机模式下配置，这里简单的配置为本机的ip地址，端口一般是固定的。

Hama单机模式实例-PageRank

PageRank
(1)生成 randomgraph,运行如下命令：
./bin/hama jar hama-examples-0.7.0.jar gen fastgen -v 100 -e 10 -o randomgraph -t 2

注意：我这里由于配置的文件系统为hdfs！所以生成的文件在hdfs上，不在本地！这里踩了大坑，在本地找了半天没找到！

这个命令是 Hama 的一个示例命令，用于生成一个随机图。

./bin/hama：这是 Hama 的执行程序文件路径。

jar hama-examples-0.7.0.jar：这是要执行的 Java 程序的 JAR 文件路径。

gen fastgen：这是要运行的 Hama 作业类型，它使用 FastGen 算法生成随机图。

-v 100：这是生成的随机图的顶点数量。

-e 10：这是生成的随机图的边数量。

-o randomgraph：这是生成的随机图的输出目录。

-t 2：这是生成的随机图使用的任务数。

执行这个命令会生成一个随机图，顶点数量为 100，边数量为 10，输出到名为 randomgraph 的目录中，使用 2 个任务来执行生成操作。

randomgraph一共保存了生成100个节点，其中的每一行描述了其中一个顶点所链接的其他顶点，每个顶点有10条边。

执行结果是一个排序

从生成的结果图可以看出，每个节点的pangescore分数都是差不多，因为每个顶点所链接的边都是10条，所以分数都是差不多。

计算PI

使用蒙特卡洛方法计算pi值

代码如下：

import java.io.IOException; 

import org.apache.commons.logging.Log; 

import org.apache.commons.logging.LogFactory; 

import org.apache.hadoop.fs.FSDataInputStream; 

import org.apache.hadoop.fs.FileStatus; 

import org.apache.hadoop.fs.FileSystem; 

import org.apache.hadoop.fs.Path; 

import org.apache.hadoop.io.DoubleWritable; 

import org.apache.hadoop.io.IOUtils; 

import org.apache.hadoop.io.NullWritable; 

import org.apache.hadoop.io.Text; 

import org.apache.hama.HamaConfiguration; 

import org.apache.hama.bsp.BSP; 

import org.apache.hama.bsp.BSPJob; 

import org.apache.hama.bsp.BSPJobClient; 

import org.apache.hama.bsp.BSPPeer; 

import org.apache.hama.bsp.ClusterStatus; 

import org.apache.hama.bsp.FileOutputFormat; 

import org.apache.hama.bsp.NullInputFormat; 

import org.apache.hama.bsp.TextOutputFormat; 

import org.apache.hama.bsp.sync.SyncException; 



public class PiEstimator 

{ 

    private static Path TMP_OUTPUT = new Path("/tmp/pi-" 

        + System.currentTimeMillis()); 



    public static class MyEstimator 

        extends 

        BSP<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> 

    { 

    public static final Log LOG = LogFactory.getLog(MyEstimator.class); 

    private String masterTask; 

    private static final int iterations = 10000; 



    @Override 

    public void bsp( 

        BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> peer) 

        throws IOException, SyncException, InterruptedException 

    { 



        int in = 0; 

        for (int i = 0; i < iterations; i++) 

        { 

        double x = 2.0 * Math.random() - 1.0, y = 2.0 * Math.random() - 1.0; 

        if ((Math.sqrt(x * x + y * y) < 1.0)) 

        { 

            in++; 

        } 

        } 



        double data = 4.0 * in / iterations; 



        peer.send(masterTask, new DoubleWritable(data)); 

        peer.sync(); 

    } 



    @Override 

    public void setup( 

        BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> peer) 

        throws IOException 

    { 

        // Choose one as a master 

        this.masterTask = peer.getPeerName(peer.getNumPeers() / 2); 

    } 



    @Override 

    public void cleanup( 

        BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> peer) 

        throws IOException 

    { 

        if (peer.getPeerName().equals(masterTask)) 

        { 

        double pi = 0.0; 

        int numPeers = peer.getNumCurrentMessages(); 

        DoubleWritable received; 

        while ((received = peer.getCurrentMessage()) != null) 

        { 

            pi += received.get(); 

        } 



        pi = pi / numPeers; 

        peer.write(new Text("Estimated value of PI is"), 

            new DoubleWritable(pi)); 

        } 

    } 

    } 



    static void printOutput(HamaConfiguration conf) throws IOException 

    { 

    FileSystem fs = FileSystem.get(conf); 

    FileStatus[] files = fs.listStatus(TMP_OUTPUT); 

    for (int i = 0; i < files.length; i++) 

    { 

        if (files[i].getLen() > 0) 

        { 

        FSDataInputStream in = fs.open(files[i].getPath()); 

        IOUtils.copyBytes(in, System.out, conf, false); 

        in.close(); 

        break; 

        } 

    } 



    fs.delete(TMP_OUTPUT, true); 

    } 



    public static void main(String[] args) throws InterruptedException, 

        IOException, ClassNotFoundException 

    { 

    // BSP job configuration 

    HamaConfiguration conf = new HamaConfiguration(); 



    BSPJob bsp = new BSPJob(conf, PiEstimator.class); 

    // Set the job name 

    bsp.setJobName("Pi Estimation Example"); 

    bsp.setBspClass(MyEstimator.class); 

    bsp.setInputFormat(NullInputFormat.class); 

    bsp.setOutputKeyClass(Text.class); 

    bsp.setOutputValueClass(DoubleWritable.class); 

    bsp.setOutputFormat(TextOutputFormat.class); 

    FileOutputFormat.setOutputPath(bsp, TMP_OUTPUT); 



    BSPJobClient jobClient = new BSPJobClient(conf); 

    ClusterStatus cluster = jobClient.getClusterStatus(true); 



    if (args.length > 0) 

    { 

        bsp.setNumBspTask(Integer.parseInt(args[0])); 

    } else 

    { 

        // Set to maximum 

        bsp.setNumBspTask(cluster.getMaxTasks()); 

    } 



    long startTime = System.currentTimeMillis(); 

    if (bsp.waitForCompletion(true)) 

    { 

        printOutput(conf); 

        System.out.println("Job Finished in " 

            + (System.currentTimeMillis() - startTime) / 1000.0 

            + " seconds"); 

    } 

    } 

}

这段代码是一个使用Apache Hama框架的Java程序，它使用蒙特卡洛方法来估计π的值。Apache Hama是一个用于大数据分析的框架，它使用了批量同步并行（BSP）计算模型。

主类是PiEstimator，它包含了一个嵌套的静态类MyEstimator，这个类扩展了Apache Hama提供的BSP类。BSP类是一个泛型类，它接受五个参数：输入键，输入值，输出键，输出值和消息类型。

public static class MyEstimator extends BSP<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable>

MyEstimator类重写了BSP类的三个方法：setup，bsp和cleanup。

setup方法在计算开始时被调用一次，用于设置计算。在这个例子中，它从同伴中选择一个主任务。

public void setup(BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> peer) throws IOException

bsp方法包含了主要的计算。它在单位正方形中生成随机点，计算有多少点落在单位圆内，然后将圆内点的比例（乘以4）发送给主任务。

public void bsp(BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> peer) throws IOException, SyncException, InterruptedException

cleanup方法在计算结束时被调用一次。如果当前任务是主任务，它会收集所有同伴的结果，对它们进行平均，然后将π的估计值写入输出。

public void cleanup(BSPPeer<NullWritable, NullWritable, Text, DoubleWritable, DoubleWritable> peer) throws IOException

PiEstimator类的main方法设置了BSP任务，包括设置输入和输出格式，输出键和值的类，以及任务的数量。然后它启动任务并等待其完成。任务完成后，它打印输出（π的估计值）和所用时间。

public static void main(String[] args) throws InterruptedException, IOException, ClassNotFoundException

对代码进行编译

附上maven所需依赖包

    <dependencies>
        <!--storm依赖-->
        <dependency>
            <groupId>commons-logging</groupId>
            <artifactId>commons-logging</artifactId>
            <version>1.2</version>
        </dependency>


        <!-- Hadoop HDFS -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.7</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hama</groupId>
            <artifactId>hama-core</artifactId>
            <version>0.7.0</version>
        </dependency>

    </dependencies>

如需编译好的文件，文章顶部下载

最后运行命令计算PI：./bin/hama jar PiEstimator.jar org.example.PiEstimator