Overview of Spark, YARN, and HDFS

Spark is a relatively recent addition to the Hadoop ecosystem. Spark is an analytics engine and framework capable of running queries 100 times faster than traditional MapReduce jobs written in Hadoop. In addition to the performance boost, developers can write Spark jobs in Scala, Python, and Java if they so desire. Spark can load data directly from disk, memory, and other data storage technologies such as Amazon S3, Hadoop Distributed File System (HDFS), HBase, Cassandra, etc.

Submitting Spark Jobs

Due to the iterative nature of Spark scripts, they are often developed interactively and can be developed as a script or in a notebook. This results in compact scripts that consist of tight, functional code.

A Spark script can be submitted to a Spark cluster using various methods:

  1. Running the script directly on the head node
  2. Using the acluster submit command from the client
  3. Interactively in an IPython shell or Jupyter Notebook on the cluster
  4. Using the spark-submit script from the client

To run a script on the head node, simply execute python example.py on the cluster. Developing locally on test data and pushing the same analytics scripts to the cluster is a key feature of Anaconda Cluster. With a cluster created and Spark scripts developed, you can use the acluster submit command to automatically push the script to the head node and run it on the Spark cluster.

The below examples use the acluster submit command to run the Spark examples, but any of the above methods can be used to submit a job to the Spark cluster.

Running Spark in Different Modes

Anaconda Cluster can install Spark in standalone mode via the spark-standalone plugin or with theYARN resource manager via the spark-yarn plugin. YARN can be useful when resource management on the cluster is an issue, e.g., when resources need to be shared by many tasks, users and multiple applications.

Spark scripts can be configured to run in standalone mode:

conf = SparkConf()
conf.setMaster('spark://<HOSTNAME_SPARK_MASTER>:7077')

or with YARN by setting the yarn-client as the master within the script:

conf = SparkConf()
conf.setMaster('yarn-client')

You can also submit jobs with YARN by setting --master yarn-client as an option to the spark-submit command:

spark-submit --master yarn-client spark_example.py

Working with Data in HDFS

Moving data in and around HDFS can be difficult. If you need to move data from your local machine to HDFS, from Amazon S3 to HDFS, from Amazon S3 to Redshift, from HDFS to Hive, and so on, we recommend using odo, which is part of the Blaze ecosystemOdo efficiently migrates data from the source to the target through a network of conversions.

Use odo to upload a file:

# Load local data into HDFS
auth = {'user': 'hdfs','port': '14000'}
odo('./iris.csv', 'hdfs://{}:/tmp/iris/iris.csv'.format(HEAD_NODE_IP),
     **auth)

Use odo to upload URLs:

# Load local data into HDFS
auth = {'user': 'hdfs','port': '14000'}
url = 'https://raw.githubusercontent.com/ContinuumIO/blaze/master/blaze/examples/data/iris.csv'
odo(url, 'hdfs://{}:/tmp/iris/iris.csv'.format(HEAD_NODE_IP),
     **auth)

If you are unfamiliar with Spark and/or SQL, we recommend using Blaze to express selections, aggregations, group-bys, etc. in a dataframe-like style. Blaze provides Python users with a familiar interface to query data that exists in other data storage systems.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
目标检测(Object Detection)是计算机视觉领域的一个核心问题,其主要任务是找出图像中所有感兴趣的目标(物体),并确定它们的类别和位置。以下是对目标检测的详细阐述: 一、基本概念 目标检测的任务是解决“在哪里?是什么?”的问题,即定位出图像中目标的位置并识别出目标的类别。由于各类物体具有不同的外观、形状和姿态,加上成像时光照、遮挡等因素的干扰,目标检测一直是计算机视觉领域最具挑战性的任务之一。 二、核心问题 目标检测涉及以下几个核心问题: 分类问题:判断图像中的目标属于哪个类别。 定位问题:确定目标在图像中的具体位置。 大小问题:目标可能具有不同的大小。 形状问题:目标可能具有不同的形状。 三、算法分类 基于深度学习的目标检测算法主要分为两大类: Two-stage算法:先进行区域生成(Region Proposal),生成有可能包含待检物体的预选框(Region Proposal),再通过卷积神经网络进行样本分类。常见的Two-stage算法包括R-CNN、Fast R-CNN、Faster R-CNN等。 One-stage算法:不用生成区域提议,直接在网络中提取特征来预测物体分类和位置。常见的One-stage算法包括YOLO系列(YOLOv1、YOLOv2、YOLOv3、YOLOv4、YOLOv5等)、SSD和RetinaNet等。 四、算法原理 以YOLO系列为例,YOLO将目标检测视为回归问题,将输入图像一次性划分为多个区域,直接在输出层预测边界框和类别概率。YOLO采用卷积网络来提取特征,使用全连接层来得到预测值。其网络结构通常包含多个卷积层和全连接层,通过卷积层提取图像特征,通过全连接层输出预测结果。 五、应用领域 目标检测技术已经广泛应用于各个领域,为人们的生活带来了极大的便利。以下是一些主要的应用领域: 安全监控:在商场、银行
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值