使用Hadoop Streaming Image进行图像分类 classification with Hadoop Streaming

Note: this article was originally posted on a previous version of the 500px engineering blog. A lot has changed since it was originally posted on Feb 1, 2015. In the future posts, we will be covering how our image classification solution has evolved and what other interesting machine learning projects we have.

TLDR: This post provides an overview of how to perform large scale image classification using Hadoop Streaming. First, we look at each component individually and identify things that need to be taken into account for the final integration. Then we go over how to ‘glue’ all pieces together to produce the desired results.

Introduction

Till recently image search and discovery at 500px was based mainly on meta information provided by users which included tagstitle and description. It’s obvious that the quality of such search and discovery directly depends on how well users describe their photos. Since this is such a tedious process, most photos are simply left untagged or have inconsistent keywords. The recentupdate from Facebook illustrates the scale of the problem:

~2 Billion photos are shared on Facebook per day!

Luckily, due to the recent advancements in Deep Neural Networks (NN) and their efficient implementations using GPUs, it is now possible to perform automatic image classification (tagging) with nearly human performance. Getting close to human capabilities or even beating them does not mean we’ve achieved real Artificial Intelligence (AI). The deep NN algorithms do not ‘understand’ content of images nor the concepts that describe them. Instead, they learn invariant representation of patterns (features) found in images and create a mapping function that maps pixels of an image to a combination of patterns found in the image. This mapping function is essentially a specially crafted mathematical function that takes a set of numbers (pixels) and outputs another set of numbers (features).

Despite not being real AI, deep NN algorithms applied to image classification tasks have been significantly improved over the last few years. This is best illustrated by looking at the results of the ImageNet competition.

Improvements of image classification algorithms on ImageNet

Additional advantages of using NN algorithms include:

  • calculating keyword weights in order to better control search results
  • calculating similarity between images
  • reverse image search
  • extracting aesthetic qualities of images

There are many open source projects like TheanoTorchCaffeDL4JToronto DeepNet that provide tools for implementing NNs. For our approach we choseCaffe because it was one of the easiest to start with and also satisfied our requirements:

  • availability of pre-trained models which is important, considering it takes weeks to train a model and requires large training set like ImageNet.
  • Python wrapper so we can use a lot of existing machine learning libraries like scikit-learn in the same environment
  • active community, decent documentation and examples

So we have the library and models for automatic image recognition. Let’s look at the steps we need to perform to classify one image:

  1. download an image
  2. resize and crop to fit the model’s input dimensions
  3. convert RGB pixels into an array of real numbers
  4. call Caffe’s APIs
  5. extract predicted labels and features
  6. save results

When operating with hundreds of millions of images, just the download step may take weeks to complete on a single machine. A faster approach would include performing these steps in parallel. We achieved this by parallelizing execution both within one machine using GPU and multiprocessing, and across many machines relying on Hadoop.

Hadoop Streaming for Image Classification

Hadoop is a framework for distributed processing and storage. Normally, it requires to write Java code to run your jobs. Conveniently, it also offersHadoop Streaming that allows to use any scripting language to create custom jobs. Hadoop Streaming is essentially a job that will run your script in a child process. stdin and stdout are used to pass data to and from your code. stderr is used for logging and debugging. Hadoop Streaming assumes that your input and output data is textual and has one item per line. This means you cannot simply point Hadoop to a folder containing images. It will try to read each image as a text file and will eventually fail.

The big challenge with Hadoop is installing and configuring the cluster. It could take a full-time devops position just to do that. Our shortcut to the problem was to use Elastic MapReduce (EMR) which is Amazon’s web service that provides APIs to configure a Hadoop cluster and run jobs. Another advantage of EMR is seamless integration with Amazon S3. Essentially, when running an EMR job you specify where to read data from and save results to on S3. The only limitation is the input param needs to point to a file or a folder with files on S3, it will not traverse subfolders.

So the input to EMR should be either a text file or a folder with text files on S3. Since we are dealing with a very large collection of photos, we generated a list of files, where each file contained around 1000 records (lines) in the following format:

photo_id  s3://input_photo.jpg  s3://output_classification.json
photo_id  s3://input_photo.jpg  s3://output_classification.json
photo_id  s3://input_photo.jpg  s3://output_classification.json
...

By default, EMR outputs data as one large list split among many part-xxxxxfiles. These split files could be very large and are not aligned with new line boundaries which make parsing them hard. For our classification pipeline we saved predictions for each image as a separate JSON file. This is why we explicitly specified destination location in the input for Hadoop so each worker can save results for each image separately.

We performed image classification on GPU-based EC2 instances. To our surprise we found out that these g2.2xlarge instances are highly unreliable. The most common problem we observed was the failures of the CUDA driver to find the GPU card. From Hadoop side it looked like a stuck worker whose containers were blocked by the GPU driver. The workaround was to kill the failed instance and let EMR to bring a new node and restore HDFS cluster.

There were also problems resulting in failed jobs that Hadoop/EMR could not handle. Those included exceeding memory configuration limits, running out of disk space, S3 connectivity issues, bugs in the code and so on. This is why we needed ability to resume long running jobs that failed. Our approach was to write photo_id into Hadoop’s output for successfully processed images. In case of a failed job, we would parse part-xxxxx files and exclude successfully processed photo ids from the input files.

Bootstrapping and configuring EMR with mrjob

Before we can schedule a job on EMR we first need to bootstrap the cluster and then configure Hadoop. Amazon provides command line tools to perform both actions. However, we took an easier path and used mrjob which is a Python library that provides abstractions over EMR APIs and makes developing and testing of MapReduce jobs very easy.

It’s worth mentioning that mrjob is   not  the slowest Python framework for Hadoop as opposed to the   blog post  by Cloudera. Instead, it is as fast as any Hadoop Streaming job can get. It just does more than other frameworks by offering   protocols  which serialize and deserialize raw text into native Python objects. Of course, this feature can be disabled in the code.

To bootstrap and configure EMR via mrjob you create a YAML file. This is what our classifier_job.conf looked like:

It’s important to highlight the memory specific settings both for YARN (Hadoop’s resource manager) and MapReduce. One map operation required around 2.7GB of RAM: 1.7GB to keep Caffe model in memory + 1GB for downloading and preparing 5 images in parallel. This is why we allocated 3GB per one YARN container. Each g2.2xlarge worker had 15GB RAM out of which 12GB was allocated to MapReduce and 3GB was left to the OS. This gave us 4 containers per node or 80 containers for 20 nodes cluster.

The bash script classifier_job.sh that installed Numpy, Caffe and other libraries would run on the master and worker nodes. To make sure it did not fail on the master node without GPU we had the following switch:

Running the classification job

The image classification job consisted of a single map operation that included downloading images, preparing batch data, performing classification on GPU and saving results. To justify overhead of loading data in/from GPU, images needed to be processed in batches. Unfortunately, mrjob does not support batch processing out of the box. The workaround was to override run_mappermethod and manually control reading lines from stdin.

The batch size was set to 50 images based on the available GPU memory ofg2.2xlarge instance. Below are the two main methods that handled batch operation: run_mapper for preparing batch data and process_batch for batch classification task.

Among all the major steps of process_batch, downloading images was the slowest step:

download_preprocess_for_caffe_batch was parallelized to perform downloading using 5 processes. Normally this download step would be a good candidate for a separate map operation. However, the limitation here is to store all resized images in HDFS (Hadoop Distributed File System), which would require to have a very large EMR cluster. Alternatively, it is possible to increase the number of parallel processes from 5 to 10, for example. But doing so would require to increase the map task’s memory footprint and as the result to reduce number of containers per node (e.g. from 4 containers to 3).

Here are the stats for the final run:

  • 20 nodes EMR cluster
  • 80 containers (4 containers per node)
  • ~600 images/s (30 images/s per node)
  • ~80% average CPU load across the cluster
  • several days to complete
General tips

Just a compilation of random tips we found useful while working on this project.

  • Use EMR job flow to start a persistent cluster so you can avoid provisioning and boostraping every time you want to run a job. An example of possible workflow using mrjob:
# start cluster. It will return job flow id
python -m mrjob.tools.emr.create_job_flow \
--conf-path=classifier_job.conf

# schedule a job using the job flow id
python classifier_job.py -r emr \
--emr-job-flow-id=j-396IAJN5Y6BCB \
--conf-path=classifier_job.conf \
--no-output --output-dir=s3://output s3://input
  • Use EC2 spot instances to save EMR costs by specifying the bid price in the config file:
ec2_master_instance_bid_price: '0.07'
ec2_core_instance_bid_price: '0.07'
  • Use Hadoop’s counters and stderr to output debug and profiling info. For example:
...
except Exception as e:
  self.increment_counter('error', str(e))
  self.increment_counter('count', 'failed photos', count)
  stderr.write(str(e) + '\n')
...
  • ssh to EMR cluster and run:
# to view worker nodes and # of running containers
yarn node -list

# to view running jobs and get urls for the monitoring dashboard
yarn application -list
  • Specify S3 location where to store EMR logs in the mrjob’s config file.
s3_log_uri: s3n://path_to_store_emr_logs

These logs contain when and where each map/reduce attempt is started and whether it succeeded or failed. For the failed tasks you can look at the corresponding container logs that contain your code’s stderr output and other possible exceptions.

Summary

We showed how to perform a large scale image classification task using caffe, Elastic MapReduce (EMR) and mrjob. In our approach we optimized for simplicity and speed of development.

The combination of Hadoop Streaming and machine learning libraries available in Python opens up interesting opportunities for large scale data processing and analysis.


from: https://developers.500px.com/image-classification-with-hadoop-streaming-1aa18b81e22b#.b27jerwf8

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
基于Hadoop Streaming进行MapReduce实践需要遵循以下步骤: 1. 编写Mapper和Reducer代码,将其保存为可执行文件或脚本。Mapper和Reducer可以使用任何编程语言,只要它们可以在命令行上运行即可。 2. 在Hadoop集群上启动Hadoop Streaming作业,并指定Mapper和Reducer的可执行文件或脚本路径。 3. 将输入数据上传到HDFS中,并指定输入路径作为Hadoop Streaming作业的输入。 4. 指定输出路径作为Hadoop Streaming作业的输出。 5. 启动Hadoop Streaming作业,并等待作业完成。 下面是一个简单的示例,展示如何使用Hadoop Streaming进行单词计数: 1. 编写Mapper代码,将其保存为Python脚本: ```python #!/usr/bin/env python import sys for line in sys.stdin: line = line.strip() words = line.split() for word in words: print(word + '\t' + '1') ``` 2. 编写Reducer代码,将其保存为Python脚本: ```python #!/usr/bin/env python import sys current_word = None current_count = 0 word = None for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) count = int(count) if current_word == word: current_count += count else: if current_word: print(current_word + '\t' + str(current_count)) current_count = count current_word = word if current_word == word: print(current_word + '\t' + str(current_count)) ``` 3. 将Mapper和Reducer保存为可执行文件,并上传到HDFS中。 4. 将输入数据上传到HDFS中,例如: ```bash $ hadoop fs -put input.txt /input ``` 5. 启动Hadoop Streaming作业: ```bash $ hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \ -file mapper.py -mapper mapper.py \ -file reducer.py -reducer reducer.py \ -input /input -output /output ``` 在这个命令中,我们指定了Mapper和Reducer的Python脚本路径,以及输入和输出路径。Hadoop Streaming会自动将输入数据分割成小块,并将它们分配给Mapper进行处理。Mapper将每个单词映射到一个计数器,并将其发送到Reducer进行聚合。最终结果被写入到输出路径中。 6. 查看输出结果: ```bash $ hadoop fs -cat /output/* ``` 这将显示单词和它们的计数器,例如: ```text hello 2 world 1 ``` 这就是基于Hadoop Streaming进行MapReduce实践的基本步骤。你可以使用不同的编程语言和算法来解决不同的问题。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值