Three years ago, I was tasked with migrating the data analysis stack of a company with millions of daily active users from its traditional SQL-based infrastructure to Apache Hadoop. Back then, I considered securing the cluster, although it wasn’t really necessary as the cluster was only accessed internally by the data engineers and scientists. Soon I found out that it is not as simple as just a few configurations in some XML file. In fact, I was surprised about how much extra complexity is needed for something that seemed pretty trivial: “I just want my cluster to have auth!”. After a couple of days of struggling with Kerberos, KDC, key-tab files, principals, and other things that I didn’t 100% understand back then, I just decided to move on. Why fix it if it’s not broken.

三年前,我的任务是将一家拥有数百万日常活动用户的公司的数据分析堆栈从其传统的基于SQL的基础架构迁移到Apache Hadoop。 那时,我考虑过保护群集,尽管实际上并没有必要,因为仅由数据工程师和科学家在内部访问群集。 很快,我发现它并不像一些XML文件中的几个配置那样简单。 实际上,对于似乎微不足道的某些东西需要多少额外的复杂性,我感到惊讶:“我只希望我的集群具有auth!”。 经过几天的努力,尝试使用Kerberos,KDC,密钥表文件,主体以及当时我100%都不了解的其他东西,我决定继续前进。 如果它没有损坏,为什么要修复它。

Fast-forward three years, I found myself working on a different Big Data project in a different organization on a different continent. This time, we had to have Hadoop security as we were installing a multi-tenant solution. This is where it all hit back.

快进了三年,我发现自己在不同大陆的不同组织中从事不同的大数据项目。 这次,我们安装多租户解决方案时必须具有Hadoop安全性。 这就是一切的反击。

什么是Kerberos (What is Kerberos)

Kerberos is, in fact, the go-to solution for the centralization of auth servers for most network admins. Many people are using it every day without even knowing about it, partly thanks to the fact that Microsoft basically adopted Kerberos and then renamed it and made sure their version doesn’t work with the one everybody else used (kind of like how they POSIX). It is the corner-stone of Active Directory, LDAP and Samba.

实际上,Kerberos是大多数网络管理员集中化身份验证服务器的首选解决方案。 许多人每天都在使用它,甚至根本不了解它,部分原因是这样的事实:Microsoft基本采用Kerberos,然后对其进行了重命名,并确保其版本与其他所有人都无法使用(有点像他们的POSIX) 。 它是Active Directory,LDAP和Samba的基石。

Truth is, you don’t really need to fully understand how it works under the hood, although I strongly recommend that you do take a few minutes to learn the basic concepts from this amazing tutorial (believe me, I’ve watched a few of them). The main idea is, instead of different services managing passwords and access themselves, let a centralized server do it and issue “tickets” for the user. The user can present this ticket as the token, and using a very cleverly designed protocol the password does not need to be passed around in the network. We’ll explore that in a simple analogy later.

事实是,尽管您强烈建议您花几分钟的时间来学习该教程中的基本概念,但您实际上并不需要完全了解它的工作原理(相信我,我已经看了一些他们)。 主要思想是,让集中式服务器来执行此操作并为用户颁发“票证”,而不是由不同的服务来管理密码和自行访问。 用户可以将此票证作为令牌出示,并且使用非常巧妙设计的协议,无需在网络中传递密码。 稍后我们将以一个简单的类比来探讨这一点。

Hadoop为什么使用Kerberos并使我的生活更艰难 (Why does Hadoop use Kerberos and make my life harder)

Rule of thumb: If you’re implementing your own auth and security utils from scratch, chances are you’re screwing up. Small errors in the design or implementation of the protocols and routines can cause catastrophic vulnerabilities. You should not reinvent the wheel. This is why the Hadoop project did not invent its own authentication scheme. There are many many modules involved in a Hadoop cluster, and new projects based on Hadoop emerge pretty frequently. Using a well-known standard is a good engineering decision.

经验法则:如果您要从头开始实现自己的身份验证和安全工具,则很可能会搞砸。 协议和例程的设计或实现中的小错误可能会导致灾难性漏洞。 您不应该重新发明轮子。 这就是Hadoop项目没有发明自己的身份验证方案的原因。 Hadoop集群中涉及许多模块,基于Hadoop的新项目经常出现。 使用众所周知的标准是一个好的工程决策。

我需要知道些什么? (What do I need to know?)

Hadoop is known to be a nightmare in configuration and maintenance, so it is important to understand a few key concepts here, explained in simple English:


KDC: Imagine instead of showing your driver’s license everywhere, there was this machine that scanned your driver’s license and issued you disposable tickets with your name on it. That machine is called the KDC (Key Distribution Centre) in Kerberos. You could then use that ticket to get into the club or something. So simply put, a principal is an entity or a user.

KDC:想像一下,这台机器没有扫描驾驶执照,而是扫描了驾驶执照,并向您签发了带有您的姓名的一次性机票。 该机器在Kerberos中称为KDC (密钥分发中心)。 然后,您可以使用该票进入俱乐部或其他东西。 简而言之,委托人是实体或用户。

Principal: Think of Kerberos principals as the name on an ID card. It is the equivalent of a user entity in the Kerberos framework.

主体:将Kerberos主体视为ID卡上的名称。 它等效于Kerberos框架中的用户实体。

Keytab File: a keytab file is close to an RSA key in its functionality. Basically, it is the “driver's license” in the ticket machine analogy. It’s what you need to present to the KDC for it to issue you a ticket, but not the only way to do so. The machine can also take your password or PIN, but in Secure Hadoop we mostly use keytab files. Of course, the keytab/password doesn’t get sent to KDC out in the clear and there are more steps involved, but for now let’s not focus on those details. Why not show the driver’s license (keytab file) itself to the club bouncer instead of the ticket? Because we don’t really trust bouncers not to copy it, or we simply don’t wanna keep carrying our “driver’s license” around and risk losing it (That’s part of why Kerberos is so secure. Also, the ticket that you get for presenting at the club is kind of written in bouncer language. So if someone steals it, they can’t use it at the bank).

密钥表文件密钥表文件在功能上接近RSA密钥。 基本上,它是售票机中的“驾驶执照”。 这是您需要向KDC出示票证的方法,但这不是唯一的方法。 该机器还可以获取您的密码或PIN,但是在Secure Hadoop中,我们主要使用keytab文件。 当然,密钥表/密码不会直接发送到KDC,并且涉及更多步骤,但是现在让我们不要专注于这些细节。 为什么不向俱乐部保镖而不是机票显示驾驶执照(密钥表文件)本身? 因为我们真的不相信保镖不会复制它,或者我们只是不想继续携带我们的“驾驶执照”并冒着丢失它的风险(这就是Kerberos如此安全的原因。此外,您获得的罚单出现在俱乐部是用保镖语言编写的,因此,如果有人偷了它,他们就不能在银行使用它。

Hadoop Kerberos配置 (Hadoop Kerberos Configs)

First of all, I’d highly suggest using Ambari if you’re going with the secure cluster for the first time. Trying to figure out Hadoop Security without using Ambari is like learning how to moonwalk wearing ice cleats.

首先,如果您是第一次使用安全群集,我强烈建议您使用Ambari。 尝试在不使用Ambari的情况下弄清Hadoop安全性就像在学习如何穿冰鞋走上月球。

For securing the cluster, you need to install Kerberos first. The instructions here are designed for Centos 7, but it’s a fairly similar process on Debian-based systems as well. Install Kerberos client on all cluster machines. Here, we assume one master node called oldtrafford and two slaves called amfield and stamford:

为了保护群集,您需要首先安装Kerberos。 这里的说明是为Centos 7设计的,但是在基于Debian的系统上也是如此。 在所有群集计算机上安装Kerberos客户端。 在这里,我们假设一个主节点称为oldtrafford ,两个从节点称为amfieldstamford

pdsh -w oldtraford,amfield,stamford sudo yum install -y krb5-workstation krb5-libs krb5-auth-dialog

Install the Kerberos server on the master node:


yum install krb5-server krb5-libs krb5-auth-dialog krb5-workstation

The most important config file in Kerberos is the /etc/krb5.conf file which among other things, holds all the details about where the KDC and everything else is. In the /etc/krb5.conf

Kerberos中最重要的配置文件是/etc/krb5.conf文件,其中除其他外,它包含有关KDC以及其他所有内容的所有详细信息。 在/etc/krb5.conf

[libdefaults]...default_realm = IMANAKBARI.COM...[realms]IMANAKBARI.COM = {kdc = oldtrafford.imanakbari.comadmin_server = oldtrafford.imanakbari.com}[domain_realm].imanakbari.com = IMANAKBARI.COMimanakbari.com = IMANAKBARI.COM

Copy the /etc/krb5.conf file to all hosts in the cluster.


The capital case is conventional. The configs above define a realm named IMANAKABRI.COM in all caps, which is conventional. Then point to the domain name of the KDC and admin server for this realm. notice that in Kerberos, we always use FQDNs, the “full” name of the host. A realm is essentially an auth domain. It can represent an institution or a company’s auth system. For instance, in University of Waterloo, we use the realm: CSCLUB.UWATERLOO.CA for our campus network authentications.

大写字母是常规的。 上面的配置在所有大写字母中都定义了一个名为IMANAKABRI.COM领域 ,这是常规的。 然后指向该领域的KDC和管理服务器的域名。 请注意,在Kerberos中,我们始终使用FQDN(主机的全名)。 领域本质上是一个身份验证域。 它可以代表机构或公司的身份验证系统。 例如,在滑铁卢大学中,我们使用CSCLUB.UWATERLOO.CA进行校园网络身份验证。

Then we use the utility kdb5_utils to create the Kerberos database. Be careful not to lose the KDC master password.

然后,我们使用实用程序kdb5_utils创建Kerberos数据库。 注意不要丢失KDC主密码。

sudo kdb5_util create -s

Now, update the /var/kerberos/krb5kdc/kdc.conf file on the server:


[realms]IMANAKBARI.COM = {acl_file = /var/kerberos/krb5kdc/kadm5.acldict_file = /usr/share/dict/wordsadmin_keytab = /var/kerberos/krb5kdc/kadm5.keytabsupported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal camellia256-cts:normal camellia128-cts:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal}

Update ACL on the Kerberos server by editing the /var/kerberos/krb5kdc/kadm5.acl file:



Start the KDC:


sudo service krb5kdc startsudo service kadmin start

And make sure it runs on startup:


sudo chkconfig krb5kdc onsudo chkconfig kadmin on

Create the admin principal (the other principals will be made by Ambari itself)


sudo kadmin.localaddprinc root/admin@IMANAKBARI.COM

Download JCE policy 8 and place it in Java libs on all hosts:


sudo unzip -o -j -q jce_policy-8.zip -d <JAVA_HOME>/jre/lib/security/

And now you can run the Kerberization wizard in Ambari by going to Admin>Kerberos and clicking Enable Kerberos.

现在,您可以转到Admin> Kerberos并单击Enable Kerberos ,以在Ambari中运行Kerberization向导

Image for post

The process takes quite a while, and it has to restart all services. By default, Ambari generates all the keytabs needed by HDFS, YARN, SPNEGO, Spark, etc.

该过程需要相当长的时间,并且必须重新启动所有服务。 默认情况下,Ambari会生成HDFS,YARN,SPNEGO,Spark等所需的所有键表。

Image for post

Now you can not expect that after the wizard everything would just work. Troubleshooting Kerberos has had me pulling my hair and I’m sure I’m not the only one. Here are a few of the problems that I had to figure out:

现在,您不能指望向导完成后一切都会正常进行。 对Kerberos进行故障排除使我无法胜任,而且我确定我不是唯一的人。 以下是我必须解决的一些问题:

对启用Kerberos的群集进行故障排除 (Trouble-shooting Kerberos-enabled Cluster)

_HOST值 (The _HOST value)

In HDFS configs, they use a trick so that they do not have to write a separate XML config for each host in the cluster. The _HOST macro, in principal names, automatically resolves to each host’s name. Now that is recipe for disaster, because in a lot of environments, a host may have different “names”. This is typical in “multi-homed” environments, on which a very good tutorial is available here.

在HDFS配置中,它们使用技巧,因此不必为群集中的每个主机编写单独的XML配置。 主体名称中的_HOST宏会自动解析为每个主机的名称。 现在这就是灾难的秘诀,因为在许多环境中,主机可能具有不同的“名称”。 这是“多宿主”环境中的典型情况, 此处提供很好的教程。

In order to avoid errors like this when starting the cluster after Kerberization, you need to make sure you know what each variable in Hadoop configurations translates to:


Traceback (most recent call last):File "/var/lib/ambari-agent/cache/common-services/HDFS/", line 361, in <module>NameNode().execute()File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 375, in executemethod(env)File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 978, in restartself.start(env, upgrade_type=upgrade_type)File "/var/lib/ambari-agent/cache/common-services/HDFS/", line 99, in startupgrade_suspended=params.upgrade_suspended, env=env)File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", line 89, in thunkreturn fn(*args, **kwargs)File "/var/lib/ambari-agent/cache/common-services/HDFS/", line 234, in namenodecreate_hdfs_directories()File "/var/lib/ambari-agent/cache/common-services/HDFS/", line 301, in create_hdfs_directoriesmode=0777,File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 166, in __init__self.env.run()File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 160, in runself.run_action(resource, action)File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 124, in run_actionprovider_action()File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 606, in action_create_on_executeself.action_delayed("create")File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 603, in action_delayedself.get_hdfs_resource_executor().action_delayed(action_name, self)File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 330, in action_delayedself._assert_valid()File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 289, in _assert_validself.target_status = self._get_file_status(target)File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 432, in _get_file_statuslist_status = self.util.run_command(target, 'GETFILESTATUS', method='GET', ignore_status_codes=['404'], assertable_result=False)File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 177, in run_commandreturn self._run_command(*args, **kwargs)File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 250, in _run_commandraise WebHDFSCallException(err_msg, result_dict)resource_management.libraries.providers.hdfs_resource.WebHDFSCallException: Execution of 'curl -sS -L -w '%{http_code}' -X GET --negotiate -u : 'http://oldtrafford.imanakbari.com:50070/webhdfs/v1/tmp?op=GETFILESTATUS'' returned status_code=403.<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/><title>Error 403 org.apache.hadoop.security.authentication.client.AuthenticationException</title></head><body><h2>HTTP ERROR 403</h2><p>Problem accessing /webhdfs/v1/tmp. Reason:<pre>    org.apache.hadoop.security.authentication.client.AuthenticationException</pre></p><hr /><i><small>Powered by Jetty://</small></i><br/><br/>

or this one:


resource_management.libraries.providers.hdfs_resource.WebHDFSCallException: Execution of 'curl -sS -L -w '%{http_code}' -X GET --negotiate -u : 'http://oldtrafford.imanakbari.com:50070/webhdfs/v1/tmp?op=GETFILESTATUS'' returned status_code=403.<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/><title>Error 403 org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: Failure unspecified at GSS-API level (Mechanism level: Invalid argument (400) - Cannot find key of appropriate type to decrypt AP REP - AES256 CTS mode with HMAC SHA1-96)</title></head><body><h2>HTTP ERROR 403</h2><p>Problem accessing /webhdfs/v1/tmp. Reason:<pre>    org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: Failure unspecified at GSS-API level (Mechanism level: Invalid argument (400) - Cannot find key of appropriate type to decrypt AP REP - AES256 CTS mode with HMAC SHA1-96)</pre></p><hr /><i><small>Powered by Jetty://</small></i><br/><br/><br/>

When running into a problem like this, the first thing to check is the upper/lower case of realm name in the configurations. Kerberos is of course, case-sensitive.

当遇到这样的问题时,首先要检查的是配置中领域名称的大小写。 Kerberos当然是区分大小写的。

But another source of the problem is _HOST substitution. _HOST is by default substituted to Java’s InetAddress.getLocalHost().getCanonicalHostName().toLowerCase() unless a config named hadoop.security.dns.interface is set in core-site.xml.

但是问题的另一个来源是_HOST替换。 _HOST默认替换为Java的InetAddress . getLocalHost (). getCanonicalHostName (). toLowerCase () InetAddress . getLocalHost (). getCanonicalHostName (). toLowerCase () 除非在core-site.xml设置了名为hadoop.security.dns.interface的配置,否则为InetAddress . getLocalHost (). getCanonicalHostName (). toLowerCase ()

The hadoop.security.dns.interface essentially tells Hadoop that on each host, look at the machine’s IP address on this interface (say enp0s4), and then make a reverse DNS call on that address and then replace _HOST with that domain name. You can also tell Hadoop which DNS server to use for this, using the hadoop.security.dns.nameserver configuration.

hadoop.security.dns.interface本质上告诉Hadoop,在每个主机上,查看该接口上机器的IP地址(例如enp0s4 ),然后对该地址进行反向DNS调用,然后用该域名替换_HOST。 您还可以使用hadoop.security.dns.nameserver配置告诉Hadoop为此使用哪个DNS服务器。

On the other hand, when Ambari generates the keytabs, it uses the hostnames you used in Ambari, which should typically be the “full” host name (FQDN) like: oldtrafford.imanakbari.com . You can check those names in the Hosts tab of Ambari:

另一方面,当Ambari生成密钥表时,它将使用您在Ambari中使用的主机名,通常应为“完整”主机名(FQDN),例如: oldtrafford.imanakbari.com 。 您可以在Ambari的“主机”选项卡中检查这些名称:

Image for post

You can check the principal names in the generated keytabs using the klist -kt command:

您可以使用klist -kt命令在生成的密钥表中检查主体名称:

$ klist -kt /etc/security/keytabs/spnego.service.keytabKeytab name: FILE:/etc/security/keytabs/spnego.service.keytabKVNO Timestamp         Principal---- ----------------- -----------------------4 18/03/20 13:08:29 HTTP/oldtrafford.imanakbari.com@IMANAKBARI.COM4 18/03/20 13:08:29 HTTP/oldtrafford.imanakbari.com@IMANAKBARI.COM4 18/03/20 13:08:29 HTTP/oldtrafford.imanakbari.com@IMANAKBARI.COM4 18/03/20 13:08:29 HTTP/oldtrafford.imanakbari.com@IMANAKBARI.COM4 18/03/20 13:08:29 HTTP/oldtrafford.imanakbari.com@IMANAKBARI.COM

So in communications between Hadoop components, the identity (principal name) that Spnego on oldtrafford would present, is: HTTP/oldtrafford.imanakbari.com@IMANAKBARI.COM . This should 100% match what HTTP/_HOST@${realm} translates to, so if _HOST gets resolved to oldtrafford and not oldtrafford.imanakbari.com , everything gets messed up because other components expect different credentials than what they expect.

因此,在Hadoop组件之间的通信中,oldtrafford上的oldtrafford将提供的标识(主名称)为: HTTP/oldtrafford.imanakbari.com@IMANAKBARI.COM 。 这应该100%匹配HTTP/_HOST@${realm}转换的内容,因此,如果_HOST被解析为oldtrafford而不是oldtrafford.imanakbari.com ,则所有内容都将混乱,因为其他组件期望的凭据与期望的凭据不同。

The best way to troubleshoot these problems and make sure these configs are correct is to inspect what the configurations resolve to. If hadoop.security.dns.interface is not set, the following Java snippet can let you check what _HOST will be replaced with on each machine:

解决这些问题并确保这些配置正确的最佳方法是检查配置可以解决的问题。 如果未设置hadoop.security.dns.interface ,则以下Java代码片段可让您检查每台计算机上将替换为_HOST的内容:

If there is a discrepancy, you can use the hadoop.security.dns.interface and hadoop.security.dns.nameserver configurations. To check what the DNS returns, you can use the nslookup tool:

如果存在差异,则可以使用hadoop.security.dns.interfacehadoop.security.dns.nameserver配置。 要检查DNS返回的内容,可以使用nslookup工具:

$ nslookup name = amfield.imanakbari.com.

Also, keep in mind that if there is no DNS set, the /etc/hosts file will be used, and the order of domain names matters on how reverse DNS lookups are performed. For example, in the following config, looking up will return amfield.mycluster.imanakbari.com and not amfield.imanakbari.com which can cause errors.

另外,请记住,如果未设置DNS,将使用/etc/hosts文件,域名的顺序与如何执行反向DNS查找有关。 例如,在以下配置中,查找192.168.1.100将返回amfield.mycluster.imanakbari.com而不是amfield.imanakbari.com ,这会导致错误。         amfield.mycluster.imanakbari.com amfield.imanakbari.com

krb_error 41消息流已修改 (krb_error 41 Message stream modified)

I’d like to share my experience with another error that was not discussed in StackOverflow and Cloudera forums enough too. When using Spark client inside a Docker container, I ran into the following error when starting the Spark on secure YARN session:

我想分享我的经验,以及另一个在StackOverflow和Cloudera论坛中没有讨论的错误。 在Docker容器中使用Spark客户端时,在安全YARN会话上启动Spark时遇到以下错误:

Exception: krb_error 41 Message stream modified (41)

Although for many others the problem was with upper/lower case of configurations, in my case the problem was with the renew_lifetime = 7d config in /etc/krb5.conf file. Removing it, did the trick.

尽管对于许多其他人来说,问题出在配置的大写/小写,但在我的情况下,问题出在/etc/krb5.conf文件中的renew_lifetime = 7d配置。 删除它,就可以了。

结论 (Conclusion)

There’s a lot of bells and whistles on the Hadoop/Kerberos configurations. I recommend (1) Taking the time to understand the core concepts (2) Using Ambari for generating configs and keytabs, at least the first time dealing with this (3) Paying attention to the principal names in Hadoop XML config files matching the ones in the generated keytabs.

Hadoop / Kerberos配置上有很多风吹草动。 我建议(1)花些时间了解核心概念(2)至少是第一次使用Ambari来生成配置和密钥表(3)注意Hadoop XML配置文件中的主体名称,该主体名称与生成的密钥表。

致谢 (Acknowledgments)

Credits are due to Faizul Bari for the KRB installation instructions.

感谢Faizul Bari提供KRB安装说明。

翻译自: https://medium.com/@blackvvine/taming-the-three-headed-beast-understanding-kerberos-for-trouble-shooting-hadoop-security-12f6c152fe97

  • 0
  • 0
    觉得还不错? 一键收藏
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0