问题

同事要在一台老服务器上部署测试环境,发现 kerberos 有问题,找我看看

当前用户已经 kinit

1
2
3
4
5
6
7
$ klist
Ticket cache: FILE:/tmp/krb5cc_1059
Default principal: [email protected]

Valid starting Expires Service principal
03/17/23 17:29:32 03/16/24 17:29:32 krbtgt/[email protected]
renew until 03/14/33 17:29:32

执行 hdfs dfs -ls / 报错

1
2
3
4
5
6
7
23/03/17 17:29:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/03/17 17:29:34 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
23/03/17 17:29:34 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
23/03/17 17:29:34 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
23/03/17 17:29:34 INFO retry.RetryInvocationHandler: java.io.IOException: DestHost:destPort <REDACTED NAMENODE>:8020 , LocalHost:localPort <REDACTED LOCAL HOSTNAME>/<REDACTED LOCAL IP>:0. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS], while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over <REDACTED NAMENODE>/<REDACTED NAMENODE IP>:8020 after 1 failover attempts. Trying to failover after sleeping for 1160ms.
23/03/17 17:29:35 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
23/03/17 17:29:35 INFO retry.RetryInvocationHandler: java.io.IOException: DestHost:destPort <REDACTED NAMENODE>:8020 , LocalHost:localPort <REDACTED LOCAL HOSTNAME>/<REDACTED LOCAL IP>:0. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS], while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over <REDACTED NAMENODE>/<REDACTED NAMENODE IP>:8020 after 2 failover attempts. Trying to failover after sleeping for 2093ms.

折腾过程

开始我以为是小问题,检查了 hosts 文件,不过应该不是 hosts 的问题,如果 hosts 有问题,就没法解析 NAMENODE hostname 成 IP 了。

然后检查了这台服务器到 NAMENODE 的连通性,没问题。

之前没遇到过这个问题,之前遇到的问题,鉴权失败会写具体的 reason,比如时间不同步,偏移太大啥的。这个也没写。

开启 DEBUG 看看

1
2
export HADOOP_ROOT_LOGGER=DEBUG,console
export HADOOP_OPTS="-Dsun.security.krb5.debug=true -Djavax.net.debug=ssl"

查看用户信息试试

1
hadoop org.apache.hadoop.security.UserGroupInformation

输出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
$ hadoop org.apache.hadoop.security.UserGroupInformation
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.5.0-152/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/3.1.5.0-152/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Getting UGI for current user
23/03/17 18:00:43 DEBUG security.SecurityUtil: Setting hadoop.security.token.service.use_ip to true
23/03/17 18:00:43 DEBUG util.Shell: setsid exited with exit code 0
Java config name: null
Native config name: /etc/krb5.conf
Loaded from native config
23/03/17 18:00:43 DEBUG security.Groups: Creating new Groups object
23/03/17 18:00:43 DEBUG util.NativeCodeLoader: Trying to load the custom-built native-hadoop library...
23/03/17 18:00:43 DEBUG util.NativeCodeLoader: Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: /usr/hdp/3.1.5.0-152/hadoop/lib/native/libhadoop.so: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /usr/hdp/3.1.5.0-152/hadoop/lib/native/libhadoop.so)
23/03/17 18:00:43 DEBUG util.NativeCodeLoader: java.library.path=:/usr/hdp/3.1.5.0-152/hadoop/lib/native/Linux-amd64-64:/usr/hdp/3.1.5.0-152/hadoop/lib/native/Linux-amd64-64:/usr/hdp/3.1.5.0-152/hadoop/lib/native
23/03/17 18:00:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/03/17 18:00:43 DEBUG util.PerformanceAdvisory: Falling back to shell based
23/03/17 18:00:43 DEBUG security.JniBasedUnixGroupsMappingWithFallback: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping
23/03/17 18:00:43 DEBUG security.Groups: Group mapping impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback; cacheTimeout=7200000; warningDeltaMs=5000
Java config name: null
Native config name: /etc/krb5.conf
Loaded from native config
>>> KdcAccessibility: reset
>>> KdcAccessibility: reset
>>>KinitOptions cache name is /tmp/krb5cc_1059
>>>DEBUG <CCacheInputStream> client principal is carpo@<REDACTED>
>>>DEBUG <CCacheInputStream> server principal is krbtgt/<REDACTED>@<REDACTED>
>>>DEBUG <CCacheInputStream> key type: 18
>>>DEBUG <CCacheInputStream> auth time: Fri Mar 17 17:52:12 CST 2023
>>>DEBUG <CCacheInputStream> start time: Fri Mar 17 17:52:12 CST 2023
>>>DEBUG <CCacheInputStream> end time: Sat Mar 16 17:52:12 CST 2024
>>>DEBUG <CCacheInputStream> renew_till time: Mon Mar 14 17:52:12 CST 2033
>>> CCacheInputStream: readFlags() FORWARDABLE; RENEWABLE; INITIAL; PRE_AUTH;
>>>DEBUG <CCacheInputStream> client principal is carpo@<REDACTED>
>>>DEBUG <CCacheInputStream> server principal is X-CACHECONF:/krb5_ccache_conf_data/fast_avail/krbtgt/<REDACTED>@<REDACTED>@<REDACTED>
>>>DEBUG <CCacheInputStream> key type: 0
>>>DEBUG <CCacheInputStream> auth time: Thu Jan 01 08:00:00 CST 1970
>>>DEBUG <CCacheInputStream> start time: null
>>>DEBUG <CCacheInputStream> end time: Thu Jan 01 08:00:00 CST 1970
>>>DEBUG <CCacheInputStream> renew_till time: null
>>> CCacheInputStream: readFlags()
>>> unsupported key type found the default TGT: 18
23/03/17 18:00:43 DEBUG security.UserGroupInformation: hadoop login
23/03/17 18:00:43 DEBUG security.UserGroupInformation: hadoop login commit
23/03/17 18:00:43 DEBUG security.UserGroupInformation: using local user:UnixPrincipal: carpo
23/03/17 18:00:43 DEBUG security.UserGroupInformation: Using user: "UnixPrincipal: carpo" with name carpo
23/03/17 18:00:43 DEBUG security.UserGroupInformation: User entry: "carpo"
23/03/17 18:00:43 DEBUG security.UserGroupInformation: UGI loginUser:carpo (auth:SIMPLE)
User: carpo
Group Ids:
23/03/17 18:00:43 DEBUG security.Groups: GroupCacheLoader - load.
Groups: user
UGI: carpo (auth:SIMPLE)
Auth method SIMPLE
Keytab false
============================================================

如果鉴权成功的话,输出是这样 (没开 DEBUG)

1
2
3
4
5
6
User: [email protected]
Group Ids:
Groups: carpo
UGI: [email protected] (auth:KERBEROS)
Auth method KERBEROS
Keytab false

上面就是失败了,失败原因

1
unsupported key type found the default TGT: 18

这个是因为这台服务器上的 java 不支持 18 类型的 加密方式,也就是 AES

搜了一下,网上有的说,在 /etc/krb5.conf 改默认的加密方式

1
2
3
4
[libdefaults]
default_tkt_enctypes = rc4-hmac aes256-cts aes128-cts des3-cbc-sha1 des-cbc-md5 des-cbc-crc
default_tgs_enctypes = rc4-hmac aes256-cts aes128-cts des3-cbc-sha1 des-cbc-md5 des-cbc-crc
permitted_enctypes = rc4-hmac aes256-cts aes128-cts des3-cbc-sha1 des-cbc-md5 des-cbc-crc

这个我尝试了,不行

可能是因为我们的 kdc 只支持 aes256-cts 的方式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# cat /var/kerberos/krb5kdc/kdc.conf
[kdcdefaults]
kdc_ports = 88
kdc_tcp_ports = 88
restrict_anonymous_to_tgt = true

[realms]
OSS3.COM = {
master_key_type = aes256-cts
max_life = 365d
max_renewable_life = 3650d
acl_file = /var/kerberos/krb5kdc/kadm5.acl
dict_file = /usr/share/dict/words
default_principal_flags = +preauth
; admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
pkinit_identity = FILE:/var/kerberos/krb5kdc/kdc.crt,/var/kerberos/krb5kdc/kdc.key
pkinit_anchors = FILE:/var/kerberos/krb5kdc/kdc.crt
pkinit_anchors = FILE:/var/kerberos/krb5kdc/cacert.pem
pkinit_pool = FILE:/var/lib/ipa-client/pki/ca-bundle.pem
}

还可能是因为 keytab 里面的 key 是用的 aes256-cts 吧

1
2
3
4
5
6
klist -kte carpo.keytab
Keytab name: FILE:/<REDACTED>/carpo.keytab
KVNO Timestamp Principal
---- ----------------- --------------------------------------------------------
4 12/13/22 14:30:49 [email protected] (aes256-cts-hmac-sha1-96)
4 12/13/22 14:30:49 [email protected] (aes128-cts-hmac-sha1-96)

看了下这台服务器的系统

1
2
# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.5 (Santiago)

是不是因为系统太老了呀,不支持 type 18 的这种加密方式,我们部署新的集群环境都是限制了要 CentOS 7.6 的

我当时慌着下班,我说“不行,系统太老了”

组长说:“你在开玩笑吗?这台服务器以前部署过XX平台的,以前肯定是可以鉴权成功的”

知道这个结论后,激起了我的胜负欲。

这就好像有人告诉你了,挖到 100 米就肯定能有水,目前没有水,是因为我目前还没挖到 100 米。

之前可以的,现在不行(因为迁移集群的缘故,配置变化了),说明和系统版本没有关系。

“Linux 一切皆文件”,肯定是某个配置文件的问题了。

( 惹毛了我就 rsync 全部同步文件。)

后来搜索发现,JDK 不支持 aes256 的加密方式是因为 JCE。

因为漂亮国的什么密码出口条例,aes256 这种高强度加密方式是有限制出口的。

只需要下载 JCE,放到 $JAVA_HOME/jre/lib/security/

我去 $JAVA_HOME/jre/lib/security/ 这个路径一看,JCE 是已经安装的呀

1
2
3
4
5
6
7
8
9
10
11
12
# cd /usr/java/jre/lib/security/
# ll
total 172
-rw-r--r-- 1 root root 4054 Dec 12 2017 blacklist
-rw-r--r-- 1 root root 1273 Dec 12 2017 blacklisted.certs
-rw-r--r-- 1 root root 113484 Dec 12 2017 cacerts
-rw-r--r-- 1 root root 2466 Dec 12 2017 java.policy
-rw-r--r-- 1 root root 33404 Dec 12 2017 java.security
-rw-r--r-- 1 root root 98 Dec 12 2017 javaws.policy
-rw-r--r-- 1 root root 3527 Dec 12 2017 local_policy.jar
-rw-r--r-- 1 root root 0 Dec 12 2017 trusted.libraries
-rw-r--r-- 1 root root 3026 Dec 12 2017 US_export_policy.jar

但是别人报错原因说的很清楚,就是不支持。

我对比另一台同样是 Redhat 6.5 但是能够鉴权成功的服务器的这个目录,发现那台服务器和这台鉴权失败的服务器的 JCE 的两个 jar 包文件大小不一样

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ cd /usr/java/jre/lib/security/
$ ll
total 180
-rw-r--r-- 1 root root 4054 Mar 17 18:16 blacklist
-rw-r--r-- 1 root root 1273 Mar 17 18:16 blacklisted.certs
-rw-r--r-- 1 root root 113484 Mar 17 18:16 cacerts
-rw-r--r-- 1 root root 2466 Mar 17 18:16 java.policy
-rw-r--r-- 1 root root 33404 Mar 17 18:16 java.security
-rw-r--r-- 1 root root 98 Mar 17 18:16 javaws.policy
-rw-r--r-- 1 root root 3035 Mar 17 18:16 local_policy.jar
-rw-r--r-- 1 root root 3527 Mar 17 18:16 local_policy.jar.20181029
-rw-r--r-- 1 root root 0 Mar 17 18:16 trusted.libraries
-rw-r--r-- 1 root root 3023 Mar 17 18:16 US_export_policy.jar
-rw-r--r-- 1 root root 3026 Mar 17 18:16 US_export_policy.jar.20181029

scp 拷贝过来后,发现 hdfs dfs -ls / 鉴权成功了。

总结

有的时候网上的信息不是那么直接刚好能解决你的问题,有的时候还需要运气,你的认知,以及一点点🤏坚持。

当然,如果不是组长给我说了“肯定可以”的结论,估计我也就以“系统太老不支持”的原因放弃了。

参考

  1. https://www.cnblogs.com/tommyjiang/p/15008787.html
  2. https://www.jianshu.com/p/cc523d5a715d