问题 同事要在一台老服务器上部署测试环境,发现 kerberos 有问题,找我看看
当前用户已经 kinit1 2 3 4 5 6 7 $ klist Ticket cache: FILE:/tmp/krb5cc_1059 Default principal: [email protected] Valid starting Expires Service principal 03/17/23 17:29:32 03/16/24 17:29:32 krbtgt/[email protected] renew until 03/14/33 17:29:32
执行 hdfs dfs -ls /
报错1 2 3 4 5 6 7 23/03/17 17:29:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 23/03/17 17:29:34 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 23/03/17 17:29:34 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 23/03/17 17:29:34 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 23/03/17 17:29:34 INFO retry.RetryInvocationHandler: java.io.IOException: DestHost:destPort <REDACTED NAMENODE>:8020 , LocalHost:localPort <REDACTED LOCAL HOSTNAME>/<REDACTED LOCAL IP>:0. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS], while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over <REDACTED NAMENODE>/<REDACTED NAMENODE IP>:8020 after 1 failover attempts. Trying to failover after sleeping for 1160ms. 23/03/17 17:29:35 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] 23/03/17 17:29:35 INFO retry.RetryInvocationHandler: java.io.IOException: DestHost:destPort <REDACTED NAMENODE>:8020 , LocalHost:localPort <REDACTED LOCAL HOSTNAME>/<REDACTED LOCAL IP>:0. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS], while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over <REDACTED NAMENODE>/<REDACTED NAMENODE IP>:8020 after 2 failover attempts. Trying to failover after sleeping for 2093ms.
折腾过程 开始我以为是小问题,检查了 hosts 文件,不过应该不是 hosts 的问题,如果 hosts 有问题,就没法解析 NAMENODE hostname 成 IP 了。
然后检查了这台服务器到 NAMENODE 的连通性,没问题。
之前没遇到过这个问题,之前遇到的问题,鉴权失败会写具体的 reason,比如时间不同步,偏移太大啥的。这个也没写。
开启 DEBUG 看看
1 2 export HADOOP_ROOT_LOGGER=DEBUG,console export HADOOP_OPTS="-Dsun.security.krb5.debug=true -Djavax.net.debug=ssl"
查看用户信息试试
1 hadoop org.apache.hadoop.security.UserGroupInformation
输出
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 $ hadoop org.apache.hadoop.security.UserGroupInformation SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/hdp/3.1.5.0-152/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hdp/3.1.5.0-152/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Getting UGI for current user 23/03/17 18:00:43 DEBUG security.SecurityUtil: Setting hadoop.security.token.service.use_ip to true 23/03/17 18:00:43 DEBUG util.Shell: setsid exited with exit code 0 Java config name: null Native config name: /etc/krb5.conf Loaded from native config 23/03/17 18:00:43 DEBUG security.Groups: Creating new Groups object 23/03/17 18:00:43 DEBUG util.NativeCodeLoader: Trying to load the custom-built native-hadoop library... 23/03/17 18:00:43 DEBUG util.NativeCodeLoader: Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: /usr/hdp/3.1.5.0-152/hadoop/lib/native/libhadoop.so: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by /usr/hdp/3.1.5.0-152/hadoop/lib/native/libhadoop.so) 23/03/17 18:00:43 DEBUG util.NativeCodeLoader: java.library.path=:/usr/hdp/3.1.5.0-152/hadoop/lib/native/Linux-amd64-64:/usr/hdp/3.1.5.0-152/hadoop/lib/native/Linux-amd64-64:/usr/hdp/3.1.5.0-152/hadoop/lib/native 23/03/17 18:00:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 23/03/17 18:00:43 DEBUG util.PerformanceAdvisory: Falling back to shell based 23/03/17 18:00:43 DEBUG security.JniBasedUnixGroupsMappingWithFallback: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping 23/03/17 18:00:43 DEBUG security.Groups: Group mapping impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback; cacheTimeout=7200000; warningDeltaMs=5000 Java config name: null Native config name: /etc/krb5.conf Loaded from native config >>> KdcAccessibility: reset >>> KdcAccessibility: reset >>>KinitOptions cache name is /tmp/krb5cc_1059 >>>DEBUG <CCacheInputStream> client principal is [email protected] <REDACTED> >>>DEBUG <CCacheInputStream> server principal is krbtgt/<REDACTED>@<REDACTED> >>>DEBUG <CCacheInputStream> key type: 18 >>>DEBUG <CCacheInputStream> auth time: Fri Mar 17 17:52:12 CST 2023 >>>DEBUG <CCacheInputStream> start time: Fri Mar 17 17:52:12 CST 2023 >>>DEBUG <CCacheInputStream> end time: Sat Mar 16 17:52:12 CST 2024 >>>DEBUG <CCacheInputStream> renew_till time: Mon Mar 14 17:52:12 CST 2033 >>> CCacheInputStream: readFlags() FORWARDABLE; RENEWABLE; INITIAL; PRE_AUTH; >>>DEBUG <CCacheInputStream> client principal is [email protected] <REDACTED> >>>DEBUG <CCacheInputStream> server principal is X-CACHECONF:/krb5_ccache_conf_data/fast_avail/krbtgt/<REDACTED>@<REDACTED>@<REDACTED> >>>DEBUG <CCacheInputStream> key type: 0 >>>DEBUG <CCacheInputStream> auth time: Thu Jan 01 08:00:00 CST 1970 >>>DEBUG <CCacheInputStream> start time: null >>>DEBUG <CCacheInputStream> end time: Thu Jan 01 08:00:00 CST 1970 >>>DEBUG <CCacheInputStream> renew_till time: null >>> CCacheInputStream: readFlags() >>> unsupported key type found the default TGT: 18 23/03/17 18:00:43 DEBUG security.UserGroupInformation: hadoop login 23/03/17 18:00:43 DEBUG security.UserGroupInformation: hadoop login commit 23/03/17 18:00:43 DEBUG security.UserGroupInformation: using local user:UnixPrincipal: carpo 23/03/17 18:00:43 DEBUG security.UserGroupInformation: Using user: "UnixPrincipal: carpo" with name carpo 23/03/17 18:00:43 DEBUG security.UserGroupInformation: User entry: "carpo" 23/03/17 18:00:43 DEBUG security.UserGroupInformation: UGI loginUser:carpo (auth:SIMPLE) User: carpo Group Ids: 23/03/17 18:00:43 DEBUG security.Groups: GroupCacheLoader - load. Groups: user UGI: carpo (auth:SIMPLE) Auth method SIMPLE Keytab false ============================================================
如果鉴权成功的话,输出是这样 (没开 DEBUG)
上面就是失败了,失败原因1 unsupported key type found the default TGT: 18
这个是因为这台服务器上的 java 不支持 18 类型的 加密方式,也就是 AES
搜了一下,网上有的说,在 /etc/krb5.conf 改默认的加密方式1 2 3 4 [libdefaults] default_tkt_enctypes = rc4-hmac aes256-cts aes128-cts des3-cbc-sha1 des-cbc-md5 des-cbc-crc default_tgs_enctypes = rc4-hmac aes256-cts aes128-cts des3-cbc-sha1 des-cbc-md5 des-cbc-crc permitted_enctypes = rc4-hmac aes256-cts aes128-cts des3-cbc-sha1 des-cbc-md5 des-cbc-crc
这个我尝试了,不行
可能是因为我们的 kdc 只支持 aes256-cts 的方式
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # cat /var/kerberos/krb5kdc/kdc.conf [kdcdefaults] kdc_ports = 88 kdc_tcp_ports = 88 restrict_anonymous_to_tgt = true [realms] OSS3.COM = { master_key_type = aes256-cts max_life = 365d max_renewable_life = 3650d acl_file = /var/kerberos/krb5kdc/kadm5.acl dict_file = /usr/share/dict/words default_principal_flags = +preauth ; admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab pkinit_identity = FILE:/var/kerberos/krb5kdc/kdc.crt,/var/kerberos/krb5kdc/kdc.key pkinit_anchors = FILE:/var/kerberos/krb5kdc/kdc.crt pkinit_anchors = FILE:/var/kerberos/krb5kdc/cacert.pem pkinit_pool = FILE:/var/lib/ipa-client/pki/ca-bundle.pem }
还可能是因为 keytab 里面的 key 是用的 aes256-cts 吧1 2 3 4 5 6 klist -kte carpo.keytab Keytab name: FILE:/<REDACTED>/carpo.keytab KVNO Timestamp Principal ---- ----------------- -------------------------------------------------------- 4 12/13/22 14:30:49 [email protected] (aes256-cts-hmac-sha1-96) 4 12/13/22 14:30:49 [email protected] (aes128-cts-hmac-sha1-96)
看了下这台服务器的系统1 2 # cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.5 (Santiago)
是不是因为系统太老了呀,不支持 type 18 的这种加密方式,我们部署新的集群环境都是限制了要 CentOS 7.6 的
我当时慌着下班,我说“不行,系统太老了”
组长说:“你在开玩笑吗?这台服务器以前部署过XX平台的,以前肯定是可以鉴权成功的”
知道这个结论后,激起了我的胜负欲。
这就好像有人告诉你了,挖到 100 米就肯定能有水,目前没有水,是因为我目前还没挖到 100 米。
之前可以的,现在不行(因为迁移集群的缘故,配置变化了),说明和系统版本没有关系。
“Linux 一切皆文件”,肯定是某个配置文件的问题了。
( 惹毛了我就 rsync 全部同步文件。)
后来搜索发现,JDK 不支持 aes256 的加密方式是因为 JCE。
因为漂亮国的什么密码出口条例,aes256 这种高强度加密方式是有限制出口的。
只需要下载 JCE,放到 $JAVA_HOME/jre/lib/security/
我去 $JAVA_HOME/jre/lib/security/
这个路径一看,JCE 是已经安装的呀
1 2 3 4 5 6 7 8 9 10 11 12 # cd /usr/java/jre/lib/security/ # ll total 172 -rw-r--r-- 1 root root 4054 Dec 12 2017 blacklist -rw-r--r-- 1 root root 1273 Dec 12 2017 blacklisted.certs -rw-r--r-- 1 root root 113484 Dec 12 2017 cacerts -rw-r--r-- 1 root root 2466 Dec 12 2017 java.policy -rw-r--r-- 1 root root 33404 Dec 12 2017 java.security -rw-r--r-- 1 root root 98 Dec 12 2017 javaws.policy -rw-r--r-- 1 root root 3527 Dec 12 2017 local_policy.jar -rw-r--r-- 1 root root 0 Dec 12 2017 trusted.libraries -rw-r--r-- 1 root root 3026 Dec 12 2017 US_export_policy.jar
但是别人报错原因说的很清楚,就是不支持。
我对比另一台同样是 Redhat 6.5 但是能够鉴权成功的服务器的这个目录,发现那台服务器和这台鉴权失败的服务器的 JCE 的两个 jar 包文件大小不一样
1 2 3 4 5 6 7 8 9 10 11 12 13 14 $ cd /usr/java/jre/lib/security/ $ ll total 180 -rw-r--r-- 1 root root 4054 Mar 17 18:16 blacklist -rw-r--r-- 1 root root 1273 Mar 17 18:16 blacklisted.certs -rw-r--r-- 1 root root 113484 Mar 17 18:16 cacerts -rw-r--r-- 1 root root 2466 Mar 17 18:16 java.policy -rw-r--r-- 1 root root 33404 Mar 17 18:16 java.security -rw-r--r-- 1 root root 98 Mar 17 18:16 javaws.policy -rw-r--r-- 1 root root 3035 Mar 17 18:16 local_policy.jar -rw-r--r-- 1 root root 3527 Mar 17 18:16 local_policy.jar.20181029 -rw-r--r-- 1 root root 0 Mar 17 18:16 trusted.libraries -rw-r--r-- 1 root root 3023 Mar 17 18:16 US_export_policy.jar -rw-r--r-- 1 root root 3026 Mar 17 18:16 US_export_policy.jar.20181029
scp 拷贝过来后,发现 hdfs dfs -ls /
鉴权成功了。
总结 有的时候网上的信息不是那么直接刚好能解决你的问题,有的时候还需要运气,你的认知,以及一点点🤏坚持。
当然,如果不是组长给我说了“肯定可以”的结论,估计我也就以“系统太老不支持”的原因放弃了。
参考
https://www.cnblogs.com/tommyjiang/p/15008787.html
https://www.jianshu.com/p/cc523d5a715d