背景

最近又搭建了一套新集群(OSE),版本 HDP 3.1.5.0

于是我们又要开始迁移数据啦

因为 OSD 不能重启,所以我配置了 OSD 到 OSE 的单向信任:

  1. 修改了 OSE 的 auth_to_local 配置并重启 HDFS 服务
  2. 使 OSD 的 pricipal 访问 OSE 时能够通过 auth_to_local 映射规则正确映射到对应的 OS 用户
  3. 并且测试:使用 OSD 的域用户能够访问 OSE 的 HDFS 文件

但是当我们准备将 OSD 的数据拷贝到 OSE 时,运行 distcp 发现卡住了

没有任何报错,我开始以为是 renew HDFS_DELEGATION_TOKEN 的问题,把双方集群的 NN IP 地址加在 -Dmapreduce.job.hdfs-servers.token-renewal.exclude= 还是不行。

看来是我学艺不精,只好求助谷歌了。

搜索关键字 “distcp stuck Build file listing completed”

探索过程

谷歌结果,除了第一个是 “Apache Hadoop Distributed Copy – DistCp Guide” 。

这是 Apache Hadoop 的 distcp 教程,我只想直接解决问题,后面的两个网页和我问题相似,为我解决问题提供了思路和帮助。

参考1:
http://people.apache.org/~liuml07/2017/07/05/DistCp-gets-stuck-with-build-listing/

参考2:
https://community.cloudera.com/t5/Support-Questions/Distcp-got-stuck-with-the-below-and-doesn-t-do-anything/m-p/292259

我将简述下我如何解决这个问题

我的环境:

  • 当前客户端配置是 OSE 的
  • 当前使用 OSD 的 kerberos 用户 进行 kinit
  1. 我已经确认 kerberos 单向信任是正确配置,并且生效
  2. 测试 参考1 的步骤
    1
    2

    hadoop fs cp hdfs://<OSD Active Namenode IP>/tmp/testfile /tmp/testfile

证明 HDFS 服务正常

  1. 据参考2 Arun66 的解决过程

    We tried to run a sample MR job to test, then it failed with the following exception

我想到书里写的最基础的 MR 任务,wordcount

参考 Apache Hadoop 官网:https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v1.0

WordCount.java 代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

将代码放入到 WordCount.java 文件

导入环境变量

1
2
3
export JAVA_HOME=/usr/java/default
export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

如果你的环境上已经有 JAVA_HOME 了,就不要执行第一条了

编译

1
2
$ hadoop com.sun.tools.javac.Main WordCount.java
$ jar cf wc.jar WordCount*.class

上面导入 HADOOP_CLASSPATH 是为了让 hadoop 找到 com.sun.tools.javac.Main

输入目录和输出目录

1
2
/user/joe/wordcount/input - input directory in HDFS
/user/joe/wordcount/output - output directory in HDFS

准备输入文件

1
2
3
4
5
vi file01
Hello World Bye World

vi file02
Hello Hadoop Goodbye Hadoop

hdfs dfs -put file* /user/joe/wordcount/input

执行 wordcount

1
$ hadoop jar wc.jar WordCount /user/joe/wordcount/input /user/joe/wordcount/output

然后果然报错了

1
Error: Java.io.IOException: initialization of all the collectors failed. Error in last collector was:java.io.IOException: Invalid “mapreduce.task.io.sort.mb”:3276.

一些关键字使用[REDACTED]代替



详细报错,点击展开

[[REDACTED]@[REDACTED] ~]$ hadoop jar wc.jar WordCount /user/[REDACTED]/input /user/[REDACTED]/output
22/12/20 21:11:32 INFO client.AHSProxy: Connecting to Application History server at [REDACTED].com/[REDACTED]:10200
22/12/20 21:11:32 INFO hdfs.DFSClient: Created token for [REDACTED]: HDFS_DELEGATION_TOKEN owner=[REDACTED]@[REDACTED], renewer=yarn, realUser=, issueDate=1671541893766, maxDate=1672146693766, sequenceNumber=75, masterKeyId=37 on ha-hdfs:osea
22/12/20 21:11:32 INFO security.TokenCache: Got dt for hdfs://osea; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:osea, Ident: (token for [REDACTED]: HDFS_DELEGATION_TOKEN owner=[REDACTED]@[REDACTED], renewer=yarn, realUser=, issueDate=1671541893766, maxDate=1672146693766, sequenceNumber=75, masterKeyId=37)
22/12/20 21:11:32 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
22/12/20 21:11:32 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/[REDACTED]/.staging/job_1670950620170_0046
22/12/20 21:11:33 INFO input.FileInputFormat: Total input files to process : 2
22/12/20 21:11:33 INFO mapreduce.JobSubmitter: number of splits:2
22/12/20 21:11:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1670950620170_0046
22/12/20 21:11:33 INFO mapreduce.JobSubmitter: Executing with tokens: [Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:osea, Ident: (token for [REDACTED]: HDFS_DELEGATION_TOKEN owner=[REDACTED]@[REDACTED], renewer=yarn, realUser=, issueDate=1671541893766, maxDate=1672146693766, sequenceNumber=75, masterKeyId=37)]
22/12/20 21:11:33 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.5.0-152/0/resource-types.xml
22/12/20 21:11:33 INFO impl.TimelineClientImpl: Timeline service address: ose688.[REDACTED]:8188
22/12/20 21:11:34 INFO impl.YarnClientImpl: Submitted application application_1670950620170_0046
22/12/20 21:11:34 INFO mapreduce.Job: The url to track the job: http://ose688.[REDACTED]:8088/proxy/application_1670950620170_0046/
22/12/20 21:11:34 INFO mapreduce.Job: Running job: job_1670950620170_0046
22/12/20 21:11:42 INFO mapreduce.Job: Job job_1670950620170_0046 running in uber mode : false
22/12/20 21:11:42 INFO mapreduce.Job: map 0% reduce 0%
22/12/20 21:11:47 INFO mapreduce.Job: Task Id : attempt_1670950620170_0046_m_000000_0, Status : FAILED
Error: java.io.IOException: Initialization of all the collectors failed. Error in last collector was:java.io.IOException: Invalid “mapreduce.task.io.sort.mb”: 3276
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:423)
at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:82)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:710)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:782)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
Caused by: java.io.IOException: Invalid “mapreduce.task.io.sort.mb”: 3276
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:991)
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:408)
… 9 more

22/12/20 21:11:50 INFO mapreduce.Job: Task Id : attempt_1670950620170_0046_m_000001_0, Status : FAILED
Error: java.io.IOException: Initialization of all the collectors failed. Error in last collector was:java.io.IOException: Invalid “mapreduce.task.io.sort.mb”: 3276
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:423)
at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:82)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:710)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:782)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
Caused by: java.io.IOException: Invalid “mapreduce.task.io.sort.mb”: 3276
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:991)
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:408)
… 9 more

22/12/20 21:11:54 INFO mapreduce.Job: Task Id : attempt_1670950620170_0046_m_000000_1, Status : FAILED
Error: java.io.IOException: Initialization of all the collectors failed. Error in last collector was:java.io.IOException: Invalid “mapreduce.task.io.sort.mb”: 3276
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:423)
at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:82)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:710)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:782)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
Caused by: java.io.IOException: Invalid “mapreduce.task.io.sort.mb”: 3276
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:991)
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:408)
… 9 more

22/12/20 21:11:57 INFO mapreduce.Job: Task Id : attempt_1670950620170_0046_m_000001_1, Status : FAILED
Error: java.io.IOException: Initialization of all the collectors failed. Error in last collector was:java.io.IOException: Invalid “mapreduce.task.io.sort.mb”: 3276
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:423)
at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:82)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:710)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:782)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
Caused by: java.io.IOException: Invalid “mapreduce.task.io.sort.mb”: 3276
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:991)
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:408)
… 9 more

22/12/20 21:12:02 INFO mapreduce.Job: Task Id : attempt_1670950620170_0046_m_000000_2, Status : FAILED
Error: java.io.IOException: Initialization of all the collectors failed. Error in last collector was:java.io.IOException: Invalid “mapreduce.task.io.sort.mb”: 3276
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:423)
at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:82)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:710)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:782)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
Caused by: java.io.IOException: Invalid “mapreduce.task.io.sort.mb”: 3276
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:991)
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:408)
… 9 more

22/12/20 21:12:04 INFO mapreduce.Job: Task Id : attempt_1670950620170_0046_m_000001_2, Status : FAILED
Error: java.io.IOException: Initialization of all the collectors failed. Error in last collector was:java.io.IOException: Invalid “mapreduce.task.io.sort.mb”: 3276
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:423)
at org.apache.hadoop.mapred.MapTask.access$100(MapTask.java:82)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:710)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:782)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
Caused by: java.io.IOException: Invalid “mapreduce.task.io.sort.mb”: 3276
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:991)
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:408)
… 9 more


在 Ambari 上将 mapreduce.task.io.sort.mb (<= 2047) 改为 2047,重启相关服务 SPARK YARN,Hive 也最好重启下,因为 Hive 配置里面也有这个参数

再次运行 wordcount

1
$ hadoop jar wc.jar WordCount /user/joe/wordcount/input /user/joe/wordcount/output


任务执行成功的详细日志


[[REDACTED]@[REDACTED] ~]$ hdfs dfs -rm -r /user/[REDACTED]/output
22/12/20 21:15:25 INFO fs.TrashPolicyDefault: Moved: ‘hdfs://osea/user/[REDACTED]/output’ to trash at: hdfs://osea/user/[REDACTED]/.Trash/Current/user/[REDACTED]/output1671542125954

[[REDACTED]@[REDACTED] ~]$ hadoop jar wc.jar WordCount /user/[REDACTED]/input /user/[REDACTED]/output
22/12/20 21:15:32 INFO client.AHSProxy: Connecting to Application History server at ose688.[REDACTED]/[REDACTED]:10200
22/12/20 21:15:32 INFO hdfs.DFSClient: Created token for [REDACTED]: HDFS_DELEGATION_TOKEN owner=[REDACTED]@[REDACTED], renewer=yarn, realUser=, issueDate=1671542133482, maxDate=1672146933482, sequenceNumber=76, masterKeyId=37 on ha-hdfs:osea
22/12/20 21:15:32 INFO security.TokenCache: Got dt for hdfs://osea; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:osea, Ident: (token for [REDACTED]: HDFS_DELEGATION_TOKEN owner=[REDACTED]@[REDACTED], renewer=yarn, realUser=, issueDate=1671542133482, maxDate=1672146933482, sequenceNumber=76, masterKeyId=37)
22/12/20 21:15:32 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
22/12/20 21:15:32 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/[REDACTED]/.staging/job_1670950620170_0047
22/12/20 21:15:32 INFO input.FileInputFormat: Total input files to process : 2
22/12/20 21:15:32 INFO mapreduce.JobSubmitter: number of splits:2
22/12/20 21:15:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1670950620170_0047
22/12/20 21:15:33 INFO mapreduce.JobSubmitter: Executing with tokens: [Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:osea, Ident: (token for [REDACTED]: HDFS_DELEGATION_TOKEN owner=[REDACTED]@[REDACTED], renewer=yarn, realUser=, issueDate=1671542133482, maxDate=1672146933482, sequenceNumber=76, masterKeyId=37)]
22/12/20 21:15:33 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.1.5.0-152/0/resource-types.xml
22/12/20 21:15:33 INFO impl.TimelineClientImpl: Timeline service address: ose688.[REDACTED]:8188
22/12/20 21:15:34 INFO impl.YarnClientImpl: Submitted application application_1670950620170_0047
22/12/20 21:15:34 INFO mapreduce.Job: The url to track the job: http://ose688.[REDACTED]:8088/proxy/application_1670950620170_0047/
22/12/20 21:15:34 INFO mapreduce.Job: Running job: job_1670950620170_0047
22/12/20 21:15:43 INFO mapreduce.Job: Job job_1670950620170_0047 running in uber mode : false
22/12/20 21:15:43 INFO mapreduce.Job: map 0% reduce 0%
22/12/20 21:15:52 INFO mapreduce.Job: map 100% reduce 0%
22/12/20 21:15:59 INFO mapreduce.Job: map 100% reduce 100%
22/12/20 21:16:00 INFO mapreduce.Job: Job job_1670950620170_0047 completed successfully
22/12/20 21:16:00 INFO mapreduce.Job: Counters: 53
File System Counters
FILE: Number of bytes read=79
FILE: Number of bytes written=747874
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=260
HDFS: Number of bytes written=41
HDFS: Number of read operations=11
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Rack-local map tasks=2
Total time spent by all maps in occupied slots (ms)=103928
Total time spent by all reduces in occupied slots (ms)=43184
Total time spent by all map tasks (ms)=12991
Total time spent by all reduce tasks (ms)=5398
Total vcore-milliseconds taken by all map tasks=12991
Total vcore-milliseconds taken by all reduce tasks=5398
Total megabyte-milliseconds taken by all map tasks=106422272
Total megabyte-milliseconds taken by all reduce tasks=44220416
Map-Reduce Framework
Map input records=2
Map output records=8
Map output bytes=82
Map output materialized bytes=85
Input split bytes=210
Combine input records=8
Combine output records=6
Reduce input groups=5
Reduce shuffle bytes=85
Reduce input records=6
Reduce output records=5
Spilled Records=12
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=312
CPU time spent (ms)=5030
Physical memory (bytes) snapshot=5385678848
Virtual memory (bytes) snapshot=27895767040
Total committed heap usage (bytes)=6113198080
Peak Map Physical memory (bytes)=2508402688
Peak Map Virtual memory (bytes)=9314967552
Peak Reduce Physical memory (bytes)=395964416
Peak Reduce Virtual memory (bytes)=9350410240
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=50
File Output Format Counters
Bytes Written=41
[[REDACTED]@[REDACTED] ~]$ hdfs dfs -ls /user/[REDACTED]/output
Found 2 items
-rw-r–r– 3 [REDACTED] hdfs 0 2022-12-20 21:15 /user/[REDACTED]/output/_SUCCESS
-rw-r–r– 3 [REDACTED] hdfs 41 2022-12-20 21:15 /user/[REDACTED]/output/part-r-00000
[[REDACTED]@[REDACTED] ~]$ hdfs dfs -cat /user/[REDACTED]/output/part-r-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2


再次执行 distcp,发现正常拷贝了

原因分析

MapTask.MapOutputBuffer使用 byte[] 大小的数组来实现缓冲区。

mapreduce.task.io.sort.mb,在分配数组之前检查大小是否正确。问题是 Java 数组不能有超过 2^31 - 1 的元素(即使是64位JVM),这是 Java 语言专门化本身的一个限制。这意味着配置 mapreduce.task.io.sort.mb 实际上被限制为<= 2047

2^31-1 是 字节,换算成 MB,就是 2047 了。