Set the number of maps in DOOP

Foreword

As described in many documents, the number of Mappers is not directly controllable by default, because the number of Mappers is determined by the size and number of inputs. By default, how many blocks the final input takes, and how many Mappers should be started. If the number of files input is huge, but the size of each file is smaller than the blockSize of HDFS, it will cause the started Mapper to be equal to the number of files (that is, each file occupies a block), then it is likely that the number of Mappers started is exceeded. Restricted and caused a crash. These logics are indeed correct, but they are all logic by default. In fact, if you make some custom settings, you can control it.

In Hadoop, setting the number of Map tasks is not as straightforward as setting the number of Reduce tasks. That is, you cannot directly tell the Hadoop how many Map tasks should be started through the API.

You may be wondering, isn't the interface org.apache.hadoop.mapred.JobConf.setNumMapTasks(int n) provided in the API? Can't this value set the number of Map tasks? This API is indeed correct. Explain "Note: This is only a hint to the framework." on the documentation. This value is only a hint for the Hadoop framework and does not play a decisive role. In other words, even if you set it, you don't necessarily get the effect you want.

1. Introduction to InputFormat

Before setting the number of Map tasks, it is very important to understand the basics related to Map-Reduce input.

This interface (org.apache.hadoop.mapred.InputFormat) describes the input specification of the Map-Reduce job (input-specificaTIon), which splits all input files into logical InputSplits, and each InputSplit will be assigned to one. A separate mapper; it also provides a concrete implementation of the RecordReader, which takes the input records from the logical InputSplit and passes them to the Mapper for processing.

InputFormat has several concrete implementations, such as FileInputFormat (the underlying abstract class for handling file-based input), DBInputFormat (which handles database-based input, data from a table that can be queried with SQL), and KeyValueTexTInputFormat (special FineInputFormat, which handles Plain Text). File, the file is divided into lines by carriage return or carriage return, each line is divided into Key and Value by key.value.separator.in.input.line, CompositeInputFormat, DelegaTIngInputFormat and so on. FileInputFormat and its subtypes are used in most application scenarios.

Through the above brief introduction, we know that InputFormat determines InputSplit, and each InputSplit is assigned to a separate Mapper, so InputFormat determines the number of specific Map tasks.

2. Factors affecting the number of Maps in FileInputFormat

In everyday use, FileInputFormat is the most commonly used InputFormat, and it has many concrete implementations. The following factors affecting the number of Maps are only valid for FileInputFormat and its subclasses. Other non-FileInputFormats can be viewed by the corresponding getSplits(JobConf job, int numSplits) implementation.

Please see the following code snippet (excerpted from org.apache.hadoop.mapred.FileInputFormat.getSplits, hadoop-0.20.205.0 source code):

[java] view plaincopy

Long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);

Long minSize = Math.max(job.getLong("mapred.min.split.size", 1), minSplitSize);

For (FileStatus file: files) {

Path path = file.getPath();

FileSystem fs = path.getFileSystem(job);

If ((length != 0) && isSplitable(fs, path)) {

Long blockSize = file.getBlockSize();

Long splitSize = computeSplitSize(goalSize, minSize, blockSize);

Long bytesRemaining = length;

While (((double) bytesRemaining)/splitSize ã€‹ SPLIT_SLOP) {

String[] splitHosts = getSplitHosts(blkLocaTIons,length-bytesRemaining, splitSize, clusterMap);

Splits.add(new FileSplit(path, length-bytesRemaining, splitSize, splitHosts));

bytesRemaining -= splitSize;

}

If (bytesRemaining != 0) {

Splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining, blkLocations[blkLocations.length-1].getHosts()));

}

} else if (length != 0) {

String[] splitHosts = getSplitHosts(blkLocations,0,length,clusterMap);

Splits.add(new FileSplit(path, 0, length, splitHosts));

} else {

//Create empty hosts array for zero length files

Splits.add(new FileSplit(path, 0, length, new String[0]));

}

Return splits.toArray(new FileSplit[splits.size()]);

Protected long computeSplitSize(long goalSize, long minSize, long blockSize) {

Return Math.max(minSize, Math.min(goalSize, blockSize));

}

totalSize: is the total size of all inputs for the entire Map-Reduce job.

numSplits: from job.getNumMapTasks(), which is the value set by org.apache.hadoop.mapred.JobConf.setNumMapTasks(int n) when the job starts, and gives the hint of the number of maps of the MR framework.

goalSize: is the ratio of the total input size to the number of prompted Map tasks, that is, how much data each Mapper is expected to process. It is only expected. The number of data to be processed is determined by the following computeSplitSize.

minSplitSize: The default is 1, which can be reset by the subclass replication function protected void setMinSplitSize(long minSplitSize). In general, it is 1, except for special circumstances.

minSize: Take the larger of 1 and mapred.min.split.size.

blockSize: Block size of HDFS. The default is 64M. Generally, the HDFS is set to 128M.

splitSize: is the size of each Split, then the number of Map is basically totalSize / splitSize.

Next, let's look at the logic of computeSplitSize: first take a smaller value in goalSize (the amount of data expected by each Mapper) and HDFS, and then take a larger comparison with mapred.min.split.size.

3. How to adjust the number of Maps

With 2 analysis, it is easy to adjust the number of Maps below.

3.1 Decrease the number of Mappers created when the Map-Reduce job starts

When dealing with large amounts of big data, a common situation is that the number of mappers started by the job is too large and exceeds the system limit, causing Hadoop to throw an abnormal termination. The idea to solve this anomaly is to reduce the number of mappers. details as follows:

3.1.1 The input file size is huge, but not a small file

This can be done by increasing the input size of each mapper, ie increasing minSize or increasing blockSize to reduce the number of mappers required. Increasing blockSize is usually not feasible, because when HDFS is done by hadoop namenode -format, blockSize is already determined (determined by dfs.block.size when formatting). If you want to change blockSize, you need to reformat HDFS, which will of course be lost. Existing data. So usually only by increasing minSize, that is, increasing the value of mapred.min.split.size.

3.1.2 The number of input files is huge and both are small files.

The so-called small file is that the size of a single file is smaller than blockSize. This situation is not feasible by increasing mapred.min.split.size. You need to use CombineFileInputFormat derived from FileInputFormat to combine multiple input paths into one InputSplit for mapper processing, thus reducing the number of mappers. The details will be updated and expanded later.

3.2 Increase the number of Mappers created when Map-Reduce job starts

Increasing the number of mappers can be done by reducing the input of each mapper, ie reducing the blockSize or reducing the value of mapred.min.split.size.

Din Rail Terminal Block

Basic Features
1. The terminal has universal mounting feet so that it can be installed on U-rail NC 35 and G-rail NC32.
2. The closed screw guide hole ensures ideal screwdriver operation.
3. Equipped with uniform accessories for terminals of multiple cross-section grades, such as end plates, grouping partitions, etc.
4. Potential distribution can be achieved by inserting a fixed bridge in the center of the terminal or an edge-plug bridge inserted into the wire cavity.
5. The grounding terminal and the N-line slider breaking terminal with the same shape as the common terminal.
6. Using the identification system ZT, unified terminal identification can be realized.
7. The rich graphics enhance the three-dimensional sense of the wiring system.

Din Rail Terminal Block,Din Rail Fuse Terminal Block,Din Rail Busbar Terminal Block,Din Rail Power Terminal Blocks

Sichuan Xinlian electronic science and technology Company , https://www.sztmlchs.com