scanner一次输入多行_提高篇--自定义Hadoop的输入格式

最新推荐文章于 2022-11-29 10:47:35 发布

weixin_39773158

最新推荐文章于 2022-11-29 10:47:35 发布

阅读量134

点赞数

文章标签： scanner一次输入多行

本文链接：https://blog.csdn.net/weixin_39773158/article/details/111608228

版权

背景：

这两天要把一个文件中的的多个html代码块进行解析，该文件特别大将近1TB，所以想用Hadoop来处理。该文件内容格式如下所示:

htmlds.txt

<html>
<title>title1</title>
<div>xxxx</div>
....
</html>
<html>
<title>title2</title>
<div>xxxxxxx</div>
...
</html>
....

现在就有个问题，Hadoop默认的map处理逻辑是对每一行进行同样的逻辑处理，而我现在需要对每一个<html>...</html>为一个单元进行处理，因此就需要进行Hadoop输入格式的自定义。

基础知识【1】：

1.什么是输入格式

MapReduce中接口InputFormat定义了获取数据分片和获取记录读取器的方法，分别是getSplits和createRecordReader方法，分别用于获取数据分片和定义数据访问方式；

public abstract class InputFormat<K, V> {
  public abstract List<InputSplit> getSplits(JobContext context) 
  throws IOException, InterruptedException;
  public abstract RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext context) 
  throws    IOException,InterruptedException;
}

RecordReader类是实际用来加载数据并把数据转换为适合Mapper读取的键值对，它会在输入块上被重复的调用直到整个输入块被处理完毕，每一次调用RecordReader都会调用Mapper的map()方法。

2.有哪些输入格式

抽象类FileInputFormat实现了InputFormat接口，是所有操作文件类型输入类的父类。InputFormat常见的接口实现类包括TextInputFormat、KeyValueTextInputFormat、NLineInputFormat、CombineTextInputFormat、SequenceFileInputFormat、DBInputFormat等。

TextInputFormat

TextInputFormat是默认的InputFormat。TextInputFormat提供了一个LineRecordReader，这个类会把输入文件的每一行作为值，每一行在文件中的字节偏移量为键。每条记录是一行输入，键是LongWritable类型，存储该行在整个文件中的字节偏移量，值是这行的内容，不包括任何行终止符（换行符和回车符）。

KeyValueTextInputFormat

每一行均为一条记录，被分隔符分割为key，value。可以通过在驱动类中设置conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, "");来设定分隔符。默认分隔符是tab（t）。此时的键是每行排在制表符之前的Text序列。

实现方案：

通过上面的介绍，我们知道了输入格式是什么东西，那怎么改呢，就是重写我们的InputFormat接口，由于我们的实现功能是一次读取多行数据，所以可以参照LineRecordReader来实现我们需要的功能。LineRecordReader类是每次读取一行就结束，那我们要想每次读取一个<html>...</html>代码块，只需要修改逻辑，即我们只有读取到</html>才结束这一次读取过程。

所以，我们可以自定义HtmlLineRecordReader类，继承RecordReader类，修改nextKeyValue里面的读取逻辑。

HtmlLineRecordReader.java

	  public boolean nextKeyValue() throws IOException {
	    if (key == null) {
	      key = new LongWritable();
	    }
	    key.set(pos);
	    if (value == null) {
	      value = new Text();
	    }
	    int newSize = 0;
	    
	    
	    // flag为是否清空的标记，保证每隔两条清空一次
	    boolean flag=true;
	    //curValue 为读取到的最新的一行，value为这次循环读取的所有行的不断积累
	    Text curValue = new Text();
	    int cycleCount = 0;
	    do{
	    		if (curValue.getLength()==0) {
					cycleCount++;
			}else {
				cycleCount = 0;
			}
	        while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
	          if (pos == 0) {
	            newSize = skipUtfByteOrderMark();
	          } else {
	        	  //重写readLine方法，传入curValue和flag
	            newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos),curValue,flag);
	            pos += newSize;
	          }

	          if ((newSize == 0) || (newSize < maxLineLength)) {
	            break;
	          }

	          // line too long. try again
	          LOG.info("Skipped line of size " + newSize + " at pos " + 
	                   (pos - newSize));
	        }
//	        System.out.println("For...Loop:value:"+value.toString());
//	        System.out.println("For...Loop:curValue:"+curValue.toString());
	        //如果未读取到</html>(''是因为我自己数据集的问题包含'')则不清空累计append的value,
	        if (!curValue.toString().equals("</html>")) {
				flag = false;
			}
	        //结束条件：遇到了'</html>'字符串，cycleCount是为了防止我的文件出现连续空行，该异常不多见可以根据自己情况适当调整
	    }while(!curValue.toString().equals("</html>")&&cycleCount<20);//do while... end//
	    
	    if (newSize == 0) {
	      key = null;
	      value = null;
	      return false;
	    } else {
	      return true;
	    }
	  }

readLine方法在HtmlLineReader.java里

public int readLine(Text str, int maxLineLength,
			int maxBytesToConsume,Text curValue,boolean flag) throws IOException {
		if (this.recordDelimiterBytes != null) {
			return readCustomLine(str, maxLineLength, maxBytesToConsume,curValue,flag);
		} else {
			return readDefaultLine(str, maxLineLength, maxBytesToConsume,curValue,flag);
		}
	}

然后重载readCustomLine和readDefaultLine两个方法

/**
	 * 重载readCustomerLine方法
	 */
	private int readCustomLine(Text str, int maxLineLength, int maxBytesToConsume,Text curValue,boolean flag)
			throws IOException {

		if(flag){
			str.clear();
		}
		curValue.clear();
		int txtLength = 0; // tracks str.getLength(), as an optimization
		long bytesConsumed = 0;
		int delPosn = 0;
		int ambiguousByteCount=0; // To capture the ambiguous characters count
		do {
			int startPosn = bufferPosn; // Start from previous end position
			if (bufferPosn >= bufferLength) {
				startPosn = bufferPosn = 0;
				bufferLength = fillBuffer(in, buffer, ambiguousByteCount > 0);
				if (bufferLength <= 0) {
					if (ambiguousByteCount > 0) {
						str.append(recordDelimiterBytes, 0, ambiguousByteCount);
						curValue.append(recordDelimiterBytes, 0, ambiguousByteCount);
						bytesConsumed += ambiguousByteCount;
					}
					break; // EOF
				}
			}
			for (; bufferPosn < bufferLength; ++bufferPosn) {
				if (buffer[bufferPosn] == recordDelimiterBytes[delPosn]) {
					delPosn++;
					if (delPosn >= recordDelimiterBytes.length) {
						bufferPosn++;
						break;
					}
				} else if (delPosn != 0) {
					bufferPosn--;
					delPosn = 0;
				}
			}
			int readLength = bufferPosn - startPosn;
			bytesConsumed += readLength;
			int appendLength = readLength - delPosn;
			if (appendLength > maxLineLength - txtLength) {
				appendLength = maxLineLength - txtLength;
			}
			bytesConsumed += ambiguousByteCount;
			if (appendLength >= 0 && ambiguousByteCount > 0) {
				//appending the ambiguous characters (refer case 2.2)
				str.append(recordDelimiterBytes, 0, ambiguousByteCount);
				curValue.append(recordDelimiterBytes, 0, ambiguousByteCount);
				ambiguousByteCount = 0;
				// since it is now certain that the split did not split a delimiter we
				// should not read the next record: clear the flag otherwise duplicate
				// records could be generated
				unsetNeedAdditionalRecordAfterSplit();
			}
			if (appendLength > 0) {
				str.append(buffer, startPosn, appendLength);
				curValue.append(buffer, startPosn, appendLength);
				txtLength += appendLength;
			}
			if (bufferPosn >= bufferLength) {
				if (delPosn > 0 && delPosn < recordDelimiterBytes.length) {
					ambiguousByteCount = delPosn;
					bytesConsumed -= ambiguousByteCount; //to be consumed in next
				}
			}
		} while (delPosn < recordDelimiterBytes.length 
				&& bytesConsumed < maxBytesToConsume);
		if (bytesConsumed > Integer.MAX_VALUE) {
			throw new IOException("Too many bytes before delimiter: " + bytesConsumed);
		}
		return (int) bytesConsumed; 
	}


	/**
	 * 重载readDefaultLine方法
	 */
	private int readDefaultLine(Text str, int maxLineLength, int maxBytesToConsume,Text curValue,boolean flag)
			throws IOException {
		if(flag){
			str.clear();
		}
		curValue.clear();
		int txtLength = 0; //tracks str.getLength(), as an optimization
		int newlineLength = 0; //length of terminating newline
		boolean prevCharCR = false; //true of prev char was CR
		long bytesConsumed = 0;
		do {
			int startPosn = bufferPosn; //starting from where we left off the last time
			if (bufferPosn >= bufferLength) {
				startPosn = bufferPosn = 0;
				if (prevCharCR) {
					++bytesConsumed; //account for CR from previous read
				}
				bufferLength = fillBuffer(in, buffer, prevCharCR);
				if (bufferLength <= 0) {
					break; // EOF
				}
			}
			for (; bufferPosn < bufferLength; ++bufferPosn) { //search for newline
				if (buffer[bufferPosn] == LF) {
					newlineLength = (prevCharCR) ? 2 : 1;
					++bufferPosn; // at next invocation proceed from following byte
					break;
				}
				if (prevCharCR) { //CR + notLF, we are at notLF
					newlineLength = 1;
					break;
				}
				prevCharCR = (buffer[bufferPosn] == CR);
			}
			int readLength = bufferPosn - startPosn;
			if (prevCharCR && newlineLength == 0) {
				--readLength; //CR at the end of the buffer
			}
			bytesConsumed += readLength;
			int appendLength = readLength - newlineLength;
			if (appendLength > maxLineLength - txtLength) {
				appendLength = maxLineLength - txtLength;
			}
			if (appendLength > 0) {
				str.append(buffer, startPosn, appendLength);
				curValue.append(buffer, startPosn, appendLength);
				txtLength += appendLength;
			}
		} while (newlineLength == 0 && bytesConsumed < maxBytesToConsume);

		if (bytesConsumed > Integer.MAX_VALUE) {
			throw new IOException("Too many bytes before newline: " + bytesConsumed);
		}
		return (int)bytesConsumed;
	}

以上就完成了Hadoop输入格式的自定义工作，我们再总结一下：