项目链接
持续更新中,欢迎star
前言
参考《自己动手写Java虚拟机》这本书写一个JVM。
这本书用Go写了一个jvm,没有JIT
,没有PGO
,甚至连GC
都没有,可以说是非常的没用了,然后我用Java重写了这个JVM,Java写的Java虚拟机,没用程度可以说是更上一层楼。那这种方式能不能算是自举呢?
根据维基 自举(Bootstrapping)的定义:
In computer science, bootstrapping is the technique for producing a self-compiling compiler — that is, a compiler (or assembler) written in the source programming language that it intends to compile. An initial core version of the compiler (the bootstrap compiler) is generated in a different language (which could be assembly language); successive expanded versions of the compiler are developed using this minimal subset of the language.
Java并不是所实现的解释器的子集,所以不能算是自举。
抛开生态不说(因为重写一个跑在本地的项目并不很需要依赖生态),Go和Java的设计理念都是趋向于简单,改写也不麻烦:
- 原项目中使用的有栈
routine
可以用线程简单代替而不需要池化(数量很少) Channel
可以用Buffer
+Semaphore
代替- Go
type
关键字带来的元能力Java也可以用getClass.getName()
运行时反射来做到 - Java的泛型也能够带来一些方便,内存和IO在这里很明显也不是啥问题。
- 类库:Go 用来命令行解析的
flag
包可以使用jcommander
代替
所以综上所述,这个JVM没什么技术难度,只不过是一个大一点的demo用来学jvm原理的罢了。
Class文件搜索
主要文件在classpath目录下,使用jcommander
解析命令行参数。
jcommander
的文档:https://jcommander.org/
从类路径中搜索类,java类路径分为三个部分:启动类路径,扩展类路径,用户类路径
类路径由用户使用命令行参数指定
执行顺序是 类路径初始化–>查找用户提供的类
Entry接口用来表示类路径项,组合实现DirEntry、ZipEntry、CompositeEntry和WildcardEntry四个类,DirEntry表示目录形式的类路径,ZipEntry用来表示zip或者jar形式的类路径,CompositeEntry用来表示文件分隔符分割多个文件的路径,WildcardEntry用来表示以*
结尾指代目录下所有文件的情况。
Class文件解析
构成class文件的基本数据单位是字节,数据在class文件中以大端方式存储。
比较关键的是ClassReader类,用于辅助字节操作。
/**
* @author treblez
* @Description 辅助读取数据的类
*/
public class ClassReader {
private final ByteBuffer buf;
ClassReader(byte[] data){
buf = ByteBuffer.allocate(data.length+5);
buf.put(data);
// 注意,清除标志位
buf.rewind();
}
public byte readUint8() {
return buf.get();
}
public char readUint16() {
byte[] tmp = new byte[2];
buf.get(tmp,0,2);
return (char) (((tmp[0] & 0xFF) << 8) | (tmp[1] & 0xFF));
}
public int readUint32() {
byte[] tmp = new byte[4];
buf.get(tmp,0,4);
// 注意运算符优先级
return ((tmp[3]&0xff) |((tmp[2]&0xff) << 8) | ((tmp[1]&0xff) << 16) | ((tmp[0]&0xff) << 24));
}
public long readUint64() {
byte[] tmp = new byte[8];
buf.get(tmp,0,8);
return (((long)(tmp[0] & 0xFF) << 56) | ((long)(tmp[1] & 0xFF) << 48) | ((long)(tmp[2] & 0xFF) << 40)
| ((long)(tmp[3] & 0xFF) << 32) |
(tmp[4] & 0xFF << 24) | (tmp[5] & 0xFF << 16) | (tmp[6] & 0xFF << 8) | (tmp[7] & 0xFF));
}
/**
*读取uint16表,大小由开头的数据指定
*/
public char[] readUint16s() {
var n = readUint16();
char[] s = new char[n];
for(int i=0;i<n;i++){
s[i] = readUint16();
}
return s;
}
public byte[] readBytes(int n) {
byte[] ret = new byte[n];
buf.get(ret, 0, n);
return ret;
}
}
字节流的读取顺序如下所示:
void read(ClassReader reader) throws Exception {
// 验证魔数
readAndCheckMagic(reader);
// 验证版本号
readAndCheckVersion(reader);
// 读取常量池
constantPool = new ConstantPool().readConstantPool(reader);
//类访问标志 bitmask
accessFlags = reader.readUint16();
/*
* 类和超类索引,thisClass必须是有效的常量池索引
* superClass只在Object.class中是0,其它文件中必须有效
*/
thisClass = reader.readUint16();
superClass = reader.readUint16();
//接口索引表,给出该类实现的所有接口的名字
interfaces = reader.readUint16s();
// 字段表
fields = MemberInfo.readMembers(reader, constantPool);
// 方法表
methods = MemberInfo.readMembers(reader, constantPool);
// 属性表
attributes = AttributeInfo.readAttributes(reader, constantPool);
}
魔数的值必须为0xCAFEBABE
类、超类、接口表都以常量池索引的方式存放。
字段、方法、类都有使用bitmask
实现的访问标志,访问标志后是常量池索引,给出字段或者方法的描述符,最后是属性表。
常量池中放着很多的常量信息,包括数字和字符串常量、类和接口名、字段和方法名等。以8比特无符号整数来标志常量类型:
int CONSTANT_CLASS = 7;
int CONSTANT_FIELDREF = 9;
int CONSTANT_METHODREF = 10;
int CONSTANT_INTERFACE_METHODREF = 11;
int CONSTANT_STRING = 8;
int CONSTANT_INTEGER = 3;
int CONSTANT_FLOAT = 4;
int CONSTANT_LONG = 5;
int CONSTANT_DOUBLE = 6;
int CONSTANT_NAME_AND_TYPE = 12;
int CONSTANT_UTF8 = 1;
int CONSTANT_METHOD_HANDLE = 15;
int CONSTANT_METHOD_TYPE = 16;
int CONSTANT_INVOKE_DYNAMIC = 18;
/**
* 读取常量信息
*
* @param reader
*/
void readInfo(ClassReader reader) throws IOException;
/**
* 读取tag值,new创建具体常量,然后调用readInfo读取常量信息
* @param reader
* @param cp
* @return
* @throws Exception
*/
static ConstantInfo readConstantInfo(ClassReader reader, ConstantPool cp) throws Exception {
var tag = reader.readUint8();
ConstantInfo ret = switch (tag) {
case CONSTANT_INTEGER -> new ConstantIntegerInfo();
case CONSTANT_FLOAT -> new ConstantFloatInfo();
case CONSTANT_LONG -> new ConstantLongInfo();
case CONSTANT_DOUBLE -> new ConstantDoubleInfo();
case CONSTANT_UTF8 -> new ConstantUtf8Info();
case CONSTANT_STRING -> new ConstantStringInfo(cp);
case CONSTANT_CLASS -> new ConstantClassInfo(cp);
case CONSTANT_FIELDREF -> new ConstantFieldRefInfo(cp);
case CONSTANT_METHODREF -> new ConstantMethodRefInfo(cp);
case CONSTANT_INTERFACE_METHODREF -> new ConstantInterfaceMethodRefInfo(cp);
case CONSTANT_NAME_AND_TYPE -> new ConstantNameAndTypeInfo();
// 以下三条为了支持SE7 invokedynamic指令
// 即:先在运行时动态解析出调用点限定符所引用的方法,然后再执行该方法
case CONSTANT_METHOD_TYPE -> new ConstantMethodTypeInfo();
case CONSTANT_METHOD_HANDLE -> new ConstantMethodHandleInfo();
case CONSTANT_INVOKE_DYNAMIC -> new ConstantInvokeDynamicInfo();
default -> throw new Exception("java.lang.ClassFormatError: constant pool tag!");
};
ret.readInfo(reader);
return ret;
}
方法的字节码存放在属性表中,Deprecated(不建议使用)、Synthetic(源文件不存在)起标记作用,SourceFile指示源文件名,ConstantValue表示常量表达式的值,constantValue表示常量表达式的值,Code属性存放字节码等方法信息,Exceptions表示抛出的异常表,LineNumberTable和LocalVariableTable存放方法的行号和局部变量信息。
还需要注意的是在常量池中字符串使用Mutf-8存放,需要自己解析到UTF-8
核心代码如下:
protected CoderResult decodeLoop(final ByteBuffer source, final CharBuffer target) {
// Track the position of the source buffer, so that consumed but
// unused octets can be "put back". The value of this variable is
// explicitly incremented each time a character is successfully
// decoded, in order to avoid having to query the source buffer via
// an unnecessary method invocation inside the loop.
int sourcePosition = source.position();
while (true) {
try {
final byte a = source.get();
// The first three bits of the first octet determine the
// length of the octet sequence. Simultaneously checking the
// fourth bit is a cheap way to avoid an explicit check for
// an invalid leading octet of 1111xxxx.
//
// Shifting the four high bits to the four low bits makes
// the switch labels nearly contiguous. This enables the
// compiler to use the tableswitch instruction, rather than
// the lookupswitch instruction. The distinction could be
// significant for a tight loop, though in this case a
// modern JIT compiler would probably be able to optimize
// away any difference.
switch ((a & 0xFF) >> 4) {
case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7: {
// first octet 0xxxxxxx
// 000000000aaaaaaa as 0aaaaaaa
// final char ch = (char)(a);
// target.put(ch);
target.put((char)(a));
sourcePosition += 1;
break;
}
case 12: case 13: {
// first octet 110xxxxx
// 00000aaaaabbbbbb as 110aaaaa 10bbbbbb
final byte b = source.get();
if ((b & 0xC0) != 0x80) {
return CoderResult.malformedForLength(2);
}
// final char ch = (char)(((a & 0x1F) << 6) | (b & 0x3F));
// target.put(ch);
target.put((char)(((a & 0x1F) << 6) | (b & 0x3F)));
sourcePosition += 2;
break;
}
case 14: {
// first octet 1110xxxx
// aaaabbbbbbcccccc as 1110aaaa 10bbbbbb 10cccccc
final byte b = source.get();
if ((b & 0xC0) != 0x80) {
return CoderResult.malformedForLength(2);
}
final byte c = source.get();
if ((c & 0xC0) != 0x80) {
return CoderResult.malformedForLength(3);
}
// final char ch = (char)(((a & 0x0F) << 12) | ((b & 0x3F) << 6) | (c & 0x3F));
// target.put(ch);
target.put((char)(((a & 0x0F) << 12) | ((b & 0x3F) << 6) | (c & 0x3F)));
sourcePosition += 3;
break;
}
// case 8: case 9: case 10: case 11:
// first octet 10xxxxxx
// case 15:
// first octet 1111xxxx
default: {
return CoderResult.malformedForLength(1);
}
}
} catch (final BufferUnderflowException e) {
// "Put back" unused octets of a partial character.
source.position(sourcePosition);
return CoderResult.UNDERFLOW;
} catch (final BufferOverflowException e) {
// "Put back" unused octets of a full character.
source.position(sourcePosition);
return CoderResult.OVERFLOW;
}
}
}
}