用Java写一个Java虚拟机 Class文件读取与解析

SakamataZ

已于 2022-03-29 22:43:49 修改

阅读量489

点赞数 2

分类专栏：编译原理文章标签： golang 编译原理 java

于 2021-11-16 23:06:52 首次发布

本文链接：https://blog.csdn.net/treblez/article/details/121348712

版权

编译原理专栏收录该内容

12 篇文章 0 订阅

订阅专栏

本文探讨了用Java实现一个简单的JVM，不同于Go版本，这个项目不包含JIT、PGO或GC，主要用于学习JVM原理。作者分析了自举的概念，并指出项目更多是作为教育演示而非实用工具。通过替换Go语言特性，如线程、Buffer等，展示了Java在实现类似功能上的可能性。

摘要由CSDN通过智能技术生成

项目链接
持续更新中，欢迎star

文章目录

前言

参考《自己动手写Java虚拟机》这本书写一个JVM。
这本书用Go写了一个jvm，没有JIT，没有PGO，甚至连GC都没有，可以说是非常的没用了，然后我用Java重写了这个JVM，Java写的Java虚拟机，没用程度可以说是更上一层楼。那这种方式能不能算是自举呢？
根据维基自举（Bootstrapping）的定义：

In computer science, bootstrapping is the technique for producing a self-compiling compiler — that is, a compiler (or assembler) written in the source programming language that it intends to compile. An initial core version of the compiler (the bootstrap compiler) is generated in a different language (which could be assembly language); successive expanded versions of the compiler are developed using this minimal subset of the language.

Java并不是所实现的解释器的子集，所以不能算是自举。
抛开生态不说（因为重写一个跑在本地的项目并不很需要依赖生态），Go和Java的设计理念都是趋向于简单，改写也不麻烦：

原项目中使用的有栈routine可以用线程简单代替而不需要池化（数量很少）
Channel可以用Buffer+Semaphore代替
Go type关键字带来的元能力Java也可以用getClass.getName()运行时反射来做到
Java的泛型也能够带来一些方便，内存和IO在这里很明显也不是啥问题。
类库:Go 用来命令行解析的flag包可以使用jcommander代替

所以综上所述，这个JVM没什么技术难度，只不过是一个大一点的demo用来学jvm原理的罢了。

Class文件搜索

主要文件在classpath目录下，使用jcommander解析命令行参数。
jcommander的文档：https://jcommander.org/
从类路径中搜索类，java类路径分为三个部分：启动类路径，扩展类路径，用户类路径
类路径由用户使用命令行参数指定
执行顺序是类路径初始化–>查找用户提供的类
Entry接口用来表示类路径项，组合实现DirEntry、ZipEntry、CompositeEntry和WildcardEntry四个类，DirEntry表示目录形式的类路径，ZipEntry用来表示zip或者jar形式的类路径，CompositeEntry用来表示文件分隔符分割多个文件的路径，WildcardEntry用来表示以*结尾指代目录下所有文件的情况。

Class文件解析

在这里插入图片描述

构成class文件的基本数据单位是字节，数据在class文件中以大端方式存储。
比较关键的是ClassReader类，用于辅助字节操作。

/**
 * @author treblez
 * @Description 辅助读取数据的类
 */
public class ClassReader {
    private final ByteBuffer buf;
    ClassReader(byte[] data){
        buf = ByteBuffer.allocate(data.length+5);
        buf.put(data);
        // 注意，清除标志位
        buf.rewind();
    }
    
    public byte readUint8() {
        return buf.get();
    }
    
    public char readUint16() {
        byte[] tmp = new byte[2];
        buf.get(tmp,0,2);
        return (char) (((tmp[0] & 0xFF) << 8) | (tmp[1] & 0xFF));
    }
    
    public int readUint32()  {
        byte[] tmp = new byte[4];
        buf.get(tmp,0,4);
        // 注意运算符优先级
        return  ((tmp[3]&0xff) |((tmp[2]&0xff) << 8) | ((tmp[1]&0xff)  << 16) | ((tmp[0]&0xff) << 24));
    }
    
    public long readUint64() {
        byte[] tmp = new byte[8];
        buf.get(tmp,0,8);
        return  (((long)(tmp[0] & 0xFF) << 56) | ((long)(tmp[1] & 0xFF) << 48) | ((long)(tmp[2] & 0xFF) << 40)
                | ((long)(tmp[3] & 0xFF) << 32) |
                (tmp[4] & 0xFF << 24) | (tmp[5] & 0xFF << 16) | (tmp[6] & 0xFF << 8) | (tmp[7] & 0xFF));
    }
    /**
     *读取uint16表，大小由开头的数据指定
      */
    public char[] readUint16s() {
        var n = readUint16();
        char[] s = new char[n];
        for(int i=0;i<n;i++){
            s[i] = readUint16();
        }
        return s;
    }

    public byte[] readBytes(int n) {
        byte[] ret = new byte[n];
        buf.get(ret, 0, n);
        return ret;
    }
}

字节流的读取顺序如下所示：

    void read(ClassReader reader) throws Exception {
    	// 验证魔数
        readAndCheckMagic(reader);
        // 验证版本号
        readAndCheckVersion(reader);
        // 读取常量池
        constantPool = new ConstantPool().readConstantPool(reader);
        //类访问标志 bitmask
        accessFlags = reader.readUint16();
        /*
         * 类和超类索引，thisClass必须是有效的常量池索引
         * superClass只在Object.class中是0，其它文件中必须有效
         */
        thisClass = reader.readUint16();
        superClass = reader.readUint16();
        //接口索引表，给出该类实现的所有接口的名字
        interfaces = reader.readUint16s();
        // 字段表
        fields = MemberInfo.readMembers(reader, constantPool);
        // 方法表
        methods = MemberInfo.readMembers(reader, constantPool);
        // 属性表
        attributes = AttributeInfo.readAttributes(reader, constantPool);
    }

魔数的值必须为0xCAFEBABE 类、超类、接口表都以常量池索引的方式存放。
字段、方法、类都有使用bitmask实现的访问标志，访问标志后是常量池索引，给出字段或者方法的描述符，最后是属性表。
常量池中放着很多的常量信息，包括数字和字符串常量、类和接口名、字段和方法名等。以8比特无符号整数来标志常量类型：

int CONSTANT_CLASS = 7;
    int CONSTANT_FIELDREF = 9;
    int CONSTANT_METHODREF = 10;
    int CONSTANT_INTERFACE_METHODREF = 11;
    int CONSTANT_STRING = 8;
    int CONSTANT_INTEGER = 3;
    int CONSTANT_FLOAT = 4;
    int CONSTANT_LONG = 5;
    int CONSTANT_DOUBLE = 6;
    int CONSTANT_NAME_AND_TYPE = 12;
    int CONSTANT_UTF8 = 1;
    int CONSTANT_METHOD_HANDLE = 15;
    int CONSTANT_METHOD_TYPE = 16;
    int CONSTANT_INVOKE_DYNAMIC = 18;

    /**
     * 读取常量信息
     *
     * @param reader
     */
    void readInfo(ClassReader reader) throws IOException;

    /**
     * 读取tag值，new创建具体常量，然后调用readInfo读取常量信息
     * @param reader
     * @param cp
     * @return
     * @throws Exception
     */
    static ConstantInfo readConstantInfo(ClassReader reader, ConstantPool cp) throws Exception {
        var tag = reader.readUint8();
        ConstantInfo ret = switch (tag) {
            case CONSTANT_INTEGER -> new ConstantIntegerInfo();
            case CONSTANT_FLOAT -> new ConstantFloatInfo();
            case CONSTANT_LONG -> new ConstantLongInfo();
            case CONSTANT_DOUBLE -> new ConstantDoubleInfo();
            case CONSTANT_UTF8 -> new ConstantUtf8Info();
            case CONSTANT_STRING -> new ConstantStringInfo(cp);
            case CONSTANT_CLASS -> new ConstantClassInfo(cp);
            case CONSTANT_FIELDREF -> new ConstantFieldRefInfo(cp);
            case CONSTANT_METHODREF -> new ConstantMethodRefInfo(cp);
            case CONSTANT_INTERFACE_METHODREF -> new ConstantInterfaceMethodRefInfo(cp);
            case CONSTANT_NAME_AND_TYPE -> new ConstantNameAndTypeInfo();
            // 以下三条为了支持SE7 invokedynamic指令
            // 即：先在运行时动态解析出调用点限定符所引用的方法，然后再执行该方法
            case CONSTANT_METHOD_TYPE -> new ConstantMethodTypeInfo();
            case CONSTANT_METHOD_HANDLE -> new ConstantMethodHandleInfo();
            case CONSTANT_INVOKE_DYNAMIC -> new ConstantInvokeDynamicInfo();
            default -> throw new Exception("java.lang.ClassFormatError: constant pool tag!");
        };
        ret.readInfo(reader);
        return ret;
    }

方法的字节码存放在属性表中，Deprecated（不建议使用）、Synthetic（源文件不存在）起标记作用，SourceFile指示源文件名，ConstantValue表示常量表达式的值，constantValue表示常量表达式的值，Code属性存放字节码等方法信息，Exceptions表示抛出的异常表，LineNumberTable和LocalVariableTable存放方法的行号和局部变量信息。
还需要注意的是在常量池中字符串使用Mutf-8存放，需要自己解析到UTF-8
核心代码如下：

protected CoderResult decodeLoop(final ByteBuffer source, final CharBuffer target) {
            // Track the position of the source buffer, so that consumed but
            // unused octets can be "put back". The value of this variable is
            // explicitly incremented each time a character is successfully
            // decoded, in order to avoid having to query the source buffer via
            // an unnecessary method invocation inside the loop.
            int sourcePosition = source.position();
            while (true) {
                try {
                    final byte a = source.get();
                    // The first three bits of the first octet determine the
                    // length of the octet sequence. Simultaneously checking the
                    // fourth bit is a cheap way to avoid an explicit check for
                    // an invalid leading octet of 1111xxxx.
                    //
                    // Shifting the four high bits to the four low bits makes
                    // the switch labels nearly contiguous. This enables the
                    // compiler to use the tableswitch instruction, rather than
                    // the lookupswitch instruction. The distinction could be
                    // significant for a tight loop, though in this case a
                    // modern JIT compiler would probably be able to optimize
                    // away any difference.
                    switch ((a & 0xFF) >> 4) {
                        case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7: {
                            // first octet 0xxxxxxx
                            // 000000000aaaaaaa as 0aaaaaaa
                         // final char ch = (char)(a);
                         // target.put(ch);
                            target.put((char)(a));
                            sourcePosition += 1;
                            break;
                        }
                        case 12: case 13: {
                            // first octet 110xxxxx
                            // 00000aaaaabbbbbb as 110aaaaa 10bbbbbb
                            final byte b = source.get();
                            if ((b & 0xC0) != 0x80) {
                                return CoderResult.malformedForLength(2);
                            }
                         // final char ch = (char)(((a & 0x1F) << 6) | (b & 0x3F));
                         // target.put(ch);
                            target.put((char)(((a & 0x1F) << 6) | (b & 0x3F)));
                            sourcePosition += 2;
                            break;
                        }
                        case 14: {
                            // first octet 1110xxxx
                            // aaaabbbbbbcccccc as 1110aaaa 10bbbbbb 10cccccc
                            final byte b = source.get();
                            if ((b & 0xC0) != 0x80) {
                                return CoderResult.malformedForLength(2);
                            }
                            final byte c = source.get();
                            if ((c & 0xC0) != 0x80) {
                                return CoderResult.malformedForLength(3);
                            }
                         // final char ch = (char)(((a & 0x0F) << 12) | ((b & 0x3F) << 6) | (c & 0x3F));
                         // target.put(ch);
                            target.put((char)(((a & 0x0F) << 12) | ((b & 0x3F) << 6) | (c & 0x3F)));
                            sourcePosition += 3;
                            break;
                        }
                     // case 8: case 9: case 10: case 11:
                            // first octet 10xxxxxx
                     // case 15:
                            // first octet 1111xxxx
                        default: {
                            return CoderResult.malformedForLength(1);
                        }
                    }
                } catch (final BufferUnderflowException e) {
                    // "Put back" unused octets of a partial character.
                    source.position(sourcePosition);
                    return CoderResult.UNDERFLOW;
                } catch (final BufferOverflowException e) {
                    // "Put back" unused octets of a full character.
                    source.position(sourcePosition);
                    return CoderResult.OVERFLOW;
                }
            }
        }
    }