【编译原理课程设计一】 --- Java实现预处理+标识符的识别+词法分析

安澜122

已于 2024-07-06 22:33:37 修改

阅读量564

点赞数 18

文章标签： java ide

于 2024-07-06 22:11:14 首次发布

本文链接：https://blog.csdn.net/qq_62711363/article/details/140236410

版权

【任务介绍】在词法分析之前，对程序员提交的源程序进行预处理，剔除注释等不必要的字符，以简化词法分析的处理

【输入】字符串形式的源程序。

【输出】处理之后的字符串形式的源程序。

【题目】设计一个程序，从任意字符串中剔除C语言形式的注释，包括：

1. 形如： //…的单行注释；

2. 形如： /*…*/ 的多行注释。

【任务介绍】根据给定源语言的构词规则，从任意字符串中识别出所有的合法标识符。

【输入】字符串。

【输出】单词符号流，一行一个单词。

【题目】设计一个程序，从任意字符串中识别出所有可视为C语言“名字”的子串。注意：

1. 构词规则：以字母打头，后跟任意多个字母、数字的单词；长度不超过15；不区分大小写；把下划线视为第27个字母。

2. 关键字保留，即：语言定义中保留了某些单词用作关键字，同学们不可以将这些单词用作“名字“（变量名、常量名、函数名、标号名等等）。

【任务介绍】根据给定源语言的构词规则，从任意字符串中识别出该语言所有的合法的单词符号，并以等长的二元组形式输出。

【输入】字符串形式的源程序。

【输出】单词符号所构成的串（流），单词以等长的二元组形式呈现。

【题目】设计一个程序，根据给定源语言的构词规则，从任意字符串中识别出该语言所有的合法的单词符号，并以等长的二元组形式输出。

实验思路：使用本实验的目标是实现一个简单的词法分析器。通过构建Lexer类，使用Lexer类各种工具类完成准备工作。在Preprocessor类中使用状态机来删除注释。

Lexer类：包含各种工具类(read-读入方法, shouldSkipLine-去除字符串内无关字符串, splitIntoWords-分割字符串为单词的方法, isValidIdentifier-验证一个字符串是否是有效的标识符，isPunctuation-检查字符是否是标点符号，isStringLiteral-检查字符是否是字符串常量，isConstant-检查是不是数字)。

package Lexer;

import java.io.*;
import java.util.ArrayList;
import java.util.List;

public class Lexer {
    // 定义输入和输出文件的路径
    private static final String INPUT_FILE_PATH = "input.txt";
    private static final String OUTPUT_FILE_PATH = "Output_Lexer.txt";

    public static void main(String[] args) {
        try {
            // 预处理和词法分析
            preprocessAndAnalyze();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    // 预处理和词法分析方法
    private static void preprocessAndAnalyze() throws IOException {
        Preprocessor.preprocess2(INPUT_FILE_PATH, OUTPUT_FILE_PATH);
        System.out.println("预处理完成，结果已输出到：" + OUTPUT_FILE_PATH);
        List<String> lines = read(OUTPUT_FILE_PATH);
        IdentifierRecognizer.identify(lines);
        LexicalAnalyzer.analyze(lines);
    }

    // 读取文件内容，跳过不需要的行
    private static List<String> read(String filePath) throws IOException {
        List<String> lines = new ArrayList<>();
        try (BufferedReader reader = new BufferedReader(new FileReader(filePath))) {
            String line;
            while ((line = reader.readLine()) != null) {
                if (!shouldSkipLine(line)) {
                    lines.add(line);
                }
            }
        }
        return lines;
    }

    // 判断是否需要跳过当前行
    private static boolean shouldSkipLine(String line) {
        return line.contains("print(") || line.contains("printf(") || line.contains("scanf(")
                || line.startsWith("#include") || line.contains("error:");
    }

    // 分割字符串为单词的方法
    public static List<String> splitIntoWords(String line) {
        List<String> words = new ArrayList<>();
        StringBuilder word = new StringBuilder();
        for (char c : line.toCharArray()) {
            if (Character.isWhitespace(c)) {
                if (!word.isEmpty()) {
                    words.add(word.toString());
                    word.setLength(0);
                }
            } else if (isPunctuation(c)) {
                if (!word.isEmpty()) {
                    words.add(word.toString());
                    word.setLength(0);
                }
                words.add(Character.toString(c));
            } else {
                word.append(c);
            }
        }
        if (!word.isEmpty()) {
            words.add(word.toString());
        }
        return words;
    }

    // 验证一个字符串是否是有效的标识符
    public static boolean isValidIdentifier(String word) {
        if (word.length() > 15 || word.isEmpty()) return false;
        if (!Character.isLetter(word.charAt(0)) && word.charAt(0) != '_') return false;
        for (char c : word.substring(1).toCharArray()) {
            if (!Character.isLetterOrDigit(c) && c != '_') return false;
        }
        return true;
    }

    // 检查字符是否是标点符号
    public static boolean isPunctuation(char c) {
        String punctuations = ".,;:!?'\"(){}[]<>-+*/";
        return punctuations.indexOf(c) != -1;
    }

    // 检查是否是字符串常量
    public static boolean isStringLiteral(String word) {
        return word.startsWith("\"") && word.endsWith("\"");
    }

    // 检查是否是数字常量
    public static boolean isConstant(String word) {
        try {
            Double.parseDouble(word);
            return true;
        } catch (NumberFormatException e) {
            return false;
        }
    }
}

IdentifierRecognizer类：输入字符串流的源代码，分割字符串为字符，然后检查单词是否符合标识符的构词规则。

package Lexer;

import java.util.List;

// 输入字符串流的源代码，分割字符串为字符，然后检查单词是否符合标识符的构词规则
public class IdentifierRecognizer {
    public static void identify(List<String> lines) {
        for (String line : lines) { // 遍历字符串流
            List<String> words = Lexer.splitIntoWords(line); // 分割字符串为单词
            for (String word : words) {
                // 检查单词是否符合标识符的构词规则
                if (Lexer.isValidIdentifier(word)) {
                    if (Identifier.isKeyword(word) || Identifier.isOperator(word)) {
                        System.out.println(word);
                    }
                }
            }
        }
    }
}

Identifier类：自创的库，包含关键字和运算符。

package Lexer;

import java.util.*;
public class Identifier {
    // 关键字,包括main在内
    static final Set<String> keywords = new HashSet<>(Arrays.asList(
            "auto", "short", "int", "long", "float", "double", "char", "struct",
            "union", "enum", "typedef", "const", "unsigned", "signed", "extern",
            "register", "static", "volatile", "void", "if", "else", "switch",
            "case", "for", "do", "while", "goto", "continue", "break", "default",
            "sizeof", "return", "main", "define","print","printf","scanf","include"
    ));

    static final Set<String> keywords_short = new HashSet<>(Arrays.asList(
            "int", "long", "float", "double", "char","byte","void","short","boolean","String"
    ));

    // 运算符
    static final Set<String> operators = new HashSet<>(Arrays.asList(
            "+", "-", "*", "/", "%", "++", "--", "==", "!=", ">", ">=", "<", "<=",
            "&&", "||", "!", "=", "+=", "-=", "*=", "/=", "%=", "<<=", ">>=", "&=",
            "^=", "|=", "&", "|", "^", "~", "<<", ">>", "?", ":", ",", ".", "->", "<", ">",
            "(", ")", "[", "]", "{", "}", ";"
    ));

    public static boolean isKeyword(String s) {
        return !keywords.contains(s);
    }

    public static boolean isKeyword_short(String s) {
        return !keywords_short.contains(s);
    }

    public static boolean isOperator(String s) {
        return !operators.contains(s);
    }

}

Preprocessor类：用来删除注释。

package Lexer;

import java.io.*;
import java.nio.file.*;

class Preprocessor {

    // 预处理方法1: 使用正则表达式去除注释
    public static String preprocess1(String sourceCode) {
        // 去除单行注释
        String noSingleLineComments = sourceCode.replaceAll("//.*", "");
        // 去除多行注释
        return noSingleLineComments.replaceAll("(?s)/\\*.*?\\*/", "");
    }

    // 预处理方法2: 从文件逐行读取并处理注释
    private static boolean inComment = false; // 标记是否处于注释块中
    private static boolean inStringLiteral = false; // 标记是否处于字符串字面量中

    // 预处理文件并输出结果
    public static void preprocess2(String inputFilePath, String outputFilePath) throws IOException {
        Path inputPath = Paths.get(inputFilePath);
        Path outputPath = Paths.get(outputFilePath);

        try (BufferedReader reader = Files.newBufferedReader(inputPath);
             BufferedWriter writer = Files.newBufferedWriter(outputPath)) {

            String line;
            while ((line = reader.readLine()) != null) {
                String processedLine = processLine(line);
                if (!processedLine.isEmpty()) {
                    writer.write(processedLine);
                    writer.newLine();
                }
            }

            if (inComment) {
                writer.write("error: unterminated comment");
                writer.newLine();
            }
        }
    }

    // 处理每一行，去除注释
    private static String processLine(String line) {
        StringBuilder processedLine = new StringBuilder();
        for (int i = 0; i < line.length(); i++) {
            char curChar = line.charAt(i);
            char nextChar = (i + 1 < line.length()) ? line.charAt(i + 1) : '\0';

            // 处理字符串字面量
            if (curChar == '"' && !inComment) {
                inStringLiteral = !inStringLiteral;
                processedLine.append(curChar);
                continue;
            }

            // 处理注释
            if (!inStringLiteral) {
                if (!inComment) {
                    // 开始多行注释
                    if (curChar == '/' && nextChar == '*') {
                        inComment = true;
                        i++;
                        // 开始单行注释
                    } else if (curChar == '/' && nextChar == '/') {
                        break;
                    } else {
                        processedLine.append(curChar);
                    }
                } else if (curChar == '*' && nextChar == '/') {
                    // 结束多行注释
                    inComment = false;
                    i++;
                }
            } else {
                processedLine.append(curChar);
            }
        }
        return processedLine.toString();
    }
}

LexicalAnalyzer类：词法分析模块。

package Lexer;

import java.util.*;

// 词法分析模块
class LexicalAnalyzer {
    private static final Map<String, Integer> identifierLocation = new LinkedHashMap<>();
    private static final Map<String, Integer> constantTimes = new HashMap<>();
    public static void analyze(List<String> lines) {
        int identifierIndex = 1; // 用来记录不同的标识符是第几次出现

        for (String line : lines) {
            List<String> words = Lexer.splitIntoWords(line);
            for (String word : words) {
                if (Lexer.isValidIdentifier(word) && !Identifier.isKeyword(word)) {
                    // 如果是标识符且不是关键字
                    identifierLocation.putIfAbsent(word, identifierIndex++);
                    System.out.println("(" + word + ", " + identifierLocation.get(word) + ")");
                } else if (Lexer.isConstant(word)) {
                    // 如果是常数
                    constantTimes.putIfAbsent(word, 0);
                    constantTimes.put(word, constantTimes.get(word) + 1);
                    System.out.println("(" + word + ", " + constantTimes.get(word) + ")");
                } else if (Lexer.isStringLiteral(word) || Identifier.isOperator(word)) {
                    // 如果是字符串常量或操作符
                    System.out.println("(" + word + ", _)");
                }
            }
        }
    }
}

项目结构：运行截图：

开始: Lexer类程序启动。
主函数 main: 程序的主入口点，它调用 preprocessAndAnalyze 方法。
预处理和词法分析 preprocessAndAnalyze: 执行预处理和词法分析任务的方法。
调用 Preprocessor.preprocess2 进行预处理: 使用 Preprocessor.preprocess2 方法对输入文件进行处理。
输出预处理结果到文件: 将预处理的结果输出到指定的输出文件中，并打印消息告知用户预处理完成。
词法分析 LexicalAnalyzer.analyze: 对识别出的标识符执行词法分析。
结束: 程序运行完毕