正则表达式解析成段文字

1、正则表达式
“.?":匹配所有内容,其中一端必须有字符,例如:"DOB (.?);”,表示匹配‘DOB’和‘;’之间(包括‘DOB;’)内容
“(?<=(,|China))”:匹配‘,’或‘China’开头的内容,不包括‘,’和‘China’
“(?=;)”:匹配‘;’结尾的内容,不包括‘;’例如:(?<=(,|China)).?(?=;),就是截取‘,’或‘China’开头,以‘;’结尾中间的内容
"([\u4e00-\u9fa5]+(\W[\u4e00-\u9fa5]+)
)":匹配中文,中文中可包含‘()’等字符
“A(?!B)”:匹配不是以B结尾的A,例如:"(.?)(?= \((?![A-Z]{3,}))",匹配以‘ (’结尾,但是括号后面不是连续3个以上的大写字母
"([ ]?\w+)
":匹配连续的英文数字,例如:“Company Number([ ]?\w+)*”,匹配number后面的字母或数字

2、案例,使用正则解析一段文字

@Test
    public void test4() {
        Map<String, String> map = new HashMap<>();
        String content = "CHANG AN SHIPPING & TECHNOLOGY (Chinese Traditional: 長安海運技術有限公司) (a.k.a. CHANG AN SHIPPING AND TECHNOLOGY), Room 2105, DL1849, Trend Centre, 29-31 Cheung Lee Street, Chai Wan, Hong Kong, China; Secondary sanctions risk: North Korea Sanctions Regulations, sections 510.201 and 510.210; Transactions Prohibited For Persons Owned or Controlled By U.S. Financial Institutions: North Korea Sanctions Regulations section 510.214; Company Number IMO 5938411 [DPRK4]."
        //content = "ALWEFAQ LTD, 15 Grognet Street, Mosta MST 3613, Malta; 22 Freedom Street, Famagusta, Cyprus; Registration Number C 68939 (Malta) [LIBYA3] (Linked To: MUSBAH, Nourddin Milood M; Linked To: WADI, Musbah Mohamad M)."


        Matcher matcher_aka = Pattern.compile("(a.k.a. (.*?)[;()])+").matcher(content);
        List<String> akaList = new ArrayList<>();
        while (matcher_aka.find()) {
            String AKA = matcher_aka.group(2);
            akaList.add(AKA);
        }
        if (akaList.size() > 0) {
            map.put("SDN_a.k.a.", akaList.join(";"));
        }
        //通常第一个小括号里面的内容是中文名
        Matcher matcher_china_name = Pattern.compile("\\((.*?)\\)[, ]").matcher(content);
        if (matcher_china_name.find()) {
            String group = matcher_china_name.group(1);
            Matcher matcher_simplified = Pattern.compile("Chinese (Simplified|Traditional): ([\\u4e00-\\u9fa5]+(\\W[\\u4e00-\\u9fa5]+)*)").matcher(group);
            List<String> chList = new ArrayList<>();
            while (matcher_simplified.find()) {
                String simplified = matcher_simplified.group(2);
                chList.add(simplified);
            }
            if (chList.size() > 0) {
                map.put("SDN_Chinese_Name", chList.join(";"));
            }
        }

        Matcher matcher_Program = Pattern.compile("(\\[(.*?)])+").matcher(content);
        List<String> projramlist = new ArrayList<>();
        while (matcher_Program.find()) {
            String cate = matcher_Program.group(2);
            projramlist.add(StringUtils.trimToEmpty(cate));
        }
        if (projramlist.size() > 0) {
            map.put("SDN_Program", projramlist.join(";"));
        }
                //BEIJING SUKBAKSO, Qixingmen Store, No. 8 Apartment, Fangcaodi West Road, Chaoyang District, Beijing 100020,
        //COSCO SHIPPING TANKER (DALIAN) CO., LTD. (a.k.a. COSCO SHIPPING TANKER DALIAN; a.k.a. DALI
        Matcher matcher_en_name = Pattern.compile("(.*?)((?= \\((?![A-Z]{3,}))|[A-Z]{3,}\\.?(?=,))").matcher(content);
        if (matcher_en_name.find()) {
            String name = matcher_en_name.group();
            map.put("SDN_English_Name", name);
        }

        //地址末尾的地名,一般是国家或Hong Kong等地区
        String[] addressEnd = ["Hong Kong", "Taiwan", "Germany", "Iran", "United Arab Emirates", "Korea, North", "Thailand", "Marshall Islands", "Russia", "Malaysia", "Kuwait", "British"];
        //地址中间包含的地名单词
        String[] wordContain = ["Street", "District", "Road", " St\\W", " Rd\\W", " Lu\\W", " RD\\(S\\)", "Block", "Avenue", "Highway", "Tower", "dong", "Province", "Frankfurt am Main", "Xinjiang", "Jumeirah Bay"];
        //地址中间包含的城市,判断是否包含例如:Moscow 101000等信息
        String[] cityContain = ["Beijing", "Shanghai", "Moscow", "Berlin", "Kuala Lumpur"];
        List<String> addressContain = new ArrayList<>(Arrays.asList(wordContain));
        cityContain.each { city -> addressContain.add(city + " \\d+") }
        String name_en = map.get("SDN_English_Name");
        Matcher matcher_address = Pattern.compile("(?<=(\\),|China[)]?;|" + name_en + ", |" + addressEnd.join(";|") + ";))(.*?)(" + addressContain.join("|") + ")(.*?)(?=[;\\[])").matcher(content);
        List<String> addrlist = new ArrayList<>();
        while (matcher_address.find()) {
            String address = matcher_address.group().trim();
            addrlist.add(StringUtils.trimToEmpty(address));
        }
        if (addrlist.size() > 0) {
            map.put("SDN_Address", addrlist.join(";"));
        }

        Matcher matcher_Registration = Pattern.compile("Registration Number(([ /]?\\w+)*( \\(([ ]?\\w+)*\\))*)").matcher(content);
        if (matcher_Registration.find()) {
            String registration = matcher_Registration.group(1).trim();
            map.put("SDN_Registration_Number", registration);
        }
        //Company Number 58462 (Marshall Islands) [IRAN-EO13846]."
        Matcher matcher_Company_num = Pattern.compile("Company Number(([ ]?\\w+)*( \\(([ ]?\\w+)*\\))?)").matcher(content);
        if (matcher_Company_num.find()) {
            String group = matcher_Company_num.group(1).trim();
            map.put("SDN_Company_Number", group);
        }


        Iterator it = map.entrySet().iterator();
        while (it.hasNext()) {
            Map.Entry entry = it.next();
            System.out.println(entry.key + " = " + entry.value);
        }

    }


3、结果

SDN_Address = Room 2105, DL1849, Trend Centre, 29-31 Cheung Lee Street, Chai Wan, Hong Kong, China
SDN_Company_Number = IMO 5938411
SDN_a.k.a. = CHANG AN SHIPPING AND TECHNOLOGY
SDN_Program = DPRK4
SDN_English_Name = CHANG AN SHIPPING & TECHNOLOGY
SDN_Chinese_Name = 長安海運技術有限公司
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值