正则表达式

最新推荐文章于 2024-08-26 20:42:01 发布

kd_myway

最新推荐文章于 2024-08-26 20:42:01 发布

阅读量292

点赞数

分类专栏：正则表达式文章标签：正则表达式爬虫

本文链接：https://blog.csdn.net/qq_15797229/article/details/54143913

版权

正则表达式专栏收录该内容

0 篇文章 0 订阅

订阅专栏

转义字符

简单的转义字符

标准字符集合

能够与’多种字符’匹配的表达式
注意区分大小写，大写是相反的意思

标准字符集

自定义字符集合

[]方括号匹配方式，能够匹配方括号中任意一个字符

自定义字符集

正则表达式的特殊符号，被包含到中括号中去，则失去特殊意义，除了^,-之外
标准字符集合，除小数点外，如果被包含于中括号，自定义字符集合将包含该集合。比如：[\d.\-+]将匹配数字、小数点、-、+

量词

修饰匹配次数的特殊字符

匹配次数中的贪婪模式（匹配字符越多越好，默认）
匹配次数中的非贪婪模式（匹配字符越少越好，修饰匹配次数的特殊符号后再加上一个“?”号）

字符边界

本组标记匹配的不是字符而是位置，符合某种条件的位置

字符边界

\b匹配这样一个位置：前面的字符和后面的字符不全是\w

正则表达式的匹配模式

匹配模式

选择符和分组

反向引用（\nnn）

每一对()会分配一个编号，使用()的捕获根据左括号的顺序从1开始自动编号
通过反向引用，可以对分组已捕获的字符串进行引用

([a-z]{2})\1

捕获组

预搜索（零宽断言）

预搜索

[a-z]+(?=ing)

(?=exp)

[a-z]+(?!\d+)

(?!exp)

[a-z]+(?<=go)

(?<=exp)

[a-z]+(?<!go)

(?<!exp)

练习（电话号码）

(0\d{2,3}-\d{7,8})|(1[358]\d{9})

练习（邮箱）

[\w\-]+@[a-zA-Z0-9]+(\.[A-Za-z]{2,4}){1,2}

java中的正则表达式

/**
 * 正则表达式的使用
 * @author L J
 */
public class RegexDemo {
    public static void main(String[] args) {
        //测试sdjfign@90384是否符合 ：\w+

        //表达式对象
        Pattern p = Pattern.compile("\\w+");

        //创建Matcher对象
        Matcher m = p.matcher("sdjfign@90384");

        //尝试将整个字符串序列与该模式匹配
//      boolean result = m.matches();
//      System.out.println(result); //false

        //该方法扫描输入的序列，查找与该模式匹配的下一个子序列
//      System.out.println(m.find()); //true
//      System.out.println(m.find()); //true
//      System.out.println(m.find()); //false

        //group方法返回查找到的字符
        while(m.find()) {
            //group(),group(0)匹配整个表达式的子字符串
            System.out.println(m.group()); 
        }
    }
}

/**
 * 正则表达式的用法(分组)
 * @author L J
 */
public class RegexDemo2 {
    public static void main(String[] args) {
        //测试sdj23**hhf89**jfj443是否符合 ：([a-z]+)([0-9]+)

        //表达式对象
        Pattern p = Pattern.compile("([a-z]+)([0-9]+)");

        //创建Matcher对象
        Matcher m = p.matcher("sdj23**hhf89**jfj443");

        while(m.find()) {
            //group(),group(0)匹配整个表达式的子字符串
            System.out.println(m.group()); 
            System.out.println(m.group(1)); 
            System.out.println(m.group(2)); 
        }
    }
}

/**
 * 正则表达式的用法(替换)
 * @author L J
 */
public class RegexDemo3 {
    public static void main(String[] args) {
        //表达式对象
        Pattern p = Pattern.compile("[0-9]");

        //创建Matcher对象
        Matcher m = p.matcher("sdj23**hhf89**jfj443");

        //替换
        String newStr = m.replaceAll("/");
        System.out.println(newStr); //sdj//**hhf//**jfj///
    }
}

/**
 * 正则表达式的用法(分割)
 * @author L J
 */
public class RegexDemo4 {
    public static void main(String[] args) {
        String str = "sdj23hhf89jfj443";
        String[] arrs = str.split("\\d+");
        System.out.println(Arrays.toString(arrs));//[sdj, hhf, jfj]
    }
}

网络爬虫原理

/**
 * 网络爬虫取链接
 * @author L J
 */
public class WebSpider {
    public static void main(String[] args) {
        String str = getURLContent("http://www.163.com", "gbk");
        List<String> result = getMatcherSubstrs(str, "href=\"([\\w\\s./:]+?)\"");
        for (String r : result) {
            System.out.println(r);
        }
    }

    public static List<String> getMatcherSubstrs(String destStr, String regex) {
        //该正则表达式取得所有的超链接
        //Pattern p = Pattern.compile("<a[\\s\\S]+?</a>");
        Pattern p = Pattern.compile(regex);
        Matcher m = p.matcher(destStr);
        List<String> result = new ArrayList<String>();
        while(m.find()) {
            result.add(m.group(1));
        }
        return result;
    }

    /**
     * 通过url获得网页源码
     * @param urlStr url地址
     * @return 源码
     */
    public static String getURLContent(String urlStr, String charset) {
        StringBuilder sb = new StringBuilder();
        try {
            //网站url
            URL url = new URL(urlStr);
            //输入流
            BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream(), Charset.forName(charset)));

            //读取网页源码
            String temp = "";
            while((temp = reader.readLine()) != null) {
                sb.append(temp);
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        return sb.toString();
    }
}