Java 正则表达式（二）：Java API

1. Java API 中正则表达式相关的类有

正则表达式相关的类位于包 java.util.regex 下，有两个主要的类 Pattern 和 Matcher
- Pattern：表示正则表达式对象，它与要处理的具体字符串无关
- Matcher：表示一个匹配，它将正则表达式应用于一个具体字符串，通过它对字符串进行处理
正则表达式在 Java 中是需要先以字符串形式表示的

2. 在 Java 中怎样表示正则表达式

在 Java 中，没有什么特殊语法能直接表示正则表达式，正则表达式需要用字符串表示
- 而在字符串中，\ 也是一个元字符，为了在字符串中表示正则表达式的 \，就需要使用两个 \，即 \\
- 而要匹配 \ 本身，就需要 4 个 \，即 \\\\。比如，表达式 <(\w+)>(.*)</\1> 对应的字符串表示就是 <(\\w+)>(.*)</\\1>
- 一个简单的规则是：正则表达式中的任何一个 \，在字符串中，需要替换为两个 \
字符串表示的正则表达式可以被编译为一个 Pattern 对象。比如 String regex = "<(\\w+)>(.*)</\\1>";，可以被编译为 Pattern pattern = Pattern.compile(regex);
- Pattern 是正则表达式的面向对象表示。所谓编译，简单理解就是将字符串表示为了一个内部结构，这个结构是一个有穷自动机
- 编译有一定的成本，而且 Pattern 对象只与正则表达式有关，与要处理的具体文本无关，它可以安全地被多线程共享。所以，在使用同一个正则表达式处理多个文本时，应该尽量重用同一个 Pattern 对象，避免重复编译
Pattern 的 compile() 方法接受一个额外参数，可以指定匹配模式：public static Pattern compile(String regex, int flags)。其中，单行模式（点号模式）、多行模式和大小写无关模式，对应的常量分别是：Pattern.DOTALL、Pattern.MULTILINE 和 Pattern.CASE_INSENSITIVE。多个模式可以一起使用，通过 | 连起来即可，比如：Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
- 还有一个模式 Pattern.LITERAL，在此模式下，正则表达式字符串中的元字符将失去特殊含义，被看作普通字符
- Pattern 有一个静态方法：public static String quote(String s)，与模式 Pattern.LITERAL 类似，它将 s 中的字符都看作普通字符。quote() 方法基本上就是在字符串 s 的前后加了 \Q 和 \E。比如，如果 s 为 \\d{6}，则 quote() 的返回值就是 \\Q\\d{6}\\E

3. 怎样理解 `String` 的 `split()` 方法：`public String[] split(String regex)`

split() 方法将参数 regex 看作正则表达式，而不是普通的字符。所以，如果分隔符是元字符，比如 . $ | () [ { ^ ? * + \，就需要转义。比如，按点号 . 分隔，需要写为：String[] fields = str.split("\\.");
- 如果分隔符是用户指定的，程序事先不知道，可以通过 Pattern.quote() 将其看作普通字符串
既然是正则表达式，分隔符就不一定是一个字符。比如，可以将一个或多个空白字符或点号作为分隔符，如下：String str = "abc def hello.\n world";，String [] fields = str.split("[\\s.]+");。fields 内容是 [abc, def, hello, world]
- 需要说明的是，尾部的空白字符串不会包含在返回的结果数组中，但头部和中间的空白字符串会被包含在内
- 如果字符串中找不到匹配 regex 的分隔符，返回数组长度为 1，元素为原字符串

4. `Pattern` 的 `split()` 方法：`public String[] split(CharSequence)` 和 `String` 的 `split()` 方法：`public String[] split(String regex)` 的区别是

Pattern 接受的参数是 CharSequence，更为通用。String StringBuilder StringBuffer CharBuffer 等都实现了该接口
如果 regex 长度大于 1 或包含元字符，String 的 split() 方法必须先将 regex 编译为 Pattern 对象，再调用 Pattern 的 split() 方法。这时，为避免重复编译，应该优先采用 Pattern 的 split() 方法
如果 regex 就是一个字符且不是元字符，String 的 split() 方法会采用更为简单高效的实现。所以，这时应该优先采用 String 的 split() 方法

5. 怎样理解 `String` 的 `matches()` 方法 `public boolean matches(String regex)`

String 的 matches() 方法实际调用的是 Pattern 的 matches() 方法

public static boolean matches(String regex, CharSequence input) {
    Pattern p = Pattern.compile(regex); //编译 regex 为 Pattern 对象
    Matcher m = p.matcher(input); //调用 matcher() 方法生成一个匹配对象 Matcher
    return m.matches(); //Matcher 的 matches() 方法返回是否完整匹配
}

6. 怎样理解 `Matcher` 的查找方法 `find()`

Matcher 的内部纪录有一个位置，起始为 0，find() 方法从这个位置查找匹配正则表达式的子字符串。找到后，返回 true，并更新这个内部位置，匹配到的子字符串信息可以通过如下方法获取
- public String group()：匹配到的完整子字符串
- public int start()：子字符串在整个字符串中的起始位置
- public int end()：子字符串在整个字符串中的结束位置加 1
group() 方法其实调用的是 group(0) 方法，表示获取匹配的第 0 个分组的内容。分组 0 是一个特殊的捕获分组，表示匹配的整个子字符串。除了分组 0，Matcher 还有如下方法，获取分组的更多信息
- public int groupCount()：分组个数
- public String group(int group)：分组编号为 group 的内容
- public String group(String name)：分组命名为 name 的内容
- public int start(int group)：分组编号为 group 的起始位置
- public int end(int group)：分组编号为 group 的结束位置加 1

7. 怎样理解 `String` 的替换方法

String 有多个替换方法
- public String replace(char oldChar, char newChar)
- public String replace(CharSequence target, CharSequence replacement)
- pubilc String replaceAll(String regex, String replacement)
- public String replaceFirst(String regex, String replacement)
第一个 replace() 方法操作的是单个字符，第二个是 CharSequence，它们都是将参数看作普通字符。而 replaceAll() 和 replaceFirst() 则将参数 regex 看作正则表达式，它们的区别是，replaceAll() 替换所有找到的子字符串，而 replaceFirst() 则只替换第一个找到的
- 在 replaceAll() 和 replaceFirst() 中，参数 replacement 也不是被看作普通的字符串，可以使用美元符号加数字的形式（比如 $1）引用捕获分组。字符 $ 在 replacement 中是元字符，如果需要替换为字符 $ 本身，需要使用转义
- 如果替换字符串是用户提供的，为避免元字符的干扰，可以使用 Matcher 的如下静态方法将其视为普通字符串：public static String quoteReplacement(String s)
String 的 replaceAll() 和 replaceFirst() 调用的其实是 Pattern 和 Matcher 中的方法。比如，replaceAll() 方法的代码是：public String replaceAll(String regex, String replacement) { return Pattern.compile(regex).matcher(this).replaceAll(replacement); }
replaceAll() 和 replaceFirst() 都定义在 Matcher 中。除了一次性的替换操作外，Matcher 还定义了边查找、边替换的方法：public Matcher appendReplacement(StringBuffer sb, String replacement) 和 public StringBuilder appendTail(StringBuffer sb)。这两个方法用于和 find() 方法一起用

8. 写出下面代码的输出结果并分析

public static void replaceCat() {
    Pattern p = Pattern.compile("cat");
    Matcher m = p.matcher("one cat, two cat, three cat");
    StringBuffer sb = new StringBuffer(); //存放最终的替换结果
    int foundNum = 0;
    while(m.find()) { 
        //Matcher 内部除了有一个查找位置，还有一个 append 位置，初始为 0，当找到一个匹配的子字符串后，appendReplacement() 做了三件事：
        //1) 将  append 位置到当前匹配之前的子字符串 append 到 sb 中，在第一次操作中为 "one"，第二次为 ", two"
        //2) 将替换字符串 append 到 sb 中
        //3) 更新 append 位置为当前匹配之后的位置
        m.appendReplacement(sb, "dog");
        foundNum++;
        if(foundNum == 2) {
            break;
        }
    }
    m.appendTail(sb); //将 append 之后所有的字符 append 到 sb 中
    System.out.println(sb.toString());
}

//输出：one dog, tow dog, three dog
//分析见注释

1. Java API 中正则表达式相关的类有

2. 在 Java 中怎样表示正则表达式

3. 怎样理解 String 的 split() 方法：public String[] split(String regex)

4. Pattern 的 split() 方法：public String[] split(CharSequence) 和 String 的 split() 方法：public String[] split(String regex) 的区别是

5. 怎样理解 String 的 matches() 方法 public boolean matches(String regex)

6. 怎样理解 Matcher 的查找方法 find()

7. 怎样理解 String 的替换方法

8. 写出下面代码的输出结果并分析

3. 怎样理解 `String` 的 `split()` 方法：`public String[] split(String regex)`

4. `Pattern` 的 `split()` 方法：`public String[] split(CharSequence)` 和 `String` 的 `split()` 方法：`public String[] split(String regex)` 的区别是

5. 怎样理解 `String` 的 `matches()` 方法 `public boolean matches(String regex)`

6. 怎样理解 `Matcher` 的查找方法 `find()`

7. 怎样理解 `String` 的替换方法