Apache Groovy 轻松正则表达式 - OpenSource.net

正则表达式 – 爱也好，恨也罢，对于大多数程序员来说，它们都是不可避免的现实，尤其是像我这样不断与数据搏斗的数据整理者。（如果您尚未安装 Groovy，请阅读本系列的介绍。）

这句名言，虽然经常被引用，但谁先说的仍然模糊不清，似乎是必须引用的

有些人，当遇到问题时，会想“我知道了，我将使用正则表达式。” 现在他们有两个问题了。

它们仍然是识别和转换文本的强大工具——尤其是那些从未打算被识别或转换的文本。Groovy 至少尝试让正则表达式，如果不是更友好，至少，更 groovy。

那么，除了 Java 的基本功能之外，Groovy 还为正则表达式带来了什么？

在上一篇文章中，您了解了 slashy 字符串，它消除了转义反斜杠的需要。这些有助于处理正则表达式。

假设您正在从某处读取一些 HTML，并且您正在查找使用描述列表定义的术语。
描述列表如下所示

<dl>
  <dt>winter</dt>
    <dd>winter solstice to spring equinox</dd>
  <dt>spring</dt>
    <dd>spring equinox to summer solstice</dd>
  <dt>summer</dt>
    <dd>summer solstice to fall equinox</dd>
  <dt>autumn</dt>
    <dd>fall equinox to winter solstice</dd>
</dl>

这将呈现为

winter
winter solstice to spring equinox
spring
spring equinox to summer solstice
summer
summer solstice to fall equinox
autumn
fall equinox to winter solstice

要定义的术语出现在 <dt> 和 </dt> 元素之间。

使用 slashy 字符串表示法，可以找到这些术语的正则表达式如下所示

~/<dt>.*<\/dt>/

请注意前面的 ~，它表示以下字符串将被编译成正则表达式，并且您必须转义 / 字符。

假设您正在阅读 HTML 文档，并希望查看文档中定义了哪些术语。
像这样的脚本将接近满足您的需求

1  String html = """
2  <html>
3    <head>
4      <title>Seasons</title>
5    </head>
6    <body>
7      <h1>Learning about the seasons</h1>
8      <p>What is the formal definition of the four seasons of the year?</p>
9      <dl>
10        <dt>winter</dt>
11        <dd>winter solstice to spring equinox</dd>
12        <dt>spring</dt>
13        <dd>spring equinox to summer solstice</dd>
14        <dt>summer</dt>
15        <dd>summer solstice to fall equinox</dd>
16        <dt>autumn</dt>
17        <dd>fall equinox to winter solstice</dd>
18      </dl>
19    </body>
20  </html>
21  """
22  def dts = html =~ /<dt>.*<\/dt>/
23  for (f in dts) {
24    println f
25  }

当您运行它时，您会得到

$ groovy Groovy11a.groovy
<dt>winter</dt>
<dt>spring</dt>
<dt>summer</dt>
<dt>autumn</dt>

符号 =~ 是 Groovy 的 find 运算符。以这种方式使用它会产生 Matcher 类的实例。您可以迭代 Matcher 实例中的匹配项以获取匹配的子字符串。

有时，处理 Matcher 对象太复杂了。也许您只想确定在任何数据行上是否存在感兴趣的模式。在这种情况下，您可以使用 match 运算符 ==~

1  String html = """
2  <html>
3    <head>
4      <title>Seasons</title>
5    </head>
6    <body>
7      <h1>Learning about the seasons</h1>
8      <p>What is the formal definition of the four seasons of the year?</p>
9      <dl>
10        <dt>winter</dt>
11        <dd>winter solstice to spring equinox</dd>
12        <dt>spring</dt>
13        <dd>spring equinox to summer solstice</dd>
14        <dt>summer</dt>
15        <dd>summer solstice to fall equinox</dd>
16        <dt>autumn</dt>
17        <dd>fall equinox to winter solstice</dd>
18      </dl>
19    </body>
20  </html>
21  """

22  for (line in html.split(/\n/)) {
23    if (line ==~ /.*<dt>.*<\/dt>.*/)
24      println line
25  }

当您运行此代码时，您会看到

$ groovy Groovy11b.groovy
      <dt>winter</dt>
      <dt>spring</dt>
      <dt>summer</dt>
      <dt>autumn</dt>

有几个重要的注意事项——match 运算符匹配（或不匹配）整个字符串。因此，您必须更改模式以允许在感兴趣的模式之前和之后出现文本。请注意，您所做的只是识别测试字符串中某处具有该模式。

match 运算符对我的工作特别有用的地方是在 Groovy 的 switch 语句中。例如，您可以使用 switch 语句和您在两篇文章前学到的 Groovy String 增强方法 takeBetween() 从示例 HTML 中提取术语和定义，如下所示

1  String html = """
2  <html>
3    <head>
4      <title>Seasons</title>
5    </head>
6    <body>
7      <h1>Learning about the seasons</h1>
8      <p>What is the formal definition of the four seasons of the year?</p>
9      <dl>
10        <dt>winter</dt>
11        <dd>winter solstice to spring equinox</dd>
12        <dt>spring</dt>
13        <dd>spring equinox to summer solstice</dd>
14        <dt>summer</dt>
15        <dd>summer solstice to fall equinox</dd>
16        <dt>autumn</dt>
17        <dd>fall equinox to winter solstice</dd>
18      </dl>
19    </body>
20  </html>
21  """
22  def definitionList = []
23  def definition = [:]
24  html.split(/\n/).each { line ->
25    switch (line) {
26    case ~/.*<dt>.*<\/dt>.*/:
27      definition.term = line.takeBetween("<dt>", "</dt>")
28      break
29    case ~/.*<dd>.*<\/dd>.*/:
30      definition.definition = line.takeBetween("<dd>", "</dd>")
31      break
32    default:
33      break
34    }
35    if (definition.containsKey("term") && definition.containsKey("definition")) {
36      definitionList << definition
37      definition = [:]
38    }
39  }
40  println "Definition(s) encountered in HTML:"
41  definitionList.each { d ->
42    println "${d.term}: ${d.definition}"
43  }

以下是关于上述代码的一些注释

在第 24-39 行中，我使用为列表定义的 each() 方法和一个闭包来循环遍历 HTML 的每一行（在前面的示例中，我使用了 Groovy for… in 语句，但通常我更喜欢在这种情况下使用 each）。
在第 25-34 行中，我使用带有模式匹配的 Groovy switch 语句来检测包含 <dt>… </dt> 和 <dd>… </dd> 的行。我使用 takeBetween() 方法获取这些元素之间的文本，并将该文本保存在第 23 行定义的 definition 映射中。
在第 35-38 行中，我查看 definition 映射是否同时具有 term 和 definition 键，如果是，则将其附加到 definitionList，然后再将其重置为空映射。
在第 41-43 行中，我使用为列表定义的 each() 方法和一个闭包来循环遍历 definitionList 中的每个定义映射。
我可以更 groovy 地使用列表的专用 collect() 方法，但我会将其留到后续文章中。
我可以只在每次 definition 映射已满时打印它，而不是将其累积到 definitionList 中。但在我的经验中，通常需要分离这两个步骤并在它们之间进行中间处理。

当您运行此代码时，您会得到

$ groovy Groovy11c.groovy
Definition(s) encountered in HTML:
winter: winter solstice to spring equinox
spring: spring equinox to summer solstice
summer: summer solstice to fall equinox
autumn: fall equinox to winter solstice