4.8 字符串提取

substr(x, start, stop)
substring(text, first, last = 1000000L)

substr 和 substring 函数通过位置进行字符串的拆分和提取，它们本身不使用正则表达式，结合其他正则表达式函数regexpr, gregexpr 和 regexec，可以很方便地从大量文本中提取所需的信息。作用类似之前提到的 regmatches 函数

参数设置基本相同

x/text 是要拆分的字符串向量
start/first 截取的起始位置向量
stop/last 截取的终止位置向量

返回值有差别

substr 返回的字串个数等于第一个参数 x 的长度
substring 返回字串个数等于三个参数中最长向量长度，短向量循环使用。

x <- "123456789"
substr(x, c(2, 4), c(4, 5, 8))

## [1] "234"

substring(x, c(2, 4), c(4, 5, 8))

## [1] "234"     "45"      "2345678"

substr("abcdef", 2, 4)

## [1] "bcd"

substring("abcdef", 1:6, 1:6)

## [1] "a" "b" "c" "d" "e" "f"

因为 x 的向量长度为1，所以 substr 获得的结果只有1个字串，即第2和第3个参数向量只用了第一个组合：起始位置2，终止位置4。而 substring 的语句三个参数中最长的向量为 c(4,5,8)，执行时按短向量循环使用的规则第一个参数事实上就是c(x,x,x)，第二个参数就成了c(2,4,2)，最终截取的字串起始位置组合为：2-4, 4-5和2-8。

x <- c("123456789", "abcdefghijklmnopq")
substr(x, c(2, 4), c(4, 5, 8))

## [1] "234" "de"

substring(x, c(2, 4), c(4, 5, 8))

## [1] "234"     "de"      "2345678"

更加高级的字符串抽取

# 从字符串中抽取固定模式的文本，替代 stringr::str_extract
# 只抽取一个匹配的
extract_str <- function(text, pattern) regmatches(text, regexpr(pattern, text))
# 符合模式的全部抽取
gextract_str <- function(text, pattern) regmatches(text, gregexpr(pattern, text))

举例子，抽取连续的数字

# 两个例子
extract_str(text = "abd123da345das", pattern = "(\\d+){3}")

## [1] "123"

gextract_str(text = "abd123da345das", pattern = "(\\d+){3}")

## [[1]]
## [1] "123" "345"

例子来自于 https://recology.info/2018/10/limiting-dependencies/