English 中文(简体)
上次出现数字时的切分, 取取取 II 部分
原标题:split on last occurrence of digit, take 2nd part
  • 时间:2012-05-24 05:57:43
  •  标签:
  • regex
  • r

如果我有一个字符串,想要在最后一个数字上分裂, 并保持最后一部分的分割hpw, 我能做到这一点吗?

x <- c("ID", paste0("X", 1:10, state.name[1:10]))

我喜欢

 [1] NA            "Alabama"     "Alaska"      "Arizona"     "Arkansas"   
 [6] "California"  "Colorado"    "Connecticut" "Delaware"    "Florida"    
[11] "Georgia"    

但将满足:

 [1] "ID"          "Alabama"     "Alaska"      "Arizona"     "Arkansas"   
 [6] "California"  "Colorado"    "Connecticut" "Delaware"    "Florida"    
[11] "Georgia"    

我可以通过下列方式获得第一部分:

unlist(strsplit(x, "[^0-9]*$"))

但想要第二部分。

提前感谢您

最佳回答
library(stringr)
unlist(lapply(str_split(x, "[0-9]"), tail,n=1))

给给

[1] "ID"          "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California"  "Colorado"    "Connecticut" "Delaware"   
[10] "Florida"     "Georgia"

我想看看文件stringr < /code>,以便(最有可能)采取更好的办法。

问题回答

您可以用正则表达式来简单一步完成此步骤 :

gsub("(^.*\d+)(\w*)", "\2", x)

成果包括:

 [1] "ID"          "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California"  "Colorado"    "Connecticut"
 [9] "Delaware"    "Florida"     "Georgia"  

Regex 做什么 :

  1. "(^.*\d+)(\w*)": Look for two groups of characters.
    • The first group (^.*\d+) looks for any digit followed by at least one number at the start of the string.
    • The second group \w* looks for an alpha-numeric character.
  2. The "\2" as the second argument to gsub() means to replace the original string with the second group that the regex found.

这看起来有点怪怪的,但很管用:

state.pt2 <- unlist(strsplit(x,"^.[0-9]+"))
state.pt2[state.pt2!=""]

最好删除字符串开头的匹配生成的 < code> “ s”, 但我无法理解 。

在此使用 substr gregexpr 的另一种方法, 避免对结果进行子集 :

substr(x,unlist(lapply(gregexpr("[0-9]",x),max))+1,nchar(x))

< 坚固> gsubfn

尝试 < a href= > "http://gsubfn.googlecode.com" rel= "no follow" >gsubfn 解决方案 :

> library(gsubfn)
> strapply(x, ".*\d(\w*)|$", ~ if (nchar(z)) z else NA, simplify = TRUE)
 [1] NA            "Alabama"     "Alaska"      "Arizona"     "Arkansas"   
 [6] "California"  "Colorado"    "Connecticut" "Delaware"    "Florida"    
[11] "Georgia"    

它与最后一个位数相匹配, 后面加上字字符, 然后返回单词字符, 或者如果它不匹配的话, 则返回行尾( 以确保它符合某些内容 ) 。 如果第一个匹配成功, 那么返回它; 否则, 后引用将是空的, 所以返回 NA 。

请注意,如果(nchar(z))z z NA ,公式是一种写函数 conference(z)的短手方式,如果(nchar(z))z other NA conference(z)的短手方式,而该函数可以以略多一点的键盘代替公式。

< 强 > gsub

类似的战略也可以使用直线 gsub 来运作,但需要两行和略为复杂的正则表达式。 我们在这里使用第二个替代词来从第一个替代词中解析非模式 :

> s <- gsub(".*\d(\w*)|.*", "\1", x)
> ifelse(nchar(s), s, NA)
 [1] NA            "Alabama"     "Alaska"      "Arizona"     "Arkansas"   
 [6] "California"  "Colorado"    "Connecticut" "Delaware"    "Florida"    
[11] "Georgia"    

EDIT:微小改进





相关问题
Uncommon regular expressions [closed]

Recently I discovered two amazing regular expression features: ?: and ?!. I was curious of other neat regex features. So maybe you would like to share some tricky regular expressions.

regex to trap img tag, both versions

I need to remove image tags from text, so both versions of the tag: <img src="" ... ></img> <img src="" ... />

C++, Boost regex, replace value function of matched value?

Specifically, I have an array of strings called val, and want to replace all instances of "%{n}%" in the input with val[n]. More generally, I want the replace value to be a function of the match ...

PowerShell -match operator and multiple groups

I have the following log entry that I am processing in PowerShell I m trying to extract all the activity names and durations using the -match operator but I am only getting one match group back. I m ...

Is it possible to negate a regular expression search?

I m building a lexical analysis engine in c#. For the most part it is done and works quite well. One of the features of my lexer is that it allows any user to input their own regular expressions. This ...

regex for four-digit numbers (or "default")

I need a regex for four-digit numbers separated by comma ("default" can also be a value). Examples: 6755 3452,8767,9865,8766,3454 7678,9876 1234,9867,6876,9865 default Note: "default" ...