I have some pascal-cased text that I m trying to split into separate tokens/words.
For example, "Hello123AIIsCool"
would become ["Hello", "123", "AI", "Is", "Cool"]
.
Some Conditions
- "Words" will always start with an upper-cased letter. E.g.,
"Hello"
- A contiguous sequence of numbers should be left together. E.g.,
"123"
->["123"]
, not["1", "2", "3"]
- A contiguous sequence of upper-cased letters should be kept together except when the last letter is the start to a new word as defined in the first condition. E.g.,
"ABCat"
->["AB", "Cat"]
, not["ABC", "at"]
- There is no guarantee that each condition will have a match in a string. E.g.,
"Hello"
,"HelloAI"
,"HelloAIIsCool"
"Hello123"
,"123AI"
,"AIIsCool"
, and any other combination I haven t provided are potential candidates.
我曾尝试过两岸差异。 以下两项尝试与我所希望的相当接近,但并非完全一样。
Version 0
import re
def extract_v0(string: str) -> list[str]:
word_pattern = r"[A-Z][a-z]*"
num_pattern = r"d+"
pattern = f"{word_pattern}|{num_pattern}"
extracts: list[str] = re.findall(
pattern=pattern, string=string
)
return extracts
string = "Hello123AIIsCool"
extract_v0(string)
[ Hello , 123 , A , I , Is , Cool ]
Version 1
import re
def extract_v1(string: str) -> list[str]:
word_pattern = r"[A-Z][a-z]+"
num_pattern = r"d+"
upper_pattern = r"[A-Z][^a-z]*"
pattern = f"{word_pattern}|{num_pattern}|{upper_pattern}"
extracts: list[str] = re.findall(
pattern=pattern, string=string
)
return extracts
string = "Hello123AIIsCool"
extract_v1(string)
[ Hello , 123 , AII , Cool ]
Best Option So Far
这使用一种 combination和 lo。 它是行之有效的,但这是最佳解决办法吗? 还是有一些可以做的ancy?
import re
def extract_v2(string: str) -> list[str]:
word_pattern = r"[A-Z][a-z]+"
num_pattern = r"d+"
upper_pattern = r"[A-Z][A-Z]*"
groups = []
for pattern in [word_pattern, num_pattern, upper_pattern]:
while string.strip():
group = re.search(pattern=pattern, string=string)
if group is not None:
groups.append(group)
string = string[:group.start()] + " " + string[group.end():]
else:
break
ordered = sorted(groups, key=lambda g: g.start())
return [grp.group() for grp in ordered]
string = "Hello123AIIsCool"
extract_v2(string)
[ Hello , 123 , AI , Is , Cool ]