Some Conditions

Question

I have some pascal-cased text that I m trying to split into separate tokens/words. For example, "Hello123AIIsCool" would become ["Hello", "123", "AI", "Is", "Cool"].

Some Conditions

"Words" will always start with an upper-cased letter. E.g., "Hello"
A contiguous sequence of numbers should be left together. E.g., "123" -> ["123"], not ["1", "2", "3"]
A contiguous sequence of upper-cased letters should be kept together except when the last letter is the start to a new word as defined in the first condition. E.g., "ABCat" -> ["AB", "Cat"], not ["ABC", "at"]
There is no guarantee that each condition will have a match in a string. E.g., "Hello", "HelloAI", "HelloAIIsCool" "Hello123", "123AI", "AIIsCool", and any other combination I haven t provided are potential candidates.

我曾尝试过两岸差异。以下两项尝试与我所希望的相当接近,但并非完全一样。

Version 0

import re


def extract_v0(string: str) -> list[str]:
    word_pattern = r"[A-Z][a-z]*"
    num_pattern = r"d+"
    pattern = f"{word_pattern}|{num_pattern}"
    extracts: list[str] = re.findall(
        pattern=pattern, string=string
    )
    return extracts


string = "Hello123AIIsCool"
extract_v0(string)

[ Hello ,  123 ,  A ,  I ,  Is ,  Cool ]

Version 1

import re


def extract_v1(string: str) -> list[str]:
    word_pattern = r"[A-Z][a-z]+"
    num_pattern = r"d+"
    upper_pattern = r"[A-Z][^a-z]*"
    pattern = f"{word_pattern}|{num_pattern}|{upper_pattern}"
    extracts: list[str] = re.findall(
        pattern=pattern, string=string
    )
    return extracts


string = "Hello123AIIsCool"
extract_v1(string)

[ Hello ,  123 ,  AII ,  Cool ]

Best Option So Far

这使用一种 combination和 lo。它是行之有效的,但这是最佳解决办法吗? 还是有一些可以做的ancy?

import re


def extract_v2(string: str) -> list[str]:
    word_pattern = r"[A-Z][a-z]+"
    num_pattern = r"d+"
    upper_pattern = r"[A-Z][A-Z]*"
    groups = []
    for pattern in [word_pattern, num_pattern, upper_pattern]:
        while string.strip():
            group = re.search(pattern=pattern, string=string)
            if group is not None:
                groups.append(group)
                string = string[:group.start()] + " " + string[group.end():]
            else:
                break
    
    ordered = sorted(groups, key=lambda g: g.start())
    return [grp.group() for grp in ordered]


string = "Hello123AIIsCool"
extract_v2(string)

[ Hello ,  123 ,  AI ,  Is ,  Cool ]

Answer 1

依据:

import re


def extract_v1(string: str) -> list[str]:
    word_pattern = r"[A-Z][a-z]+"
    num_pattern = r"d+"
    upper_pattern = r"[A-Z]+(?![a-z])"  # Fixed
    pattern = f"{word_pattern}|{num_pattern}|{upper_pattern}"
    extracts: list[str] = re.findall(
        pattern=pattern, string=string
    )
    return extracts


string = "Hello123AIIsCool"
extract_v1(string)

结果:

[ Hello ,  123 ,  AI ,  Is ,  Cool ]

固定的<代码>upper_pattern将尽可能贴上上层的字母,如存在,将停在下级信函之前。

Answer 2

使用<代码>re.sub和split()

import re

def pascal_case_split(identifier):
    return re.sub( ([A-Z][a-z]+) , r  1 , re.sub( ([A-Z]+) , r  1 , re.sub( ([0-9]+) , r  1 , identifier))).split()

a = pascal_case_split("Hello123AIIsCool")
a

[ Hello ,  123 ,  AI ,  Is ,  Cool ]

参引

Answer 3

<代码>re.findall 你们的工作应该少得多。 <代码>re.X,允许在轨值上平息。

>>> re.findall(
...   r ([A-Z]{2,} (?![a-z]) | d+ | [A-Z] [a-z]*) , 
...    Hello12 3AIIsCool , 
...   re.X
... )
[ Hello ,  123 ,  AI ,  Is ,  Cool ]

Answer 4

您可尝试:

[A-Z](?:[a-z]+|(?:[A-Z](?![a-z]))+)?|d+

见。

import re

pattern = r"[A-Z](?:[a-z]+|(?:[A-Z](?![a-z]))+)?|d+"
text = "Hello123AIIsCoolAndHTML5IsAMarkupLanguage"

print(re.findall(pattern, text))
# [ Hello ,  123 ,  AI ,  Is ,  Cool ,  And ,  HTML ,  5 ,  Is ,  A ,  Markup ,  Language ]

Some Conditions

Version 0

Version 1

Best Option So Far

友情链接