English 中文(简体)
在 Python 中查找包含一定数量某些字符的子字符串
原标题:Regex expression to find the substring that contains certain amount of certain characters with Python
  • 时间:2024-05-23 00:39:39
  •  标签:
  • python
  • regex

I have a string like "GGACATCGCGGTGGATCGAC"

How can I find all substrings that contain, for example, three Gs in any order? For this string, it would be GCGG or GTGG , or GGACATCG . Also need to find such substrings for several letters, like, substrings with two Gs and one C ( GGAC , GGATC , CGG )

问题回答

Sliding Window:

  • The best way to solve this problem is to write an O(N) algorithm.
  • If the number of chars are manageable, then it can be tuned reasonably good. If not, you need to write several methods and make your algorithms modular to be easy to debug and maintain.

Code

def _sliding_window(s, char_a, char_b, count_a, count_b):
    res = []
    starts = []

    for i in range(len(s)):
        char = s[i]

        if char in (char_a, char_b):
            starts.append(i)

    for start in starts:
        As, Bs = 0, 0
        temp = ""

        for i in range(start, len(s)):
            char = s[i]

            if char == char_a:
                As += 1
            if char == char_b:
                Bs += 1

            temp += char

            if As == count_a and Bs == count_b:
                res.append(temp)
                break

    return res


s = "GGACATCGCGGTGGATCGCCACGCGACATCGCGGTGGATCGACGGACATCCGCGGTGCGATCCGACGACATGCGCGCG"

print(_sliding_window(s,  G ,  C , 2, 1), "
")
print(_sliding_window(s,  G ,  C , 3, 0), "
")
print(_sliding_window(s,  G ,  C , 1, 2), "
")
print(_sliding_window(s,  G ,  C , 0, 3), "
")



Note:

  • 在上述算法中,我们首先找到起始指数, 然后我们执行滑动窗口。

  • 锡尔有些逻辑问题,我会交给你处理

  • 如果您没有太多数据, 您可以使用 O( N) 2 算法, 它很容易写 。

Prints

[ GGAC ,  GCG ,  CGG ,  GGATC ,  GATCG ,  GCG ,  GCG ,  CGG ,  GGATC ,  GATCG ,  GACG ,  CGG ,  GGAC ,  GCG ,  CGG ,  GTGC ,  GCG ,  GACG ,  GACATG ,  GCG ,  GCG ,  GCG ] 

[ GGTG ,  GTGG ,  GGTG ,  GTGG ,  GGTG ] 

[ GACATC ,  CATCG ,  CGC ,  CGC ,  GCC ,  CACG ,  CGC ,  CGAC ,  GACATC ,  CATCG ,  CGC ,  CGAC ,  GACATC ,  CCG ,  CGC ,  CGATC ,  GATCC ,  CCG ,  CGAC ,  CGAC ,  CATGC ,  CGC ,  CGC ] 

[ CCAC ,  CATCC ] 

Regex Pattern

d 很难为每个案件写出模式。 但写出模式的方式如下:

  • Define a lookahead with what s necessary to be.
  • Define your patterns one by one for each case and use pipes to join them.

For 3G in any order:

(?=.*G.*G.*G)([G].*?[G].*?[G])

Code

import re

p = r (?=.*G.*G.*G)([G].*?[G].*?[G]) 

s = """
GGACATCGCGGTGGATCGACGGACATCGCGGTGGATCGACGGACATCGCGGTGGATCGAC GACATGCGCGCG. GGACATGCGCGCGGGACATGCGCGCG. GGACATGCGCGCG. 
GCGG or GTGG, or GGACATCG. GCGG or GTGG, or GGACATGCGCGCG. GGACATGCGCGCG. GGACATGCGCGCGGGACATGCGCGCG. GGACATGCGCGCG. 
GCGG or GTGG, or GGACATCG. GCGG or GTGG, or GGACATGCGCGCG. GGACATGCGCGCG. GGACATGCGCGCGGGACATGCGCGCG. GGACATGCGCGCG. 
"""

find_3G = re.findall(p, s)

print(find_3G)

# Note that  GCG. G  is also part of those found.


Prints

[ GGACATCG , GGTG , GATCGACG , GACATCGCG , GTGG , GACGG , GCGG , GGATCG , GACATGCG , GCG. G , GACATGCG , GCGG , GACATGCG , GCG. G , GACATGCG , GCGG , GTGG , GGACATCG , GCGG , GTGG , GGACATG , GCGCG , GGACATG , GCGCG , GGACATG , GCGCG , GGACATG , GCGCG , GGACATG , GCGCG , GCGG , GTGG , GGACATCG , GCGG , GTGG , GGACATG , GCGCG , GGACATG , GCGCG , GGACATG , GCGCG , GGACATG , GCGCG , GGACATG , GCGCG ]


2G和1C的天真野蛮力量算法如下:

def _pattern(s):
    return s.count( G ) == 2 and s.count( C ) == 1

def _collect(s):
    res = []
    for i in range(len(s)):
        for j in range(i + 1, len(s) + 1):
            part = s[i:j]
            if _pattern(part):
                res.append(part)
    return res


s = """
GGACATCGCGGTGGATCGACGGACATCGCGGTGGATCGACGGACATCGCGGTGGATCGAC GACATGCGCGCG. GGACATGCGCGCGGGACATGCGCGCG. GGACATGCGCGCG. 
GCGG or GTGG, or GGACATCG. GCGG or GTGG, or GGACATGCGCGCG. GGACATGCGCGCG. GGACATGCGCGCGGGACATGCGCGCG. GGACATGCGCGCG. 
GCGG or GTGG, or GGACATCG. GCGG or GTGG, or GGACATGCGCGCG. GGACATGCGCGCG. GGACATGCGCGCGGGACATGCGCGCG. GGACATGCGCGCG. 
"""
G, C = 2, 1

print(_collect(s))

Prints

[ GGAC , GGACA , GGACAT , GGAC , GGACA , GGACAT , GCG , CGG , CGGT , TGGATC , GGATC , GATCG , GATCGA , GACG , ACGG , ACGGA , CGG , CGGA , GGAC , GGACA , GGACAT , GCG , CGG , CGGT , TGGATC , GGATC , GATCG , GATCGA , GACG , ACGG , ACGGA , CGG , CGGA , GGAC , GGACA , GGACAT , GCG , CGG , CGGT , TGGATC , GGATC , GATCG , GATCGA , GAC G , GAC GA , GACATG , GACATG , ATGCG , TGCG , GCG , GCG , GCG , GCG. , GCG. , CG. G , . GGAC , . GGACA , . GGACAT , GGAC , GGACA , GGACAT , GGAC , GGACA , GGACAT , GACATG , ATGCG , TGCG , GCG , GCG , GCG , CGG , GGAC , GGACA , GGACAT , GACATG , ATGCG , TGCG , GCG , GCG , GCG , GCG. , GCG. , CG. G , . GGAC , . GGACA , . GGACAT , GGAC , GGACA , GGACAT , GGAC , GGACA , GGACAT , GACATG , ATGCG , TGCG , GCG , GCG , GCG , GCG. , GCG. , GCG. , CG. G , G. GC , . GCG , GCG , GCG , GCG , CGG , CGG , CGG o , CGG or , CGG or , , or GGAC , , or GGACA , , or GGACAT , or GGAC , or GGACA , or GGACAT , or GGAC , or GGACA , or GGACAT , r GGAC , r GGACA , r GGACAT , GGAC , GGACA , GGACAT , GGAC , GGACA , GGACAT , ATCG. G , TCG. G , CG. G , G. GC , . GCG , GCG , GCG , CGG , CGG , CGG o , CGG or , CGG or , , or GGAC , , or GGACA , , or GGACAT , or GGAC , or GGACA , or GGACAT , or GGAC , or GGACA , or GGACAT , r GGAC , r GGACA , r GGACAT , GGAC , GGACA , GGACAT , GGAC , GGACA , GGACAT , GACATG , ATGCG , TGCG , GCG , GCG , GCG , GCG. , GCG. , CG. G , . GGAC , . GGACA , . GGACAT , GGAC , GGACA , GGACAT , GGAC , GGACA , GGACAT , GACATG , ATGCG , TGCG , GCG , GCG , GCG , GCG. , GCG. , CG. G , . GGAC , . GGACA , . GGACAT , GGAC , GGACA , GGACAT , GGAC , GGACA , GGACAT , GACATG , ATGCG , TGCG , GCG , GCG , GCG , CGG , GGAC , GGACA , GGACAT , GACATG , ATGCG , TGCG , GCG , GCG , GCG , GCG. , GCG. , CG. G , . GGAC , . GGACA , . GGACAT , GGAC , GGACA , GGACAT , GGAC , GGACA , GGACAT , GACATG , ATGCG , TGCG , GCG , GCG , GCG , GCG. , GCG. , GCG. , CG. G , G. GC , . GCG , GCG , GCG , GCG , CGG , CGG , CGG o , CGG or , CGG or , , or GGAC , , or GGACA , , or GGACAT , or GGAC , or GGACA , or GGACAT , or GGAC , or GGACA , or GGACAT , r GGAC , r GGACA , r GGACAT , GGAC , GGACA , GGACAT , GGAC , GGACA , GGACAT , ATCG. G , TCG. G , CG. G , G. GC , . GCG , GCG , GCG , CGG , CGG , CGG o , CGG or , CGG or , , or GGAC , , or GGACA , , or GGACAT , or GGAC , or GGACA , or GGACAT , or GGAC , or GGACA , or GGACAT , r GGAC , r GGACA , r GGACAT , GGAC , GGACA , GGACAT , GGAC , GGACA , GGACAT , GACATG , ATGCG , TGCG , GCG , GCG , GCG , GCG. , GCG. , CG. G , . GGAC , . GGACA , . GGACAT , GGAC , GGACA , GGACAT , GGAC , GGACA , GGACAT , GACATG , ATGCG , TGCG , GCG , GCG , GCG , GCG. , GCG. , CG. G , . GGAC , . GGACA , . GGACAT , GGAC , GGACA , GGACAT , GGAC , GGACA , GGACAT , GACATG , ATGCG , TGCG , GCG , GCG , GCG , CGG , GGAC , GGACA , GGACAT , GACATG , ATGCG , TGCG , GCG , GCG , GCG , GCG. , GCG. , CG. G , . GGAC , . GGACA , . GGACAT , GGAC , GGACA , GGACAT , GGAC , GGACA , GGACAT , GACATG , ATGCG , TGCG , GCG , GCG , GCG , GCG. , GCG. , GCG. ]


Note

  • 这个算法可以用O( N) 时间复杂性来写。 我只是想在这里显示我们如何解决问题。

  • 您可以添加处理边缘案件的方法 。

  • 评级非常灵活,易于维持,易于调试。

首先,你应该注意,匹配的子字符串必须用想要的字母来开始和结束。

在此限制下, 将子字符串与 3 个精确的 < code> G 匹配, 将 < code> G 以可选的非 < code> G 字符捕捉到 < code> G/code> 字符, 从而可以相当直截了当:

(?=((?:G[^G]*){2}G))

说明:https://regex101.com/r/ajXtbq/1, rel=>https://regex101.com/r/ajXtbq/1

将子字符串与两个确切的 G 和一个 C 匹配起来的参与要多一些。 它会从下面开始捕捉每个字母的所需数目并捕捉以下内容, 然后捕捉一个由先前所捕捉的指定字母之一构成的子字符串, 其后面的字符不是该字母, 直到它处于一个点, 以下是什么是先前捕捉到其他想要的字母之后的内容之一 :

(?=([^C]*C)(.*$))(?=((?:[^G]*G){2})(.*$))(?=(?=[CG])(1[^C]*(?=4$)|3[^G]*(?=2$)))

说明:https://regex101.com/r/uCtCfq/1, rel=>https://regex101.com/r/uCtCfq/1

你可以试试这个:

import re

string = "GGACATCGCGGTGGATCGAC"

# Find all substrings with three Gs
three_gs = re.findall(r"G.*G.*G", string)
print(three_gs)  # Output: [ GCGG ,  GTGG ,  GGACATCG ]

# Find all substrings with two Gs and one C
two_gs_one_c = re.findall(r"G.*G.*C|C.*G.*G", string)
print(two_gs_one_c)  # Output: [ GGAC ,  GGATC ,  CGG ]

或者你可以尝试这样更动态的方法:

import re

def find_substrings(string, chars, count):
    pattern =  | .join(f"{char}.*" * count for char in chars)
    return re.findall(pattern, string)

string = "GGACATCGCGGTGGATCGAC"

# Find all substrings with three Gs
three_gs = find_substrings(string, [ G ], 3)
print(three_gs)  # Output: [ GCGG ,  GTGG ,  GGACATCG ]

# Find all substrings with two Gs and one C
two_gs_one_c = find_substrings(string, [ G ,  C ], [2, 1])
print(two_gs_one_c)  # Output: [ GGAC ,  GGATC ,  CGG ]

Re.findall 函数返回字符串中模式的所有非重叠匹配。 模式 G. *G. *G. *G 匹配包含 3 Gs 的任何子字符串, 而 G. *G. *C. *G. *G 匹配包含 2 Gs 和 1 C 的任何子字符串 。





相关问题
Can Django models use MySQL functions?

Is there a way to force Django models to pass a field to a MySQL function every time the model data is read or loaded? To clarify what I mean in SQL, I want the Django model to produce something like ...

An enterprise scheduler for python (like quartz)

I am looking for an enterprise tasks scheduler for python, like quartz is for Java. Requirements: Persistent: if the process restarts or the machine restarts, then all the jobs must stay there and ...

How to remove unique, then duplicate dictionaries in a list?

Given the following list that contains some duplicate and some unique dictionaries, what is the best method to remove unique dictionaries first, then reduce the duplicate dictionaries to single ...

What is suggested seed value to use with random.seed()?

Simple enough question: I m using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this ...

How can I make the PyDev editor selectively ignore errors?

I m using PyDev under Eclipse to write some Jython code. I ve got numerous instances where I need to do something like this: import com.work.project.component.client.Interface.ISubInterface as ...

How do I profile `paster serve` s startup time?

Python s paster serve app.ini is taking longer than I would like to be ready for the first request. I know how to profile requests with middleware, but how do I profile the initialization time? I ...

Pragmatically adding give-aways/freebies to an online store

Our business currently has an online store and recently we ve been offering free specials to our customers. Right now, we simply display the special and give the buyer a notice stating we will add the ...

Converting Dictionary to List? [duplicate]

I m trying to convert a Python dictionary into a Python list, in order to perform some calculations. #My dictionary dict = {} dict[ Capital ]="London" dict[ Food ]="Fish&Chips" dict[ 2012 ]="...

热门标签