Question

在 html 文件中, 我有以下正则用于检测起始和结束脚本标记 :

<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:[^s]))*)</script>

meaning in short it will catch: <script "NOT THIS</s" > "NOT THIS</s" </script>

it works but needs really long time to detect <script>, even minutes or hours for long strings

连长字符串也完美运作:

<script[^<]*>[^<]*</script>

然而,对于其他标签,例如 , 和也有可能作为属性的值出现。

python 测试 :

import re
pattern = re.compile( <script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:^s]))*)</script> , re.I + re.DOTALL)
re.search(pattern,  11<script type="text/javascript"> easy>example</script>22 ).group()
re.search(pattern,  <script type="text/javascript">  + ( hard example  * 50) +  </script> ).group()

how can I fix it? The inner part of regex (after <script>) should be changed and simplified.

PS :) Anticipate your answers about the wrong approach like using regex in html parsing, I know very well many html/xml parsers, and what I can expect in often broken html code, and regex is really useful here.

comment: well, I need to handle:
each <a < document like this.border="5px;">
and approach is to use parsers and regex together BeautifulSoup is only 2k lines, which not handling every html and just extends regex from sgmllib.

and the main reason is that I must know exact the position where every tag starts and stop. and every broken html must be handled.
BS is not perfect, sometimes happens:
BeautifulSoup( < scriPt >a<aa>s< /script> ).findAll( script ) == []

@Cylian: atomic grouping as you know is not available in python s re.
so non-geedy everything .*? until <s/stags*>** is a winner at this time.

I know that is not perfect in that case: re.search( <sscript.?<s*/sscripts> , < script </script> shit </script> ).group() but I can handle refused tail in the next parsing.

It s pretty obvious that html parsing with regex is not one battle figthing.

Answer 1

我不知道皮松, 但我知道通常的表达方式:

如果您使用贪婪/非贪婪操作员,您可以得到简单得多的regex:

这是假设没有嵌套脚本。

Answer 2

使用 HTML 解析器如美酒。

见“< a href=>” 的伟大解答 https://stackoverflow.com/ questions/55598524/can-i-remove-script-stative-tags-with- beautifulsoup ” 。我能用美丽的豆子去掉脚本标签吗? ?

如果你唯一的工具是锤子,那么每个问题都开始看起来像钉子。通常的表达方式是一把强大的锤子,但并不总是解决某些问题的最佳办法。

我猜你出于安全原因想要从用户张贴的 HTML 中删除脚本。如果安全是主要关切, 通常的表达方式就很难执行, 因为黑客可以修改很多东西来愚弄你的 Regex, 但大多数浏览器会乐于评估... 专门化的解析器更容易使用、表现更好、更安全。

如果你还在想"我为什么不能使用regex" 读到