在 html 文件中, 我有以下正则用于检测起始和结束脚本标记 :
<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:[^s]))*)</script>
meaning in short it will catch: <script "NOT THIS</s" > "NOT THIS</s" </script>
it works but needs really long time to detect <script>,
even minutes or hours for long strings
连长字符串也完美运作:
<script[^<]*>[^<]*</script>
然而,对于其他标签,例如
python 测试 :
import re
pattern = re.compile( <script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:^s]))*)</script> , re.I + re.DOTALL)
re.search(pattern, 11<script type="text/javascript"> easy>example</script>22 ).group()
re.search(pattern, <script type="text/javascript"> + ( hard example * 50) + </script> ).group()
how can I fix it? The inner part of regex (after <script>) should be changed and simplified.
PS :) Anticipate your answers about the wrong approach like using regex in html parsing, I know very well many html/xml parsers, and what I can expect in often broken html code, and regex is really useful here.
comment:
well, I need to handle:
each <a < document like this.border="5px;">
and approach is to use parsers and regex together
BeautifulSoup is only 2k lines, which not handling every html and just extends regex from sgmllib.
and the main reason is that I must know exact the position where every tag starts and stop. and every broken html must be handled.
BS is not perfect, sometimes happens:
BeautifulSoup( < scriPt
>a<aa>s< /script> ).findAll( script ) == []
@Cylian:
atomic grouping as you know is not available in python s re.
so non-geedy everything .*? until <s/stags*>** is a winner at this time.
I know that is not perfect in that case:
re.search( <sscript.?<s*/sscripts> , < script </script> shit </script> ).group()
but I can handle refused tail in the next parsing.
It s pretty obvious that html parsing with regex is not one battle figthing.