English 中文(简体)
特定字(UTF-8)上最优化的N字节
原标题:Optimized regex for N words around a given word (UTF-8)

我试图找到一个最优化的雷管,以便把(如果有的话)N字重新放在另一个面,以便总结一下。 “词”的定义大于[a-z]。 作为参考词的插图可以放在一句话的中间,或者不会由空间直接环绕。

我已经走到了以下几条工作,但似乎在寻找彼此之间超过6至7个字时实际上感到贪 gr和cho:

/(?:[^s
]+[s
]+[^s
]*){0,4}lorem(?:[^s
]*[s
]+[^s
]+){0,4}/u

这是PHP方法I ve所建的,但我需要帮助使监管机构不太贪.,并努力使用围绕任何几个字。

/**
 * Finds N words around a specified word in a string.
 *
 * @param string $string The complete string to look in.
 * @param string $find The string to look for.
 * @param integer $before The number of words to look for before $find.
 * @param integer $after The number of words to look for after $find.
 * @return mixed False if $find was not found and all the words around otherwise.
 */
private function getWordsAround($string, $find, $before, $after)
{
    $matches = array();
    $find = preg_quote($find);
    $regex =  (?:[^s
]+[s
]+[^s
]*){0,  . (int)$before .  }  .
        $find .  (?:[^s
]*[s
]+[^s
]+){0,  . (int)$after .  } ;
    if (preg_match("/$regex/u", $string, $matches)) {
        return $matches[0];
    } else {
        return false;
    }
}

如果我有以下经费:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, 
felis non vehicula suscipit, enim quam adipiscing turpis, eget rutrum 
eros velit non enim. Sed commodo cursus vulputate. Aliquam id diam sed arcu 
fringilla venenatis. Cras vitae ante ut tellus malesuada convallis. Vivamus 
luctus ante vel ligula eleifend condimentum. Donec a vulputate velit. 
Suspendisse velit risus, volutpat at dapibus vitae, viverra vel nulla."

简称。 口头说明(数额、简历、8、8) 我想要取得以下结果:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, 
felis non vehicula suscipit,"

感谢您的帮助。

问题回答

如何使用reg或某种其他方法将投入案文分成几句。 然后,用字眼看着目标词。 一旦发现,便会 gr碎所需的阵列,并合在一起印刷。

为了在言辞之间保持原有的白色空间,你可以在每一字末尾加入。

此外,这也可以作为流层子加以实施,而不是首先分割整个地段。

正如前面提到的那样,这个问题非常严重。 为了解决这一问题,我试图利用 look头和 look头把 match子 match起来。 因此,我来了:

/consectetur(?<=((?:S+s+){0,8})s*consectetur)s*(?=((?:S+s+){0,8}))/

不幸的是,这并不奏效,因为多长视线背后的人在《刑法》中得不到支持(或在这个问题上不支持)。 因此,留给我们:

/consecteturs*(?:S+s+){0,8}/

仅能反映对口,在匹配后最多可加8字。 但是,如果你 ,则请上PREG_OFFSET_CAP_>/code>的旗帜, 取到>>的背后,将第0-8字(使用、<>>>>>>/code/(?

$s = "put test string here";
$matches = array();
if (preg_match( /consecteturs*(?:S+s+){0,8}/ , $s, $matches, PREG_OFFSET_CAPTURE)) {
  $before = strrev(substr($s, 0, $matches[0][1]));
  $before_match = array();
  preg_match( /s*(?:S+s+){0,8}/ , $before, $before_match);
  echo strrev($before_match[0]) . $matches[0][0];
}

你们可以在大街上更快地这样做,在相距之前就采取安全等级,如100。 然后,你只是推翻了100个特征。

尽管如此,采用固定表达方式的解决办法可能更好。

这里是你所希望的一种内部项目。 不可能让你在使用地功能中抓住这一业绩。

There should be no problem using this for UTF-8 functions, since , and (and in general all the ASCII characters) cannot appear as part of another character sequence. So if you pass valid UTF-8 data to both parameters you should be fine. Reversing UTF-8 data as you would normally reverse single character encodings (with strrev) would indeed mean trouble, but this function doesn t do that.

PHP_FUNCTION(surrounding_text)
{
    struct circ_array {
        int *offsets;
        int cur;
        int size;
    } circ_array;
    long before;
    long after;
    char *haystack, *needle;
    int haystack_len, needle_len;
    int i, in_word = 0, in_match = 0;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ssll",
        &haystack, &haystack_len, &needle, &needle_len, &before, &after) 
        == FAILURE)
        return;

    if (needle_len == 0) {
        php_error_docref(NULL TSRMLS_CC, E_WARNING,
            "Cannot have empty needle");
        return;
    }

    if (before < 0 || after < 0) {
        php_error_docref(NULL TSRMLS_CC, E_WARNING,
            "Number of words after and before should be non-negative");
        return;
    }

    /* saves beggining of match and words before */
    circ_array.offsets = safe_emalloc(before + 1, sizeof *circ_array.offsets, 0);
    circ_array.cur = 0;
    circ_array.size = before + 1;

    for (i = 0; i < haystack_len; i++) {
        if (haystack[i] == needle[in_match]) {
            in_match++;
            if (!in_word) {
                in_word = 1;
                circ_array.offsets[circ_array.cur % circ_array.size] = i;
                circ_array.cur++;
            }
            if (in_match == needle_len)
                break; /* found */
        } else {
            int is_sep = haystack[i] ==     || haystack[i] ==  
  || haystack[i] ==  
 ;

            if (in_match)
                in_match = 0;

            if (is_sep) {
                if (in_word)
                    in_word = 0;
            } else { /* not a separator */
                if (!in_word) {
                    in_word = 1;
                    circ_array.offsets[circ_array.cur % circ_array.size] = i;
                    circ_array.cur++;
                }
            }
        }
    }

    if (in_match != needle_len) {
        efree(circ_array.offsets);
        RETURN_FALSE;
    }


    /* find words after; in_word is 1 */
    for (i++; i < haystack_len; i++) {
        int is_sep = haystack[i] ==     || haystack[i] ==  
  || haystack[i] ==  
 ;
        if (is_sep) {
            if (in_word) {
                if (after == 0)
                    break;
                after--;
                in_word = 0;
            }
        } else {
            if (!in_word)
                in_word = 1;
        }
    }

    {
        char *result;
        int start, result_len;
        if (circ_array.cur < circ_array.size)
            start = circ_array.offsets[0];
        else
            start = circ_array.offsets[circ_array.cur % circ_array.size];

        result_len = i - start;
        result = emalloc(result_len + 1);
        memcpy(result, &haystack[start], result_len);
        result[result_len] =   ;

        efree(circ_array.offsets);
        RETURN_STRINGL(result, result_len, 0);
    }

}

从我的测验来看,C项功能比Atputah的版本高出4倍(而且没有问题strrev)。

这里的工作很出色:

(?:[^s
]*[s
]+){0,8}(?:[^s
]*)consectetur(?:[^s
]*)(?:[s
]+[^s
]*){0,8}

Gives:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras auctor, felis non vehicula suscipit,

然而,这一定期表述的表现是绝对的。 我确实不知道如何提高这一效率,没有定期表达,就没有这样做了。

接近尾声的“绝对校服”的原因是,发动机试图在every nature上开对,然后在发现最终无法发现你重新找寻和抛弃一切之前,推进了数十种特性。

使用这一舱位的问题是,它使浮游发动机发生灾难性的后退轨道。 尝试数量随着强度的增加而成倍增长,即no>/em>。 您不妨查看atomic grouping,以改进业绩。

或者,你可以发现第一种特定词的发生,开始追溯到言词到预期的长度。 Pseudo-ish Code:

$pos = strpos($find);
$result = $find;

foreach $word before $pos {
    $result = $word . $result;
    $count++
    if ($count >= $target)
        break;
}

foreach $word after $pos {
    $result .= $word;
    $count++
    if ($count >= $target)
        break;
}

当然,找到前面和之后的言辞,处理部分扼杀,确实会令人迷惑。





相关问题
Brute-force/DoS prevention in PHP [closed]

I am trying to write a script to prevent brute-force login attempts in a website I m building. The logic goes something like this: User sends login information. Check if username and password is ...

please can anyone check this while loop and if condition

<?php $con=mysql_connect("localhost","mts","mts"); if(!con) { die( unable to connect . mysql_error()); } mysql_select_db("mts",$con); /* date_default_timezone_set ("Asia/Calcutta"); $date = ...

定值美元

如何确认来自正确来源的数字。

Generating a drop down list of timezones with PHP

Most sites need some way to show the dates on the site in the users preferred timezone. Below are two lists that I found and then one method using the built in PHP DateTime class in PHP 5. I need ...

Text as watermarking in PHP

I want to create text as a watermark for an image. the water mark should have the following properties front: Impact color: white opacity: 31% Font style: regular, bold Bevel and Emboss size: 30 ...

How does php cast boolean variables?

How does php cast boolean variables? I was trying to save a boolean value to an array: $result["Users"]["is_login"] = true; but when I use debug the is_login value is blank. and when I do ...