English 中文(简体)
将文字纳入问题PHP,复杂问题
原标题:Split text into words problem PHP, complicated problem
  • 时间:2009-10-21 12:56:09
  •  标签:

我试图将案文分为以下几句:

$delimiterList = array(" ", ".", "-", ",", ";", "_", ":",
           "!", "?", "/", "(", ")", "[", "]", "{", "}", "<", ">", "
", "
",
            " );
$words = mb_split($delimiterList, $string);

在我不得不与人数有关的一些情况下,我被困。

E.g. If I have the text "Look at this.My score is 3.14, and I am happy about it.". Now the array is

[0]=>Look,
[1]=>at,
[2]=>this,
[3]=>My,
[4]=>score,
[5]=>is,
[6]=>3,
[7]=>14,
[8]=>and, ....

Then also the 3.14 is divided in 3 and 14 which should not happen in my case. I mean point should divide two strings but not two numbers. It should be like:

[0]=>Look,
[1]=>at,
[2]=>this,
[3]=>My,
[4]=>score,
[5]=>is,
[6]=>3.14,
[7]=>and, ....

但是,我没有想避免这种情况!

没有人想如何解决这一问题?

Thanx, Granit

最佳回答

或使用:

<?php
$str = "Look at this.My score is 3.14, and I am happy about it.";

// alternative to handle Marko s example (updated)
// /([s_;?!/()[]{}<>
"]|.$|(?<=D)[:,.-]|[:,.-](?=D))/

var_dump(preg_split( /([s-_,:;?!/()[]{}<>
"]|(?<!d).(?!d))/ ,
                    $str, null, PREG_SPLIT_NO_EMPTY));

array(13) {
  [0]=>
  string(4) "Look"
  [1]=>
  string(2) "at"
  [2]=>
  string(4) "this"
  [3]=>
  string(2) "My"
  [4]=>
  string(5) "score"
  [5]=>
  string(2) "is"
  [6]=>
  string(4) "3.14"
  [7]=>
  string(3) "and"
  [8]=>
  string(1) "I"
  [9]=>
  string(2) "am"
  [10]=>
  string(5) "happy"
  [11]=>
  string(5) "about"
  [12]=>
  string(2) "it"
}
问题回答

我的第一个想法是preg_match_all( /w+/ ,$string,$matches);,但结果与你本人相同。 问题是,用狗分离的人数非常模糊。 这既可以是口头的,也可以是判决的结束,因此我们需要一种办法,改变这种扼杀,从而消除双重含义。

例如,在本句中,我们有几部分希望保留一字:。 我的分数是3.14,我对此感到高兴。 它不是334,3 ,今天是2009-12-11:12:13 >。

我们首先建立一个搜索和大体;取代词典,将例外编码为不会分裂的东西:

$encode = array(
     /(d+?).(d+?)/  =>  \1DOT\2 ,
     /(d+?),(d+?)/  =>  \1COMMA\2 ,
     /(d+?)-(d+?)-(d+?) (d+?):(d+?):(d+?)/  =>  \1DASH\2DASH\3SPACE\4COLON\5COLON\6 
);

其次,我们制定了例外:

foreach ($encode as $regex => $repl) {
    $string = preg_replace($regex, $repl, $string);
}

引证:

preg_match_all( /w+/ , $string, $matches);

改写成密码的字句:

$decode = array(
     search  =>  array( DOT ,  COMMA ,  DASH ,  SPACE ,  COLON ),
     replace  => array( . ,    , ,      - ,       ,      :     )
);
foreach ($matches as $k => $v) {
    $matches[$k] = str_replace($decode[ search ], $decode[ replace ], $v);
}

您可以像你一样简单地或复杂地把ex作为例外使用,但总是会发现一些模糊之处,例如,头一个结束点和下一个起点:<代码> 计票数量仅为3.3,除3.5外别无。

使用<代码>。 “,而不是, in $delimiter 名单/编号。





相关问题