English 中文(简体)
Java 一次性地(或以最有效率的方式)重划多个不同的子体
原标题:Java Replacing multiple different substring in a string at once (or in the most efficient way)
  • 时间:2009-08-25 07:52:34
  •  标签:

I need to replace many different sub-string in a string in the most efficient way. is there another way other then the brute force way of replacing each field using string.replace ?

最佳回答

如果你在很长的时间里运作,或者你正在许多扼子上工作,那么使用 j就值得。 牵线搭桥(这需要时间来编集,因此,如果你的投入很小,或经常改变搜索模式,那么它就没有效率。

下面是一个充分的例子,它基于从地图上取得的口号清单。 (Uses StringUtils from Papa Commons Lang)。

Map<String,String> tokens = new HashMap<String,String>();
tokens.put("cat", "Garfield");
tokens.put("beverage", "coffee");

String template = "%cat% really needs some %beverage%.";

// Create pattern of the format "%(cat|beverage)%"
String patternString = "%(" + StringUtils.join(tokens.keySet(), "|") + ")%";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(template);

StringBuffer sb = new StringBuffer();
while(matcher.find()) {
    matcher.appendReplacement(sb, tokens.get(matcher.group(1)));
}
matcher.appendTail(sb);

System.out.println(sb.toString());

定期表述汇编后,一般都非常快速地扫描投入(尽管你的定期表达方式复杂或涉及背轨,但为了确认这一点,你仍然需要制定基准!)

问题回答

Algorithm

取代配对机的最有效途径之一是使用Aho-Corasick_utila>,有表演者(profant=thra=thad)。 http://www.magicwerk.org/res/collections/javadoc/org/magicwerk/brownies/ Collections/BigList.html” rel=“noreferer”> 实施情况。

Simple Code

简单的解决办法杠杆是:Apache sStringUtils.replace Each 如下:

  private String testStringUtils(
    final String text, final Map<String, String> definitions ) {
    final String[] keys = keys( definitions );
    final String[] values = values( definitions );

    return StringUtils.replaceEach( text, keys, values );
  }

这在大案文上放慢。

Fast Code

https://github.com/robert-bor/aho-corasick”rel=“noreferer”> “Aho-Corasick”算法的Bor s implementation采用相同方法签名的façade,使执行更为复杂:

  private String testBorAhoCorasick(
    final String text, final Map<String, String> definitions ) {
    // Create a buffer sufficiently large that re-allocations are minimized.
    final StringBuilder sb = new StringBuilder( text.length() << 1 );

    final TrieBuilder builder = Trie.builder();
    builder.onlyWholeWords();
    builder.removeOverlaps();

    final String[] keys = keys( definitions );

    for( final String key : keys ) {
      builder.addKeyword( key );
    }

    final Trie trie = builder.build();
    final Collection<Emit> emits = trie.parseText( text );

    int prevIndex = 0;

    for( final Emit emit : emits ) {
      final int matchIndex = emit.getStart();

      sb.append( text.substring( prevIndex, matchIndex ) );
      sb.append( definitions.get( emit.getKeyword() ) );
      prevIndex = emit.getEnd() + 1;
    }

    // Add the remainder of the string (contains no more matches).
    sb.append( text.substring( prevIndex ) );

    return sb.toString();
  }

Benchmarks

关于基准,利用建立了缓冲。 如下:

  private final static int TEXT_SIZE = 1000;
  private final static int MATCHES_DIVISOR = 10;

  private final static StringBuilder SOURCE
    = new StringBuilder( randomNumeric( TEXT_SIZE ) );

在<代码>MATCHES_DIVISOR>上规定可注入的变数:

  private void injectVariables( final Map<String, String> definitions ) {
    for( int i = (SOURCE.length() / MATCHES_DIVISOR) + 1; i > 0; i-- ) {
      final int r = current().nextInt( 1, SOURCE.length() );
      SOURCE.insert( r, randomKey( definitions ) );
    }
  }

基准代码本身(JMH似乎超高技能:

long duration = System.nanoTime();
final String result = testBorAhoCorasick( text, definitions );
duration = System.nanoTime() - duration;
System.out.println( elapsed( duration ) );

1,000,000 : 1,000

一个简单的缩微标,有1 000 000个特性和1 000个随机布置的阵列取代。

  • testStringUtils: 25 seconds, 25533 millis
  • testBorAhoCorasick: 0 seconds, 68 millis

无争辩。

10,000 : 1,000

采用10 000个特性和1 000个对应体取代:

  • testStringUtils: 1 seconds, 1402 millis
  • testBorAhoCorasick: 0 seconds, 37 millis

鸿沟已经缩小。

1,000 : 10

采用1 000个特性和10个对应体取代:

  • testStringUtils: 0 seconds, 7 millis
  • testBorAhoCorasick: 0 seconds, 19 millis

在短训中,设立Aho-Corasick eclips的间接费用通过<条码>StringUtils.replaceEach采用了简便武力办法。

一种基于案文长度的混合办法是有可能的,以取得最佳的两种执行。

Implementations

考虑比较其他超过1个甲基溴版本的执行,包括:

Papers

与算法有关的文件和资料:

这为我工作:

String result = input.replaceAll("string1|string2|string3","replacementString");

例:

String input = "applemangobananaarefruits";
String result = input.replaceAll("mango|are|ts","-");
System.out.println(result);

<Output: apple-banana-frui-

如果你要改变一段时期,那么使用“强硬”手段(但衡量你的表现)通常会更为有效:

String str = "The rain in Spain falls mainly on the plain";
StringBuilder sb = new StringBuilder(str);
// do your replacing in sb - although you ll find this trickier than simply using String
String newStr = sb.toString();

每当你在“强硬”上取而代之时,就会产生新的强硬物,因为强硬是不可改变的。 强硬分子是 mu的,也就是说,可以像你想要的那样加以改变。

<代码>StringBuilder将更有效地替换,因为其特性阵列缓冲可被指定为所需时间。

当然,真正的问题是,这是否是一种选择太远? 我的第一个问题是,你是否衡量了这一问题,并断定这是一个问题。

检查:

String.format(str,STR[])

例如:

String.format( "Put your %s where your %s is", "money", "mouth" );

The following is based on oddT ∗∗ E/CN.6/2009/1。 这种解决办法存在这样的问题,即如果替代物含有经常表达中具有特别意义的特性,那么你就能够取得意外结果。 我还希望能够选择性地进行对案件不敏感的查询。 这里我要谈谈:

/**
 * Performs simultaneous search/replace of multiple strings. Case Sensitive!
 */
public String replaceMultiple(String target, Map<String, String> replacements) {
  return replaceMultiple(target, replacements, true);
}

/**
 * Performs simultaneous search/replace of multiple strings.
 * 
 * @param target        string to perform replacements on.
 * @param replacements  map where key represents value to search for, and value represents replacem
 * @param caseSensitive whether or not the search is case-sensitive.
 * @return replaced string
 */
public String replaceMultiple(String target, Map<String, String> replacements, boolean caseSensitive) {
  if(target == null || "".equals(target) || replacements == null || replacements.size() == 0)
    return target;

  //if we are doing case-insensitive replacements, we need to make the map case-insensitive--make a new map with all-lower-case keys
  if(!caseSensitive) {
    Map<String, String> altReplacements = new HashMap<String, String>(replacements.size());
    for(String key : replacements.keySet())
      altReplacements.put(key.toLowerCase(), replacements.get(key));

    replacements = altReplacements;
  }

  StringBuilder patternString = new StringBuilder();
  if(!caseSensitive)
    patternString.append("(?i)");

  patternString.append( ( );
  boolean first = true;
  for(String key : replacements.keySet()) {
    if(first)
      first = false;
    else
      patternString.append( | );

    patternString.append(Pattern.quote(key));
  }
  patternString.append( ) );

  Pattern pattern = Pattern.compile(patternString.toString());
  Matcher matcher = pattern.matcher(target);

  StringBuffer res = new StringBuffer();
  while(matcher.find()) {
    String match = matcher.group(1);
    if(!caseSensitive)
      match = match.toLowerCase();
    matcher.appendReplacement(res, replacements.get(match));
  }
  matcher.appendTail(res);

  return res.toString();
}

我的单位测试案例如下:

@Test
public void replaceMultipleTest() {
  assertNull(ExtStringUtils.replaceMultiple(null, null));
  assertNull(ExtStringUtils.replaceMultiple(null, Collections.<String, String>emptyMap()));
  assertEquals("", ExtStringUtils.replaceMultiple("", null));
  assertEquals("", ExtStringUtils.replaceMultiple("", Collections.<String, String>emptyMap()));

  assertEquals("folks, we are not sane anymore. with me, i promise you, we will burn in flames", ExtStringUtils.replaceMultiple("folks, we are not winning anymore. with me, i promise you, we will win big league", makeMap("win big league", "burn in flames", "winning", "sane")));

  assertEquals("bcaacbbcaacb", ExtStringUtils.replaceMultiple("abccbaabccba", makeMap("a", "b", "b", "c", "c", "a")));
  assertEquals("bcaCBAbcCCBb", ExtStringUtils.replaceMultiple("abcCBAabCCBa", makeMap("a", "b", "b", "c", "c", "a")));
  assertEquals("bcaacbbcaacb", ExtStringUtils.replaceMultiple("abcCBAabCCBa", makeMap("a", "b", "b", "c", "c", "a"), false));

  assertEquals("c colon  backslash temp backslash  star  dot  star ", ExtStringUtils.replaceMultiple("c:\temp\*.*", makeMap(".", " dot ", ":", " colon ", "\", " backslash ", "*", " star "), false));
}

private Map<String, String> makeMap(String ... vals) {
  Map<String, String> map = new HashMap<String, String>(vals.length / 2);
  for(int i = 1; i < vals.length; i+= 2)
    map.put(vals[i-1], vals[i]);
  return map;
}
public String replace(String input, Map<String, String> pairs) {
  // Reverse lexic-order of keys is good enough for most cases,
  // as it puts longer words before their prefixes ("tool" before "too").
  // However, there are corner cases, which this algorithm doesn t handle
  // no matter what order of keys you choose, eg. it fails to match "edit"
  // before "bed" in "..bedit.." because "bed" appears first in the input,
  // but "edit" may be the desired longer match. Depends which you prefer.
  final Map<String, String> sorted = 
      new TreeMap<String, String>(Collections.reverseOrder());
  sorted.putAll(pairs);
  final String[] keys = sorted.keySet().toArray(new String[sorted.size()]);
  final String[] vals = sorted.values().toArray(new String[sorted.size()]);
  final int lo = 0, hi = input.length();
  final StringBuilder result = new StringBuilder();
  int s = lo;
  for (int i = s; i < hi; i++) {
    for (int p = 0; p < keys.length; p++) {
      if (input.regionMatches(i, keys[p], 0, keys[p].length())) {
        /* TODO: check for "edit", if this is "bed" in "..bedit.." case,
         * i.e. look ahead for all prioritized/longer keys starting within
         * the current match region; iff found, then ignore match ("bed")
         * and continue search (find "edit" later), else handle match. */
        // if (better-match-overlaps-right-ahead)
        //   continue;
        result.append(input, s, i).append(vals[p]);
        i += keys[p].length();
        s = i--;
      }
    }
  }
  if (s == lo) // no matches? no changes!
    return input;
  return result.append(input, s, hi).toString();
}

Summary: Single class implementation of Dave s answer, to automatically choose the most efficient of the two algorithms.

This is a full, single class implementation based on the above excellent answer from Dave Jarvis. The class automatically chooses between the two different supplied algorithms, for maximum efficiency. (This answer is for people who would just like to quickly copy and paste.)

ReplaceStrings class:

package somepackage

import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import org.ahocorasick.trie.Emit;
import org.ahocorasick.trie.Trie;
import org.ahocorasick.trie.Trie.TrieBuilder;
import org.apache.commons.lang3.StringUtils;

/**
 * ReplaceStrings, This class is used to replace multiple strings in a section of text, with high
 * time efficiency. The chosen algorithms were adapted from: https://stackoverflow.com/a/40836618
 */
public final class ReplaceStrings {

    /**
     * replace, This replaces multiple strings in a section of text, according to the supplied
     * search and replace definitions. For maximum efficiency, this will automatically choose
     * between two possible replacement algorithms.
     *
     * Performance note: If it is known in advance that the source text is long, then this method
     * signature has a very small additional performance advantage over the other method signature.
     * (Although either method signature will still choose the best algorithm.)
     */
    public static String replace(
        final String sourceText, final Map<String, String> searchReplaceDefinitions) {
        final boolean useLongAlgorithm
            = (sourceText.length() > 1000 || searchReplaceDefinitions.size() > 25);
        if (useLongAlgorithm) {
            // No parameter adaptations are needed for the long algorithm.
            return replaceUsing_AhoCorasickAlgorithm(sourceText, searchReplaceDefinitions);
        } else {
            // Create search and replace arrays, which are needed by the short algorithm.
            final ArrayList<String> searchList = new ArrayList<>();
            final ArrayList<String> replaceList = new ArrayList<>();
            final Set<Map.Entry<String, String>> allEntries = searchReplaceDefinitions.entrySet();
            for (Map.Entry<String, String> entry : allEntries) {
                searchList.add(entry.getKey());
                replaceList.add(entry.getValue());
            }
            return replaceUsing_StringUtilsAlgorithm(sourceText, searchList, replaceList);
        }
    }

    /**
     * replace, This replaces multiple strings in a section of text, according to the supplied
     * search strings and replacement strings. For maximum efficiency, this will automatically
     * choose between two possible replacement algorithms.
     *
     * Performance note: If it is known in advance that the source text is short, then this method
     * signature has a very small additional performance advantage over the other method signature.
     * (Although either method signature will still choose the best algorithm.)
     */
    public static String replace(final String sourceText,
        final ArrayList<String> searchList, final ArrayList<String> replacementList) {
        if (searchList.size() != replacementList.size()) {
            throw new RuntimeException("ReplaceStrings.replace(), "
                + "The search list and the replacement list must be the same size.");
        }
        final boolean useLongAlgorithm = (sourceText.length() > 1000 || searchList.size() > 25);
        if (useLongAlgorithm) {
            // Create a definitions map, which is needed by the long algorithm.
            HashMap<String, String> definitions = new HashMap<>();
            final int searchListLength = searchList.size();
            for (int index = 0; index < searchListLength; ++index) {
                definitions.put(searchList.get(index), replacementList.get(index));
            }
            return replaceUsing_AhoCorasickAlgorithm(sourceText, definitions);
        } else {
            // No parameter adaptations are needed for the short algorithm.
            return replaceUsing_StringUtilsAlgorithm(sourceText, searchList, replacementList);
        }
    }

    /**
     * replaceUsing_StringUtilsAlgorithm, This is a string replacement algorithm that is most
     * efficient for sourceText under 1000 characters, and less than 25 search strings.
     */
    private static String replaceUsing_StringUtilsAlgorithm(final String sourceText,
        final ArrayList<String> searchList, final ArrayList<String> replacementList) {
        final String[] searchArray = searchList.toArray(new String[]{});
        final String[] replacementArray = replacementList.toArray(new String[]{});
        return StringUtils.replaceEach(sourceText, searchArray, replacementArray);
    }

    /**
     * replaceUsing_AhoCorasickAlgorithm, This is a string replacement algorithm that is most
     * efficient for sourceText over 1000 characters, or large lists of search strings.
     */
    private static String replaceUsing_AhoCorasickAlgorithm(final String sourceText,
        final Map<String, String> searchReplaceDefinitions) {
        // Create a buffer sufficiently large that re-allocations are minimized.
        final StringBuilder sb = new StringBuilder(sourceText.length() << 1);
        final TrieBuilder builder = Trie.builder();
        builder.onlyWholeWords();
        builder.ignoreOverlaps();
        for (final String key : searchReplaceDefinitions.keySet()) {
            builder.addKeyword(key);
        }
        final Trie trie = builder.build();
        final Collection<Emit> emits = trie.parseText(sourceText);
        int prevIndex = 0;
        for (final Emit emit : emits) {
            final int matchIndex = emit.getStart();

            sb.append(sourceText.substring(prevIndex, matchIndex));
            sb.append(searchReplaceDefinitions.get(emit.getKeyword()));
            prevIndex = emit.getEnd() + 1;
        }
        // Add the remainder of the string (contains no more matches).
        sb.append(sourceText.substring(prevIndex));
        return sb.toString();
    }

    /**
     * main, This contains some test and example code.
     */
    public static void main(String[] args) {
        String shortSource = "The quick brown fox jumped over something. ";
        StringBuilder longSourceBuilder = new StringBuilder();
        for (int i = 0; i < 50; ++i) {
            longSourceBuilder.append(shortSource);
        }
        String longSource = longSourceBuilder.toString();
        HashMap<String, String> searchReplaceMap = new HashMap<>();
        ArrayList<String> searchList = new ArrayList<>();
        ArrayList<String> replaceList = new ArrayList<>();
        searchReplaceMap.put("fox", "grasshopper");
        searchReplaceMap.put("something", "the mountain");
        searchList.add("fox");
        replaceList.add("grasshopper");
        searchList.add("something");
        replaceList.add("the mountain");
        String shortResultUsingArrays = replace(shortSource, searchList, replaceList);
        String shortResultUsingMap = replace(shortSource, searchReplaceMap);
        String longResultUsingArrays = replace(longSource, searchList, replaceList);
        String longResultUsingMap = replace(longSource, searchReplaceMap);
        System.out.println(shortResultUsingArrays);
        System.out.println("----------------------------------------------");
        System.out.println(shortResultUsingMap);
        System.out.println("----------------------------------------------");
        System.out.println(longResultUsingArrays);
        System.out.println("----------------------------------------------");
        System.out.println(longResultUsingMap);
        System.out.println("----------------------------------------------");
    }
}

Needed Maven dependencies:

(如果需要,请上页)。

    <!-- Apache Commons utilities. Super commonly used utilities.
    https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-lang3</artifactId>
        <version>3.10</version>
    </dependency>

    <!-- ahocorasick, An algorithm used for efficient searching and 
    replacing of multiple strings.
    https://mvnrepository.com/artifact/org.ahocorasick/ahocorasick -->
    <dependency>
        <groupId>org.ahocorasick</groupId>
        <artifactId>ahocorasick</artifactId>
        <version>0.4.0</version>
    </dependency>




相关问题
热门标签