Implement JVM String

Question

Implement JVM String

The String class has some methods that I cannot understand why they were implemented as follows ... replace is one of them.

public String replace(CharSequence target, CharSequence replacement) { return Pattern.compile(target.toString(), Pattern.LITERAL).matcher( this).replaceAll(Matcher.quoteReplacement(replacement.toString())); }

Are there any significant advantages compared to a simpler and more efficient (fast!) Method?

 public static String replace(String string, String searchFor, String replaceWith) { StringBuilder result=new StringBuilder(); int index=0; int beginIndex=0; while((index=string.indexOf(searchFor, index))!=-1){ result.append(string.substring(beginIndex, index)+replaceWith); index+=searchFor.length(); beginIndex=index; } result.append(string.substring(beginIndex, string.length())); return result.toString(); }

Statistics with Java 7:
1,000,000 iterations
replace "b" with "x" in "abc"
result: "axc"

Times:
string.replace: 485ms
string.replaceAll: 490ms
optimized replacement = 180 ms

Code similar to the Java 7 split method is highly optimized to avoid possible compilation / regular expression processing:

 public String[] split(String regex, int limit) { /* fastpath if the regex is a (1)one-char String and this character is not one of the RegEx meta characters ".$|()[{^?*+\\", or (2)two-char String and the first char is the backslash and the second is not the ascii digit or ascii letter. */ char ch = 0; if (((regex.value.length == 1 && ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) || (regex.length() == 2 && regex.charAt(0) == '\\' && (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 && ((ch-'a')|('z'-ch)) < 0 && ((ch-'A')|('Z'-ch)) < 0)) && (ch < Character.MIN_HIGH_SURROGATE || ch > Character.MAX_LOW_SURROGATE)) { int off = 0; int next = 0; boolean limited = limit > 0; ArrayList<String> list = new ArrayList<>(); while ((next = indexOf(ch, off)) != -1) { if (!limited || list.size() < limit - 1) { list.add(substring(off, next)); off = next + 1; } else { // last one //assert (list.size() == limit - 1); list.add(substring(off, value.length)); off = value.length; break; } } // If no match was found, return this if (off == 0) return new String[]{this}; // Add remaining segment if (!limited || list.size() < limit) list.add(substring(off, value.length)); // Construct result int resultSize = list.size(); if (limit == 0) while (resultSize > 0 && list.get(resultSize - 1).length() == 0) resultSize--; String[] result = new String[resultSize]; return list.subList(0, resultSize).toArray(result); } return Pattern.compile(regex).split(this, limit); }

Following the logic of the replacement method:

 public String replaceAll(String regex, String replacement) { return Pattern.compile(regex).matcher(this).replaceAll(replacement); }

The split implementation should be:

 public String[] split(String regex, int limit) { return Pattern.compile(regex).split(this, limit); }

Performance losses are just around the corner found in replacement methods. For some reason, Oracle provides a fastpath method for some methods, not others.

+7

java string methods jvm implementation

marcolopes Jun 09 '14 at 13:54

source share

1 answer

Alex · Accepted Answer · 2014-06-09T17:11:15+0000

Are you sure that your proposed method is really faster than based on regex, which is used by the String class, not only for your own test input, but for every possible input that the program can throw at it? It relies on String.indexOf to execute a substring, which in itself is a naive implementation that is prone to bad worst cases. It is possible that Pattern implements a more sophisticated match algorithm, such as KMP , to avoid redundant comparisons.

In general, the Java team takes the performance of the main libraries very seriously and supports many internal tests using a wide range of real data. I have never encountered a situation where regular expression processing was a bottleneck. My constant advice is to start by writing the simplest possible code that works correctly, and not even think about rewriting the Java built-in modules until profiling proves that this is a bottleneck and you have exhausted all other optimization options.

As for your last edit - firstly, I would not describe the split method as highly optimized. It handles one special case, which happens to be extremely widespread and is guaranteed not to suffer from the worst worst complexity described above for the naive string matching algorithm - splitting into a single-character, literal token.

It is very good that one and the same special case could be optimized for replace and would provide some measurable improvement. But look what you need to achieve this simple optimization - about 50 lines of code. These lines of code are expensive, especially when they are part of the most likely class used in the Java library. Cost comes in many forms:

Resources are 50 lines of code that a developer should spend time writing, testing, documenting, and supporting throughout the life of the Java language.
Risk is 50 opportunities for subtle errors that slip after initial testing.
Complexity. These are 50 additional lines of code that any developer who wants to understand how this method works should now have time to read and understand.

Now your question boils down to "why is this method optimized to handle a special case, but not another?" or even in the more general sense of “why was this feature not implemented?” No one, except the original author, can answer this question definitively, but the answer almost always lies in the fact that either there is not enough demand for this function, or that the advantage obtained from the presence of this function is considered inappropriate to add it.

Implement JVM String

More articles: