The fastest way to compare strings (alphanumeric)

Question

The fastest way to compare strings (alphanumeric)

I have a performance issue related to string comparison (in Java).

I am working on a project that is supposed to sort a huge list (TableViewer in Eclipse). Anyway, I have identified a bottleneck for calling compareTo () for the string to be compared.

Is there a way to optimize string comparison performance? I searched and googled to no avail ...

Since the project is strictly limited to the Win32 environment, I thought it might be possible to use this ...

Any suggestions are welcome.

EDIT: I forgot to mention that I will need both a digital comparison and a literal string comparison.

EDIT2: The goal is to speed up the user interface, because it is unacceptable to wait a few seconds each time you click on the table title to sort. I am studying, perhaps, value caching to speed up the comparison. Since the strings are pretty static, I think that would be possible.

EDIT3: I know that many of you were concerned about the try () application - catch (). Actually, this is less of a concern, because even if I delete this code and execute only the catch block (one compareTo ()), it still runs at the same speed as the source code. If, however, I also comment on compareTo (); leaving me only the overhead of the comparison function (getting tags, etc.), it is lightning fast. So I still need a better way to compare strings. Either by caching, or by magic.

Unfortunately, changing the sorting algorithm is not possible, but I doubt it is slow because it manages to sort pure integers pretty quickly.

UPDATE:

The comparison function is implemented as part of the TableViewer environment for sorting operations, which means that I do not implement a specific sorting algorithm, but implements SWT / JFace. I only implement the comparison function.

What's even more interesting is the fact that sorting code doubles faster than string comparisons. It’s faster to sort columns with only numbers than with actual literal rows .... Which leads me to the conclusion that something suspicious is happening in the compareTo () method ...

Here is the core of the function:

// e1Label and e2Label is Strings to be compared // // Be smart about the comparison and use non-lexical comparison if // possible (ie if both strings are actually numbers...) // // Warning: This is only "semi-smart" as the sorting might get "a bit" // messed up if some of the values in a column can be parsed as // doubles while others can not... // try { // Try using numeric (double) comparison of label values // double e1_double = Double.parseDouble(e1Label); double e2_double = Double.parseDouble(e2Label); rc = Double.compare(e1_double, e2_double); } catch (NumberFormatException e) { // Use lexical comparison if double comparison is not possible // rc = e1Label.compareToIgnoreCase(e2Label); }

+6

java performance optimization eclipse

Anders hansson Aug 27 '09 at 9:16

source share

12 answers

Although the compareTo () function is the bottleneck, it probably stands out in the profiler because it is the function that is most called in your loop.

It might be helpful to know how exactly your sort functions function. You might be better off changing the sorting algorithm, since there will be much more speed.

+6

Toad Aug 27 '09 at 9:21

source share

These are almost certainly exceptions that slow down the comparison. Throwing and catching an exception is an expensive operation, and you get an exception with each non-numeric cell value.

Consider using a regular expression first to check if the value is numeric, and if not, do not try to parse it.

 private static final Pattern numberPattern = Pattern.compile("[-+0-9.e]+"); // ... // e1Label and e2Label is Strings to be compared // // Be smart about the comparison and use non-lexical comparison if // possible (ie if both strings are actually numbers...) // // Warning: This is only "semi-smart" as the sorting might get "a bit" // messed up if some of the values in a column can be parsed as // doubles while others can not... // if (numberPattern.matches(e1Label) && numberPattern.matches(e2Label)) { try { // Try using numeric (double) comparison of label values // double e1_double = Double.parseDouble(e1Label); double e2_double = Double.parseDouble(e2Label); rc = Double.compare(e1_double, e2_double); } catch (NumberFormatException e) { // Use lexical comparison if double comparison is not possible // rc = e1Label.compareToIgnoreCase(e2Label); } } else { rc = e1Label.compareToIgnoreCase(e2Label); }

+4

finnw Aug 27 '09 at 9:44

source share

Do not store values as String objects. Create your own wrapper that only calls Double.parseDouble once for each line. Cache response (either value or exception). It can probably also cache the case-insensitive version of the string.

+3

Bill michell Aug 27 '09 at 9:59

source share

I really doubt that you can significantly speed up String.compareTo (). The solution probably relates to aclling compareTo () less often. But it is impossible to tell you how to do this without knowing more about your algorithm.

+2

Guillaume Aug 27 '09 at 9:28

source share

Even if you can squeeze a little more performance from your compareTo (), I think the main problem is the size of the list. Even if, presumably, today you can reduce the sorting delay by something acceptable (1 second?), What if next year the application should display a list that is twice as large? Sorting algorithms are O (n log n), so doubling the size of the list will make sorting much slower.

For a reliable solution, consider virtual tables (using the SWT.VIRTUAL attribute). Then you can implement a basic data provider that does not have to do a full sort. Exactly how you implement it will depend on where your data comes from. If it comes from a database, you might consider placing indexes in all sortable fields. If there is no way to do this, there are other strategies that you can use, for example, if you have a quick way to divide the table into pieces (for example, rows starting with "A", rows starting with "B", etc. .), Then you can start by simply extracting the rows in the first fragment, sorting them and displaying them, since the user always starts at the top of the table. Sorting subsequent fragments may continue in the background thread.

+1

Todd owner Aug 27 '09 at 10:33

source share

If you need both literals and numeric comparisons, what data does this string contain? Do they always represent numbers?

If they contain only numbers, then it is probably much faster to store them as numbers (in addition to being cleaner).

And if you need a “literal” comparison (which I interpret as sorting “100” to “20”), you can easily implement this on int or long with some math, which is probably still much faster than String comparison.

0

Joachim sauer Aug 27 '09 at 9:28

source share

As renier and Guillaume said, String.compareTo () is not to blame. This should be slower than a numerical comparison, but in fact it is not so important.

Even if your list consists of a million elements, it should not take more than a second.

If this is an option, I would do a background search, that is, bind some indexing to strings.

You should really analyze what operations will occur most often, single inserts, bulk unsorted inserts, bulk partially sorted inserts, sorting, deleting, etc.

Depending on the most common operation, you choose a more suitable data structure.

0

Sint Aug 27 '09 at 9:55

source share

Why not sort the list once at the beginning, saving the update using insertion sort? Then, when you want to change the order from ascending to descending, information already exists. If you want to sort by another column, just keep the list around so that you disable this column? Or does this not work in SWT? (It's been a while since I used it)

0

Pod Aug 27 '09 at 9:56

source share

It seems to me that you need to avoid calling String.compareTo () as often as you. There are two ways to do this.

1) Add buckets to the form to avoid all of these comparisons.

Depending on the number of rows to be sorted (in thousands?) ?, using full sorting in the bin may require too much overhead in terms of space and garbage collections.

To avoid the fact that you can perform constant rounds of sorting in the form of a bucket, so the lines are sorted into lists containing all the lines, where, say, the first 10 letters match. Then you can use the built-in sort to sort each bucket.

2) Make a hash of each line and sort the hashes (making sure you handle conflicts). Then you can simply rearrange the lines. This is probably the easiest solution.

Using any of these solutions should allow you to sort millions of lines in less than a second.

0

Jørgen fogh Aug 27 '09 at 9:58

source share

Based on your recent clarification, here is the second answer: create a class: Item , which can be used to represent a numerical or alphanumeric value and can determine if this is the case in advance , thus avoiding the overhead of analyzing the value and handling any exceptions while calling the compareTo method.

 public class Item implements Comparable<Item> { private final String s; private final double d; private final boolean numeric; public Item(String s) { double tmpD; boolean tmpNumeric; try { // Do the work of parsing / catching exceptions *upfront*. tmpD = Double.parseDouble(s); tmpNumeric = true; } catch(NumberFormatException ex) { // Parse failed so must be a String. tmpD = 0.0; tmpNumeric = false; } this.s = s; this.d = tmpD; this.numeric = tmpNumeric; } public String asString() { return s; } public double asDouble() { if (!numeric) { throw new IllegalStateException("Not a numeric value: " + s); } return d; } public boolean isNumeric() { return numeric; } @Override public boolean equals(Object o) { if (this == o) return true; if (!(o instanceof Item)) return false; Item item = (Item) o; return Double.compare(item.d, d) == 0 && s.equals(item.s); } @Override public int hashCode() { int result; long temp; result = s.hashCode(); temp = d != +0.0d ? Double.doubleToLongBits(d) : 0L; result = 31 * result + (int) (temp ^ (temp >>> 32)); return result; } public int compareTo(Item item) { int ret; if (numeric && item.isNumeric()) { // Both items are numeric so do fast comparison. double diff = d - item.asDouble(); if (diff > 0.0) { ret = 1; } else if (diff < 0.0) { ret = -1; } else { ret = 0; } } else { ret = s.compareTo(item.asString()); } return ret; } }

0

Adamski Aug 27 '09 at 10:07

source share

Why not try trie?

http://algs4.cs.princeton.edu/52trie/

http://en.wikipedia.org/wiki/Radix_tree

As Robert Sedgwick writes: "Suggestion H. The average number of nodes considered for a search will be missed in a trie constructed from N random keys over an alphabet of size R is ~ logR N." [Sedgwick, Robert; Wayne, Kevin (2011-02-21). Algorithms (4th Edition) (Kindle Locations 12674-12676). Pearson Education (USA). Kindle Edition.]

0

Andrew Waterman May 29 '13 at 23:58

source share

Adamski · Accepted Answer · 2009-08-27T09:40:15+0000

If you have knowledge of your String content, you can pre-compute and store additional information to speed up the comparison. For example, suppose your String contains only capital letters AZ. You can rank String based on the first three letters; eg.

AAA: = 1
AAB: = 2
...
ABA: = 27

Then you can speed up your compareTo by first comparing each rank of String (quick comparison based on int) and then doing only a full String comparison if the ranks were equal.

The fastest way to compare strings (alphanumeric)

More articles: