Java - How to Measure Matcher Processing

Suppose I got the brilliant idea of ​​creating an html link tag parser to explore the Internet, and I use a regular expression to parse and record each link that appears on the page. This code is currently working fine, but I want to add some members to reflect the "operation status".

public class LinkScanner {

    private static final Pattern hrefPattern = Pattern.compile("<a\\b[^>]*href=\"(.*?)\".*?>(.*?)</a>");

    public Collection<String> scan(String html) {
        ArrayList<String> links = new ArrayList<>();
        Matcher hrefMatcher = hrefPattern.matcher(html);
        while (hrefMatcher.find()) {
            String link = hrefMatcher.group(1);
            links.add(link);
        }
        return links;
    }
}

How can I measure this process?


For example: consider this hypothetical dimension implementation ...

 public class LinkScannerWithStatus {

    private int matched;
    private int total;

    public Collection<String> scan(String html) {
        ArrayList<String> links = new ArrayList<>();
        Matcher hrefMatcher = hrefPattern.matcher(html);
        total = hrefMatcher.getFindCount(); // Assume getFindCount exists
        while (hrefMatcher.find()) {
            String link = hrefMatcher.group(1);
            links.add(link);
            matched++; // assume is a linear measurement mechanism
        }
        return links;
    }
}

I don’t know where to start .. I don’t even know if "Processing Matcher" is grammatically compatible: S

+4
source share
4 answers

Unfortunately, it Matcherdoes not have a listener interface for measuring progress. It would probably be too expensive.

String, region, . . , . , , .

, hitEnd, , . , .

, URL- , , URL- .

, , . I/O , HTML.

+2

, DOM parser HTML, , DOM, .

- HTML XML SAX. , , .

+2

. CharSequence CharSequence. Matcher , CountingCharSequence .

, CharSequence.toString() String, . , , , . toString() , , , , . .

, "100%" , , " ". : P

public class RegExProgress {

    // the org. LinkScanner provided by Victor
    public static class LinkScanner {
        private static final Pattern hrefPattern = Pattern.compile("<a\\b[^>]*href=\"(.*?)\".*?>(.*?)</a>");
        public Collection<String> scan(CharSequence html) {
            ArrayList<String> links = new ArrayList<>();
            Matcher hrefMatcher = hrefPattern.matcher(html);
            while (hrefMatcher.find()) {
                String link = hrefMatcher.group(1);
                links.add(link);
            }
            return links;
        }
    }

    interface ProgressListener {
        void listen(int characterOffset);
    }

    static class SyncedProgressListener implements ProgressListener {
        private final int size;
        private final double blockSize;
        private final double percentageOfBlock;

        private int block;

        public SyncedProgressListener(int max, int blocks) {
            this.size = max;
            this.blockSize = (double) size / (double) blocks - 0.000_001d;
            this.percentageOfBlock = (double) size / blockSize;

            this.block = 0;
            print();
        }

        public synchronized void listen(int characterOffset) {
            if (characterOffset >= blockSize * (block + 1)) {
                this.block = (int) ((double) characterOffset / blockSize);
                print();
            }
        }

        private void print() {
            System.out.printf("%d%%%n", (int) (block * percentageOfBlock));
        }
    }

    static class CountingCharSequence implements CharSequence {

        private final CharSequence wrapped;
        private final int start;
        private final int end;

        private ProgressListener progressListener;

        public CountingCharSequence(CharSequence wrapped, ProgressListener progressListener) {
            this.wrapped = wrapped;
            this.progressListener = progressListener;
            this.start = 0;
            this.end = wrapped.length();
        }

        public CountingCharSequence(CharSequence wrapped, int start, int end, ProgressListener pl) {
            this.wrapped = wrapped;
            this.progressListener = pl;
            this.start = start;
            this.end = end;
        }

        @Override
        public CharSequence subSequence(int start, int end) {
            // this may not be needed, as charAt() has to be called eventually
            System.out.printf("subSequence(%d, %d)%n", start, end);
            int newStart = this.start + start;
            int newEnd = this.start + end - start;
            progressListener.listen(newStart);
            return new CountingCharSequence(wrapped, newStart, newEnd, progressListener);
        }

        @Override
        public int length() {
            System.out.printf("length(): %d%n", end - start);
            return end - start;
        }

        @Override
        public char charAt(int index) {
            //System.out.printf("charAt(%d)%n", index);
            int realIndex = start + index;
            progressListener.listen(realIndex);
            return this.wrapped.charAt(realIndex);
        }

        @Override
        public String toString() {
            System.out.printf(" >>> toString() <<< %n", start, end);
            return wrapped.toString();
        }
    }

    public static void main(String[] args) throws Exception {
        LinkScanner scanner = new LinkScanner();
        String content = new String(Files.readAllBytes(Paths.get("regex - Java - How to measure a Matcher processing - Stack Overflow.htm")));
        SyncedProgressListener pl = new SyncedProgressListener(content.length(), 10);
        CountingCharSequence ccs = new CountingCharSequence(content, pl);
        Collection<String> urls = scanner.scan(ccs);
        // OK, I admit, this is because of an off-by one error
        System.out.printf("100%% - %d%n", urls.size());

    }
}
+1

, , , LinkedList.

You can calculate the total number of matches using: int countMatches = StringUtils.countMatches (String text, String target);

So, just look for String "href" or perhaps a tag or some other link component, then you will have, I hope, an accurate picture of how many links you have, and then you can parse them one by one. This is not ideal because it does not accept regex as the target parameter.

0
source

All Articles