How to get the number of words on a web page?

Question

How to get the number of words on a web page?

I need to get the total number of WORDS on a web page. I know about the System.Net.WebClient class. But the DownloadString() method returns all the HTML markup where I only need the TEXT so that I can determine the number of words.

Any ideas / suggestions are welcome.

+7

c # asp.net

Manish May 23 '11 at 10:36

source share

5 answers

Use the HTML Agility Pack to download and parse an HTML document.

Then you can query the document object and extract the inner text of all nodes.

+6

Odded May 23 '11 at 10:41

source share

I see two options:

Use the html library to parse a string as a tree, like a tree.
Use some simple text hacks

Option 1 is clearly cleaner, but introduces additional dependencies on third-party libraries.

There are several steps:

Remove tags (with content) whose contents do not suit you. For example, scripts and style sheets.
Remove all other tags while retaining their contents / extract text from other tags
Separate the remainder using the string.Split function with all spaces in the form of shared characters, as well as the ability to ignore empty strings of results.
Count the number of Split records returned.

Obviously, this does not work well for all languages. For example, Japanese / Chinese have no spaces between words.

+1

CodesInChaos May 23 '11 at 10:44

source share

http://www.wordcounttool.com/ ... is the easiest way to find out

+1

sharon Aug 17 '11 at 15:27

source share

If you need to count only those words that are actually visible to the user (i.e. ignoring content hidden using CSS and including content created dynamically using JavaScript), you probably need to automate browser or browser controls.

Perhaps this can be done completely with client-side JavaScript:

Load the first webpage in an iframe.
After everything is fully loaded, request a runtime DOM to retrieve only the content that is visible to the user.
Write the results to the content area of the external pages.
Repeat for the next web page.

0

Daniel Renshaw May 23 '11 at 11:28

source share

Richard Schneider · Accepted Answer · 2011-05-23T10:42:05+0000

Check out the HTML Agility Pack . This allows you to apply XPath expressions to an HTML document.

You want to find all the text nodes, and then count the words. //text() is XPath for getting all text nodes.

How to get the number of words on a web page?

More articles: