How to get the number of words on a web page?

I need to get the total number of WORDS on a web page. I know about the System.Net.WebClient class. But the DownloadString() method returns all the HTML markup where I only need the TEXT so that I can determine the number of words.

Any ideas / suggestions are welcome.

+7
source share
5 answers

Check out the HTML Agility Pack . This allows you to apply XPath expressions to an HTML document.

You want to find all the text nodes, and then count the words. //text() is XPath for getting all text nodes.

+5
source

Use the HTML Agility Pack to download and parse an HTML document.

Then you can query the document object and extract the inner text of all nodes.

+6
source

I see two options:

  • Use the html library to parse a string as a tree, like a tree.
  • Use some simple text hacks

Option 1 is clearly cleaner, but introduces additional dependencies on third-party libraries.

There are several steps:

  • Remove tags (with content) whose contents do not suit you. For example, scripts and style sheets.
  • Remove all other tags while retaining their contents / extract text from other tags
  • Separate the remainder using the string.Split function with all spaces in the form of shared characters, as well as the ability to ignore empty strings of results.
  • Count the number of Split records returned.

Obviously, this does not work well for all languages. For example, Japanese / Chinese have no spaces between words.

+1
source

http://www.wordcounttool.com/ ... is the easiest way to find out

+1
source

If you need to count only those words that are actually visible to the user (i.e. ignoring content hidden using CSS and including content created dynamically using JavaScript), you probably need to automate browser or browser controls.

Perhaps this can be done completely with client-side JavaScript:

  • Load the first webpage in an iframe.
  • After everything is fully loaded, request a runtime DOM to retrieve only the content that is visible to the user.
  • Write the results to the content area of ​​the external pages.
  • Repeat for the next web page.
0
source

All Articles