I see two options:
- Use the html library to parse a string as a tree, like a tree.
- Use some simple text hacks
Option 1 is clearly cleaner, but introduces additional dependencies on third-party libraries.
There are several steps:
- Remove tags (with content) whose contents do not suit you. For example, scripts and style sheets.
- Remove all other tags while retaining their contents / extract text from other tags
- Separate the remainder using the string.Split function with all spaces in the form of shared characters, as well as the ability to ignore empty strings of results.
- Count the number of
Split
records returned.
Obviously, this does not work well for all languages. For example, Japanese / Chinese have no spaces between words.
CodesInChaos
source share