I am retrieving data from a webpage using urllib2. The content of all pages is in English, so there is no problem with non-English text. However, pages are encoded, and sometimes they contain HTML objects such as £ or a copyright symbol, etc.
I want to check if parts of the page contain certain keywords, however I want to do a case-insensitive check (for obvious reasons).
What is the best way to convert returned page content to all lowercase letters?
def get_page_content_as_lower_case(url): request = urllib2.Request(url) page = urllib2.urlopen(request) temp = page.read() return str(temp).lower()
[[Update]]
I don't need to use urllib2 to get the data, in fact I can use BeautifulSoup instead, since I need to get the data from the specific element (s) on the page for which the BS is a much better choice. I changed the name to reflect this.
HOWEVER, the problem still remains that the extracted data is in some non-coding encoding (presumably) in utf-8. I checked one of the pages and the encoding was iso-8859-1.
Since I'm only interested in English, I want to know how I can get a lowercase version of ASCII in lowercase data extracted from this page - so I can test case-sensitive tests for whether the keyword is in the text.
I assume that the fact that I limited myself only to English (from English-language sites) reduces the choice of encoding ?. I am not very good at coding, but I assume that the valid options are:
Is this a valid assumption, and if so, maybe there is a way to write a “reliable” function that takes an encoded string containing English text and returns a lowercase lowercase version of ASCII?
Homunculus reticulli
source share