Return ASCII string string from a (possibly encoded) string fetched using urllib2 or BeautifulSoup

I am retrieving data from a webpage using urllib2. The content of all pages is in English, so there is no problem with non-English text. However, pages are encoded, and sometimes they contain HTML objects such as £ or a copyright symbol, etc.

I want to check if parts of the page contain certain keywords, however I want to do a case-insensitive check (for obvious reasons).

What is the best way to convert returned page content to all lowercase letters?

def get_page_content_as_lower_case(url): request = urllib2.Request(url) page = urllib2.urlopen(request) temp = page.read() return str(temp).lower() # this dosen't work because page contains utf-8 data 

[[Update]]

I don't need to use urllib2 to get the data, in fact I can use BeautifulSoup instead, since I need to get the data from the specific element (s) on the page for which the BS is a much better choice. I changed the name to reflect this.

HOWEVER, the problem still remains that the extracted data is in some non-coding encoding (presumably) in utf-8. I checked one of the pages and the encoding was iso-8859-1.

Since I'm only interested in English, I want to know how I can get a lowercase version of ASCII in lowercase data extracted from this page - so I can test case-sensitive tests for whether the keyword is in the text.

I assume that the fact that I limited myself only to English (from English-language sites) reduces the choice of encoding ?. I am not very good at coding, but I assume that the valid options are:

  • Ascii
  • iso-8859-1
  • Utf-8

Is this a valid assumption, and if so, maybe there is a way to write a “reliable” function that takes an encoded string containing English text and returns a lowercase lowercase version of ASCII?

+2
source share
3 answers

BeautifulSoup stores data as Unicode internally, so you don’t need to manually manipulate characters.

To find keywords (case insensitive) in the text (not in attribute values ​​or tag names):

 #!/usr/bin/env python import urllib2 from contextlib import closing import regex # pip install regex from BeautifulSoup import BeautifulSoup with closing(urllib2.urlopen(URL)) as page: soup = BeautifulSoup(page) print soup(text=regex.compile(ur'(?fi)\L<keywords>', keywords=['your', 'keywords', 'go', 'here'])) 

Example (Unicode words by @tchrist)

 #!/usr/bin/env python # -*- coding: utf-8 -*- import regex from BeautifulSoup import BeautifulSoup, Comment html = u'''<div attr="PoSt in attribute should not be found"> <!-- it must not find post inside a comment either --> <ol> <li> tag names must not match <li> Post will be found <li> the same with post <li> and post <li> and poſt <li> this is ignored </ol> </div>''' soup = BeautifulSoup(html) # remove comments comments = soup.findAll(text=lambda t: isinstance(t, Comment)) for comment in comments: comment.extract() # find text with keywords (case-insensitive) print ''.join(soup(text=regex.compile(ur'(?fi)\L<opts>', opts=['post', 'li']))) # compare it with '.lower()' print '.lower():' print ''.join(soup(text=lambda t: any(k in t.lower() for k in ['post', 'li']))) # or exact match print 'exact match:' print ''.join(soup(text=' the same with post\n')) 

Exit

  Post will be found the same with post and post and poſt .lower(): Post will be found the same with post exact match: the same with post 
+1
source

Wrong row searches are more complex than just lower level searches. For example, a German user will expect both STRASSE and Straße to match the search word Straße , but 'STRASSE'.lower() == 'strasse' (and you cannot just replace double s with ß - there is no ß in Trasse there ). Other languages ​​(in particular Turkish ) will also have similar complications.

If you want to support languages ​​other than English, you should use a library that can handle the correct phrase (e.g. Matthew Barnett regexp ).

As the saying goes, a way to extract page content:

 import contextlib def get_page_content(url): with contextlib.closing(urllib2.urlopen(url)) as uh: content = uh.read().decode('utf-8') return content # You can call .lower() on the result, but that won't work in general 
+3
source

Or with Requests :

 page_text = requests.get(url).text lowercase_text = page_text.lower() 

(Requests will automatically decode the response.)

As @tchrist says, .lower() will not do the job for Unicode text.

You can check out this alternate regex implementation that implements case-insensitive flag folding for unicode: http://code.google.com/p/mrab-regex-hg/

Frame tables are also available: http://unicode.org/Public/UNIDATA/CaseFolding.txt

+2
source

All Articles