I use a beautiful soup, and I write the cover and have the following code in it:
print soup.originalEncoding #self.addtoindex(page, soup) links=soup('a') for link in links: if('href' in dict(link.attrs)): link['href'].replace('..', '') url=urljoin(page, link['href']) if url.find("'") != -1: continue url = url.split('?')[0] url = url.split('#')[0] if url[0:4] == 'http': newpages.add(url) pages = newpages
It is assumed that link['href'].replace('..', '') captures links that come out as .. /contact/orderform.aspx,../contact/requestconsult.aspx, etc. However, it does not work. The links still have the leading ".." Is there something I can't see?
sdiener
source share