How to get all links from a site using Beautiful Soup (python) Recursively

Question

How to get all links from a site using Beautiful Soup (python) Recursively

I want to be able to recursively get all the links from the site, and then follow these links and get all the links from these sites. Depth should be 5-10 to return an array of all links found. It is preferable to use a beautiful soup / python. Thanks!

I have tried this so far and it is not working ... any help would be appreciated.

from BeautifulSoup import BeautifulSoup import urllib2 def getLinks(url): if (len(url)==0): return [url] else: files = [ ] page=urllib2.urlopen(url) soup=BeautifulSoup(page.read()) universities=soup.findAll('a',{'class':'institution'}) for eachuniversity in universities: files+=getLinks(eachuniversity['href']) return files print getLinks("http://www.utexas.edu/world/univ/alpha/")

+7

python beautifulsoup

coderlyfe Nov 25 '13 at 17:04

source share

2 answers

the number of crawl pages will grow exponentially, there are many problems that may not look complicated at first glance, look at the scrapy architecture overview to see how it should be done in real life

among other great features, scrapy will not repeat crawling the same pages (unless you force it) and can be configured for maximum DEPTH_LIMIT

even better, scrapy has built-in link-extractors

+4

Guy gavriely Nov 25 '13 at 18:30

source share

Jallo · Accepted Answer · 2013-11-25T18:18:43+0000

Recursive algorithms are used to reduce large problems to smaller ones that have the same structure, and then combine the results. They often consist of a base case that does not lead to recursion, and another case leads to recursion. For example, say you were born in 1986 and you want to calculate your age. You can write:

 def myAge(currentyear): if currentyear == 1986: #Base case, does not lead to recursion. return 0 else: #Leads to recursion return 1+myAge(currentyear-1)

I myself do not really see the point of using recursion in your problem. My suggestion, first of all, is that you set a limit in your code. What you gave us will work endlessly because the program is stuck in endlessly nested loops; he never reaches the end and begins to return. Thus, you can have a variable outside the function, which is updated every time you go down to the level, and at some point stops the function from starting a new for loop and starts returning what it found.

But then you get into changing global variables, you use recursion in a weird way, and the code gets messy.

Now, after reading the comments and seeng, what you really want, which, I must say, is not entirely clear, you can use the help from the recursive algorithm in your code, but not write it all recursively.

 def recursiveUrl(url,depth): if depth == 5: return url else: page=urllib2.urlopen(url) soup = BeautifulSoup(page.read()) newlink = soup.find('a') #find just the first one if len(newlink) == 0: return url else: return url, recursiveUrl(newlink,depth+1) def getLinks(url): page=urllib2.urlopen(url) soup = BeautifulSoup(page.read()) links = soup.find_all('a', {'class':'institution'}) for link in links: links.append(recursiveUrl(link,0)) return links

Now there is still a problem: links are not always linked to web pages, but also to files and images. This is why I wrote the if / else statement in the recursive part of the "open URL" function. Another problem is that your first website has 2166 links to institutions, and creating 2166 * 5 beautiful photos is not fast. The above code performs a recursive function 2166 times. This should not be a problem, but you are dealing with large html (or php) files, so the 2166 * 5 soup takes a huge amount of time.

How to get all links from a site using Beautiful Soup (python) Recursively

More articles: