Get the first link in a Wikipedia article not in parentheses

Question

Get the first link in a Wikipedia article not in parentheses

Therefore, I am interested in this theory that if you go to a random Wikipedia article, several times click on the first link not inside the parentheses, 95% of the cases you will find in the Philosophy article.

I wanted to write a script in Python that makes a link for me and, in the end, prints a good list of the articles that have been visited ( linkA -> linkB -> linkC ), etc.

I managed to get the HTML DOM on web pages and managed to highlight unnecessary links and the top description panel that leads the value pages. So far I have come to the conclusion that:

The DOM begins with the table that you see on the right on some pages, for example, in Human . We want to ignore these links.
Valid link elements have a <p> element somewhere as their ancestor (most often a parent or grandfather if it is inside a <b> or similar. The top bar that leads to the value pages does not contain any <p> .
Incorrect links contain some special words followed by a colon, for example. Wikipedia:

So far so good. But these are the brackets that bother me. In an article on Human , for example, the first link not inside parentheses is "/ wiki / Species", but the script finds "/ wiki / Taxonomy" that is inside them.

I have no idea how to do this programmatically, since I have to look for text in some combination of parent / child nodes, which may not always be the same. Any ideas?

My code can be seen below, but this is what I did very quickly and not very proud. However, he commented on this, so you can see my line of thoughts (hopefully :)).

 """Wikipedia fun""" import urllib2 from xml.dom.minidom import parseString import time def validWikiArticleLinkString(href): """ Takes a string and returns True if it contains the substring '/wiki/' in the beginning and does not contain any of the "special" wiki pages. """ return (href.find("/wiki/") == 0 and href.find("(disambiguation)") == -1 and href.find("File:") == -1 and href.find("Wikipedia:") == -1 and href.find("Portal:") == -1 and href.find("Special:") == -1 and href.find("Help:") == -1 and href.find("Template_talk:") == -1 and href.find("Template:") == -1 and href.find("Talk:") == -1 and href.find("Category:") == -1 and href.find("Bibcode") == -1 and href.find("Main_Page") == -1) if __name__ == "__main__": visited = [] # a list of visited links. used to avoid getting into loops opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] # need headers for the api currentPage = "Human" # the page to start with while True: infile = opener.open('http://en.wikipedia.org/w/index.php?title=%s&printable=yes' % currentPage) html = infile.read() # retrieve the contents of the wiki page we are at htmlDOM = parseString(html) # get the DOM of the parsed HTML aTags = htmlDOM.getElementsByTagName("a") # find all <a> tags for tag in aTags: if "href" in tag.attributes.keys(): # see if we have the href attribute in the tag href = tag.attributes["href"].value # get the value of the href attribute if validWikiArticleLinkString(href): # if we have one of the link types we are looking for # Now come the tricky parts. We want to look for links in the main content area only, # and we want the first link not in parentheses. # assume the link is valid. invalid = False # tables which appear to the right on the site appear first in the DOM, so we need to make sure # we are not looking at a <a> tag somewhere inside a <table>. pn = tag.parentNode while pn is not None: if str(pn).find("table at") >= 0: invalid = True break else: pn = pn.parentNode if invalid: # go to next link continue # Next we look at the descriptive texts above the article, if any; eg # This article is about .... or For other uses, see ... (disambiguation). # These kinds of links will lead into loops so we classify them as invalid. # We notice that this text does not appear to be inside a <p> block, so # we dismiss <a> tags which aren't inside any <p>. pnode = tag.parentNode while pnode is not None: if str(pnode).find("p at") >= 0: break pnode = pnode.parentNode # If we have reached the root node, which has parentNode None, we classify the # link as invalid. if pnode is None: invalid = True if invalid: continue ###### this is where I got stuck: # now we need to look if the link is inside parentheses. below is some junk # for elem in tag.parentNode.childNodes: # while elem.firstChild is not None: # elem = elem.firstChid # print elem.nodeValue print href # this will be the next link newLink = href[6:] # except for the /wiki/ part break # if we have been to this link before, break the loop if newLink in visited: print "Stuck in loop." break # or if we have reached Philosophy elif newLink == "Philosophy": print "Ended up in Philosophy." break else: visited.append(currentPage) # mark this currentPage as visited currentPage = newLink # make the the currentPage we found the new page to fetch time.sleep(5) # sleep some to see results as debug

+7

python dom parsing

pg-robban May 17 '12 at 10:48

source share

1 answer

Ponytech · Answer 1 · 2012-05-17T17:04:17+0000

I found a python script on Github ( http://github.com/JensTimmerman/scripts/blob/master/philosophy.py ) to play this game. It uses Beautifulsoup to parse the HTML and to solve the problem with brackets, it simply removes the text between the brackets before parsing the links.

Get the first link in a Wikipedia article not in parentheses

More articles: