Get the first link in a Wikipedia article not in parentheses

Therefore, I am interested in this theory that if you go to a random Wikipedia article, several times click on the first link not inside the parentheses, 95% of the cases you will find in the Philosophy article.

I wanted to write a script in Python that makes a link for me and, in the end, prints a good list of the articles that have been visited ( linkA -> linkB -> linkC ), etc.

I managed to get the HTML DOM on web pages and managed to highlight unnecessary links and the top description panel that leads the value pages. So far I have come to the conclusion that:

  • The DOM begins with the table that you see on the right on some pages, for example, in Human . We want to ignore these links.
  • Valid link elements have a <p> element somewhere as their ancestor (most often a parent or grandfather if it is inside a <b> or similar. The top bar that leads to the value pages does not contain any <p> .
  • Incorrect links contain some special words followed by a colon, for example. Wikipedia:

So far so good. But these are the brackets that bother me. In an article on Human , for example, the first link not inside parentheses is "/ wiki / Species", but the script finds "/ wiki / Taxonomy" that is inside them.

I have no idea how to do this programmatically, since I have to look for text in some combination of parent / child nodes, which may not always be the same. Any ideas?

My code can be seen below, but this is what I did very quickly and not very proud. However, he commented on this, so you can see my line of thoughts (hopefully :)).

 """Wikipedia fun""" import urllib2 from xml.dom.minidom import parseString import time def validWikiArticleLinkString(href): """ Takes a string and returns True if it contains the substring '/wiki/' in the beginning and does not contain any of the "special" wiki pages. """ return (href.find("/wiki/") == 0 and href.find("(disambiguation)") == -1 and href.find("File:") == -1 and href.find("Wikipedia:") == -1 and href.find("Portal:") == -1 and href.find("Special:") == -1 and href.find("Help:") == -1 and href.find("Template_talk:") == -1 and href.find("Template:") == -1 and href.find("Talk:") == -1 and href.find("Category:") == -1 and href.find("Bibcode") == -1 and href.find("Main_Page") == -1) if __name__ == "__main__": visited = [] # a list of visited links. used to avoid getting into loops opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] # need headers for the api currentPage = "Human" # the page to start with while True: infile = opener.open('http://en.wikipedia.org/w/index.php?title=%s&printable=yes' % currentPage) html = infile.read() # retrieve the contents of the wiki page we are at htmlDOM = parseString(html) # get the DOM of the parsed HTML aTags = htmlDOM.getElementsByTagName("a") # find all <a> tags for tag in aTags: if "href" in tag.attributes.keys(): # see if we have the href attribute in the tag href = tag.attributes["href"].value # get the value of the href attribute if validWikiArticleLinkString(href): # if we have one of the link types we are looking for # Now come the tricky parts. We want to look for links in the main content area only, # and we want the first link not in parentheses. # assume the link is valid. invalid = False # tables which appear to the right on the site appear first in the DOM, so we need to make sure # we are not looking at a <a> tag somewhere inside a <table>. pn = tag.parentNode while pn is not None: if str(pn).find("table at") >= 0: invalid = True break else: pn = pn.parentNode if invalid: # go to next link continue # Next we look at the descriptive texts above the article, if any; eg # This article is about .... or For other uses, see ... (disambiguation). # These kinds of links will lead into loops so we classify them as invalid. # We notice that this text does not appear to be inside a <p> block, so # we dismiss <a> tags which aren't inside any <p>. pnode = tag.parentNode while pnode is not None: if str(pnode).find("p at") >= 0: break pnode = pnode.parentNode # If we have reached the root node, which has parentNode None, we classify the # link as invalid. if pnode is None: invalid = True if invalid: continue ###### this is where I got stuck: # now we need to look if the link is inside parentheses. below is some junk # for elem in tag.parentNode.childNodes: # while elem.firstChild is not None: # elem = elem.firstChid # print elem.nodeValue print href # this will be the next link newLink = href[6:] # except for the /wiki/ part break # if we have been to this link before, break the loop if newLink in visited: print "Stuck in loop." break # or if we have reached Philosophy elif newLink == "Philosophy": print "Ended up in Philosophy." break else: visited.append(currentPage) # mark this currentPage as visited currentPage = newLink # make the the currentPage we found the new page to fetch time.sleep(5) # sleep some to see results as debug 
+7
source share
1 answer

I found a python script on Github ( http://github.com/JensTimmerman/scripts/blob/master/philosophy.py ) to play this game. It uses Beautifulsoup to parse the HTML and to solve the problem with brackets, it simply removes the text between the brackets before parsing the links.

+3
source

All Articles