Therefore, I am interested in this theory that if you go to a random Wikipedia article, several times click on the first link not inside the parentheses, 95% of the cases you will find in the Philosophy article.
I wanted to write a script in Python that makes a link for me and, in the end, prints a good list of the articles that have been visited ( linkA -> linkB -> linkC ), etc.
I managed to get the HTML DOM on web pages and managed to highlight unnecessary links and the top description panel that leads the value pages. So far I have come to the conclusion that:
- The DOM begins with the table that you see on the right on some pages, for example, in Human . We want to ignore these links.
- Valid link elements have a
<p> element somewhere as their ancestor (most often a parent or grandfather if it is inside a <b> or similar. The top bar that leads to the value pages does not contain any <p> . - Incorrect links contain some special words followed by a colon, for example.
Wikipedia:
So far so good. But these are the brackets that bother me. In an article on Human , for example, the first link not inside parentheses is "/ wiki / Species", but the script finds "/ wiki / Taxonomy" that is inside them.
I have no idea how to do this programmatically, since I have to look for text in some combination of parent / child nodes, which may not always be the same. Any ideas?
My code can be seen below, but this is what I did very quickly and not very proud. However, he commented on this, so you can see my line of thoughts (hopefully :)).
"""Wikipedia fun""" import urllib2 from xml.dom.minidom import parseString import time def validWikiArticleLinkString(href): """ Takes a string and returns True if it contains the substring '/wiki/' in the beginning and does not contain any of the "special" wiki pages. """ return (href.find("/wiki/") == 0 and href.find("(disambiguation)") == -1 and href.find("File:") == -1 and href.find("Wikipedia:") == -1 and href.find("Portal:") == -1 and href.find("Special:") == -1 and href.find("Help:") == -1 and href.find("Template_talk:") == -1 and href.find("Template:") == -1 and href.find("Talk:") == -1 and href.find("Category:") == -1 and href.find("Bibcode") == -1 and href.find("Main_Page") == -1) if __name__ == "__main__": visited = []
pg-robban
source share