I am doing web cleaning as part of an academic project where it is important that all links are tracked to actual content. Annoyingly, there are some important bugs with Social Media Management sites where users post their links to determine who clicks on them.
For example, consider this linkis to linkis.com , which links to http: // + bit.ly + / 1P1xh9J (dedicated link due to SO publishing restrictions), which in turn refers to http://conservatives4palin.com . The problem occurs because the original link to linkis.com is not automatically redirected. Instead, the user must click the cross in the upper right corner to navigate to the original URL.
In addition, there are various options (see, for example, linkis.com 2 , where the cross is in the lower left corner of the site). These are the only two options that I have found, but maybe more. Please note that I use a web scraper very similar to this one . The functionality for navigating to a real link does not have to be stable / functioning over time, as this is a one-time academic project.
How to automatically go to the source URL? Would it be better to develop a regex that finds the appropriate link?
source
share