Go to the source URL on social media management websites

I am doing web cleaning as part of an academic project where it is important that all links are tracked to actual content. Annoyingly, there are some important bugs with Social Media Management sites where users post their links to determine who clicks on them.

For example, consider this linkis to linkis.com , which links to http: // + bit.ly + / 1P1xh9J (dedicated link due to SO publishing restrictions), which in turn refers to http://conservatives4palin.com . The problem occurs because the original link to linkis.com is not automatically redirected. Instead, the user must click the cross in the upper right corner to navigate to the original URL.

In addition, there are various options (see, for example, linkis.com 2 , where the cross is in the lower left corner of the site). These are the only two options that I have found, but maybe more. Please note that I use a web scraper very similar to this one . The functionality for navigating to a real link does not have to be stable / functioning over time, as this is a one-time academic project.

How to automatically go to the source URL? Would it be better to develop a regex that finds the appropriate link?

+6
source share
5 answers

, -, , - iframe. .

URL-, - :

import requests                                                                                                                                                                                        
from bs4 import BeautifulSoup                                                                                                                                                                          

urls = ["http://linkis.com/conservatives4palin.com/uGXam", "http://linkis.com/paper.li/gsoberon/jozY2"]                                                                                                
response_data = []                                                                                                                                                                                     

for url in urls:                                                                                                                                                                                       
    response = requests.get(url)                                                                                                                                                                       
    soup = BeautifulSoup(response.text, 'html.parser')                                                                                                                                                 
    short_url = soup.find("iframe", {"id": "source_site"})['src']                                                                                                                                      
    response_data.append(requests.get(short_url).url)                                                                                                                                                  

print(response_data)
+1

-, javascript, html, get, , , :

  • javascript, , .
  • , . Selenium.

, -, , javascript, - .

:

from selenium import webdriver

#Create a chrome browser instance, other drivers are also available
driver = webdriver.Chrome()     

#Request a page
driver.get('http://linkis.com/conservatives4palin.com/uGXam')

#Select elements on the page and trigger events
#Selenium supports also xpath and css selectors
#Clicks the tag with the given id
driver.find_elements_by_id('some_id').click()
+2

-, , , , url, javascript ( , , - , )

try:
    from HTMLParser import HTMLParser
except ImportError:
    from html.parser import HTMLParser

import requests, re
from contextlib import closing

CHUNKSIZE = 1024
reurl = re.compile("\"longUrl\":\"(.*?)\"")
buffer = ""
htmlp = HTMLParser()
with closing(requests.get("http://linkis.com/conservatives4palin.com/uGXam", stream=True)) as res:
    for chunk in res.iter_content(chunk_size=CHUNKSIZE, decode_unicode=True):
        buffer = "".join([buffer, chunk])
        match = reurl.search(buffer)
        if match:
            print(htmlp.unescape(match.group(1)).replace('\\',''))
            break
Hide result
+1

say you can capture the href attribute / value:

s = 'href="/url/go/?url=http%3A%2F%2Fbit.ly%2F1P1xh9J"'

then you need to do the following:

import urllib.parse
s=s.partition('http')
s=s[1]+urllib.parse.unquote(s[2][0:-1])
s=urllib.parse.unquote(s)

and s will now be the string of the original bit link!

0
source

try the following code:

import requests

url = 'http://'+'bit.ly'+'/1P1xh9J'
realsite = requests.get(url)
print(realsite.url)

prints the desired result:

http://conservatives4palin.com/2015/11/robert-tracinski-the-climate-change-inquisition-begins.html?utm_source=twitterfeed&utm_medium=twitter
-1
source

All Articles