Given the n number of raw URLs, I would like to be able to classify them: news, blog, photo and video.
An example would be if a link directs the user to a photo, is it enough to say that the original link contains a file extension for images to be able to classify the raw URL as a photo?
In terms of video, blog and news, it seems that having a set of domains (e.g. http://www.youtube.com ) is not enough to classify raw URLs.
Can classification be done by exploring web content? Or are there any open source tools for this?
source
share