Classify Content by URL

Given the n number of raw URLs, I would like to be able to classify them: news, blog, photo and video.

An example would be if a link directs the user to a photo, is it enough to say that the original link contains a file extension for images to be able to classify the raw URL as a photo?

In terms of video, blog and news, it seems that having a set of domains (e.g. http://www.youtube.com ) is not enough to classify raw URLs.

Can classification be done by exploring web content? Or are there any open source tools for this?

+5
source share
1 answer

The only URLs that can even be somewhat reliably classified are those that point to a separate medium (i.e. http://foo.com/foo.jpg - this is certainly an image). Otherwise, you should analyze the contents of the page.

, Flash , , - Flash-. , , (Google !), - , . - , (ROI). , ClueWeb09 - , - .

" ".

0

All Articles