Write a python script that recursively goes through the links on the page

I am doing a project for my school in which I would like to compare fraudulent emails. I found this site: http://www.419scam.org/emails/ Now what I would like to do is save each scam in separate documents, and then I can analyze them. Here is my code:

import BeautifulSoup, urllib2 address='http://www.419scam.org/emails/' html = urllib2.urlopen(address).read() f = open('test.txt', 'wb') f.write(html) f.close() 

This saves the entire html file in text format, now I would like to split the file and save the contents of the html fraud links:

 <a href="2011-12/01/index.htm">01</a> <a href="2011-12/02/index.htm">02</a> <a href="2011-12/03/index.htm">03</a> 

and etc.

If I get this, I still need to go further and open the save of another href. Any idea how to do this in one Python code?

Thanks!

+4
source share
5 answers

You have selected the right tool in BeautifulSoup. Technically, you can do it all in one script, but you can segment it, because it looks like you will be dealing with tens of thousands of emails, all of which are separate requests, and that will take time.

This page will help you a lot, but here is just a small piece of code to get you started. This gets all the html tags that are the index pages for emails, extracts their href links and adds a little to the top of the URL, so they can be accessed directly.

 from bs4 import BeautifulSoup import re import urllib2 soup = BeautifulSoup(urllib2.urlopen("http://www.419scam.org/emails/")) tags = soup.find_all(href=re.compile("20......../index\.htm") links = [] for t in tags: links.append("http://www.419scam.org/emails/" + t['href']) 

're' is a Python regex module. In the fifth line, I told BeautifulSoup to find all the tags in the soup whose href attribute matches this regular expression. I chose this regex to get only email index pages, not all href links on this page. I noticed that the links on the index page had this pattern for all of their URLs.

Having all the 'a' tags, I then scrolled them around, extracting the string from the href attribute, doing t ['href'] and adding the rest of the URL to the beginning of the line to get the raw string URLs.

After reading this documentation, you should get an idea of ​​how to extend these methods to capture individual emails.

+5
source

You can also find the value in requests and lxml.html . Requests are another way to make HTTP requests, and lxml is an alternative to parsing xml and html content.

There are many ways to search for an html document, but you can start with cssselect .

 import requests from lxml.html import fromstring url = 'http://www.419scam.org/emails/' doc = fromstring(requests.get(url).content) atags = doc.cssselect('a') # using .get('href', '') syntax because not all a tags will have an href hrefs = (a.attrib.get('href', '') for a in atags) 

Or as suggested in the comments using .iterlinks() . Note that you still have to filter if you only need the “a” tags. In any case, calling .make_links_absolute () is likely to be useful. However, this is your homework, so play with it.

 doc.make_links_absolute(base_url=url) hrefs = (l[2] for l in doc.iterlinks() if l[0].tag == 'a') 

Further for you ... how to get through and open all the individual spam links.

+3
source

To get all the links on the page, you can use BeautifulSoup. Check out this page , this may help. In fact, he talks about how to do exactly what you need.

To save all pages, you can do the same as in your current code, but in a loop that will iterate over all the links that you have extracted and saved, say, in a list.

+2
source

You can use an HTML parser and specify the type of object you are looking for.

 from HTMLParser import HTMLParser import urllib2 class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): if tag == 'a': for attr in attrs: if attr[0] == 'href': print attr[1] address='http://www.419scam.org/emails/' html = urllib2.urlopen(address).read() f = open('test.txt', 'wb') f.write(html) f.close() parser = MyHTMLParser() parser.feed(html) 
+2
source

Here is a solution using lxml + XPath and urllib2 :

 #!/usr/bin/env python2 -u # -*- coding: utf8 -*- import cookielib, urllib2 from lxml import etree cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) page = opener.open("http://www.419scam.org/emails/") page.addheaders = [('User-agent', 'Mozilla/5.0')] reddit = etree.HTML(page.read()) # XPath expression : we get all links under body/p[2] containing *.htm for node in reddit.xpath('/html/body/p[2]/a[contains(@href,".htm")]'): for i in node.items(): url = 'http://www.419scam.org/emails/' + i[1] page = opener.open(url) page.addheaders = [('User-agent', 'Mozilla/5.0')] lst = url.split('/') try: if lst[6]: # else it a "month" link filename = '/tmp/' + url.split('/')[4] + '-' + url.split('/')[5] f = open(filename, 'w') f.write(page.read()) f.close() except: pass # vim:ts=4:sw=4 
+2
source

Source: https://habr.com/ru/post/1416013/


All Articles