You have selected the right tool in BeautifulSoup. Technically, you can do it all in one script, but you can segment it, because it looks like you will be dealing with tens of thousands of emails, all of which are separate requests, and that will take time.
This page will help you a lot, but here is just a small piece of code to get you started. This gets all the html tags that are the index pages for emails, extracts their href links and adds a little to the top of the URL, so they can be accessed directly.
from bs4 import BeautifulSoup import re import urllib2 soup = BeautifulSoup(urllib2.urlopen("http://www.419scam.org/emails/")) tags = soup.find_all(href=re.compile("20......../index\.htm") links = [] for t in tags: links.append("http://www.419scam.org/emails/" + t['href'])
're' is a Python regex module. In the fifth line, I told BeautifulSoup to find all the tags in the soup whose href attribute matches this regular expression. I chose this regex to get only email index pages, not all href links on this page. I noticed that the links on the index page had this pattern for all of their URLs.
Having all the 'a' tags, I then scrolled them around, extracting the string from the href attribute, doing t ['href'] and adding the rest of the URL to the beginning of the line to get the raw string URLs.
After reading this documentation, you should get an idea of how to extend these methods to capture individual emails.
source share