Creating a daemon using Python libtorrent to extract metadata from 100k + torrents

Question

Creating a daemon using Python libtorrent to extract metadata from 100k + torrents

I am trying to get metadata about 10k + torrents per day using python libtorrent.

This is the current code stream.

Start a libtorrent session.
Get the total number of torrents that we need metadata to download in the last 1 day.
get torrent hashes from the database in pieces
create a magnetic link using these hashes and add these magnetic URIs to the session by creating a handle for each magnet URI.
sleep for a second, while metadata is retrieved and continue to check if metadata is found or not.
If the metadata is received, add it to the database, also check if we searched the metadata for 10 minutes, if so, delete the handle, otherwise you will not search for metadata at the moment.
do above endlessly. and save session state for the future.

So far I have tried this.

#!/usr/bin/env python # this file will run as client or daemon and fetch torrent meta data ie torrent files from magnet uri import libtorrent as lt # libtorrent library import tempfile # for settings parameters while fetching metadata as temp dir import sys #getting arguiments from shell or exit script from time import sleep #sleep import shutil # removing directory tree from temp directory import os.path # for getting pwd and other things from pprint import pprint # for debugging, showing object data import MySQLdb # DB connectivity import os from datetime import date, timedelta session = lt.session(lt.fingerprint("UT", 3, 4, 5, 0), flags=0) session.listen_on(6881, 6891) session.add_extension('ut_metadata') session.add_extension('ut_pex') session.add_extension('smart_ban') session.add_extension('metadata_transfer') session_save_filename = "/magnet2torrent/magnet_to_torrent_daemon.save_state" if(os.path.isfile(session_save_filename)): fileread = open(session_save_filename, 'rb') session.load_state(lt.bdecode(fileread.read())) fileread.close() print('session loaded from file') else: print('new session started') session.add_dht_router("router.utorrent.com", 6881) session.add_dht_router("router.bittorrent.com", 6881) session.add_dht_router("dht.transmissionbt.com", 6881) session.add_dht_router("dht.aelitis.com", 6881) session.start_dht() session.start_lsd() session.start_upnp() session.start_natpmp() alive = True while alive: db_conn = MySQLdb.connect( host = '', user = '', passwd = '', db = '', unix_socket='/mysql/mysql.sock') # Open database connection #print('reconnecting') #get all records where enabled = 0 and uploaded within yesterday subset_count = 100 ; yesterday = date.today() - timedelta(1) yesterday = yesterday.strftime('%Y-%m-%d %H:%M:%S') #print(yesterday) total_count_query = ("SELECT COUNT(*) as total_count FROM content WHERE upload_date > '"+ yesterday +"' AND enabled = '0' ") #print(total_count_query) try: total_count_cursor = db_conn.cursor()# prepare a cursor object using cursor() method total_count_cursor.execute(total_count_query) # Execute the SQL command total_count_results = total_count_cursor.fetchone() # Fetch all the rows in a list of lists. total_count = total_count_results[0] print(total_count) except: print "Error: unable to select data" total_pages = total_count/subset_count #print(total_pages) current_page = 1 while(current_page <= total_pages): from_count = (current_page * subset_count) - subset_count #print(current_page) #print(from_count) hashes = [] get_mysql_data_query = ("SELECT hash FROM content WHERE upload_date > '" + yesterday +"' AND enabled = '0' ORDER BY record_num DESC LIMIT "+ str(from_count) +" , " + str(subset_count) +" ") #print(get_mysql_data_query) try: get_mysql_data_cursor = db_conn.cursor()# prepare a cursor object using cursor() method get_mysql_data_cursor.execute(get_mysql_data_query) # Execute the SQL command get_mysql_data_results = get_mysql_data_cursor.fetchall() # Fetch all the rows in a list of lists. for row in get_mysql_data_results: hashes.append(row[0].upper()) except: print "Error: unable to select data" #print(hashes) handles = [] for hash in hashes: tempdir = tempfile.mkdtemp() add_magnet_uri_params = { 'save_path': tempdir, 'duplicate_is_error': True, 'storage_mode': lt.storage_mode_t(2), 'paused': False, 'auto_managed': True, 'duplicate_is_error': True } magnet_uri = "magnet:?xt=urn:btih:" + hash.upper() + "&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.publicbt.com%3A80&tr=udp%3A%2F%2Ftracker.ccc.de%3A80" #print(magnet_uri) handle = lt.add_magnet_uri(session, magnet_uri, add_magnet_uri_params) handles.append(handle) #push handle in handles list #print("handles length is :") #print(len(handles)) while(len(handles) != 0): for h in handles: #print("inside handles for each loop") if h.has_metadata(): torinfo = h.get_torrent_info() final_info_hash = str(torinfo.info_hash()) final_info_hash = final_info_hash.upper() torfile = lt.create_torrent(torinfo) torcontent = lt.bencode(torfile.generate()) tfile_size = len(torcontent) try: insert_cursor = db_conn.cursor()# prepare a cursor object using cursor() method insert_cursor.execute("""INSERT INTO dht_tfiles (hash, tdata) VALUES (%s, %s)""", [final_info_hash , torcontent] ) db_conn.commit() #print "data inserted in DB" except MySQLdb.Error, e: try: print "MySQL Error [%d]: %s" % (e.args[0], e.args[1]) except IndexError: print "MySQL Error: %s" % str(e) shutil.rmtree(h.save_path()) # remove temp data directory session.remove_torrent(h) # remove torrnt handle from session handles.remove(h) #remove handle from list else: if(h.status().active_time > 600): # check if handle is more than 10 minutes old ie 600 seconds #print('remove_torrent') shutil.rmtree(h.save_path()) # remove temp data directory session.remove_torrent(h) # remove torrnt handle from session handles.remove(h) #remove handle from list sleep(1) #print('sleep1') #print('sleep10') #sleep(10) current_page = current_page + 1 #save session state filewrite = open(session_save_filename, "wb") filewrite.write(lt.bencode(session.save_state())) filewrite.close() print('sleep60') sleep(60) #save session state filewrite = open(session_save_filename, "wb") filewrite.write(lt.bencode(session.save_state())) filewrite.close()

I tried to stay above the script in one night and found that only about 1200 metadata metadata was found in the night session. so I am looking for script performance improvements.

I even tried decoding the save_state file and noticed that there are 700 + DHT nodes that I am connected to. so its not like DHT doesn't work,

What I plan to do is keep the handles active in the session indefinitely until the metadata is retrieved. and I’m not going to delete the pens in 10 minutes if the metadata is not received in 10 minutes, as I do now.

I have few questions regarding python lib-torrent bindings.

How many pens can I keep working? Is there a limitation for starting pens?
will 10k + or 100k descriptors work, slowing my system down? or are there resources? if so, what resources? I mean RAM, NETWORK?
I'm behind a firewall, can it be a blocked inbound port causing a slow metadata extraction speed?
can a DHT server, for example, router.bittorrent.com or any other BAN, my IP address to send too many requests?
Can other peers BAN my IP address if they find out that I am making too many requests just to get metadata?
Can I run multiple instances of this script? or maybe multithreaded? will performance improve?
If you use multiple instances of the same script, each script will get a unique node -id depending on the ip and port I use, is this a viable solution?

Is there a better approach? to achieve what I'm trying?

+7

python bittorrent magnet-uri libtorrent libtorrent-rasterbar

Amb 30 sept '15 at 13:04

source share

1 answer

the8472 · Accepted Answer · 2015-10-03T13:58:25+0000

I cannot answer questions specific to the libtorrent APIs, but some of your questions are related to bittorrent in general.

will 10k + or 100k descriptors work, slowing my system down? or are there resources? if so, what resources? I mean RAM, NETWORK?

Download metadata should not use a lot of resources, since they are not fully downloadable torrent downloads, i.e. they cannot highlight actual files or anything like that. But they will need some memory / disk space for the metadata itself as soon as they capture the first fragment of these files.

I'm behind a firewall, can it be a blocked inbound port causing a slow metadata extraction speed?

yes, by reducing the number of peers that can establish connections, it becomes more difficult to obtain metadata (or even establish some kind of connection) on swarms with a low number of peers.

NAT can cause the same problem.

can a DHT server, for example, router.bittorrent.com or any other BAN, my IP address to send too many requests?

router.bittorrent.com is a node boot file, not a server as such. The search does not request a single node, they request many different ones (among millions). But yes, individual nodes may prohibit, or rather, the speed limit, you.

This can be reduced by looking for randomly distributed identifiers to distribute the load through the DHT key space.

Can I run multiple instances of this script? or maybe multithreaded? will performance improve?

AIUI libtorrent is non-blocking or multi-threaded enough that you can plan multiple torrents at once.

I don't know if libtorrent has a speed limit for outgoing DHT requests.

if you use multiple instances of the same script, each script will get a unique node -id depending on the ip and port I use, is this a viable solution?

If you mean the DHT node identifier, then they are obtained from IP (according to BEP 42 ), not the port, although some random element is included, so a limited number of identifiers can be obtained per IP.

And some of them may also be applicable for your scenario: http://blog.libtorrent.org/2012/01/seeding-a-million-torrents/

And another option is my own implementation of DHT , which includes a CLI for mass extraction of torrents.

Creating a daemon using Python libtorrent to extract metadata from 100k + torrents

More articles: