I have a rather large mysql table, about 30M rows, 6 columns, about 2gb when loaded into memory.
I work with both python and R. In R, I can load the table into memory, and it takes about 90 seconds. But in python, it takes 40 minutes.
I tried this with both sqlalchemy and simple pymysql. The code is simple, for example with sqlalchemy:
db_engine = sqlalchemy.create_engine("mysql+pymysql://user: pass@host /database") cnx = db_engine.connect() table = pd.read_sql('select * from my_table',cnx) cnx.close()
Without sqlalchemy:
cnx = mysql.connector.connect(**db_details) cursor = cnx.cursor() cursor.execute('select * from my_table') table = pd.DataFrame(data=list(cursor),columns = cursor.column_names) cnx.close()
In any case, it is much, much slower than R, which does not make much sense to me. Why is this, and is there a way to speed it up? He will even hack.
To add, pandas has nothing to do with this for so long. In the second code snippet, if I just return list(cursor) instead of putting it in
In the editor: The database runs on the same computer as R / Python, so everything should be the same in terms of bandwidth.
In R, I use DBI, and the R code that I use is (mostly):
require(DBI) cnx <- dbConnect(dbDriver("MySQL"),dbname="database",username="user",password="pass",host="host") table <- dbGetQuery(cnx,"select * from my_table")
******** RESOLVED (BRIDGE) ********
Thanks to the helpful comments, especially from @roganjosh, it seems that the problem is that the mysql connector is written in python by default and not in C, which makes it very slow. The solution is to use MySQLdb , which is a native C connector.
In my particular installation running python 3 with anaconda, this was not possible because MySQLdb is only supported in python 2. However, for python 3, there is a MySQLdb implementation called mysqlclient .
Using this implementation, the time now is about 5 minutes to read the entire table, not as fast as R, but much less than 40 or so what it was before.