Loading mysql table in python takes a lot of time compared to R

I have a rather large mysql table, about 30M rows, 6 columns, about 2gb when loaded into memory.

I work with both python and R. In R, I can load the table into memory, and it takes about 90 seconds. But in python, it takes 40 minutes.

I tried this with both sqlalchemy and simple pymysql. The code is simple, for example with sqlalchemy:

db_engine = sqlalchemy.create_engine("mysql+pymysql://user: pass@host /database") cnx = db_engine.connect() table = pd.read_sql('select * from my_table',cnx) cnx.close() 

Without sqlalchemy:

 cnx = mysql.connector.connect(**db_details) cursor = cnx.cursor() cursor.execute('select * from my_table') table = pd.DataFrame(data=list(cursor),columns = cursor.column_names) cnx.close() 

In any case, it is much, much slower than R, which does not make much sense to me. Why is this, and is there a way to speed it up? He will even hack.

To add, pandas has nothing to do with this for so long. In the second code snippet, if I just return list(cursor) instead of putting it in

In the editor: The database runs on the same computer as R / Python, so everything should be the same in terms of bandwidth.

In R, I use DBI, and the R code that I use is (mostly):

 require(DBI) cnx <- dbConnect(dbDriver("MySQL"),dbname="database",username="user",password="pass",host="host") table <- dbGetQuery(cnx,"select * from my_table") 

******** RESOLVED (BRIDGE) ********

Thanks to the helpful comments, especially from @roganjosh, it seems that the problem is that the mysql connector is written in python by default and not in C, which makes it very slow. The solution is to use MySQLdb , which is a native C connector.

In my particular installation running python 3 with anaconda, this was not possible because MySQLdb is only supported in python 2. However, for python 3, there is a MySQLdb implementation called mysqlclient .

Using this implementation, the time now is about 5 minutes to read the entire table, not as fast as R, but much less than 40 or so what it was before.

+7
python mysql r
source share
2 answers

Thanks to the helpful comments, especially from @roganjosh, it seems that the problem is that the mysql connector is written in python by default and not in C, which makes it very slow. The solution is to use MySQLdb , which is a native C connector.

In my particular installation running python 3 with anaconda, this was not possible because MySQLdb is only supported in python 2. However, there is a MySQLdb implementation for python 3 called mysqlclient .

Using this implementation, the time now is about 5 minutes to read the entire table, not as fast as R, but much less than 40 or so what it was before.

I am still open to suggestions that will make it faster, but I guess it is as good as it will be.

+1
source share

There is also a clean C / C ++ ultramysql MySQL driver that you can use with the umysqldb adapter . Projects are inactive, but can be useful for a one-time thing - I would not use them in production, though.

Since pymysql is a pure-Python driver, you can also try running it on PyPy .

0
source share

All Articles