Python has many different ways to fetch the data needed to do some processing.
Although majority of examples you can find around are using CSV file load, fetching a data from the database is still the most common way in practice.
For all tests I’ll use SALES table from the Oracle SH sample schema.
1.
In this example, classic iteration over cx_Oracle cursor result set will be used:
import cx_Oracle
import os
import datetime
import pandas as pd
from numpy.distutils.system_info import dfftw_info
os.environ["ORACLE_HOME"] = "/path/to/oracle/home"
con = cx_Oracle.connect(user='sh', password = 'xxx', dsn = '(DESCRIPTION = (ADDRESS_LIST = (ADDRESS = (PROTOCOL = tcp)(host = localhost) (Port = 9999))) (CONNECT_DATA = (SID = orcl)))')
sql = 'select * from sales'
cur = con.cursor()
cur.arraysize = 1000
counter = 0
beginTime = datetime.datetime.now()
try:
cur.execute(sql)
for row in cur:
counter +=1
except cx_Oracle.DatabaseError as ora_exception:
error, = ora_exception.args
print("Oracle Error Code: ", error.code)
print("Oracle Error Message: ", error.message)
print('elapsed_time = ', datetime.datetime.now() - beginTime)
print(counter)
cur.close()
con.close()
elapsed_time = 0:00:01.591582
918843
Elapsed time for processing approximately 1 million records (918.843) is 1.59 seconds.
2.
In this example, instead of iterating over a result set, Fetchall method will be used.
This is the part of code that I’ve changed:
r = cur.execute(sql).fetchall()
counter = len(r)
elapsed_time = 0:00:01.588170
918843
Elapsed time is almost identical as in the first case.
3.
In the third example I’ll use Pandas instead of cursor. There are several variation named as a subcase with a letter a – c.
a) Getting the number of records by using PROD_ID column name.
df_sales = pd.read_sql(sql, con)
counter = df_sales['PROD_ID'].size
elapsed_time = 0:00:04.065327
918843
Elapsed time to process the same amount of data is 4.06 sec, which is almost 3 times slower comparing with a cursor approach (cases 1 & 2).
b) Several other ways to get the number of records that are running roughly at the same speed as 3a case.
#uncomment one of the following options
#counter = len(df_sales.index)
#counter = df_sales.shape[0]
#counter = len(df_sales.axes[0])
c) Here is the slowest way to get the number of records as I’m using values instead of key (index)
counter = len(df_sales.values)
elapsed_time = 0:00:06.027399
918843
4.
Pandas case – selecting only the first column (instead of all columns) from the SALES table
sql = 'select prod_id from sales'
df_sales = pd.read_sql(sql, con)
counter = df_sales['PROD_ID'].size
elapsed_time = 0:00:03.744749
918843
Elapsed time is 3.74 sec which is by far the best result I can get by using the Pandas framework for client side processing.
5.
Leaving to SQL engine to do its job of calculating the number of records – example with cursor:
sql = 'select count(*) as cnt from sales'
cur = con.cursor()
cur.execute(sql)
counter = cur.fetchone()[0]
elapsed_time = 0:00:00.004216
918843
6.
Leaving to SQL engine to do its job of calculating the number of records – example with Pandas:
sql = 'select count(*) as cnt from sales'
a)
counter = df_sales.iloc [0]['CNT']
elapsed_time = 0:00:00.013347
918843
b)
counter = df_sales.iat[0,0]
elapsed_time = 0:00:00.009238
918843
c)
counter = df_sales.at[0, 'CNT']
elapsed_time = 0:00:00.002658
918843
Summary:
When you need to process data from a database, the best way to do that is to leave database SQL engine to do a job.
Results with SQL engine (cases 5 & 6) are many time (600 times or more) faster comparing with procedural logic.
That is expected (relational databases are created for that purpose), although you can still very often find a code logic where client side processing is used to do database engine job, which is a very wrong approach.
Another interesting point is that Pandas package is about 2.5 times slower comparing with a cursor (result set).
In cases where cursor approach is covering all needed functionality, you shouldn’t use Pandas framework as it’s about 2.5 times slower comparing with a cursor (result set) approach.
If you need to use Pandas, the best you can do is to leave all heavy processing to the SQL database engine, and fetch and manipulate only with necessary data volume.
Comments