python - Reduce Memory Usage Pandas -

- March 15, 2012

i can't figure out way reduce memory usage program further. efficient implementation date:

columns = ['eventname', 'sessionid', "eventtime", "items", "currentpage", "browsertype"] df = pd.dataframe(columns=columns) l=[]  i, file in enumerate(glob.glob("*.log")):     print("going through log file #%s named %s..." % (i+1, file))     open(file) myfile:         l+=[json.loads(line) line in myfile]         tempdata = pd.dataframe(l)         column in tempdata.columns:             if not column in columns:                 try:                     tempdata.drop(column, axis=1, inplace=true)                 except valueerror:                     print ("oh no! we've got problem %s column! don't exist!" % (badcolumn))         l = []         df = df.append(tempdata, ignore_index = true)         # slow version, memory efficient         # length = len(df)         # length_temp = len(tempdata)         # in range(1, length_temp):         #     update_progress((i*100.0)/length_temp)         #     column in columns:         #         df.at[length+i, column] = tempdata.at[i, column]         tempdata = 0  print ("data frame initialized , filled! sorting...") df.sort(columns=["sessionid", "eventtime"], inplace = true) print ("done sorting... changing indices...") df.index = range(1, len(df)+1) print ("storing in pickles...") df.to_pickle('data.pkl')

basically, i'm reading json log files pandas dataframe, append function causing issue. creates 2 different objects in memory, causing huge memory usage. in addition, seems .to_pickle method of pandas huge memory hog, because biggest spike in memory when writing pickle. there easy way reduce memory? commented code job takes 100-1000x longer. i'm @ 45% memory usage @ max during .to_pickle part, 30% during reading of logs. more logs there are, higher number goes. help,

best, f

if need build dataframe pieces, more efficient construct list of component frames , combine them in 1 step using concat. see first approach below.

# df = 10 rows of dummy data  in [10]: %%time     ...: dfs = []     ...: _ in xrange(1000):     ...:     dfs.append(df)     ...: df_concat = pd.concat(dfs, ignore_index=true)     ...:  wall time: 42 ms  in [11]: %%time     ...: df_append = pd.dataframe(columns=df.columns)     ...: _ in xrange(1000):     ...:     df_append = df_append.append(df, ignore_index=true)     ...:  wall time: 915 ms

Search This Blog

Panthy J

python - Reduce Memory Usage Pandas -

Comments

Post a Comment

Popular posts from this blog

asp.net - 'System.Web.HttpContext' does not contain a definition for 'GetOwinContext' Mystery -

yii2 - Yii 2 Running a Cron in the basic template -

javascript - jQuery DataTable responsive doesnt work with Boostrap 3 -