python - Reduce Memory Usage Pandas -
i can't figure out way reduce memory usage program further. efficient implementation date:
columns = ['eventname', 'sessionid', "eventtime", "items", "currentpage", "browsertype"] df = pd.dataframe(columns=columns) l=[] i, file in enumerate(glob.glob("*.log")): print("going through log file #%s named %s..." % (i+1, file)) open(file) myfile: l+=[json.loads(line) line in myfile] tempdata = pd.dataframe(l) column in tempdata.columns: if not column in columns: try: tempdata.drop(column, axis=1, inplace=true) except valueerror: print ("oh no! we've got problem %s column! don't exist!" % (badcolumn)) l = [] df = df.append(tempdata, ignore_index = true) # slow version, memory efficient # length = len(df) # length_temp = len(tempdata) # in range(1, length_temp): # update_progress((i*100.0)/length_temp) # column in columns: # df.at[length+i, column] = tempdata.at[i, column] tempdata = 0 print ("data frame initialized , filled! sorting...") df.sort(columns=["sessionid", "eventtime"], inplace = true) print ("done sorting... changing indices...") df.index = range(1, len(df)+1) print ("storing in pickles...") df.to_pickle('data.pkl')
basically, i'm reading json log files pandas dataframe, append function causing issue. creates 2 different objects in memory, causing huge memory usage. in addition, seems .to_pickle method of pandas huge memory hog, because biggest spike in memory when writing pickle. there easy way reduce memory? commented code job takes 100-1000x longer. i'm @ 45% memory usage @ max during .to_pickle part, 30% during reading of logs. more logs there are, higher number goes. help,
best, f
if need build dataframe
pieces, more efficient construct list of component frames , combine them in 1 step using concat
. see first approach below.
# df = 10 rows of dummy data in [10]: %%time ...: dfs = [] ...: _ in xrange(1000): ...: dfs.append(df) ...: df_concat = pd.concat(dfs, ignore_index=true) ...: wall time: 42 ms in [11]: %%time ...: df_append = pd.dataframe(columns=df.columns) ...: _ in xrange(1000): ...: df_append = df_append.append(df, ignore_index=true) ...: wall time: 915 ms
Comments
Post a Comment