R Function Slow, Looking to Increase Speed/Performance -


i've built prediction function in r, when run it's slow, , i'm using sample of 1% of data i'll using in production. function intended predict next word given series of ngrams (two-word, three-word, or four-word combinations - created corpus).

i pass words function, example "i can", , series of three-word combinations. output ranked in order decreasing "i can read", count of 4.

here two-word ngram passed matrix, dim , example data position 100.

 dim(bigram_index)  [1] 46201  3   bigram_index[,1][100]  [1] "abandon"  bigram_index[,2][100]  [1] "contemporary"  bigram_index[,3][100]  [1] "1" 

here prediction function:

predict.next.word <- function(word, ng_matrix){ ngram_df <- data.frame(predicted=character(), count = numeric(), stringsasfactors=false)     col_ng_matrix <- nrow(bigram_index) if(ncol(ng_matrix)==3){         (i in 1:col_ng_matrix){         first_word <- ng_matrix[,1][i]         second_word <- ng_matrix[,2][i]         count_word <- ng_matrix[,3][i]         if (word[1] == first_word && !is.na(first_word)){             matched_factor <- structure(c(second_word, count_word), .names = c("predicted", "count"))             ngram_df[i,] <- as.list(matched_factor)             }         }       }  else if(ncol(ng_matrix)==4){         (i in 1:col_ng_matrix){             first_word <- ng_matrix[,1][i]             second_word <- ng_matrix[,2][i]             third_word <- ng_matrix[,3][i]             count_word <- ng_matrix[,4][i]             if (word[1] == first_word && !is.na(first_word) && word[2] == second_word && !is.na(second_word)){                 matched_factor <- structure(c(third_word, count_word), .names = c("predicted", "count"))                 ngram_df[i,] <- as.list(matched_factor)                 }              }          }  else if(ncol(ng_matrix)==5){         (i in 1:col_ng_matrix){                 first_word <- ng_matrix[,1][i]                 second_word <- ng_matrix[,2][i]                 third_word <- ng_matrix[,3][i]                 fourth_word <- ng_matrix[,4][i]                 count_word <- ng_matrix[,5][i]                 if (word[1] == first_word && !is.na(first_word) && word[2] == second_word                      && !is.na(second_word) && word[3] == third_word && !is.na(third_word)){                     ngram_df[i,] <- as.list(matched_factor)                     }                  }              } ngram_df <- transform(ngram_df, count = as.numeric(count)) return (ngram_df[order(ngram_df$count, decreasing = true),])   } 

using smallest ngram (only two-word) here time results:

system.time(predict.next.word(c("abandon"), bigram_index)) user  system elapsed  92.125  59.395 152.149  

again, ngram passed again 1% of production data, , when 3 , four-word, takes longer. please provide insight on how improve function's speed.

instead of looping through columns, writing function performs key actions of for() loop, , use apply() (with margin=2 columns, 1 rows; think you'll using latter) apply function each column (fun= argument set equal function). depending on output format, apply might not suitable. @ point plyr package, dplyr, or, favorite (but of learning curve, dplyr) data.table package.

in general, take @ hadley's book chapter on topic: http://adv-r.had.co.nz/performance.html

currently, code doesn't take advantage of fact so-call "vectorized" r code performs loops in c, making them faster (forgive me if description technically incorrect; getting idea across).

for more specific example, might helpful see input (use dput(data)) , desired output. i'd have easier time digesting want function accomplish.

some general points help, @ least little:

  1. you ncol(ng_matrix) several times; instead, nc.ngm < - ncol(ng_matrix) once @ start. savings minimal, idea still useful.
  2. instead of defining first_word second, etc., words <- ng_matrix[i,]. use previously-mentioned object count_word doing count_word <- words[nc.ngm] , other words numbered_words <- words[nc.ngm]. compare word object elements words elements, make use of mapply logic. again, little hard follow without example. in general, things "in bulk" (vectorize).

Comments

Popular posts from this blog

yii2 - Yii 2 Running a Cron in the basic template -

asp.net - 'System.Web.HttpContext' does not contain a definition for 'GetOwinContext' Mystery -

wso2esb - How to concatenate JSON array values in WSO2 ESB? -