R Function Slow, Looking to Increase Speed/Performance -
i've built prediction function in r, when run it's slow, , i'm using sample of 1% of data i'll using in production. function intended predict next word given series of ngrams (two-word, three-word, or four-word combinations - created corpus).
i pass words function, example "i can", , series of three-word combinations. output ranked in order decreasing "i can read", count of 4.
here two-word ngram passed matrix, dim , example data position 100.
dim(bigram_index) [1] 46201 3 bigram_index[,1][100] [1] "abandon" bigram_index[,2][100] [1] "contemporary" bigram_index[,3][100] [1] "1"
here prediction function:
predict.next.word <- function(word, ng_matrix){ ngram_df <- data.frame(predicted=character(), count = numeric(), stringsasfactors=false) col_ng_matrix <- nrow(bigram_index) if(ncol(ng_matrix)==3){ (i in 1:col_ng_matrix){ first_word <- ng_matrix[,1][i] second_word <- ng_matrix[,2][i] count_word <- ng_matrix[,3][i] if (word[1] == first_word && !is.na(first_word)){ matched_factor <- structure(c(second_word, count_word), .names = c("predicted", "count")) ngram_df[i,] <- as.list(matched_factor) } } } else if(ncol(ng_matrix)==4){ (i in 1:col_ng_matrix){ first_word <- ng_matrix[,1][i] second_word <- ng_matrix[,2][i] third_word <- ng_matrix[,3][i] count_word <- ng_matrix[,4][i] if (word[1] == first_word && !is.na(first_word) && word[2] == second_word && !is.na(second_word)){ matched_factor <- structure(c(third_word, count_word), .names = c("predicted", "count")) ngram_df[i,] <- as.list(matched_factor) } } } else if(ncol(ng_matrix)==5){ (i in 1:col_ng_matrix){ first_word <- ng_matrix[,1][i] second_word <- ng_matrix[,2][i] third_word <- ng_matrix[,3][i] fourth_word <- ng_matrix[,4][i] count_word <- ng_matrix[,5][i] if (word[1] == first_word && !is.na(first_word) && word[2] == second_word && !is.na(second_word) && word[3] == third_word && !is.na(third_word)){ ngram_df[i,] <- as.list(matched_factor) } } } ngram_df <- transform(ngram_df, count = as.numeric(count)) return (ngram_df[order(ngram_df$count, decreasing = true),]) }
using smallest ngram (only two-word) here time results:
system.time(predict.next.word(c("abandon"), bigram_index)) user system elapsed 92.125 59.395 152.149
again, ngram passed again 1% of production data, , when 3 , four-word, takes longer. please provide insight on how improve function's speed.
instead of looping through columns, writing function performs key actions of for()
loop, , use apply()
(with margin=2
columns, 1 rows; think you'll using latter) apply function each column (fun=
argument set equal function). depending on output format, apply might not suitable. @ point plyr
package, dplyr
, or, favorite (but of learning curve, dplyr
) data.table
package.
in general, take @ hadley's book chapter on topic: http://adv-r.had.co.nz/performance.html
currently, code doesn't take advantage of fact so-call "vectorized" r code performs loops in c, making them faster (forgive me if description technically incorrect; getting idea across).
for more specific example, might helpful see input (use dput(data)
) , desired output. i'd have easier time digesting want function accomplish.
some general points help, @ least little:
- you
ncol(ng_matrix)
several times; instead,nc.ngm < - ncol(ng_matrix)
once @ start. savings minimal, idea still useful. - instead of defining
first_word
second, etc.,words <- ng_matrix[i,]
. use previously-mentioned objectcount_word
doingcount_word <- words[nc.ngm]
, other wordsnumbered_words <- words[nc.ngm]
. compareword
object elementswords
elements, make use ofmapply
logic. again, little hard follow without example. in general, things "in bulk" (vectorize).
Comments
Post a Comment