r - Non-spreadsheet-esque Data Import and Organization -
basic problem
i have data of form
doc id 1 var1 var2 b ... varstar 453 varstar 3432 varstar 32 ... varn doc var1 var2 b (and on)
where doc
denoting beginning of each record/observation, vari
denoting variable, , varstar
denoting variable of interest that may have more 1 entry.
i
import data using r (or python, i'm bit rusty there).
for each record retrieve
id
, instances ofvarstar
.store them in manner later, easy manipulation/merger. example, list or binary (possibly sparse) matrix (for application
varstar
category overlap expected).
it seems should easy familiar csv/spreadsheet data , perhaps don't know right words google. prefer not to, e.g., create entire sql database, etc. because don't need entire database. of course, may easier elaborate organization then pick out choice pieces.
context - application
i want retrieve u.s. patent classes via google/uspto bulk downloads. id
patent number while varstar
patent class fsc
. then, want merge nber patent data. application hinges crucially on patent class designation. nber data, while nice in many respects, reports single "main" class each patent. not because based on casual perusal of data , paper volodin (2010), patents commonly given several top-level classes.
volodin, dmitry. (2010) "nber patent data technological classification issues relevant research in inventor mobility", working paper. udel.edu/~volodin/pat/draft.pdf.
assuming dat.txt
looks like:
doc id 1 var1 var2 b ... varstar 453 varstar 3432 varstar 32 ... varn doc id 2 var1 var2 b varstar 111 varstar 222 varstar 333333 ...
then possible framework:
library(dplyr) dat <- readlines("dat.txt") doc_starts <- which(grepl("^doc", dat)) doc_ends <- lead(doc_starts)-1 doc_ends[length(doc_ends)] <- length(dat) # list-ified lapply(seq_along(doc_starts), function(i) { chunk <- dat[doc_starts[i]:doc_ends[i]] id <- gsub("^id\ +", "", chunk[which(grepl("^id", chunk))]) varstars <- gsub("^varstar\ +", "", chunk[which(grepl("^varstar", chunk))]) list(id=id, varstar=varstars) }) ## [[1]] ## [[1]]$id ## [1] "1" ## ## [[1]]$varstar ## [1] "453" "3432" "32" ## ## ## [[2]] ## [[2]]$id ## [1] "2" ## ## [[2]]$varstar ## [1] "111" "222" "333333" # data.frame-d bind_rows(lapply(seq_along(doc_starts), function(i) { chunk <- dat[doc_starts[i]:doc_ends[i]] id <- gsub("^id\ +", "", chunk[which(grepl("^id", chunk))]) varstars <- gsub("^varstar\ +", "", chunk[which(grepl("^varstar", chunk))]) data_frame(id=id, varstar=varstars) })) ## source: local data frame [6 x 2] ## ## id varstar ## 1 1 453 ## 2 1 3432 ## 3 1 32 ## 4 2 111 ## 5 2 222 ## 6 2 333333
Comments
Post a Comment