r - Non-spreadsheet-esque Data Import and Organization -

- June 15, 2013

basic problem

i have data of form

doc id 1 var1 var2 b ... varstar 453 varstar 3432 varstar 32 ... varn  doc var1 var2 b (and on)

where doc denoting beginning of each record/observation, vari denoting variable, , varstar denoting variable of interest that may have more 1 entry.

import data using r (or python, i'm bit rusty there).
for each record retrieve id , instances of varstar.
store them in manner later, easy manipulation/merger. example, list or binary (possibly sparse) matrix (for application varstar category overlap expected).

it seems should easy familiar csv/spreadsheet data , perhaps don't know right words google. prefer not to, e.g., create entire sql database, etc. because don't need entire database. of course, may easier elaborate organization then pick out choice pieces.

context - application

i want retrieve u.s. patent classes via google/uspto bulk downloads. id patent number while varstar patent class fsc. then, want merge nber patent data. application hinges crucially on patent class designation. nber data, while nice in many respects, reports single "main" class each patent. not because based on casual perusal of data , paper volodin (2010), patents commonly given several top-level classes.

volodin, dmitry. (2010) "nber patent data technological classification issues relevant research in inventor mobility", working paper. udel.edu/~volodin/pat/draft.pdf.

assuming dat.txt looks like:

doc id 1 var1 var2 b ... varstar 453 varstar 3432 varstar 32 ... varn doc id 2 var1 var2 b varstar 111 varstar 222 varstar 333333 ...

then possible framework:

library(dplyr)  dat <- readlines("dat.txt")  doc_starts <- which(grepl("^doc", dat)) doc_ends <- lead(doc_starts)-1 doc_ends[length(doc_ends)] <- length(dat)  # list-ified  lapply(seq_along(doc_starts), function(i) {    chunk <- dat[doc_starts[i]:doc_ends[i]]    id <- gsub("^id\ +", "", chunk[which(grepl("^id", chunk))])   varstars <- gsub("^varstar\ +", "", chunk[which(grepl("^varstar", chunk))])    list(id=id, varstar=varstars)  })  ## [[1]] ## [[1]]$id ## [1] "1" ##  ## [[1]]$varstar ## [1] "453"  "3432" "32"   ##  ##  ## [[2]] ## [[2]]$id ## [1] "2" ##  ## [[2]]$varstar ## [1] "111"    "222"    "333333"   # data.frame-d  bind_rows(lapply(seq_along(doc_starts), function(i) {    chunk <- dat[doc_starts[i]:doc_ends[i]]    id <- gsub("^id\ +", "", chunk[which(grepl("^id", chunk))])   varstars <- gsub("^varstar\ +", "", chunk[which(grepl("^varstar", chunk))])    data_frame(id=id, varstar=varstars)  }))  ## source: local data frame [6 x 2] ##  ##   id varstar ## 1  1     453 ## 2  1    3432 ## 3  1      32 ## 4  2     111 ## 5  2     222 ## 6  2  333333

Search This Blog

Panthy J

r - Non-spreadsheet-esque Data Import and Organization -

Comments

Post a Comment

Popular posts from this blog

asp.net - 'System.Web.HttpContext' does not contain a definition for 'GetOwinContext' Mystery -

yii2 - Yii 2 Running a Cron in the basic template -

android - Crash when clicking button with custom theme -