Efficiently capture variable number of patterns with Julia regex? -
i pulling data out of giant text files sections interested in like
... section:numberofsurvivorspervault subsection:1958 xy:1_1034 xy:2_2334 subsection:1959 xy:1_1334 xy:2_2874 xy:7_12 ... section:meancapsperghoul subsection:1962 xy:1_234 xy:2_121 ....
the sections/subsections randomly scattered throughout text file , have variable numbers of xy pairs. right readall'ing full text , capturing each , adding them dataframe with:
function pushparametricdata(df, full) m = eachmatch(r"section:(.*)\r\nsubsection:([0-9]*)\r\n((xy:[0-9]*_.*?\r\n)+)"m, full) r = eachmatch(r"xy:([0-9]+)_(.*?)\r\n"m, m.captures[3]) push!(df, [m.captures[1], int(m.captures[2]), int(r.captures[1]), float(r.captures[2])]) end end end
this works ok, think allocates @ least twice memory needs due 2 regexes, , @time shows 80% of run gc. there way can done without making intermediate copy? (from can tell it's not possible single regex).
it depends on need validate on rest of text file. instance, if don't need syntax validation, in know sure text file has correct section-subsection-item structure, use regex:
(?:\g|(?:\g|^section:(.*)[\r\n]+)subsection:(\d*)[\r\n]+)xy:(\d*_.*+[\r\n]+)
iterating each xy pair.
example:
for m = eachmatch(r"(?:\g|(?:\g|^section:(.*)[\r\n]+)subsection:(\d*)[\r\n]+)xy:(\d*_.*+[\r\n]+)"m, full) if m.captures[2] != nothing sub = m.captures[2] if m.captures[1] != nothing sec = m.captures[1] end end item = m.captures[3] print("section: ", sec, " -- subsection: ", sub, " -- item: ", item) end
*please forgive me, first time try coding in julia.
prints:
section: numberofsurvivorspervault -- subsection: 1958 -- item: 1_1034 section: numberofsurvivorspervault -- subsection: 1958 -- item: 2_2334 section: numberofsurvivorspervault -- subsection: 1959 -- item: 1_1334 section: numberofsurvivorspervault -- subsection: 1959 -- item: 2_2874 section: numberofsurvivorspervault -- subsection: 1959 -- item: 7_12 section: meancapsperghoul -- subsection: 1962 -- item: 1_234 section: meancapsperghoul -- subsection: 1962 -- item: 2_121
this expression uses \g
match @ end of last match. try match in order:
if there previous match, try match xy pair in
m.captures[3]
anchored end of last match, leaving 1st , 2nd capturing group unset.if (1) doesn't match, try match both subsection , xy pair in
m.captures[2]
,m.captures[3]
, again anchored end of last match, leaving 1st capturing group unset.try perform full match, section, subsection , xy pair
this example work on subject text, , serves starting point working example depending on actual structure of text files. take account fail if have subsections missing example.
Comments
Post a Comment