Efficiently capture variable number of patterns with Julia regex? -

- March 15, 2011

i pulling data out of giant text files sections interested in like

... section:numberofsurvivorspervault subsection:1958 xy:1_1034 xy:2_2334 subsection:1959 xy:1_1334 xy:2_2874 xy:7_12 ... section:meancapsperghoul subsection:1962 xy:1_234 xy:2_121 ....

the sections/subsections randomly scattered throughout text file , have variable numbers of xy pairs. right readall'ing full text , capturing each , adding them dataframe with:

function pushparametricdata(df, full)      m = eachmatch(r"section:(.*)\r\nsubsection:([0-9]*)\r\n((xy:[0-9]*_.*?\r\n)+)"m, full)         r = eachmatch(r"xy:([0-9]+)_(.*?)\r\n"m, m.captures[3])             push!(df, [m.captures[1], int(m.captures[2]), int(r.captures[1]), float(r.captures[2])])         end     end end

this works ok, think allocates @ least twice memory needs due 2 regexes, , @time shows 80% of run gc. there way can done without making intermediate copy? (from can tell it's not possible single regex).

it depends on need validate on rest of text file. instance, if don't need syntax validation, in know sure text file has correct section-subsection-item structure, use regex:

(?:\g|(?:\g|^section:(.*)[\r\n]+)subsection:(\d*)[\r\n]+)xy:(\d*_.*+[\r\n]+)

iterating each xy pair.

example:

for m = eachmatch(r"(?:\g|(?:\g|^section:(.*)[\r\n]+)subsection:(\d*)[\r\n]+)xy:(\d*_.*+[\r\n]+)"m, full)     if m.captures[2] != nothing         sub = m.captures[2]         if m.captures[1] != nothing             sec = m.captures[1]         end     end     item = m.captures[3]      print("section: ", sec, " -- subsection: ", sub, " -- item: ", item) end

*please forgive me, first time try coding in julia.

prints:

section: numberofsurvivorspervault -- subsection: 1958 -- item: 1_1034 section: numberofsurvivorspervault -- subsection: 1958 -- item: 2_2334 section: numberofsurvivorspervault -- subsection: 1959 -- item: 1_1334 section: numberofsurvivorspervault -- subsection: 1959 -- item: 2_2874 section: numberofsurvivorspervault -- subsection: 1959 -- item: 7_12 section: meancapsperghoul -- subsection: 1962 -- item: 1_234 section: meancapsperghoul -- subsection: 1962 -- item: 2_121

this expression uses \g match @ end of last match. try match in order:

if there previous match, try match xy pair in m.captures[3] anchored end of last match, leaving 1st , 2nd capturing group unset.
if (1) doesn't match, try match both subsection , xy pair in m.captures[2] , m.captures[3], again anchored end of last match, leaving 1st capturing group unset.
try perform full match, section, subsection , xy pair

this example work on subject text, , serves starting point working example depending on actual structure of text files. take account fail if have subsections missing example.

Search This Blog

Panthy J

Efficiently capture variable number of patterns with Julia regex? -

Comments

Post a Comment

Popular posts from this blog

asp.net - 'System.Web.HttpContext' does not contain a definition for 'GetOwinContext' Mystery -

yii2 - Yii 2 Running a Cron in the basic template -

android - Crash when clicking button with custom theme -