python - Regex text between two strings -
i trying extract data fields pdf texts using regex.
the text is:
"sample experian customer\n2288150 - experian sample reports\ndata dictionary report\nfiltered by:\ncustom selection\nmarketing element:\npage 1 of 284\n2014-11-11 21:52:01 pm\nexperian , marks used herein service marks or registered trademarks of experian.\n© experian 2014 rights reserved. confidential , proprietary.\n**data dictionary**\ndate of birth acquired public , proprietary files. these sources provide, @ minimum, year of birth; month provided available. exact date of birth @ various levels of detail available \n\n\n\n\n\nnote: records coded dob exclusive of estimated age (101e)\n**element number**\n0100\ndescription\ndate of birth / exact age\n**data dictionary**\n\n\n\n\n\n\n\n\n\n\nfiller, 3 bytes\n**element number**\n0000\n**description**\nenhancement mandatory append\n**data dictionary**\n\n\nwhen there insufficient data match customer's record our enrichment master estimated age, median estimated age based on ages of other adult individuals in same zip+4 area provided. \n\n\n\n\n\n\n00 = unknown\n**element number**\n0101e\n**description**\nestimated age\n"
the field names in bold. texts between field names field values.
the first time tried extract 'description' field using following regex:
pattern = re.compile('\ndescription\n(.*?)\ndata dictionary\n') re.findall(pattern,text)
the results correct:
['date of birth / exact age', 'enhancement mandatory append']
but using same idea extract 'data dictionary' field gives empty result:
pattern = re.compile('\ndata dictionary\n(.*?)\nelement number\n') re.findall(pattern,text)
results:
[]
any idea why?
.
doesn't match newlines default. try:
pattern = re.compile('\ndata dictionary\n(.*?)\nelement number\n', flags=re.dotall) re.findall(pattern,text)
notice how passed re.dotall
flags
argument re.compile
.
Comments
Post a Comment