Parsing HTML with REs

Published: Wednesday, Dec 26, 2007 Last modified: Monday, May 27, 2024

My repo has better stuff:

12:18 < hendry> Could anyone shed some light on a regex problem I have here?

12:22 < deltab> hendry: it doesn't work for all the arpagraphs

12:22 < deltab> ^paragraphs

12:22 < deltab> only those on one line

12:25 < deltab> hendry: you need the s option to make . match newlines, as The-Fixer mentions

12:28 < deltab> hendry: or you can pass the flag to'<body>(.*?)</body>', re.S)

12:28 < deltab> er, with the input too

12:28 < hendry> thanks, I was trying to figure out this 's' meaning

12:31 < deltab> it makes . match any character including newlines

12:31 < deltab> also called re.DOTALL

12:36 < The-Fixer> you can ignore case also:"(?si)<body>(.*?)</body>", '<BODY>ab\ncd\nef</body>').group(1) #==> 'ab\ncd\nef'

12:47 < moshez> don't do stupid crap

12:47 < moshez> like, say, parsing HTML with REs


10:35 <Pythy>"(?si)<body>([^>]*)</body>", '<BoDy>ab\ncd\nef</bOdY>').group(1) #==> 'ab\ncd\nef'  # for the case where no `>' will appear between
          the <tags>.

10:38 <Pythy> (If a `<' might appear, a more involved rex could be used.  Handling nesting delimiters is trickier still, as additional code is needed to
          track the depth.  But, as was noted, for a robust solution, a parser is the way to go.)

Should be using: