Parsing HTML with REs

Published: Wednesday, Dec 26, 2007 Last modified: Monday, Dec 9, 2024

My repo has better stuff:

12:18 < hendry> Could anyone shed some light on a regex problem I have here? http://rafb.net/paste/results/U1015016.html

12:22 < deltab> hendry: it doesn't work for all the arpagraphs

12:22 < deltab> ^paragraphs

12:22 < deltab> only those on one line

12:25 < deltab> hendry: you need the s option to make . match newlines, as The-Fixer mentions

12:28 < deltab> hendry: or you can pass the flag to re.search:  re.search('<body>(.*?)</body>', re.S)

12:28 < deltab> er, with the input too

12:28 < hendry> thanks, I was trying to figure out this 's' meaning

12:31 < deltab> it makes . match any character including newlines

12:31 < deltab> also called re.DOTALL

12:36 < The-Fixer> you can ignore case also: re.search("(?si)<body>(.*?)</body>", '<BODY>ab\ncd\nef</body>').group(1) #==> 'ab\ncd\nef'

12:47 < moshez> don't do stupid crap

12:47 < moshez> like, say, parsing HTML with REs

Update:

10:35 <Pythy> re.search("(?si)<body>([^>]*)</body>", '<BoDy>ab\ncd\nef</bOdY>').group(1) #==> 'ab\ncd\nef'  # for the case where no `>' will appear between
          the <tags>.

10:38 <Pythy> (If a `<' might appear, a more involved rex could be used.  Handling nesting delimiters is trickier still, as additional code is needed to
          track the depth.  But, as was noted, for a robust solution, a parser is the way to go.)

Should be using: http://www.crummy.com/software/BeautifulSoup/