Parsing HTML with REs
Published: Wednesday, Dec 26, 2007 Last modified: Thursday, Nov 14, 2024
My repo has better stuff:
12:18 < hendry> Could anyone shed some light on a regex problem I have here? http://rafb.net/paste/results/U1015016.html
12:22 < deltab> hendry: it doesn't work for all the arpagraphs
12:22 < deltab> ^paragraphs
12:22 < deltab> only those on one line
12:25 < deltab> hendry: you need the s option to make . match newlines, as The-Fixer mentions
12:28 < deltab> hendry: or you can pass the flag to re.search: re.search('<body>(.*?)</body>', re.S)
12:28 < deltab> er, with the input too
12:28 < hendry> thanks, I was trying to figure out this 's' meaning
12:31 < deltab> it makes . match any character including newlines
12:31 < deltab> also called re.DOTALL
12:36 < The-Fixer> you can ignore case also: re.search("(?si)<body>(.*?)</body>", '<BODY>ab\ncd\nef</body>').group(1) #==> 'ab\ncd\nef'
12:47 < moshez> don't do stupid crap
12:47 < moshez> like, say, parsing HTML with REs
Update:
10:35 <Pythy> re.search("(?si)<body>([^>]*)</body>", '<BoDy>ab\ncd\nef</bOdY>').group(1) #==> 'ab\ncd\nef' # for the case where no `>' will appear between
the <tags>.
10:38 <Pythy> (If a `<' might appear, a more involved rex could be used. Handling nesting delimiters is trickier still, as additional code is needed to
track the depth. But, as was noted, for a robust solution, a parser is the way to go.)
Should be using: http://www.crummy.com/software/BeautifulSoup/