Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Parsing HTML with Regex (stackoverflow.com)
56 points by babawere on Jan 6, 2013 | hide | past | favorite | 33 comments


While I agree with the answer. What most people take from it is don't use Regex to scrape data from HTML. Which isn't exactly the point of it. Parsing HTML and scraping are two different things.

If you know the exact HTML you are working with, using regex to extract the data is in my opinion a superior way of doing it. Less lines and generally less complexity. (Such as taking the name and id of an amazon product from a single site is different from taking all the links out of any page given.)


With today's widely available DOM manipulation tools (jsdom, phantom, zombie) and proper HTML parsers in javascript there is absolutely no reason to use RegExps.


> (jsdom, phantom, zombie) and proper HTML parsers in javascript

Well, for starters, if you're not using Javascript. All four tools you mentioned are Javascript-based.


There is an XPath implementation for everything under the sun now - I saw and used one even in Erlang - and such an implementations beats regexes in readability and loc 9 times out of 10.

The only real reason to use regexes is when dealing with html so broken that parts of it are inaccessible through parser.


On the other hand, the world would really benefit from someone updating the libxml2 HTML parser to match the HTML(5) spec, since it has popular bindings for many other languages (e.g. lxml in python). The current implementation is broken in lots of (not so) edge cases. This can be a problem when there is the choice of being fast and wrong (using libxml2) or being slow and correct (using html5lib or one of the other high level implementations of the standardised algorithm).


The answer is pretty entertaining, but in context it's pedantic to the extreme. The poster's question was about matching opening tags that don't contain a closing slash, which is a tiny (regular) subset of HTML. You don't need pushdown automata to recognize these.

English, as any other natural language, is (at least mostly) a context free language too, but you wouldn't go around telling people that you shouldn't ever use regexen to match certain constructions in English text, right?


I wouldn't call a natural language context-free. They're not formal languages at all.


I'm well aware that English isn't a formal language, that's why I added the qualifier "mostly". The great majority of expressions in natural languages can, in fact, be accounted for with CFGs, and purely CFG-based Phrase Structure Grammars have been proposed (see the work of Gazdar and Pullum on Generalized Phrase Structure Grammar, from the early 80's, if you're interested). Many of Chomsky's original claims about the weak generative capacity of CFGs with respect to natural lanaguage that gave rise to transformational syntactic frameworks have since been disproven.

Whether or not there is an absolutely snug fit between CFGs formally and natural language "in the wild", so to speak, is another topic, and rather beside the point of the analogy. Context Sensitive Grammars are overly expressive, Regular Grammars much too weak, for much the same reason why they are too weak for HTML. Were there a perfect English language parser, you would not need it in order to match regular subsets of English, just as you do not need a full HTML parser in order to match regular subsets of HTML.


I had one class where we had to build a multi threaded search engine in java and parsing HTML with regex was a requirement. Regex was the downfall of about 50% of the class and the majority of the students who did well still had slight issues with their HTML parsing. Moral of the story is that regex is a poor solution for HTML. Not to mention, hours debugging regex is one of the least meaningful or rewarding experiences you can have as a programmer.


The obvious workaround for that broken requirement is to use regular expressions to handle the tokenization, and then write a simple recursive descent parser on top of that. Even if this is playing fast and loose with the requirements, it would work out fine, and you would almost certainly get a good grade if you explain why you did it.


Yes, this ^

Because, no, you can't parse XHTML with regex. As easily shown by the pumping lemma and all that jazz.

But, there's no freaking reason why you can't tokenize an XML start tag with a regex! In fact, you'll probably find that most uses of parsers in real life have regexes to tokenize down at the level that they can handle, before using a parser on the resulting tokens for the part that actually needs to be a CFG (among other reasons, because a compiled FSM is a lot faster than even a limited LALR parser).

Looking at this specific example, we can refer to the definitions for start tags [1] and empty element tags [2] in XML, and see that all their constituent rules form a regular language (if you don't believe me, it's not too hard to go check for yourself). So, especially since the orignal question doesn't even mention 'parsing', can we all please just shut up? (unless you actually want to figure out the horrible mess necessary to define a regex from the spec :P )

1. http://www.w3.org/TR/xml11/#sec-starttags

2. http://www.w3.org/TR/xml11/#dt-eetag


> the majority of the students who did well still had slight issues with their HTML parsing.

I'd like to hire the minority students, since it seems that they quite literally accomplished the impossible!


> the majority of the students who did well still had slight issues with their HTML parsing.

There were a few students who actually passed all of our professor's test cases (at least that's what they claim).


Did the other 50% succeed by telling the prof they got that he was joking?


You know you've been on SO too long when you see this title and go, "I'm pretty sure I know this question already." And you're right.


Related SO post/comment - Oh Yes You Can Use Regexes to Parse HTML! - http://stackoverflow.com/a/4234491/12195 (HN - http://news.ycombinator.com/item?id=2741780)

This comment is from Tom Christiansen of Programming Perl / Perl Cookbook fame which includes the following caveat:

So while it certainly can be done (this posting serves as an existence proof of this incontrovertible fact), that doesn’t mean it should be.


The ultimate regex repost!


There are plenty of good, real html parsers, so there's no need to try regex.

Xpath is nice for scraping HTML, though it always turns out the stuff you need is in the middle of a bunch of other text.


I would imagine that valid HTML could be parsed as XML, e.g. with the Python ElementTree XML API


That would be true for XHTML but not for HTML as tags like paragraph do not always require a end tag, which would break XML parsing.


I was under the impression that HTML4 introduced XML requirements, thus requiring the </p> tag.


A programmer has a problem which requires parsing.

He decides to use regex.

Now he has two problems.


A HN user reposts a SO post from ages ago.

Another HN user decides to repost a relevant joke from ages ago.

Now HN is going down the drain.


In audio/video format... best when he goes off the deep end and begins speaking in tongues towards the end of the post http://www.youtube.com/watch?v=pQgNRKpmFuo


If you need something fast but not necessarily 100% correct, such as for a real-time code syntax highlighter in JavaScript, RegEx is fine.


You cannot parse HTML with regex. You can find and match strings, but you can't actually parse html with regex.

Chuck Norris can parse HTML with regex.


"asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system"

This cracked me up :)


Hilarious and entertaining.


I reckon that technically regex is a tool that can be used to parse HTML. It's just that you could only use it in a very trivial way that would be better suited to other tools.


No. Technically, you cannot parse HTML with regular expressions. You can find certain strings in HTML which is a different thing.


At the very least, you can use regex to match individual characters as you scan the HTML for parsing. It's an inefficient and stupid way to do it, but it is still something you can do. And in that case, regex is technically a tool that you are using to parse HTML, even though 99.9% of the work is being done by non-regex code.


Ruby regular expressions have a \g operator which lets you call a sub expression - so technically they could be used to parse HTML if you're a masochist.


Propriety requires that someone point out that any "regular expression" permitting recursion is not, in fact, a regular expression in the formal sense. Luckily for anybody wanting to do it anyway, parsing expression grammars can handle that sort of thing with theoretical aplomb:

http://en.wikipedia.org/wiki/Parsing_expression_grammar




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: