While I agree with the answer. What most people take from it is don't use Regex to scrape data from HTML. Which isn't exactly the point of it. Parsing HTML and scraping are two different things.
If you know the exact HTML you are working with, using regex to extract the data is in my opinion a superior way of doing it. Less lines and generally less complexity. (Such as taking the name and id of an amazon product from a single site is different from taking all the links out of any page given.)
With today's widely available DOM manipulation tools (jsdom, phantom, zombie) and proper HTML parsers in javascript there is absolutely no reason to use RegExps.
There is an XPath implementation for everything under the sun now - I saw and used one even in Erlang - and such an implementations beats regexes in readability and loc 9 times out of 10.
The only real reason to use regexes is when dealing with html so broken that parts of it are inaccessible through parser.
On the other hand, the world would really benefit from someone updating the libxml2 HTML parser to match the HTML(5) spec, since it has popular bindings for many other languages (e.g. lxml in python). The current implementation is broken in lots of (not so) edge cases. This can be a problem when there is the choice of being fast and wrong (using libxml2) or being slow and correct (using html5lib or one of the other high level implementations of the standardised algorithm).
The answer is pretty entertaining, but in context it's pedantic to the extreme. The poster's question was about matching opening tags that don't contain a closing slash, which is a tiny (regular) subset of HTML. You don't need pushdown automata to recognize these.
English, as any other natural language, is (at least mostly) a context free language too, but you wouldn't go around telling people that you shouldn't ever use regexen to match certain constructions in English text, right?
I'm well aware that English isn't a formal language, that's why I added the qualifier "mostly". The great majority of expressions in natural languages can, in fact, be accounted for with CFGs, and purely CFG-based Phrase Structure Grammars have been proposed (see the work of Gazdar and Pullum on Generalized Phrase Structure Grammar, from the early 80's, if you're interested). Many of Chomsky's original claims about the weak generative capacity of CFGs with respect to natural lanaguage that gave rise to transformational syntactic frameworks have since been disproven.
Whether or not there is an absolutely snug fit between CFGs formally and natural language "in the wild", so to speak, is another topic, and rather beside the point of the analogy. Context Sensitive Grammars are overly expressive, Regular Grammars much too weak, for much the same reason why they are too weak for HTML. Were there a perfect English language parser, you would not need it in order to match regular subsets of English, just as you do not need a full HTML parser in order to match regular subsets of HTML.
I had one class where we had to build a multi threaded search engine in java and parsing HTML with regex was a requirement. Regex was the downfall of about 50% of the class and the majority of the students who did well still had slight issues with their HTML parsing. Moral of the story is that regex is a poor solution for HTML. Not to mention, hours debugging regex is one of the least meaningful or rewarding experiences you can have as a programmer.
The obvious workaround for that broken requirement is to use regular expressions to handle the tokenization, and then write a simple recursive descent parser on top of that. Even if this is playing fast and loose with the requirements, it would work out fine, and you would almost certainly get a good grade if you explain why you did it.
Because, no, you can't parse XHTML with regex. As easily shown by the pumping lemma and all that jazz.
But, there's no freaking reason why you can't tokenize an XML start tag with a regex! In fact, you'll probably find that most uses of parsers in real life have regexes to tokenize down at the level that they can handle, before using a parser on the resulting tokens for the part that actually needs to be a CFG (among other reasons, because a compiled FSM is a lot faster than even a limited LALR parser).
Looking at this specific example, we can refer to the definitions for start tags [1] and empty element tags [2] in XML, and see that all their constituent rules form a regular language (if you don't believe me, it's not too hard to go check for yourself). So, especially since the orignal question doesn't even mention 'parsing', can we all please just shut up? (unless you actually want to figure out the horrible mess necessary to define a regex from the spec :P )
I reckon that technically regex is a tool that can be used to parse HTML. It's just that you could only use it in a very trivial way that would be better suited to other tools.
At the very least, you can use regex to match individual characters as you scan the HTML for parsing. It's an inefficient and stupid way to do it, but it is still something you can do. And in that case, regex is technically a tool that you are using to parse HTML, even though 99.9% of the work is being done by non-regex code.
Ruby regular expressions have a \g operator which lets you call a sub expression - so technically they could be used to parse HTML if you're a masochist.
Propriety requires that someone point out that any "regular expression" permitting recursion is not, in fact, a regular expression in the formal sense. Luckily for anybody wanting to do it anyway, parsing expression grammars can handle that sort of thing with theoretical aplomb:
If you know the exact HTML you are working with, using regex to extract the data is in my opinion a superior way of doing it. Less lines and generally less complexity. (Such as taking the name and id of an amazon product from a single site is different from taking all the links out of any page given.)