BBCode to HTML transformation rules


I have written very simple BBCode parser using C# which transforms BBCode to HTML. Currently it supports only [b], [i] and [u] tags. I know that BBCode is always considered as valid regardless whatever user have typed. I cannot find strict specification how to transform BBCode to HTML


  1. Does standard "BBCode to HTML" specification exist?
  2. How should I handle "[b][b][/b][/b]"? For now parser yields "<b>[b][/b]</b>".
  3. How should I handle "[b][i][u]zzz[/b][/i][/u]" input? Currently my parser is smart enough to produce "<b><i><u>zzz</u></i></b>" output for such case, but I wonder that it is "too smart" approach, or it is not?

More details

I have found some ready-to-use BBCode parser implementations, but they are too heavy/complex for me and, what is worse, use tons of Regular Expressions and produce not that markup what I expect. Ideally, I want to receive XHTML at the output. For inferring "BBCode to HTML" transformation rules I am using this online parser: http://www.bbcode.org/playground.php. It produces HTML that is intuitively correct on my opinion. The only thing I dislike it does not produce XHTML. For example "[b][i]zzz[/b][/i]" is transformed to "<b><i>zzz</b></i>" (note closing tags order). FireBug of course shows this as "<b><i>zzz</i></b><i></i>". As I understand, browsers fix such wrong closing tags order cases, but I am in doubt:

  1. Should I rely on this browsers feature and do not try to make XHTML.
  2. Maybe "[b][i]zzz[/b]ccc[/i]" must be understood as "<b>[i]zzz</b>ccc[/i]" - looks logically for such improper formatting, but is in conflict with popular forums BBCode outputs (*zzz****ccc*, not **[i]zzzccc[/i])



On your first question, I don t think that relying on browsers to correct any kind of mistakes is a good idea regardless the scope of your project (well, maybe except when you re actually doing bug tests on the browser itself). Some browsers might do an awesome job on that while others might fail miserably. The best way to make sure the output syntax is correct (or at least as correct as possible) is to send it with a correct syntax to the browser in the first place.

Regarding your second question, since you re trying to have correct BBCode converted to correct HTML, if your input is [b][i]zzz[/b]ccc[/i], its correct HTML equivalent would be <i><b>zzz</b>ccc</i> and not <b>[i]zzz</b>ccc[/i]. And this is where things get complicated as you would not be writing just a converter anymore, but also a syntax checker/correcter. I have written a similar script in PHP for a rather weird game engine scripting language but the logic could be easily applied to your case. Basically, I had a flag set for each opening tag and checked if the closing tag was in the right position. Of course, this gives limited functionality but for what I needed it did the trick. If you need more advanced search patterns, I think you re stuck with regex.


If you re only going to implement B, I and U, which aren t terribly important tags, why not simply have a counter for each of those tags: +1 each time it is opened, and -1 each time it s closed.

At the end of a forum post (or whatever) if there are still-open tags, simply close them. If the user puts in invalid bbcode, it may look strange for the duration of their post, but it won t be disastrous.

Regarding invalid user-submitted markup, you have at least three options:

  1. Strip it out
  2. Print it literally, i.e. don t convert it to HTML
  3. Attempt to fix it.

I don t recommend 3. It gets really tricky really fast. 1 and 2 are both reasonable options.

As for how to parse BBCode, I strongly recommend against using regex. BBCode is actually a fairly complex language. Most significantly, it supports nesting of tags. Regex can t handle arbitrary nesting. That s one of the fundamental limitations of regex. That makes it a bad choice for parsing languages like HTML and BBCode.

For my own project, rbbcode, I use a parsing expression grammer (PEG). I recommend using something similar. In general, these types of tools are called "compiler compilers," "compiler generators," or "parser generators." Using one of these is probably the sanest approach, as it allows you to specify the grammar of BBCode in a clean, readable format. You ll have fewer bugs this way than if you use regex or attempt to build your own state machine.

