Blog
Dubious Use Of HTML Entites
Date: 16/6/2005
I have no idea if this is legal or not but I have a HTML page that is using HTML entities inside a font tag to encode characters in a particular charset. Or at least that is what I think it's doing. My HTML control of course barfs on it and I'm interested in fixing it to cope.

Firstly the font tag is:
<font face="&iuml;&frac14;&shy;&iuml;&frac14;&sup3; &iuml;&frac14;.$B!k.(B&atilde;&#130;.$B!-.(B&atilde;&#130;&middot;&atilde;&#131;&#131;&atilde;&#130;&macr;">
And the meta tag in the head of the html document is says charset=ISO-2022-JP so that font face should decode to a usable Japanese font name.

The question is how?

My understanding of entities is that they decode to utf-32 which you then ram into the output text stream as raw unicode. And the bits between entities get decoded in the documents character set declared in the meta tag at the top of the document.

This doesn't appear to be the case in this document. I think what it might be doing is encoding the utf-8 font face in entities. But the problem with that is that in the general case you can have entities that decode to utf-32 values greater than 256 and thus are not valid for the document charset or utf-8.

Ok my brain hurts. Does anyone know what is going on here?
Comments:
fret
16/06/2005 5:54am
Ok, from what I can figure out '&iuml;&frac14;&shy;' decodes to the unicode character U+ff2d (which looks like a 'M') and '&iuml;&frac14;&sup3;' decodes to the unicode character U+ff33 (which looks like a 'S'). Then there is a space and after that things fall apart.
fret
16/06/2005 6:09am
The next group of entities '&iuml;&frac14;' is the start of a 3 byte utf-8 character but the next character is a period (0x2d) which is not a valid utf-8 trailing character. Thus the decoding of utf-8 breaks.

The first 2 unicode characters form the prefix of some common Japanese font names. So I feel I'm on the right track.
fret
16/06/2005 6:40am
I'm expecting the font name to be something like:

  • MS ゴシック
  • MS 明朝

    Or some such. I don't actually have these fonts on my system or even have access to a system with these fonts.
  • Carlos Rocha
    17/06/2005 4:49pm
    LOL
     
    Reply
    From:
    Email (optional): (Will be HTML encoded to evade harvesting)
    Message:
     
    Remember username and/or email in a cookie.
    Notify me of new posts in this thread via email.
    BBcode:
    					[q]text[/q]
    					[url=link]description[/url]
    					[img]url_to_image[/img]
    					[pre]some_code[/pre]
    [b]bold_text[/b]