Bug #662: Incorrect Unicode encoding of Exif UserComment tag - Exiv2

Bug #662

Incorrect Unicode encoding of Exif UserComment tag

Added by Leo Sutic almost 12 years ago. Updated over 11 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

metadata

Target version:

0.20

Start date:

18 Dec 2009

Due date:

% Done:

100%

Estimated time:

Description

In the Exif UserComment tag, the characters may be encoded as Ascii, Unicode, JIS or "Undefined".

This bug concerns the encoding when choosing Unicode encoding.

Exiv2 uses UTF-8. Exiftool, Windows and Microsoft Photo Info use UCS-2 and decode UserComments written by exiv2 as a jumble of glyphs. Exiv2, OTOH, can't decode the tags written by the aforementioned programs. There is a problem here.

So, which one is right?

The Exif 2.1 spec references the "Unicode Standard, The Unicode Consortium, 1991, Addison-Wesley". This is the 1.0.0 version of the Unicode spec, and back then UCS-2 was the default encoding. The Exif 2.2 spec makes no reference to any specific version of the Unicode standard, but we can assume that it is the intention of the Exif standards body to make Exif 2.2 backwards compatible with Exif 2.1.

I would therefore say that it is Exiv2 that is in error here.

If the Exiv2 team finds this argument valid, we need a short and a long term solution. The short term solution is intended to bridge the gap between now and until all images whose UserComment fields have been written as UTF-8 have been converted to UCS-2. The long term solution is to completely switch to UCS-2.

My suggested short-term solution is this:

This only applies to UserComment tags marked "Unicode".
Make Exiv write UCS-2 tags, always. Optionally, allow the user to apecify a "UTF-8"-encoding, in case someone really needs it.
On read, decode the 8-byte charset specifier. If it is Unicode:
1. Look at the tag size. If odd, it can't be UCS-2 (as it is a fixed-size, 2-byte ecoding). Decode as UTF-8.
2. Look for a zero byte in the text. If one is found, the text can't be UTF-8, as that encoding has no zero bytes, and the UserComment field isn't null-terminated. Decode as UCS-2.
3. If no zero bytes, and even tag length - check for any bytes > 0x7f or <= 0x8. If none, decode as UTF-8, as it is likely to be an Ascii comment written as UTF-8. (The limits have been chosen to allow everything from tab to the end of Ascii.)
4. Decode as UCS-2.

There will always be corner cases, but right now exiv2 UserComment tags can't be read by any other Exif viewer, so the fact that this hasn't been changed yet tells me that people don't use exiv2 to write UserComment tags all that much. We are therefore trading failure on corner cases of a feature that isn't used all that much against having that feature work at all with other programs. I think the trade-off it worth it.

Additionally:

A conversion tool that reads UTF-8 and writes UCS-2 could be created.
Perhaps some parameter could be passed to the parser to force parsing of UserComment as UTF-8 or UCS-2, for the corner cases.

I'd send a patch, but I'd like to run this past you first to see if there is any interest in accepting such a patch, should I write it, before I go through the work of doing it.

Files

unicode_support.patch (11.8 KB) unicode_support.patch		Leo Sutic, 07 Jan 2010 09:52
exiv2-exifcomment-unicode.patch (14.8 KB) exiv2-exifcomment-unicode.patch	The patch.	Leo Sutic, 11 Jan 2010 15:09
exiv2-bug662.jpg (1.34 KB) exiv2-bug662.jpg	To be placed in test/data/	Leo Sutic, 11 Jan 2010 15:09
convert_bug.patch (1.05 KB) convert_bug.patch		Leo Sutic, 12 Jan 2010 09:40
exiv2-exifcomment-unicode.patch (6.64 KB) exiv2-exifcomment-unicode.patch		Leo Sutic, 12 Jan 2010 10:09

Related issues

Associated revisions

Revision 1998 (diff)
Added by Andreas Huggel almost 12 years ago

Published convertStringCharset() in the API (for #662).

Revision 2000 (diff)
Added by Andreas Huggel almost 12 years ago

#662: Patch exiv2-exifcomment-unicode.patch from Leo Sutic (unmodified, without exiv2-bug662.jpg).

Revision 2001 (diff)
Added by Andreas Huggel almost 12 years ago

#662: Mostly formatting changes and a few tweaks. Move exifcomment tests to bugfixes-test.sh

Revision 2002 (diff)
Added by Andreas Huggel almost 12 years ago

#662: Updated expected test results.

Revision 2003 (diff)
Added by Andreas Huggel almost 12 years ago

#662: Fixes by Leo Sutic. Added carriage return to the special characters.

Revision 2005 (diff)
Added by Andreas Huggel almost 12 years ago

#662: Charset conversion on read and write (and if needed on copy).

Revision 2006 (diff)
Added by Andreas Huggel almost 12 years ago

#662: Added CommentValue::detectCharset and an optional parameter for the encoding to CommentValue::comment().

Revision 2011 (diff)
Added by Andreas Huggel almost 12 years ago

#662: Code tweak and updated expected test results.

Revision 2013 (diff)
Added by Andreas Huggel almost 12 years ago

#662: Detect and interpret a BOM.

Revision 2027 (diff)
Added by Andreas Huggel almost 12 years ago

#662: Added new option -n and action fixcom to exiv2 utility.

History

Updated by Andreas Huggel almost 12 years ago

Leo,

Thanks for your thoughts.

In the Exif UserComment tag, the characters may be encoded as Ascii, Unicode, JIS or "Undefined".

This bug concerns the encoding when choosing Unicode encoding.

Technically, this is a duplicate of #562. I'll comment here but we should eventually close this bug as a duplicate.

Exiv2 uses UTF-8.

Exiv2 actually doesn't do any conversion at all. It just writes the string that the applications/user provides to the field. So it's up to them to do the right thing for now.

I would therefore say that it is Exiv2 that is in error here.

Yes. Exif UserComment tags with charset set to UNICODE should be encoded in UCS-2.

If the Exiv2 team finds this argument valid, we need a short and a long term solution. The short term solution is intended to bridge the gap between now and until all images whose UserComment fields have been written as UTF-8 have been converted to UCS-2. The long term solution is to completely switch to UCS-2.

Since we don't have control over what happens to images which have already been modified, we can aim for the long term solution straight away. The concern is that applications which use Exiv2 should continue to function if possible, in particular if they already convert the comments correctly.

This only applies to UserComment tags marked "Unicode".

Make Exiv write UCS-2 tags, always.

Ok. Note that there is some existing code to convert between UCS-2 and UTF-8, used for certain Windows XP tags. There is also code used to convert between XMP and IPTC datasets, which detects a charset and converts to UTF-8 (r1908). Aim to reuse existing code and generalize where needed.

Optionally, allow the user to apecify a "UTF-8"-encoding, in case someone really needs it.

Not desirable. Instead provide backward compatibility similar to what was done for #571 with r1908.

On read, decode the 8-byte charset specifier. If it is Unicode:

Look at the tag size. If odd, it can't be UCS-2 (as it is a fixed-size, 2-byte ecoding). Decode as UTF-8.

Look for a zero byte in the text. If one is found, the text can't be UTF-8, as that encoding has no zero bytes, and the UserComment field isn't null-terminated. Decode as UCS-2.

If no zero bytes, and even tag length - check for any bytes > 0x7f or <= 0x8. If none, decode as UTF-8, as it is likely to be an Ascii comment written as UTF-8. (The limits have been chosen to allow everything from tab to the end of Ascii.)

Decode as UCS-2.

Consider something similar to what Exiftool does. It writes the UserComment tag with an Exif character code
"ASCII" if the text consists of only 7-bit characters, else it uses the Exif
character code "UNICODE" and encodes the text in UTF-16.
It encodes the UTF-16 string using the same byte order as the rest of the
Exif/TIFF structure and without a BOM.
On read it expects a UTF-16 encoded text, has some intelligence to guess the
byte order, and interprets a BOM if there is one. It doesn't seems to have any
provision for UTF-8 encoded UserComment text.

The proposed logic based on size and existence of 0-bytes in the comment will fail in many cases because some cameras write 0-byte (sometimes other characters) as fillers to this field, presumably so that it can be modified later without having to re-write the complete TIFF structure.

I'd send a patch, [...]

Please do :) I'll be happy to discuss further and comment as the work progresses.

Andreas

Updated by Andreas Huggel almost 12 years ago

Related info¶

General remarks about the lack of Exiv2 characterset conversions
http://dev.exiv2.org/boards/3/topics/show/220#message-221
http://dev.exiv2.org/boards/3/topics/show/62#message-66

The Metadata Working Group has published a technical specification with useful information on the topic.

Bugs #562, #592: Exif.Photo.UserComment unicode comment doesn't work (Debian bug #486884). Exiv2 should decode to and from UTF-16 / UCS-2.

Exiv2 vs Exiftool handling of Exif UNICODE user comments

Bug #571: Convert character set when writing XMP sidecar. Adds UTF-8 charset detection logic and charset conversion to Exiv2

Project

General

Profile

Exiv2

Issues

Bug #662

Incorrect Unicode encoding of Exif UserComment tag

Associated revisions

History

Updated by Andreas Huggel almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Related info¶

Updated by Leo Sutic almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Leo Sutic almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Leo Sutic almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Leo Sutic almost 12 years ago

Updated by Leo Sutic almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Leo Sutic almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Leo Sutic almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Leo Sutic almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Leo Sutic almost 12 years ago

Updated by Leo Sutic almost 12 years ago

Updated by Leo Sutic almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Leo Sutic almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Leo Sutic almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Leo Sutic almost 12 years ago

Updated by Leo Sutic almost 12 years ago

Updated by Leo Sutic almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Leo Sutic almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Leo Sutic almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Leo Sutic almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Andreas Huggel almost 12 years ago

Updated by Andreas Huggel over 11 years ago