Project

General

Profile

who's mishandling UTF-8?

Added by Matěj Cepl over 11 years ago

Hi,

could with somebody with deeper knowledge of EXIF/IPTC tell me where could be the issue with http://trac.yorba.org/ticket/2247 IMHO, jbrout just passes the information down to exiv. Are all my images irrevocably corrupted?

Thank you for any response,

Matěj


Replies (8)

RE: who's mishandling UTF-8? - Added by Robin Mills over 11 years ago

I believe that jbrout uses pyexiv2 and exiv2. I built it for them a couple of years ago - around exiv2 0.18, I think. I haven't heard this story before and I hope exiv2 isn't responsible. Perhaps Andreas can comment on the library's UTF-8 performance.

It's not really clear to me from the bug report that exiv2 is responsible for this. Have you discussed this with jbrout and/or fdiba?

RE: who's mishandling UTF-8? - Added by Andreas Huggel over 11 years ago

You need to take this question back to the application which wrote the metadata to this image. Let them decide what is wrong and if necessary forward their analysis to whoever they think is the root cause for the observed issue.

This is what exiv2 and hd show for the relevant tags from the sample picture in the referred bug report:

Iptc.Application2.Keywords                   String      7  Máminy
Iptc.Application2.Keywords                   String      5  Praha
Iptc.Application2.Keywords                   String      8  Květná
Xmp.dc.subject                               XmpBag      3  Máminy, Praha, Květná
Iptc.Application2.Keywords                     7
  0000  4d c3 a1 6d 69 6e 79                             M..miny

Iptc.Application2.Keywords                     5
  0000  50 72 61 68 61                                   Praha

Iptc.Application2.Keywords                     8
  0000  4b 76 c4 9b 74 6e c3 a1                          Kv..tn..
00001c70  3e 0a 20 20 3c 64 63 3a  73 75 62 6a 65 63 74 3e  |>.  <dc:subject>|
00001c80  0a 20 20 20 3c 72 64 66  3a 42 61 67 3e 0a 20 20  |.   <rdf:Bag>.  |
00001c90  20 20 3c 72 64 66 3a 6c  69 3e 4d c3 83 c2 a1 6d  |  <rdf:li>M....m|
00001ca0  69 6e 79 3c 2f 72 64 66  3a 6c 69 3e 0a 20 20 20  |iny</rdf:li>.   |
00001cb0  20 3c 72 64 66 3a 6c 69  3e 50 72 61 68 61 3c 2f  | <rdf:li>Praha</|
00001cc0  72 64 66 3a 6c 69 3e 0a  20 20 20 20 3c 72 64 66  |rdf:li>.    <rdf|
00001cd0  3a 6c 69 3e 4b 76 c3 84  e2 80 ba 74 6e c3 83 c2  |:li>Kv.....tn...|
00001ce0  a1 3c 2f 72 64 66 3a 6c  69 3e 0a 20 20 20 3c 2f  |.</rdf:li>.   </|
00001cf0  72 64 66 3a 42 61 67 3e  0a 20 20 3c 2f 64 63 3a  |rdf:Bag>.  </dc:|

The IPTC keywords are UTF-8 encoded, just that the Iptc.Envelope.CharacterSet dataset to indicate the IPTC character set is missing. (That could easily be added.)

The XMP strings are not UTF-8 encoded. It looks like something went wrong when the XMP data was written to the file.
From the XMP data it is also clear that that was not done with Exiv2.

Andreas

RE: who's mishandling UTF-8? - Added by Matěj Cepl over 11 years ago

Yes, that's exactly what I hoped for to get here. I have filed a new ticket against jbrout for this issue.

Thank you very much

RE: who's mishandling UTF-8? - Added by Olivier Tilloy about 11 years ago

From the XMP data it is also clear that that was not done with Exiv2.

Andreas, do you mean that the tags were not written using libexiv2’s writer methods?
Note that pyexiv2 relies solely on libexiv2 to read and write metadata, it doesn’t do any custom I/O operations aside.
Could that possibly mean that those faulty tags were in fact written by another application using another library (I’m extrapolating here)?

RE: who's mishandling UTF-8? - Added by Andreas Huggel about 11 years ago

Olivier,

Yes, that's exactly what I mean. The XMP metadata in this image was not written by Exiv2. With what you're saying I take it that means it was also not done with pyexiv2.

Andreas

RE: who's mishandling UTF-8? - Added by Matěj Cepl about 11 years ago

Thanks, we were working on this last couple of days with maintainers of both pyexiv2 and jbrout (https://bugs.launchpad.net/pyexiv2/+bug/621201, and https://code.google.com/p/jbrout/issues/detail?id=160) and the conclusion is that XMP metadata are hopelessly corrupted by previous versions of jbrout (or the auxiliary programs it uses). However, it seems to be almost impossible to remove corrupted metadata (https://code.google.com/p/jbrout/issues/detail?id=161), even

exiv2 -d x picture.jpg

(from exiv2-0.20-1.fc14.x86_64) doesn't seem to be able to remove corrupted XMP data from the attached image. Any help?

p20090318_024642.jpg (1.76 MB) p20090318_024642.jpg testing image with corrupted metadata

RE: who's mishandling UTF-8? - Added by Matěj Cepl about 11 years ago

OK, this is not completely PEBKAC, but there was a mistaken assumption on my part. I thought (based on my understanding of https://code.google.com/p/jbrout/issues/detail?id=161#c1) that the problem is in XMP, but in the end looks it was more in IPTC tags "Iptc.Application2.Keywords" which were corrupted. This simple script:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys,codecs,pyexiv2

knownKeywords = codecs.open("/home/matej/archiv/2010/projekty/cleanKeywords/knownKeywords.txt",
        encoding="utf-8").read().split("\n")

goodKeywords = []
metadata = pyexiv2.ImageMetadata(unicode(sys.argv[1], "utf-8"))
metadata.read()
if ('Iptc.Application2.Keywords' in metadata.iptc_keys):
        tag = metadata['Iptc.Application2.Keywords']
        for a in tag.raw_values:
                a = unicode(a, "utf-8")
                if ((a in knownKeywords) and (a not in goodKeywords)):
                        goodKeywords.append(a)
        tag.values = goodKeywords
        metadata.write(preserve_timestamps=True)

allowed me to clear my photos from all unwanted keywords and so I have clean data now. Hopefully.

Thanks for all help.

    (1-8/8)