who's mishandling UTF-8?
Added by Matěj Cepl over 11 years ago
Hi,
could with somebody with deeper knowledge of EXIF/IPTC tell me where could be the issue with http://trac.yorba.org/ticket/2247 IMHO, jbrout just passes the information down to exiv. Are all my images irrevocably corrupted?
Thank you for any response,
Matěj
Replies (8)
RE: who's mishandling UTF-8? - Added by Robin Mills over 11 years ago
I believe that jbrout uses pyexiv2 and exiv2. I built it for them a couple of years ago - around exiv2 0.18, I think. I haven't heard this story before and I hope exiv2 isn't responsible. Perhaps Andreas can comment on the library's UTF-8 performance.
It's not really clear to me from the bug report that exiv2 is responsible for this. Have you discussed this with jbrout and/or fdiba?
RE: who's mishandling UTF-8? - Added by Andreas Huggel over 11 years ago
You need to take this question back to the application which wrote the metadata to this image. Let them decide what is wrong and if necessary forward their analysis to whoever they think is the root cause for the observed issue.
This is what exiv2 and hd show for the relevant tags from the sample picture in the referred bug report:
Iptc.Application2.Keywords String 7 Máminy Iptc.Application2.Keywords String 5 Praha Iptc.Application2.Keywords String 8 Květná Xmp.dc.subject XmpBag 3 Máminy, Praha, KvÄ›tná
Iptc.Application2.Keywords 7 0000 4d c3 a1 6d 69 6e 79 M..miny Iptc.Application2.Keywords 5 0000 50 72 61 68 61 Praha Iptc.Application2.Keywords 8 0000 4b 76 c4 9b 74 6e c3 a1 Kv..tn..
00001c70 3e 0a 20 20 3c 64 63 3a 73 75 62 6a 65 63 74 3e |>. <dc:subject>| 00001c80 0a 20 20 20 3c 72 64 66 3a 42 61 67 3e 0a 20 20 |. <rdf:Bag>. | 00001c90 20 20 3c 72 64 66 3a 6c 69 3e 4d c3 83 c2 a1 6d | <rdf:li>M....m| 00001ca0 69 6e 79 3c 2f 72 64 66 3a 6c 69 3e 0a 20 20 20 |iny</rdf:li>. | 00001cb0 20 3c 72 64 66 3a 6c 69 3e 50 72 61 68 61 3c 2f | <rdf:li>Praha</| 00001cc0 72 64 66 3a 6c 69 3e 0a 20 20 20 20 3c 72 64 66 |rdf:li>. <rdf| 00001cd0 3a 6c 69 3e 4b 76 c3 84 e2 80 ba 74 6e c3 83 c2 |:li>Kv.....tn...| 00001ce0 a1 3c 2f 72 64 66 3a 6c 69 3e 0a 20 20 20 3c 2f |.</rdf:li>. </| 00001cf0 72 64 66 3a 42 61 67 3e 0a 20 20 3c 2f 64 63 3a |rdf:Bag>. </dc:|
The IPTC keywords are UTF-8 encoded, just that the Iptc.Envelope.CharacterSet dataset to indicate the IPTC character set is missing. (That could easily be added.)
The XMP strings are not UTF-8 encoded. It looks like something went wrong when the XMP data was written to the file.
From the XMP data it is also clear that that was not done with Exiv2.
Andreas
RE: who's mishandling UTF-8? - Added by Matěj Cepl over 11 years ago
Yes, that's exactly what I hoped for to get here. I have filed a new ticket against jbrout for this issue.
Thank you very much
RE: who's mishandling UTF-8? - Added by Matěj Cepl over 11 years ago
Plot thickens ... https://bugs.launchpad.net/pyexiv2/+bug/621201 (from https://code.google.com/p/jbrout/issues/detail?id=158#c1)
RE: who's mishandling UTF-8? - Added by Olivier Tilloy about 11 years ago
From the XMP data it is also clear that that was not done with Exiv2.
Andreas, do you mean that the tags were not written using libexiv2’s writer methods?
Note that pyexiv2 relies solely on libexiv2 to read and write metadata, it doesn’t do any custom I/O operations aside.
Could that possibly mean that those faulty tags were in fact written by another application using another library (I’m extrapolating here)?
RE: who's mishandling UTF-8? - Added by Andreas Huggel about 11 years ago
Olivier,
Yes, that's exactly what I mean. The XMP metadata in this image was not written by Exiv2. With what you're saying I take it that means it was also not done with pyexiv2.
Andreas
RE: who's mishandling UTF-8? - Added by Matěj Cepl about 11 years ago
Thanks, we were working on this last couple of days with maintainers of both pyexiv2 and jbrout (https://bugs.launchpad.net/pyexiv2/+bug/621201, and https://code.google.com/p/jbrout/issues/detail?id=160) and the conclusion is that XMP metadata are hopelessly corrupted by previous versions of jbrout (or the auxiliary programs it uses). However, it seems to be almost impossible to remove corrupted metadata (https://code.google.com/p/jbrout/issues/detail?id=161), even
exiv2 -d x picture.jpg
(from exiv2-0.20-1.fc14.x86_64) doesn't seem to be able to remove corrupted XMP data from the attached image. Any help?
p20090318_024642.jpg (1.76 MB) p20090318_024642.jpg | testing image with corrupted metadata |
RE: who's mishandling UTF-8? - Added by Matěj Cepl about 11 years ago
OK, this is not completely PEBKAC, but there was a mistaken assumption on my part. I thought (based on my understanding of https://code.google.com/p/jbrout/issues/detail?id=161#c1) that the problem is in XMP, but in the end looks it was more in IPTC tags "Iptc.Application2.Keywords" which were corrupted. This simple script:
#!/usr/bin/python # -*- coding: utf-8 -*- import sys,codecs,pyexiv2 knownKeywords = codecs.open("/home/matej/archiv/2010/projekty/cleanKeywords/knownKeywords.txt", encoding="utf-8").read().split("\n") goodKeywords = [] metadata = pyexiv2.ImageMetadata(unicode(sys.argv[1], "utf-8")) metadata.read() if ('Iptc.Application2.Keywords' in metadata.iptc_keys): tag = metadata['Iptc.Application2.Keywords'] for a in tag.raw_values: a = unicode(a, "utf-8") if ((a in knownKeywords) and (a not in goodKeywords)): goodKeywords.append(a) tag.values = goodKeywords metadata.write(preserve_timestamps=True)
allowed me to clear my photos from all unwanted keywords and so I have clean data now. Hopefully.
Thanks for all help.