Project

General

Profile

Recognizing xmp sidecar files

Added by Matthias Baas over 10 years ago

Hi,

a while back, the code for recognizing xmp sidecar files has changed. The current version of libexiv2 requires the file to contain either a xpacket processing instruction or a xmpmeta tag, but according to the xmp specification both of these are optional and don't have to be present on a serialized packet. Before the more restrictive xmp check, libexiv2 processed those files without the optional data just fine but the current version throws an error saying the file contains an unknown image type.

Is there any chance the test could be relaxed again, so that those minimal xmp files can be read again? (alternatively, the isXmpType() function could also check for the rdf:RDF tag, even though an application could even put other data before the actual xmp data. See section 7.3.1 in the xmp spec: "Other XML data may appear around the rdf:RDF element.")
Or maybe the decision what handler to pick could also take the file suffix into account. I think when the file has a ".xmp" suffix and it contains something that looks like xml, then it's really the xmp handler that should take care of it and no other handler.

Thanks,

- Matthias -


Replies (5)

RE: Recognizing xmp sidecar files - Added by Andreas Huggel over 10 years ago

Yes, this was changed because of a complaint that Exiv2 attempted to parse large GPX files and almost choked on them. The related discussion is here http://dev.exiv2.org/boards/3/topics/682#message-691 and the change here http://dev.exiv2.org/projects/exiv2/repository/revisions/2372

Can we do something that makes everybody happy?

Andreas

RE: Recognizing xmp sidecar files - Added by Matthias Baas over 10 years ago

I tried running exiv2 (the version from MacPorts) on that gpx file that was posted in that other thread. Reading that file takes about 2 seconds (which doesn't sound too bad to me), but it's true, memory consumption steadily goes up to around 100MB which seems a bit excessive.
The library uses expat to parse the xml file, doesn't it? In that case, I don't see why reading a large xml file should result in such a memory pattern. expat is just streaming the data to handler functions and as long as no xmp data is encountered, I would expect that nothing really happens and memory usage should just stay the same. So it seems we actually have two separate issues here: 1) finding that memory bug and 2) making the xmp sidecar file detection more robust. Would 2) actually still be an issue if 1) was fixed?

From what I gathered from the xmp specs, the xmp packet could be part of any xml document and there could also be other information surrounding the packet. In practice though, I think it's fair enough to assume that xmp sidecar files really only contain an xmp packet and nothing else. If it's ok for the library to make this assumption, then it should probably scan for the rdf:RDF tag or even the rdf:Description tag as these ones must be present in the file. An additional obstacle is that the file may not be UTF-8, but UTF-16 or UTF-32. I didn't check if libexiv2 currently supports this or not, but at least I didn't see any checks for the byte order markers in the code (except for the UTF-8 BOM) and string comparisons seemed to be using 8-bit strings.
But frankly, I think I would rather examine the file suffix and simply check if it's ".xmp" (case-insensitive) or not. I'm not aware of any software that uses a different suffix for the xmp sidecar files. Then you could leave it up to expat to do all the decoding and error checking.

- Matthias -

RE: Recognizing xmp sidecar files - Added by Andreas Huggel over 10 years ago

The library uses expat to parse the xml file, doesn't it?

It uses the Adobe XMP-SDK which uses expat, yes.

So it seems we actually have two separate issues here: 1) finding that memory bug and 2) making the xmp sidecar file detection more robust. Would 2) actually still be an issue if 1) was fixed?

Right. Probably not as nobody would notice :)

From what I gathered from the xmp specs, the xmp packet could be part of any xml document and there could also be other information surrounding the packet. In practice though, I think it's fair enough to assume that xmp sidecar files really only contain an xmp packet and nothing else. If it's ok for the library to make this assumption, then it should probably scan for the rdf:RDF tag or even the rdf:Description tag as these ones must be present in the file. An additional obstacle is that the file may not be UTF-8, but UTF-16 or UTF-32. I didn't check if libexiv2 currently supports this or not, but at least I didn't see any checks for the byte order markers in the code (except for the UTF-8 BOM) and string comparisons seemed to be using 8-bit strings.

Assuming there is only XMP data in a sidecar is fine, we're probably doing that now. Chances are Exiv2 can't handle UTF-16 or 32 sidecars, I've never seen one. XMP-SDK supposedly converts all strings to UTF-8, so comparisons should be fine though.

Each Exiv2 image format (incl. the XMP sidecar "image" format) comes with a small function to test if a given image is of the particular format. That's also where the recent change was implemented. In this case the function is isXmpType() at the end of xmpsidecar.cpp. Any change to this function is simple, especially compared to messing with the XMP-SDK blackbox.

But frankly, I think I would rather examine the file suffix and simply check if it's ".xmp" (case-insensitive) or not. I'm not aware of any software that uses a different suffix for the xmp sidecar files.

That's not an option as the input may not be a file. Exiv2 can also parse an image in memory, the check must be done based on the image contents, preferably it is sufficient to look at a few bytes near the start.

Andreas

RE: Recognizing xmp sidecar files - Added by Matthias Baas over 10 years ago

I finally had some more time looking into this...

Andreas Huggel wrote:

Assuming there is only XMP data in a sidecar is fine, we're probably doing that now. Chances are Exiv2 can't handle UTF-16 or 32 sidecars, I've never seen one. XMP-SDK supposedly converts all strings to UTF-8, so comparisons should be fine though.

I did a quick test where I always returned true in isXmpType() and passed an UTF-16 sidecar file to the library. It did read the file just fine, so it would actually be supported if isXmpType() would let it through.

But frankly, I think I would rather examine the file suffix and simply check if it's ".xmp" (case-insensitive) or not. I'm not aware of any software that uses a different suffix for the xmp sidecar files.

That's not an option as the input may not be a file. Exiv2 can also parse an image in memory, the check must be done based on the image contents, preferably it is sufficient to look at a few bytes near the start.

I see.

Personally, I think the isXmpType() should really just check if the data is xml data or not (to keep the check simple and to avoid rejecting valid data). In the ideal case, the data would get streamed to the xml parser which would either trigger an error early on if it's not valid xml/xmp or it would just scan through the file without doing much else if it's a xml file without xmp data in it. Unfortunately, as I saw in the code, the entire file is read into memory first before it's being processed. But the really big memory hit happened in the constructor of the SXMPMeta class which is part of Adobe's xmp toolkit, so it seems it's the Adobe code that either has a bug or is just really inefficient. I noticed that the xmp toolkit version shipping with libexiv2 is 4.4.0, but on the Adobe site there is a version 5.1.2. Maybe it would be worth upgrading? (hoping they have fixed that memory problem)

Trying to detect more in isXmpType() is actually not as straightforward as it seems. You would have to start a proper xml decoding pass and check if you get some valid tags back and then scan the first few tags to see if there's anything that looks like xmp.

- Matthias -

RE: Recognizing xmp sidecar files - Added by Andreas Huggel over 10 years ago

Assuming there is only XMP data in a sidecar is fine, we're probably doing that now. Chances are Exiv2 can't handle UTF-16 or 32 sidecars, I've never seen one. XMP-SDK supposedly converts all strings to UTF-8, so comparisons should be fine though.

I did a quick test where I always returned true in isXmpType() and passed an UTF-16 sidecar file to the library. It did read the file just fine, so it would actually be supported if isXmpType() would let it through.

Great! :)

I noticed that the xmp toolkit version shipping with libexiv2 is 4.4.0, but on the Adobe site there is a version 5.1.2. Maybe it would be worth upgrading? (hoping they have fixed that memory problem)

Yes, absolutely. The issue for this upgrade (#742) has been open for too long.

-ahu.

    (1-5/5)