Project

General

Profile

GSoC 2013 "Cloud Ready" Project Specification » History » Version 13

Robin Mills, 15 Apr 2013 22:11

1 1 Robin Mills
h1. GSoC 2013 "Cloud Ready" Project Specification
2
3 2 Robin Mills
There are four subprojects:
4
5 3 Robin Mills
# HTTP I/O support (GSoC 2013 Student)
6 2 Robin Mills
# exiv2(.exe) to run as a service (daemon) on a web socket.
7
# client-side use of the exiv2 service (using the web socket)
8
# JSON support 
9
10 4 Robin Mills
This is quite a large project.  Robin Mills intends to implement the daemon/web-socket support during Spring 2013.  The GSoC student is expected to implement the http I/O support.  The proposal to have a GSoC student join us will be made via KDE http://community.kde.org/GSoC/2013/Ideas#Exiv2_.22Cloud_Ready.22_Project
11 1 Robin Mills
12 3 Robin Mills
h2. 1 HTTP I/O support (GSoC 2013 Student)
13 1 Robin Mills
14 3 Robin Mills
Today we provide support files available on the file system. These files can be memory mapped if this feature is supported by the host OS.
15
16
With the increasing interest in "cloud" computing, it's become ever more common for files to reside in remote locations which are not mapped to the file system. Very common cases today are ftp and http. For example: http://bla/bla/bla/file.jpg. Today there are a myriad of "Cloud" storage products, such as AWS, DropBox, Google Drive, Sky Drive, Box,iCloud, Just Cloud and more.
17
18
The proposal is to support http, ftp and ssh. This can be done by deriving a new Class from the BasicIO abstract class. The exiv2 command would accept filenames with a URL. For example:
19
20
<pre>
21
exiv2 -pt http://clanmills.com/files/Robin.jpg
22
exiv2 -pt ftp://username@password:/clanmills.com/Robin.jpg
23
exiv2 -pt ssh://username@password:/clanmills.com/Robin.jpg
24
</pre>
25
In most image files, the meta-data is defined in the first 100k of the file, so the implementation should only read blocks on demand from the server and avoid copying the complete file.
26
27
The simplest possible implementation of this proposal for exiv2 to detect the protocol and use a helper application such as curl or ssh. This implementation probably requires copying the complete file from the remote storage to a temporary file in the local file system. While such an implementation can be constructed quickly, this does not satisfy the project aim to make efficient use of band-width.
28
29
It is very desirable to use a robust implementation of the web protocols and a library such as libcurl should be considered. The selection of the protocol support library must respect build implications. We should be careful to avoid adding a large library (such as boost) to the build dependencies. Additionally, the implementation is required to be written in C++ and run on Mac/Windows/Linux without dependency on platform frameworks such as .Net, Java, or Cocoa. It may be that build switches can be provided to enable Exiv2 to use platform frameworks. This could be especially useful on mobile platforms such as Android and iOS.
30
31
The implementation should provide bi-directional support (both read and write) with read-access being the first priority.
32
33 13 Robin Mills
!http://clanmills.com/files/Exiv2CloudReady.png!
34
35
[[http://clanmills.com/files/Exiv2CloudReady.pdf]]
36
37 3 Robin Mills
h2. 2 and 3 Exiv2 daemon server and client
38
39
enable exiv2 to run as a service (daemon) on a web socket. I imagine two types of clients:
40
41
# exiv2 itself of course
42
# JavaScript/WebSocket client
43
44
To do this we could do something like this:
45
<pre>
46
Server:      # exiv2 --daemon --port 54321
47
Client:      $ exiv2 -pt exv://server:54321:/Robin.jpg
48
Even better: $ exiv2 -pt exv://server:54321:/http://clanmills.com/files/Robin.jpg
49
</pre>I don't want to get into detail concerning the JavaScript API for this. Something like this:
50
<pre>
51
<script src="js/Exiv2.js">
52
var exiv2     = new Exiv2( { server : 'clanmills.com' , port : 54321 }); 
53
var metadata  = eval(exiv2.command('--JSON -pt /Robin.jpg '));
54
// or even better
55
var metadata  = eval(exiv2.command('--JSON -pt http://clanmills.com/files/Robin.jpg'));
56
</pre>To get the most from this functionality, we should provide JSON (and/or XML) support which I discuss below.
57
58
h2. 4 JSON Support
59
60
5 years ago, I became interested in exiv2 to implement a GeoTagging application. I decided to use Python as an excuse to learn the language. I used the pyexiv2 wrapper, written by Olivier, and the project was a success. Building exiv2 and pyexiv2 on Windows and MacOSX was a challenge (to say the least).
61
62
Since then, I've worked steadily on the exiv2 msvc and msvc64 built environments and I believe both are working very well.
63
64
Sadly, building pyexiv2 is remains a challenge because it requires boost and the scons build utility. (scons is/was another GSoC project.) The consequence is that my python script seldom uses the latest exiv2 and is not available on all my machines (Windows/Cygwin/Mac/Kubuntu). The script is stable (hardly been changed in 5 years), however building the pyexiv2 wrapper is a maintenance challenge. The pyexiv2 has to be built for specific versions of python (2.6, 2.7 etc), architecture (32/64 bit), platform (windows/cygwin/macosx/linux).
65
66
This is not a criticism of Olivier's pyexiv2 wrapper. Olivier has done a very good job. Python wrappers which link C++ are a severe maintenance challenge. I haven't worked for years with Perl's C++ support (XS and/or SWIG), however I anticipate similar pain and trouble.
67
68
JSON to the rescue. My proposal is to provide a JSON interface to read and write meta-data in the exiv2 command-line utility.
69
70
As an sample application to prove our JSON support, provide wrappers for Perl and Python. The wrappers can be written entirely in the scripting language and use the language's JSON support. There is no need to get involved with C++ integration challenges such as boost/scons/pyexiv2, xs and swig. When reading from files, the wrapper will call exiv2.exe ONCE to capture all JSON to file. When writing to files, the wrapper will call exiv2.exe ONCE. This strategy will enable the wrappers to and run on all platforms on which exiv2.exe is available.
71
72
h2. Expected results:
73
74
# To deploy a webservice to provide Exiv2 services.
75
# To provide a JavaScript library to enable developers use the Exiv2 service.
76
# An engineering assessment of the effort involved in providing access to cloud servers such as AWS.
77
78
h2. GSoC Mentor:
79
80 2 Robin Mills
Robin Mills http://clanmills.com/files/CV.pdf
81
82
I've been a volunteer on the Exiv2 project for 5 years.  I worked for Adobe for 10 years, where I implemented reading PDF and JDF files over http (without copying the complete file).  I'm now a freelance contractor and I've been working on a mobile app which uses WebSockets.  I've worked on both server and client code.
83 5 Robin Mills
84
h2. Project Notes:
85
86
If you wish the submit a proposal, or discuss this project with me, then please do the following:
87
88
* Confirm with me that you have good C++ skills
89
* Download and build Exiv2.
90
91
When you read the code, here are some suggestions for matters you may wish to consider:
92
93
h3. 1)	Exiv2 BasicIO abstract class and HttpIO concrete class (for reading)
94
95
*	I don’t remember the API, however it has methods to read from stream (open/close/tell/seek,read,write)
96
*	You should derive a new class from BasicIO, and could be called HttpIO or something like that
97
*	HttpIO should allocate memory for the complete file and maintain a map of blocks which have been copied from the server
98
*	When a read is requested, HttpIO should ensure the appropriate blocks have been requested, update the map, return data
99
*	The copy from the server should use HTTP’s ‘byte range’ to limit limit the number of bytes to be copied
100
101 10 Robin Mills
Why do we need to allocate memory for the complete file? I think we just need to allocate enough space for the metadata, don't we?
102
# Some elderly HTTP server don't support "byte range", you have to copy the whole file.
103
# I’m hoping we can use the Memory Mapping IO code.  So you populate the memory with data “Just in time”.
104
# Some file formats (eg PDF) are random access and the meta-data can be anywhere in the file.  Most JPGs have the meta-data in the first 100k, however we want our code to handle other possibilities.
105
# The map tells us which blocks to transmit to the server when we’ve modified the file.
106
# The map is very simple – an array of bools.  I suggest a block size of 8*1024 – however make sure that’s a const that we can tune.  You might also want to always prepopulate the first 100k on open.  So, when you get the “open” call, you do a 100k GET  from the server and you’re in business.  Good, eh?
107
108
109
110 5 Robin Mills
h3. 2)	HttpIO and writing
111
112
*	I’ve never wanted to write "byte-ranges" over http.  We need to research this.
113
*	However the map should maintain the parts of the file which have changed and only send those blocks to the server.
114
115
h3. 3)	Protocol support library
116
117
*	I respect libcurl.  I believe it supports http/https/ftp/sftp
118
*	I’m sure it can do ‘byte-ranges” for http/gets. 
119
*	I don’t know if it can do byte-ranges on other protocols
120
121
h3. 4)	Other protocols
122
123 12 Robin Mills
*       file:/// data: and "-" (meaning stdin)
124 5 Robin Mills
*	Other protocols may be possible. (smb, nfs, ssh).  Needs to be investigated.
125
*	Cloud protocols (AWS, DropBox etc).  Needs to be investigated.
126
127
h3. 5)  User Interface, test harness and Platforms
128
129 6 Robin Mills
Exiv2 is a library.  There is no user interface.  Exiv2 includes about 20 sample applications which are all command-line programs.  The main application is exiv2(.exe) which does many things and is used by the test suite.  The test suite is written in bash.  On Windows, the test suite is run from Cygwin - however it can test libraries built with Visual Studio as well as GCC and Clang.
130 5 Robin Mills
131 6 Robin Mills
All code is required to build, execute and test correctly on the major platforms: Windows/MacOSX/Linux.  I will provide help to port from the development system to the others.
132 5 Robin Mills
133
h3. 6) Some thoughts about implementation
134
135 6 Robin Mills
There is a Memory Mapped IO class in Exiv2.  I think we can use that to implement the HTTP read stuff.  We can allocate memory for the complete file, then populate the memory "just in time" when the user makes a read read.
136 5 Robin Mills
137
The priority is to have HTTP/read support (without copying the whole file).
138
Writing back isn’t so interesting (most HTTP servers don’t allow PUT).
139
Other protocols are interesting, and the "quick and dirty" solution is to copy the complete file.
140 7 Robin Mills
141
This business of only updating those parts of the file which have changed are very effectively implemented by rsync.  Perhaps we should investigate if we can incorporate that in our solution.  I personally update clanmills.com (which has 6GB/100,000+ files) using rsync (over ssh) and I am always astonished by its speed/reliability.
142
143
h1. Notes about prototyping a solution
144 1 Robin Mills
145 10 Robin Mills
I) First of all, for all folks interested in contributing to this project, I recommend that you register with the Exiv2 forum *AND* add a watch to this page.  It's my intention to update this page quite frequently.  I'll make the same information available to everybody who would like to be involved.
146 7 Robin Mills
147 10 Robin Mills
2) If you don't know anything about HTTP's GET verb, HTTP Header and Body - now's the time to learn.  Google something up, visit the library, or ask around.  Everybody involved in web programming knows about this.  So, I'm not going to discuss it here.
148 8 Robin Mills
149 7 Robin Mills
3) When you inspect the Exiv2 code base, you discover that concrete class for opening and reading files are derived from the abstract BasicIO class.  I recommend that you "instrument" those function with printf (or cout) statements to report when they are called and their arguments.  You should be able to run the command exiv2 -pt foo.jpg and you'll see the IO calls being made.
150 1 Robin Mills
151 11 Robin Mills
Here's my efforts (I've run it on Windows/Cygwin/MacOSX)
152
http://clanmills.com/exiv2/basicio.zip
153 7 Robin Mills
154 8 Robin Mills
4) Download and build curl.
155 7 Robin Mills
Like Exiv2, curl is both a library and a very useful command-line tool.  Have a look at the man page for curl : http://linux.about.com/od/commands/l/blcmdl1_curl.htm and you'll discover the very interesting --range from-to option.  You'll also find curl --verbose helpful as it shows you what's being done by curl:
156
157
<pre>
158
Robins-iMac:temp rmills$ curl http://clanmills.com/files/CV.pdf > CV.pdf
159
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
160
                                 Dload  Upload   Total   Spent    Left  Speed
161
100 38539  100 38539    0     0  91844      0 --:--:-- --:--:-- --:--:--  106k
162
Robins-iMac:temp rmills$ ls -alt CV.pdf
163
-rw-r--r--  1 rmills  staff  38539 Mar  1 17:41 CV.pdf
164
165
Robins-iMac:temp rmills$ curl --verbose http://clanmills.com/files/CV.pdf > /dev/null 
166
* About to connect() to clanmills.com port 80 (#0)
167
*   Trying 173.254.28.62...
168
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
169
                                 Dload  Upload   Total   Spent    Left  Speed
170
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* connected
171
* Connected to clanmills.com (173.254.28.62) port 80 (#0)
172
> GET /files/CV.pdf HTTP/1.1
173
> User-Agent: curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0 OpenSSL/0.9.8r zlib/1.2.5
174
> Host: clanmills.com
175
> Accept: */*
176
> 
177
< HTTP/1.1 200 OK
178
< Date: Sat, 02 Mar 2013 01:41:41 GMT
179
< Server: Apache
180
< Last-Modified: Sat, 10 Nov 2012 20:55:11 GMT
181
< Accept-Ranges: bytes
182
< Content-Length: 38539
183
< Vary: Accept-Encoding
184
< Content-Type: application/pdf
185
< 
186
{ [data not shown]
187
100 38539  100 38539    0     0  91458      0 --:--:-- --:--:-- --:--:--  105k
188
* Connection #0 to host clanmills.com left intact
189
* Closing connection #0
190
191 9 Robin Mills
Robins-iMac:temp rmills$ curl http://clanmills.com/files/CV.pdf | od -a | head
192
193
0000000   %   P   D   F   -   1   .   4  nl   %   G   l  si   "  nl   5
194
0000020  sp   0  sp   o   b   j  nl   <   <   /   L   e   n   g   t   h
195
0000040  sp   6  sp   0  sp   R   /   F   i   l   t   e   r  sp   /   F
196
0000060   l   a   t   e   D   e   c   o   d   e   >   >  nl   s   t   r
197
0000100   e   a   m  nl   x  fs   m   ]   k dc3   ^   6   u  rs   X   q
198
0000120   k   *  gs   6   i   R   t   ^   Y  em   N   d   ^   7   U   R
199
0000140 eot   H stx   $   ?   U   w   H   5  gs   X   Z   $  gs  ht   {
200
0000160   A   +   Y dc2   b   ]   V syn   V   6   r   3   |   !   ?   7
201
0000200 bel   $ soh   s   @   <  sp   x   .   V  so   ;   S   [   c   a
202
0000220   >   $   A   \   N   }   <   8   x   r   (   . dc4   >   *   ]
203 7 Robin Mills
204
Robins-iMac:temp rmills$ curl --range 23-60 http://clanmills.com/files/CV.pdf 
205
<</Length 6 0 R/Filter /FlateDecode>>
206
Robins-iMac:temp rmills$ 
207
</pre>
208 1 Robin Mills
209 7 Robin Mills
4) And how, you might be able to implement the HttpIO class!
210 1 Robin Mills
211 10 Robin Mills
You know how to get the length of the file (it's in the Content-Length: header).  You'll find the HTTP verb HEAD is designed to provide this information.  And you know how to read a range of bytes!
212 8 Robin Mills
213
If you download the file http://clanmills.com/LargsPanorama.jpg:
214
215
<pre>
216
1008 rmills@rmills-linux:/Windows/Users/rmills/clanmills $ cd ~/temp
217
1009 rmills@rmills-linux:~/temp $ curl http://clanmills.com/LargsPanorama.jpg > Largs.jpg
218
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
219
                                 Dload  Upload   Total   Spent    Left  Speed
220
100  485k  100  485k    0     0   115k      0  0:00:04  0:00:04 --:--:--  126k
221
1010 rmills@rmills-linux:~/temp $ open Largs.jpg 
222
1011 rmills@rmills-linux:~/temp $ exiv2 -pt Largs.jpg
223
Exif.Image.Orientation                       Short       1  top, left
224
Exif.Image.XResolution                       Rational    1  72
225
Exif.Image.YResolution                       Rational    1  72
226
Exif.Image.ResolutionUnit                    Short       1  inch
227
Exif.Image.Software                          Ascii      29  Adobe Photoshop CS Macintosh
228
Exif.Image.DateTime                          Ascii      20  2007:01:28 11:28:40
229
Exif.Image.ExifTag                           Long        1  164
230
Exif.Photo.ColorSpace                        Short       1  Uncalibrated
231
Exif.Photo.PixelXDimension                   Long        1  2160
232
Exif.Photo.PixelYDimension                   Long        1  345
233
Exif.Thumbnail.Compression                   Short       1  JPEG (old-style)
234
Exif.Thumbnail.XResolution                   Rational    1  72
235
Exif.Thumbnail.YResolution                   Rational    1  72
236
Exif.Thumbnail.ResolutionUnit                Short       1  inch
237
Exif.Thumbnail.JPEGInterchangeFormat         Long        1  302
238
Exif.Thumbnail.JPEGInterchangeFormatLength   Long        1  1688
239
1012 rmills@rmills-linux:~/temp $ 
240
</pre>
241
The aim of the project is to write an new HttpIO class (derived from BasicIO), so that the command:
242
243
exiv2 -pt http://clanmills.com/LargsPanorama.jpg produces the same output as above.
244
245
The "quick and dirty" solution is when "open" is called, you use curl to download the file to /tmp/LargsPanorama.jpg, then delegate everything to the FileIO class.  Works?  Of course!  Efficient?  No!  You copied the whole file.
246
247
Here are some things to think about:
248
# The clever solution is to use the byte-range feature of curl to copy only those bytes actually requested by Exiv2.
249
# We don't want to invoke external programs like curl.  We want to link and call libcurl for ourselves.
250
# What happens if exiv2 requests the same bytes more than once?  We want to cache them of course.
251 1 Robin Mills
# What about other protocols:  FTP/SSH and so on.  Well, that's what the projects about.
252 8 Robin Mills
253
I hope this all makes sense.  I know you'll ask me when you're confused.
254
255 10 Robin Mills
This photo is of the beautiful town of Largs in Scotland where I was born. !http://clanmills.com/LargsPanorama.jpg!