PixelTone
Make my day more heavenly. Perhaps we do not yet have an app that will carry this out for us, but we do have technology to apply such commands to our images. PixelTone, a recent collaboration from Adobe Research and the University of Michigan, is an exploratory iPad (app)lication that uses natural language, spoken or written, in conjunction with standard user interface controls and gestures, to edit images.
Language gives us the freedom to convey our specific intentions succinctly in a practically unlimited variety of forms and nuances. We have been communicating among ourselves through spoken language for likely hundreds of thousands, perhaps millions, of years, yet only recently have we been able to speak effectively to our software applications.
Several of us in Adobe Research, in collaboration with the University of Michigan, are attempting to evolve image editing into a form which supplements our familiar interactions through the addition of natural language; think Siri for Photoshop. What would such a thing look like? What would you say to it? How would you interact with it? How much more creative and how much simpler could our image editing experiences be? These are some of the questions we sought to explore.
At some point in human evolution we acquired this extra communication skill which greatly opened up the power of conveying very specific and nuanced intentions. Gestures and simple utterances only go so far; a hand wave may indicate that I am leaving, but to convey my intention of going to the store to purchase some coffee and returning within the hour is very difficult without language. And the same can be said of how we communicate with our image editing software. A simple gesture might indicate that I want to apply a gradient adjustment to my image, but to convey my full intention of applying a bit of contrast and perhaps some color warming to the shadows in a graded fashion across only the sky and not the trees could drive a normally sane Photoshop user …, well let’s just say it’s a challenge! And yet one simple line of natural language, whether spoken or written, is all that it takes to clearly specify the task. With PixelTone we are building upon this evolution of man-machine interactions to foster a more creative and yet simpler experience for image editing.
PixelTone combines gestures with speech to simplify such image processing tasks as applying gradients of effects or localized adjustments. You can point to or scrub over a region and say “make it greener here”, for example.
Another task that is greatly simplified with language is tagging an image. It is a big chore to type labels to be placed on an image but with PixelTone it becomes a breeze. And objects, once tagged, can be edited by name.
Key to its robustness is the construction of PixelTone’s language interpreter to handle an unlimited vocabulary. This is made effective by grammatical parsing and our understanding of sentence forms. After the speech is recognized it is parsed into the various parts of speech and cast into one of our set of sentence template forms. The figure below shows an example, in the “interpreter” box, of parsing into verb (VX), noun (NX) and adjectival (AX) expressions. For example, the sentence “make the shadows on the left slightly brighter” would get interpreted as: VX = make, NX = shadows on the left, AX = slightly brighter. Template interpretation allows this to be recast as: Apply the brighten filter with a parameter setting of “slight” to the shadow tonal areas, modulated by a linear gradient mask applied towards the left side of the image. That’s a lot of stuff going on for such a simple input sentence. Hence the power of natural language!
Now here’s the really cool part – PixelTone does not care whether you want to make the image brighter or shinier or lighter. Whereas most editing applications have a restrictive set of vocabulary that must be used, and you must memorize them or know where to find them on menu lists, PixelTone provides the flexibility of using natural language naturally; that is, you don’t have to know the specific vocabulary, you just have to be close. You could say “it’s too dark”, “too dim”, “too gloomy”, “too drab”, etc. and the image gets brightened. In fact, you don’t even have to be so close; you can say “make it shimmer”, “make it shine”, “make it shake” or “make it heavenly” and PixelTone will find the semantically closest action or set of actions to implement.
So we may not be able to guarantee a complete experience of a heavenly day, but with PixelTone or one of its future incarnations we should be able to ensure that at least your photographs and your experiences of editing them become more heavenly.
PixelTone was put together by a team from Adobe Research – myself: Gregg Wilensky, Mira Dontcheva, Walter Chang, Aseem Agarwala, Jason Linder and, from the University of Michigan, our summer intern Gierad Laput along with his professor Eytan Adar.
For more information, see the Adobe Research project page:
http://www.adobe.com/technology/projects/editing-images-with-natural-language.html
which also includes a paper which will be presented at the CHI 2013 conference on Human Factors in Computing Systems in May.
More information presented in John Nack’s blog:
http://blogs.adobe.com/jnack/2013/02/feedback-please-voice-driven-photo-editing.html
and here are a few news-stories, as well:
Comment