MVTron gets Xuggle-er with MediaTools

It’s been a while since I did much with MVTron, but I took a brief look hack at it the other day, and it’s actually coming along, I think. That said, it isn’t something I’m actively working on, so there probably aren’t going to be any spectacular results or anything.

Earlier, my approach to MVTron was to wrap AviSynth in the hopes that it could take advantage of all the fast AviSynth plugins people have written in C. I managed to get a scene detector working that processed my test video in about twice the video’s duration. That struck me as a pretty poor performance, but I was still hopeful, and I was planning to get my plan in motion from there.

Well, it’s kind of embarrassing to say this, but I completely forgot about that AviSynth-Groovy success story until just now when I read my own blog. That said, this time around I duplicated that success in less than a day without interfacing to AviSynth at all, using Xuggle’s (relatively) new MediaTools API. And now I’m usually looking at processing times only 170% the test video’s duration. Thanks, Xuggle!

Random-Access Pass

It’s probably not worth mentioning this, but before I turned to MediaTools, I first tried to improve the Xuggler code I was already using. I have a random-access video class that wraps a Xuggler container and uses seek to go backwards. I’m well aware that video isn’t usually encoded for fast random access, so the idea of this interface is that, as long as you’re asking for frames in the right order, you get them as efficiently as usual, but when you need to back up, that’s still easy to do. It does a great job of abstracting away the usual packet-pushing Xuggler API… if you’re lucky.

The problem was that the first few frames after a seek weren’t necessarily the right ones, even if the rest of them were. Well, I tried to gloss over that by copying off a few frames after each keyframe the first time through the file, and then using the preserved ones in lieu of corrupt ones later on. It worked like a charm, but only if there was only one seek needed. Apparently, the first seek might corrupt a few frames, but seeking again after that just breaks the whole video.

I briefly considered having the random-access wrapper open a new Xuggler container object and fast-forward from the beginning every time it needed to seek backwards, but I passed on that for the time being, in favor of MediaTools.


The MediaTools library also “does a great job of abstracting away the usual packet-pushing Xuggler API.” Rather than trying to hide the serial access, though, MediaTools embraces it. Essentially, you build a bundle of objects that represents the data flow (e.g., source video to conversion routines to destination file), and then in one magical line you execute it.

A MediaTool pipeline is built using the concept of events. You make a listener and register it with an object that broadcasts events, and the broadcasting object keeps track of that reference and calls the listener with events as they happen. This is a familiar concept in Java GUI programming and probably a lot of other things (things I haven’t experienced yet), but this time around I find it funny, ’cause seeing this use of listeners gave me the epiphany I needed to appreciate Haskell’s arrows. I’ve never used arrows, but I miss them already. I think MediaTools made an appropriate choice though; arrows may be a more natural abstraction, but I think they aren’t as easy to get the hang of as listeners, which defeats the purpose of an easier-alternative API.

(Incidentally, I think Marsyas uses a very similar pipeline metaphor. I haven’t looked deeply enough into Marsyas to draw a meaningful comparison, though.)

In my own MediaTools-powered code, I’m finding that I’m using absolutely no Groovy except in my sandbox scripts, and even my Groovy sandboxes are mostly copied-and-pasted from equivalent Java ones. Groovy still excels after the pipeline has done its job, when I want to do things like print out totals and histograms, but I think all the subclassing and put-this-one-in-that-one code is as brief in Java as it’ll ever be.

In Groovy’s favor, a network of listeners like this is naturally hierarchical. (It’s a bit of a pain to write a listener that waits for more than one event before doing something, since you have to manage state manually. As long as you’re managing to avoid that pain, your pipeline has a single input and several outputs that branch out from there. In fact, the “one magical line” that executes the pipeline depends on there being only one root input, and I think it would get pretty complicated with more than one.) When you’re taking a bunch of building blocks and putting them together hierarchically, you have a great candidate for a Groovy markup-style DSL.

The thing is, I’m not accustomed to making markup-style DSLs just yet, and even if I were, this is not the time. The listeners I’ve made and considered making have all sorts of different structures, so for the time being I don’t have many patterns worth abstracting away.

The Guts

So, what all did I accomplish in less than a day? Here are the details, in English about as clear as I can muster.

Scene Detection, a Diabolical Transition

As I mentioned, I implemented the scene detector. The algorithm I’m using is slightly different than the one I was using before. Before, I had an algorithm that took a tolerance and a clip as input and that output a list of booleans that indicated where the breaks were. However, giving the tolerance at the beginning like that was just a temporary kludge. I really want MVTron to detect the tolerance by default most of the time; having to explicitly tell it things about every video you give it ruins the fun that would come from just throwing it at a directory of music and videos and seeing what it came up with. In order for auto-detected tolerances to be practical, they need to be detected after the preprocessing is done, so that new and better tolerance detection methods can be swapped in without taking hours to re-scan all the footage.

As it turns out, translating my algorithm from tolerance-given-at-preprocessing-time to tolerance-given-never wasn’t very straightforward. Surprise surprise, right? It actually might not be the caveat you’d expect.

Under ideal circumstances, I would just take a difference between adjacent frames at preprocessing time, find outliers at composition time, and be done with it. However, I’m lucky enough that one of my test videos has an example of a pretty understandable phenomenon. There are frames A-B-C in this video where it’s obvious that frames A and C would constitute a scene break except that frame B is blurred almost exactly halfway between them. (I’m pretty sure it was telecined so that the pixel rows actually alternated between the two scenes, then scaled down so that those rows blended together.) This makes frame breaks A-B and B-C have intensity far below the other scene breaks, even getting dangerously close to the breaks between animation cels, but despite this, I’d rather have both A-B and B-C detected as scene breaks rather than neither.

The original algorithm worked by measuring all the A-C intensities as well (or else adding the A-B and B-C intensities, which in my tests has similar results), and then putting “unexpected” scene breaks detected that way (that is, A-C scene breaks without apparent A-B or B-C breaks) into the result as a special case.

The new algorithm doesn’t make a boolean stream until an intensity stream is finally analyzed to determine a threshold. Instead it works on combining intensity streams. In fact, it only uses the A-B intensity stream, so in a sense it combines that stream with itself. Here it is in Groovy:

// Some product fuzzy logic operations, which operate on values between 0 and 1,
// inclusive. The or defined here will be used below.
def not = { 1 - it };
def and = { a, b -> a * b };
def or = { a, b -> not( and( not( a ), not( b ) ) ) };

// By now, diff has been defined such that
// diff == [ 0 ] + (frame break intensities gained from preprocessing) + [ 0 ].
// If there are 0 frames, then diff == [ 0 ] instead.

// The focusedDiff accentuates each intensity of diff in one of two ways: If an
// intensity is a local maximum, it's augmented with its greatest neighbor. If
// it isn't a local maximum, it's augmented with itself. This has the effect of
// holding back local maxima with small neighbors, which are crisp already.
def focusedDiff =
    numberOfFrames == 0 ? [ 0 ] : [ 0 ] + (1..<numberOfFrames).collect { or(
        diff[ it ],
        Math.min( diff[ it ], Math.max( diff[ it - 1 ], diff[ it + 1 ] ) )
    ) } + [ 0 ];

(For what it’s worth, since min( J, max( I, K ) ) = max( min( I, J ), min( J, K ) ), this sequence operation could be reasoned about without referring to the elements, calling it something like “an accentuation zip of a sequence by the pairwise maximum of the pairwise minimum of a zero-bookended version of itself.”)

Scene Detection Grit

To evaluate the numeric difference between two frames, I calculate the root-mean-square of the differences in R, G, and B values. This is the same as the root-mean-square of the pixel color differences, for a pixel difference calculation that’s proportional to the colors’ Euclidean distance in the sRGB color space. My code makes it easy to swap in a better color metric (or a better frame metric altogether) if that turns out to help, but this is working just fine. I actually think it might be slightly less useful than the same metric without all the squaring and root-taking, which is closer to what I used in AviSynth (which was the average luma of the absolute difference of the frames).

To guess a tolerance for scene breaks, I just take the maximum frame break intensity and divide by two. This algorithm could definitely use some work, but I’ll have to test on some more videos or read up some more before I can figure out how to improve it.

Audio Analysis

I’m just barely understanding most of what I’m trying to do here. I read a lot about DFTs and DCTs, and I only learn enough to program something, and I guess if I can explain it to a computer then I must understand it perfectly, but it doesn’t feel that way. I think I’m shooting for a Mel-frequency cepstrum here at some point, and I have a pretty good idea of how to get one from the sample data, but I’ve had to work myself up to the concepts one at a time, and I stopped coding on this after I’d successfully invoked an FFT and made a histogram of the peak frequencies at each audio frame in a test video. (Yeah, I’ve only tested this with videos rather than music so far, mostly ’cause the videos were handy and I knew I could load them.)

I didn’t try to write the FFT code myself. Instead I’m using a library, JTransforms. I don’t really care where I get my FFT, as long as it’s fast, accurate, and feature-complete. If I don’t have to maintain it, awesome.


The point I’m trying to make is, it probably took me about as long to write this summary as it took to try some stuff, fail, and start MVTron all over with MediaTools, and yet the MediaTools code was faster and tackled things I hadn’t even gotten around to attempting before. I think that says something for Xuggler/MediaTools.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s