A couple of months ago, I took a stab at making an AMV (anime music video). I’ve done it before… once… but this time, like lots of times, I had absolutely no time to work on it because I’d jam-packed myself with more homework than I knew what to do with.
So, I found myself pacing along, in a sleep-deprived state, talking at a friend about how I tried to do such-and-such and had to do wossname instead and never found time for, um, doing that AMV thing and that since it was me of all people doing it I might as well save time in an insanely difficult way by, say, programming my computer do it all for me.
An automatic AMV generator. Sometimes I amaze myself. Oftentimes I come up with some idea that blows my socks off and then realize it’s just a telephone or a wheel. This time, whether or not it had been done before, I knew I was particularly persnicketily picky about my AMVs… so picky that I was sure no other automatic AMV generator could be exactly what I wanted.
To make myself clear, this is something that doesn’t exist yet, but it’s something I’ve been actively working on for a few weeks. And I’m calling it MVTron.
Things I could insist on putting into an AMV generator:
- Face and voice recognition, so as to make sure the clips chosen follow a particular main character or group of characters
- Speech recognition and natural language comprehension, so as to make witty parallels between the song and the video
- Instrument recognition, so as to classify the genre of the song and pick clips that complement it
- Text identification, so as to avoid some of the text-annotated scenes common in comedy anime, which would only distract the viewer from what’s actually happening in the scene
Yeah, yeah, yeah. I’m not constructing a super-high-tech surveillance system that doubles as a CAPTCHA crack. But just in case anybody does construct one, I’m pretty sure MVTron wouldn’t mind borrowing some of that spy tech in the name of art.
Things I think would suffice and are actually pretty widely known:
- Beat detection is a must. If nothing else, the video choices need to at least be on beat. Otherwise, I think people will just see it as a random sequence of clips… and, essentially, yeah, that’s probably what it would be.
- Scene detection is there mainly for polish. Without it, I think MVTron wouldn’t hesitate to cut to a scene that itself cuts to another scene two frames later, and picky viewers (like me) would notice the sloppy editing.
- Frequency analysis is how I could identify traits of the audio besides just its volume. I’ll get to how exactly I might need this in a second.
- Sequence alignment is the way I’d match up video with audio. Apparently HMMs (Hidden Markov Models) are pretty good at that, so I get the feeling I should have paid a bit more attention when I checked out that book in the school library a few years ago. For now, all I need is something that works, and I think a dynamic programming solution like the Smith-Waterman algorithm will be acceptable. (For all I know, it’s better. I can’t exactly tell. I’m finding myself in several new frontiers here.)
In order for beat detection to help the video sync up, there has to be some way to detect motion in the video, to know when someone slams a table or falls over or something and causes a sudden decrease in energy in the video that can be followed through with in the audio. (You don’t expect me to let it just change the scene every beat, do you? Sheesh, that would be annoying.) Fortunately, I’m pretty sure this kind of motion detection can be done at the same time as scene detection.
Beats are one thing, but I like pop, and beats are definitely not all there is in pop. I want MVTron to be able to detect a soft part of the song and coordinate it with tranquil video or to detect an extended pitch shift in the song and coordinate it with a video clip that has synchronized careening or acceleration/deceleration. (Did that make sense? I essentially want the audio and video to say “whooooaah” at the same time, lol.)
With all of this in mind, what I need is a way to look at a video or audio stream and extract its beats/scene changes, intensity (volume and movement), its “bending,” and how confident that “bending” estimate is.
For video, I expect to calculate “bending” by finding the center of the movement in every consecutive pair of frames and calculating the magnitude of the second derivative of that. (This is intentionally like tracking the center of mass of the main object and calculating the acceleration of that point.) The immediate variance (the inverse of the “confidence” I was talking about) can be calculated by finding the expected squared distance of any high-change point from the calculated center. (This unfortunately means that large, solid-colored objects will probably have a bit of trouble being recognized.)
For audio, I expect to calculate “bending” by taking a Fourier transform, finding the dominant pitch, and taking the derivative of that. The immediate variance here will similarly be calculated by taking the expected squared distance from that frequency to any particular frequency contributing to the sound, weighted by the amplitude that other frequency is. There might need to be some logarithmic rescaling here.
I’m not sure whether I’ve chosen exactly the right formulas here, but it’s all heuristics. If it feels wrong and I can fudge something to fix it, I will. I also intend to use some sort of adaptive normalization, so as to make the difficulties in comparing these various dissimilar units negligible.
So, what I have in mind here is a way to preprocess video and audio streams in such a way that I can store exactly what I want to know about them in small files that can then be automatically sorted through, arranged, and rebuilt into a final product. I think the design is set. I’ve proven to myself that I have the audio and video processing capability, and now I just need to code. Right? This is pretty hopeful.