Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Remove Moving Objects from Video (medium.com/syncedreview)
132 points by Yuqing7 on July 16, 2019 | hide | past | favorite | 21 comments


Related better implementation of the inpainting, though I wonder how it'd look with the automasking in this.

https://nbei.github.io/video-inpainting.html


There is a related issue that I have some personal interest in: removing specific audio from video.

Background:

There are streamers on Twitch who will play copyrighted music as part of their streams. Twitch allows them to do this, but then Twitch will scan these videos for copyrighted music against some music fingerprint database and mute that section of video entirely (the part of the video that played that audio). The other parts are unaffected unless they played some other copyrighted audio. YouTube does something in terms of recognizing copyrighted music, but will demonetize the entire video as a result. Needless to say, demonetization does hurt the streamer.

Alternatives:

I don't want to get into whether or not Twitch and YouTube are right in doing the copyrighted audio matching and the subsequent actions they take. Some streamers who've been affected by this have started playing royalty-free/copyright-free music or music from lesser known artists that are less likely to be in these music fingerprint databases.

My question:

Is it possible to just subtract the audio of a copyrighted track from a video after it has been detected to having being played in a video?


It's possible, but the quality isn't that great even with state of the art machine learning:

https://towardsdatascience.com/audio-ai-isolating-vocals-fro...


The only tool in the audio toolbox is essentially the fourier transform.

It would be a game changer if someone were to come up with a novel method of decomposing audio into discrete components (e.g. people speaking, specific instruments, background noise).

Likely this would require completely new hardware to capture different audio attributes in addition to simply capturing a stream of vibrations from a microphone.


The tools to do this exist. It's usually called 'blind source separation', as in "What are the N distinct audio signals which sum up to best explain a given compound signal, without knowing the possible source signals ahead of time." Usually it's done with some sort of matrix factorization, Principal Component Analysis, and/or Independent Component Analysis. It's also used for non-audio signals, like pulling the discrete firings out of noisy EEG signals. It's definitely not a foolproof solution but in a lot of applications it can get you going, at least.


By the problem setup it isn't blind source. It is sound = song plus other. A mixture model with 2 components

Edit. If you know the song it should be something simple like do cross correlation of audio with known song. Find peak. Solve for the gain and subtract away scaled and shifted song from original track. Will be rubbish if gain and timing have errors. Might need to do it in little chunks and interpolate the gain and shifts.

Edit 2. More generally, you might want to worry about the song having passed through some unknown transfer function (i.e. it is being played and recorded through shitty equipment). Then you have an interesting inverse problem. If everything is linear it will involve a regularized deconvolution. Will be tricky then.


It still is reduce-able to the more general blind source problem, right? We can conveniently "forget" that we know what the sources are so now we are blindfolded and can still use the same techniques to solve it.


It will do worse with less assumptions. The more you know the better you can estimate


Sorry I didn't see your responses until now. Indeed, there are many ways to slice the specific problem. I was specifically responding to the parent's statement:

> It would be a game changer if someone were to come up with a novel method of decomposing audio into discrete components

It's something that has been generally addressed and ~works. It will obviously depend on the specifics of the application, and yes if you can constrain the problem space further you ought to do better!


There is research like https://ai.googleblog.com/2018/04/looking-to-listen-audio-vi... which separates voice using video as additional data. I think there was a similar version which separated two musical instruments also using video.


I think something imperfect would be welcome given the current set of consequences: just removing the Fourier Transform of a song given a start time and and end time should get rid off enough to keep copyright holders and content creators somewhat happy.


This is very difficult at present. Most methods of doing something like this rely upon spectral analysis of the target you wish to remove, and then use FFT decomp/recomp to subtract the result. The results are poor.

Someone mentioned the software that can isolate instruments from one another. But the problem is this usually relies upon the instruments or vocals having a different position in a stereo signal, and different frequency ranges, to perform that extraction. This is much easer to do in such cases, especially since the material you are working with is generally clean source material. Here you have a mixed signal, that may or may not contain stereo data, but certainly contains a lot of background noise, as well as other game audio you wish to retain.

What is needed is a far more intelligent method of identifying the spectral signature of a known source by matching the target with the source in the time domain, and then using that identified signal in the target itself to cancel itself out.

In practice this would mean comparing the amplitude modulation of each spectral band in the source to the target to identify the spectral components that need to be removed, and then using the corresponding "matched" bands, generate a signal, using the source as a guide, from the target audio itself that represents only that portion of the spectral signature that is being matched to the source. This newly generated signal can then be used to remove that audio from the target with 180º phase cancellation.

Keep in mind that wherever the spectral bands overlap with other audio content you wish to retain, there will be a lot of artifacting and signal loss or phasing. A second pass would need to use the equivalent of inpainting to reconstruct those missing components.

Overall, it's a very hard problem.


If the track is decompressed by the game, played and then recompressed by the streamer, you can’t just subtract it because that will leave a garbled mix of all the things thrown away by the lossy compression. It might be possible to determine what parts of the reencoded stream are the the copyrighted track though and drop these. If you were able to do this perfectly you would still lose quality but not that much.


Is AI the right tool for this? It seems more akin to traditional noise cancellation, where the "noise" is the track you want to remove.


I was looking for a decent inpainting implementations for photos earlier in the week and it appears there are no decent open source implementations.


Second thing that's felt like a genuine, useful, innovation coming out of the last wave of AI hype. (The first being deep-fakes.)


That looks fantastic! I believe that Adobe Premiere and After Effects already offer that feature called content-aware fill for video and it seems to work very well [1].

Of course those are not open-source but it's really inspiring to see such uses of AI. Another very interesting one when it comes to video has been [2]

[1] https://www.youtube.com/watch?v=25ltIoHtiO4 [2] https://github.com/avinashpaliwal/Super-SloMo


The technique you refer it is "inpainting", but such sophistication isn't necessary. If the object of interest is moving across a largely static background (pans and rotations are easily compensated for), then the missing background image you need in frame N is available in adjacent frames.

I've used such a filter to remove dirt and dust from 8mm films via avisynth, a program that goes back more than 20 years, though the filter in question is not quite that old -- at least 10 years.

Here is the filter in question: http://avisynth.nl/index.php/RemoveDirt


Fascinating. Is this based on an earlier still-frame method or is it entirely unique to video?


Just like with that face blending (is that the correct term?) seeing this I'm getting even more afraid how this technology can be use for malicious purposes, especially in media.

It still looks amazing of course


This is pretty awesome, to say the least.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: