Have you seen https://www.descript.com/ it transcribes video and allows you to edit the transcript. Those edits to the transcript are reflected in the video. You can even train it to voices if you have enough content.
I mean you can always write an EDL. The issue with most filmic material however is that moving pictures have their own internal timing, movments, directional changes and so on. Ignoring the content of shots is something that you might do with long shots of talking heads, but literally everything else won't let you get away with it in my experience.
This would be particularly useful for speeches or presentations where the content is the important part and the visuals don't change much (or wouldn't create jarring cuts when just editing based on the transcript).
Amazing! Love it so much. My brain was running wild with possibilities (like having an autocomplete from the corpus, live audio only previews, and the above)
I didn't realise there was a github! You should add a link to your tutorial
I guess there would need to be an intermediate step. Videogrep helps to surface useful ngrams and there would still be a manual/creative step to stitch them together in a way that works.
This is awesome! I’ve considered building something nearly identical over the years, as I’ve definitely used VTT files to aid in searching for content to edit, but never did because getting all the FFmpeg stuff to work made my head hurt. I’m so glad someone else has done the hard work for me and that it’s been documented so well!
If anyone else decides to give this a try on video files with multiple audio tracks, there doesn't seem to be an easy way to tell it to select a certain track.
I got it working by manually adding `-map 0:2` (`2` being the trackid I'm interested in) when calling ffmpeg.
You'll have to make that edit in both `videogrep/transcribe.py` as well as `moviepy/audio/io/readers.py`.
And I'm not sure how easy adding real support for that would be, considering that moviepy doesn't currently have a way to support it (https://github.com/Zulko/moviepy/issues/1654)
Back in 2011-12, my MFA (poetry) thesis project was a sort of poetic ~conversation between myself, and (selected) poems generated by a program I wrote, using transcripts of Glenn Beck's TV show.
I really, really wanted to be able to generate video ~performances of the generated poem in each pair for my thesis reading (and for evolving the project beyond the thesis). I have to imagine videogrep could support that in some form, at least if I had the footage. (Not that I want to re-heat that particular project at this point).
This is very cool! I wonder if Videogrep works better with videos sourced from Youtube (consistent formats, framerates, bitrates) compared to arbitrary sources.
I've used ffmpeg before to chop video bits and merge them before. Mixed results. It'd struggle to cut at exact frames or the audio would go out of sync or the frame rate would get messed up.
I gave up and decided to tackle the problem on the playback side. Like players respect subtitle srt/vtt files, I wish there were a "jumplist" format (like a playlist but "intra-file") that you could place alongside video/audio files, and players would automatically play the media as per markers in the file, managing any prebuffering, etc. for smooth playback.
For a client project, I did this with the Exoplayer lib on Android, which kinda already has an "evented" playback support where you can queue events on the playback timeline. A "jumplist" file is a simple .jls CSV file with the same filename as the video file.
Each line contains:
<start-time>,<end-time>,<extra-features>
"extra-features" could be playback speed, pan, zoom, whatever.
Code parses the file and queues events on the playback timeline (On tick 0 jump to first <start-time>, on each <end-time> go to next <start-time>).
I set it up to buffer the whole file aggressively, but that could be improved. Downside may be that more data is downloaded than is played. Upside is that multiple people can author their own "jumplist" files without time consuming re-encode of media.
Resounding notions of Object Oriented Ontology[1] in Cinema[2][3] here, which is very much about pick out & possibly stitching together key items from film.
> "All of the elements of a shot’s mise en scène, all of the non-relational objects within the film frame, are figures of a sort. The figure is the likeness of a material object, whether that likeness is by-design or purely accidental. A shot is a cluster of cinematic figures, an entanglement. Actors and props are by no means the only kinds of cinematic figures—the space that they occupy and navigate is itself a figure"
Hi Sam, I'm big fan of your work! Coincidently, I just made a simple POC video editor by editing text using this speech to text model https://huggingface.co/facebook/wav2vec2-large-960h-lv60-sel.... It might be cool to integrate into your Videogrep tool, it also works offline with CPU or GPU, and gives you timestamps for word or character level.
This needs a function where you can give it a string and it goes and finds the longest matches from the database then builds a video that says what the string says.
Also it would be fun if it output a kdenlive project file, so you could easily tweak the boundaries or clip orders.
great project! since it relies heavily on subtitle files, and as an alternative to generating your own, which websites would you recommend to find subtitles for videos which are not on youtube i.e. movies and series? preferably ones with ratings systems similar to guitar tabs websites - I can envisage a musical similarity in the variance and quality of user-submitted content e.g. timing, volume, tone, punctuation, expression, improvisation, etc. since I doubt many are composed from the actual scripts. I have never used vosk so am also wondering whether it would be quicker and more reliable than filtering and spot checking say a few subtitle files per video
I just started playing around with the transcription part after seeing this blog post. Consider giving it a try.
I'm not sure how well most subtitle sources will work with this. I don't think they'll generally embed the word timings needed for picking out fragments (just line timings). The blog post mentions it being the case for `.srt` specifically. Not 100% sure, someone with better understanding of the subtitle formats would be able to correct me.
FWIW I'm finding the video transcription to be working quite well (and I even decided to use Japanese-speaking media because I wanted to see how well vosk handles it).
It might be my system, but the transcription is unfortunately a bit slow/single threaded. I quickly added a GNU `parallel` in front of the transcription step to speed up processing an entire season.
I hope the subtitles website I am searching for will provide multiple formats and I understand a lot more effort would be required to produce the .vtt with word fragments. running a diff on vosk text and subtitle file text might help to iron out ambiguities
WTF is a supercut.
...OK apparently it means cutting a number of parts from the source video containing a given spoken text and joining them together again. Still not sure why you would call that a supercut.
I'm not sure why you're being down-voted. Its not a term I'm familiar with either. Even just a link to the wikipedia article would have improved the post immensely.
Has Zuckerberg deliberately had work/make-up done to look like his own avatar might in some sort of 'metaverse' world? I can't be alone in thinking a lot of those clips look more like gameplay footage than photography?
Text based document would be so much easier for this than big clunky Premiere.