Hacker Newsnew | past | comments | ask | show | jobs | submit | georgemandis's commentslogin

Definitely in the same spirit!

Clearly the next thing we need to test is removing all the vowels from words, or something like that :)


I had this same thought and won't pretend my fear was rational, haha.

One thing that I thought was fairly clear in my write-up but feels a little lost in the comments: I didn't just try this with whisper. I tried it with their newer gpt-4o-transcription model, which seems considerably faster. There's no way to run that one locally.


I kind of want to take a more proper poke at this but focus more one summarization accuracy over word-for-word accuracy, though I see the value in both.

I'm actually curious, if I run transcriptions back-to-back-to-back on the exact same audio, how much variance should I expect?

Maybe I'll try three approaches:

- A straight diff comparison (I know a lot of people are calling for this, but I really think this is less useful than it sounds)

- A "variance within the modal" test running it multiple times against the same audio, tracking how much it varies between runs

- An LLM analysis assessing if the primary points from a talk were captured and summarized at 1x, 2x, 3x, 4x runs (I think this is far more useful and interesting)


Hahaha. Okay, okay... I will watch it now ;)

(Thanks for your good sense of humor)


I like that your post deliberately gets to the point first and then (optionally) expands later, I think it's a good and generally underutilized format. I often advise people to structure their emails in the same way, e.g. first just cutting to the chase with the specific ask, then giving more context optionally below.

It's not my intention to bloat information or delivery but I also don't super know how to follow this format especially in this kind of talk. Because it's not so much about relaying specific information (like your final script here), but more as a collection of prompts back to the audience as things to think about.

My companion tweet to this video on X had a brief TLDR/Summary included where I tried, but I didn't super think it was very reflective of the talk, it was more about topics covered.

Anyway, I am overall a big fan of doing more compute at the "creation time" to compress other people's time during "consumption time" and I think it's the respectful and kind thing to do.


I watched your talk. There are so many more interesting ideas in there that resonated with me that the summary (unsurprisingly) skipped over. I'm glad I watched it!

LLMs as the operating system, the way you interface with vibe-coding (smaller chunks) and the idea that maybe we haven't found the "GUI for AI" yet are all things I've pondered and discussed with people. You articulated them well.

I think some formats, like a talk, don't lend themselves easily to meaningful summaries. It's about giving the audience things to think about, to your point. It's the sum of storytelling that's more than the whole and why we still do it.

My post is, at the end of the day, really more about a neat trick to optimize transcriptions. This particular video might be a great example of why you may not always want to do that :)

Anyway, thanks for the time and thanks for the talk!


> I often advise people to structure their emails [..]

I frequently do the same, and eventually someone sent me this HBR article summarizing the concept nicely as "bottom line up front". It's a good primer for those interested.

https://hbr.org/2016/11/how-to-write-email-with-military-pre...


Interesting! At $0.02 to $0.04 an hour I don't suspect you've been hunting for optimizations, but I wonder if this "speed up the audio" trick would save you even more.

> We do this internally with our tool that automatically transcribes local government council meetings right when they get uploaded to YouTube

Doesn't YouTube do this for you automatically these days within a day or so?


> Doesn't YouTube do this for you automatically these days within a day or so?

Oh yeah, we do a check first and use youtube-transcript-api if there's an automatic one available:

https://github.com/jdepoix/youtube-transcript-api

The tool usually detects them within like ~5 mins of being uploaded though, so usually none are available yet. Then it'll send the summaries to our internal Slack channel for our editors, in case there's anything interesting to 'follow up on' from the meeting.

Probably would be a good idea to add a delay to it and wait for the automatic ones though :)


> I wonder if this "speed up the audio" trick would save you even more.

At this point you'll need to at least check how much running ffmpeg costs. Probably less than $0.01 per hour of audio (approximate savings) but still.


> Doesn't YouTube do this for you automatically these days within a day or so?

Last time I checked, I think the Google auto-captions were noticeably worse quality than whisper, but maybe that has changed.


Yeah, I'd like to do a more formal analysis of the outputs if I can carve out the time.

I don't think a simple diff is the way to go, at least for what I'm interested in. What I care about more is the overall accuracy of the summary—not the word-for-word transcription.

The test I want to setup is using LLMs to evaluate the summarized output and see if the primary themes/topics persist. That's more interesting and useful to me for this exercise.


Should be fixed now. Thank you!


For what it's worth, I completely agree with you, for all the reasons you're saying. With talks in particular I think it's seldom about the raw content and ideas presented and more about the ancillary ideas they provoke and inspire, like you're describing.

There is just so much content out there. And context is everything. If the person sharing it had led with some specific ideas or thoughts I might have taken the time to watch and looked for those ideas. But in the context it was received—a quick link with no additional context—I really just wanted the "gist" to know what I was even potentially responding to.

In this case, for me, it was worth it. I can go back and decide if I want to watch it. Your comment has intrigued me so I very well might!

++ to "Slower is usually better for thinking"


Oooh fun! I had a feeling there was more ffmpeg wizardry I could be leaning into here. I'll have to try this later—thanks for the idea!


In the meantime I realized that the apad part is nonsensical - it pads the end of the stream, not at each silence-removed cut. I wanted to get angry at o3 for proposing this, but then I had a look at the silenceremove= documentation myself: https://ffmpeg.org/ffmpeg-filters.html#silenceremove

Good god. You couldn't make that any more convoluted and hard-to-grasp if you wanted to. You gotta love ffmpeg!

I now think this might be a good solution:

    ffmpeg -i video-audio.m4a \
           -af "silenceremove=start_periods=1:stop_periods=-1:stop_duration=0.15:stop_threshold=-40dB:detection=rms" \
           -c:a aac -b:a 128k output.m4a -y


I love ffmpeg but the documentation is often close to incomprehensible.


Out of curiosity, how might you improve those docs? They seem fairly reasonable to me


The documentation reads like it was written by a programmer who documented the different parameters to their implementation of a specific algorithm. Now when you as the user come along and want to use silenceremove, you'll have to carefully read through this, and build your own mental model of that algorithm, and then you'll be able to set these parameters accordingly. That takes a lot of time and energy, in this case multiple read-throughs and I'd say > 5 minutes.

Good documentation should do this work for you. It should explain somewhat atomic concepts to you, that you can immediately adapt, and compose. Where it already works is for the "detection" and "window" parameters, which are straightforward. But the actions of trimming in the start/middle/end, and how to configure how long the silence lasts before trimming, whether to ignore short bursts of noise, whether to skip every nth silence period, these are all ideas and concepts that get mushed together in 10 parameters which are called start/stop-duration/threshold/silence/mode/periods.

If you want to apply this filter, it takes a long time to build mental models for these 10 parameters. You do have some example calls, which is great, but which doesn't help if you need to adjust any of these - then you probably need to understand them all.

Some stuff I stumbled over when reading it:

"To remove silence from the middle of a file, specify a stop_periods that is negative. This value is then treated as a positive value [...]" - what? Why is this parameter so heavily overloaded?

"start_duration: Specify the amount of time that non-silence must be detected before it stops trimming audio" - parameter is named start_something, but it's about stopping? Why?

"start_periods: [...] Normally, [...] start_periods will be 1 [...]. Default value is 0."

"start_mode: Specify mode of detection of silence end at start": start_mode end at start?

It's very clunky. Every parameter has multiple modes of operation. Why is it start and stop for beginning and end, and why is "do stuff in the middle" part of the end? Why is there no global mode?

You could nitpick this stuff to death. In the end, naming things is famously one of the two hard problems in computer science (the others being cache invalidation and off-by-one errors). And writing good documentation is also very, very hard work. Just exposing the internals of the algorithm is often not great UX, because then every user has to learn how the thing works internally before they can start using it (hey, looking at you, git).

So while it's easy to point out where these docs fail, it would be a lot of work to rewrite this documentation from the top down, explaining the concepts first. Or even rewriting the interface to make this more approachable, and the parameters less overloaded. But since it's hard work, and not sexy to programmers, it won't get done, and many people will come after, having to spend time on reading and re-reading this current mess.


> "start_mode: Specify mode of detection of silence end at start": start_mode end at start?

In "start_mode", "start" means "initial", and "mode" means "method". But specifically, it's a method of figuring out where the silence ends.

> In the end, naming things is famously one of the two hard problems in computer science

It's also one of the hard problems in English.


> naming things is famously one of the two hard problems in computer science

Isn't ffmpeg made by a French person? As a francophone myself, I can tell you one of the biggest weakness of francophone programmers is naming things, even worse when it's in English. Maybe it's what's at play here.


Curious if this is helpful.

https://claude.ai/public/artifacts/96ea8227-48c3-484d-b30b-6...

I had Claude rewrite the documentation for silenceremove based on your feedback.


if you did it in 2 passes, you could find the cut points using silence detect, use a bunch of -ss/-t/-i based on those segments, apad each segment with a -filter_complex chain the ends in concating. it would be a wonderfully gnarly command for very little benefit. but it could be done


I was trying to summarize a 40-minute talk with OpenAI’s transcription API, but it was too long. So I sped it up with ffmpeg to fit within the 25-minute cap. It worked quite well (Up to 3x speeds) and was cheaper and faster, so I wrote about it.

Felt like a fun trick worth sharing. There’s a full script and cost breakdown.


You could have kept quiet and started a cheaper than openai transcription business :)


I've already done that [1]. A fraction of the price, 24-hour limit per file, and speedup tricks like the OP's are welcome. :)

[1] https://speechischeap.com


Nice. Don't expect you to spill the beans but is it doing OK (some customers?)

Just wondering if I cam build a retirement out of APIs :)


It's sustainable, but not enough to retire on at this point.

> Just wondering if I cam build a retirement out of APIs :)

I think it's possible, but you need to find a way to add value beyond the commodity itself (e.g., audio classification and speaker diarization in my case).


Can it do real-time transcription with diarization? I'm looking for that for a product feature I'm working on. Currently I've seen Speechmatics do this well, haven't heard of others.


Not yet. The gains in efficiency come from optimizing the speedup factor. Real-time audio cannot be processed any faster than 1× by definition.


Pre-processing of the audio still a valid biz, multiple types of pre-processing might be valid


Sure, but now the world is a better place because he shared something useful!


Or openai will do it themselves for transcription tasks


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: