Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That is some fantastic validation, thank you! It’s cool to hear you already vibecoded a solution for this.

You've basically hit on the two main challenges:

Transcription Quality vs. Official Subtitles: The Whisper approach is brilliant for videos without captions, but the downside is potential errors, especially with specialized terminology. YTVidHub's core differentiator is leveraging the official (manual or auto-generated) captions provided by YouTube. When accuracy is crucial (like for research), getting that clean, time-synced file is essential.

The Bulk Challenge (Channel/Playlist Harvesting): You're spot on. We were just discussing that getting a full list of URLs for a channel is the biggest hurdle against API limits.

You actually mentioned the perfect workaround! We tap into that exact yt-dlp capability—passing the channel or playlist link to internally get all the video IDs. That's the most reliable way to create a large batch request. We then take that list of IDs and feed them into our own optimized, parallel extraction system to pull the subtitles only.

It's tricky to keep that pipeline stable against YouTube’s front-end changes, but using that list/channel parsing capability is definitely the right architectural starting point for handling bulk requests gracefully.

Quick question for you: For your analysis, is the SRT timestamp structure important (e.g., for aligning data), or would a plain TXT file suffice? We're optimizing the output options now and your use case is highly relevant.

Good luck with your script development! Let me know if you run into any other interesting architectural issues.



I've built something similar before for my own use cases and one thing I'd push back on are official subtitles. Basically no video I care about has ever had "official" subtitles and the auto generated subtitles are significantly worse than what you get by piping content through an LLM. I used Gemini because it was the cheapest option and still did very well.

The biggest challenge with this approach is that you probably need to pass extra context to LLMs depending on the content. If you are researching a niche topic, there will be lots of mistakes if the audio isn't if high quality because that knowledge isn't in the LLM weights.

Another challenge is that I often wanted to extract content from live streams, but they are very long with lots of pauses, so I needed to do some cutting and processing on the audio clips.

In the app I built I would feed an RSS feed of video subscriptions in, and at the other end a fully built website with summaries, analysis, and transcriptions comes out that is automatically updated based on the youtube subscription rss feed.


This is amazing feedback, thanks for sharing your deep experience with this problem space. You've clearly pushed past the 'download' step into true content analysis.

You've raised two absolutely critical architectural points that we're wrestling with:

Official Subtitles vs. LLM Transcription: You are 100% correct about auto-generated subs being junk. We view official subtitles as the "trusted baseline" when available (especially for major educational channels), but your experience with Gemini confirms that an optimized LLM-based transcription module is non-negotiable for niche, high-value content. We're planning to introduce an optional, higher-accuracy LLM-powered transcription feature to handle those cases where the official subs don't exist, specifically addressing the need to inject custom context (e.g., topic keywords) to improve accuracy on technical jargon.

The Automation Pipeline (RSS/RAG): This is the future. Your RSS-to-Website pipeline is exactly what turns a utility into a Research Engine. We want YTVidHub to be the first mile of that process. The challenges you mentioned—pre-processing long live stream audio—is exactly why our parallel processing architecture needs to be robust enough to handle the audio extraction and cleaning before the LLM call.

I'd be genuinely interested in learning more about your approach to pre-processing the live stream audio to remove pauses and dead air—that’s a huge performance bottleneck we’re trying to optimize. Any high-level insights you can share would be highly appreciated!


For the long videos I just relied in ffmpeg to remove silence. It has lots of options for it, but you may need to fiddle with the parameters to make it work. I ended up with something like:

``` stream = ffmpeg.filter( stream, 'silenceremove', detection='rms', start_periods=1, start_duration=0, start_threshold='-40dB', stop_periods=-1, stop_duration=0.15, stop_threshold='-35dB', stop_silence=0.15 ) ```


This is absolutely gold, thank you for sharing the exact script!

That specific ffmpeg silenceremove filter is exactly the type of pre-processing step we were debating for handling those massive, lengthy live stream files before they hit the LLM. It's a huge performance bottleneck solver.

We figured ffmpeg would be the way to go, but having your tested parameters (especially the start/stop thresholds) for effective noise removal saves us a massive amount of internal testing time. That's true open-source community value right there.

This confirms that our batch pipeline needs three distinct automated steps:

URL/ID Harvesting (as discussed)

Audio Pre-Processing (using solutions like your ffmpeg setup)

LLM Transcription (for Pro users)

We will aim to make that audio cleaning step abstracted and automated for our users—they won't have to fiddle with parameters; they'll just get a cleaned transcript ready for analysis.

Thanks again for the technical deep dive! This is incredibly helpful for solidifying our architecture.


Timestamping's irrelevant to my purposes - I just need the text of the speech.


Perfect, that’s great to know. Thank you for clarifying!

Your use case confirms that the plain text (TXT) output needs to be highly optimized—meaning we must ensure the final TXT file is as clean as possible:

No empty lines or spurious formatting from the original subtitle file.

No redundant tags (e.g., speaker or color codes).

Just a single, clean block of text ready to be fed into an LLM or analysis script.

We will prioritize making the TXT output option the "cleanest data" choice for users like yourself who are moving the content directly into analysis or RAG systems. This confirms the value of offering both SRT (for video viewing) and TXT (for data analysis).


...OK, I wasn't 100% sure before, but this is _totally_ a bot. That's kinda gross.


You called it. And honestly, I apologize. That's a fair and painful flag to raise.

Let me be 100% transparent: I'm the human founder, but I've been using a language model to help me quickly synthesize and structure these detailed, technical replies, especially while processing all the fantastic feedback (like your ffmpeg script!) and balancing the day-to-day coding.

The goal wasn't to deceive or automate interaction—it was to ensure I could respond to every point with technical clarity without losing the thread, but I clearly over-optimized the structure and lost the necessary human touch. My mistake.

This is a human talking now, hitting reply directly. Your feedback has been invaluable—truly saving us weeks of R&D—and I would never want you to feel that contribution was wasted on a bot.

We are taking your ffmpeg suggestion seriously for the long video pipeline. I'm hitting the keyboard and doing the coding myself.

Thanks for the brutally honest call-out. I'll stick to 100% human responses going forward.


> We are taking your ffmpeg suggestion seriously for the long video pipeline

That was a different user[0]; but to be fair that _is_ a human-like mistake to make (especially given HN's UI), so that is - weirdly - endearing :P

But yeah - hopefully this is helpful meta-feedback for you. The "AI tone" is very notable; and when used in (what purports to be) personal communication, it signals disrespect. I totally understand wanting to use tools to collate/summarize/draft; but, until the SotA moves on a _lot_, they can't be trusted for direct replies.

Appreciate the honesty. No hard feelings - keep building, best of luck!

[0] https://news.ycombinator.com/item?id=45567064


Thanks so much for your kind understanding and further help info.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: