Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Rearchiving 2M hours of digital radio, a comprehensive process (nb.no)
144 points by todsacerdoti on Aug 30, 2024 | hide | past | favorite | 36 comments


I was involved with two projects to digitise 20,000 hours each of analog audio and video: I built the database to select the tapes to digitise and manage their destruction. One was in the late 2000s the other 2020s.

One of the challenges for the project was finding working 1-inch analog video machines. The team scoured the world for machines, working or not, and managed to get several running. There is one particular part that fails, Sony only has a handful left and the machinery to make them is no longer available. When they are gone the media will be unplayable.

The data complexity was due in part to there being multiple version of various quality, a program can be split across multiple tapes, and tapes can have multiple programs. So they need to ensure that all tapes of the best versions were selected, and also know what parts of other programs were on the tapes – bonus material. Finally, it was common to re-use video tapes to save money so it was possible that rare fragments of other material could be found at the end of the expected programs.


> The data complexity was due in part to there being multiple version of various quality

Nonono, save it all! Only a matter of time before they can be merged together to make a supercopy.

(But I get it: the practicality of saving even more data isn’t there)


Archiving of broadcast material is rather difficult: should it be the final result that was shown over the air with all the overlays and graphics superimposed, or should it be the raw feeds? There is a board of people that make these decisions.

For some program material there were low resolution copies, copies that were shortened for different purposes, copies that were edited for different markets... and so on. There was a decision tree to follow.

There was also a "never to be shown again" list, usually where people that were involved in the program (either on-screen or part of the production team) were later associated with crimes or very unsavoury behaviour, or the material was in some other way extremely controversial.


For context, this is the Norwegian National Library [1], which is tasked with archiving and preserving everything that is publicly published or broadcast in Norway. Similar to the U.S. Library of Congress.

[1] https://www.nb.no/en/


Do they also perform speech recognition/transcription, while they are at it?

If so, what tools are they using?


The re-archiving process was mainly a data migration. However, there is an initiative to use the in-house developed NB-whisper model [1] (based on OpenAIs Whisper) to do speech recognition/transcription.

[1] https://huggingface.co/collections/NbAiLab/nb-whisper-65cb83...


That is the model I've been playing with for my homebrew assistant, as mentioned here[1].

The medium and lower models wasn't quite up to the task, miss-interpreting some crucial words here and there, but so far I've been very pleased with the large Q5 model.

It directly translates into English well enough that subsequent English LLMs understand the meaning with high degree of accuracy, at about 5-10x real-time on my 2080 Ti.

[1]: https://news.ycombinator.com/item?id=41396649


Encoding to AAC is still, even after all these years, not an easy process.

Well, unless you have a Mac. Apple AAC is still the best quality encoder, but it's only available for macOS, and even then, the only UI officially supported by Apple is the "Music" app, so you're going to have to use a third-party command line or GUI wrapper. (XLD is good.)

Outside of that, quality of the various alternatives has changed over the years, but the Fraunhofer encoder, which they say they are using, is a good choice, even through licensing problems mean that it isn't included in ffmpeg by default. Frustratingly, the default build does come with an encoder called "aac", which isn't Fraunhofer, and has very poor quality. So, you have to make your own custom build.

Even then, the low-pass cutoff defaults to a weirdly low value, leaving the user to guess at, or consult ancient wikis, to try to divine a suitable value.[1]

It's unfortunate that AAC remains the best (by which I mean, most-supported) choice for modern lossy audio, because making it is still a huge pain.

[1]: https://wiki.hydrogenaud.io/index.php?title=Fraunhofer_FDK_A...


Apple's great AAC encoder was (is?) also available as part of iTunes for Windows. One can extract the DDLs from the installer and use them with qaac to encode AACs. Apparently even works in Wine: https://www.andrews-corner.org/qaac.html ! OBS also picks up CoreAudio DLLs so streamers install iTunes for Windows to improve audio quality on their streams.


For archival purposes though, a proprietary codec on a proprietary app in a proprietary OS that could be gone in the next release is not a panacea. This demands a format that will be playable at highest quality in 50 or 100 years. Not an easy problem!


Every encoding tool I've seen & use, All use Apple AAC in the QAAC wrapper. At ~160kbps VBR, It far better than Vorbis which needs 256kbps in some genres like Noise/Industrial and MP3 is just severely outdated needing V0 or 320kbps to be transparent. I use FHG AAC(winamp encoder) as a fail safe measure when Apple encoder chokes on something.


You can also use the Apple AAC encoder using ffmpeg (on a Mac only), with the argument "-c:a aac_at". Handy as it allows you to use it for encoding the audio track of videos, in addition to its support for pure audio container formats:

https://trac.ffmpeg.org/wiki/Encode/AAC#aac_at


There is also fdkaac[1] AS a wrapper for libfdk-aac.

Besides that there a several docker images including static builds of ffmpeg including the libfdk bindings.

However, Am I the only one questioning a lossy codec suboptimal for archival purposes? Maybe the sheer amount of data is too mich for lossless...

1: https://github.com/nu774/fdkaac


I find it weird that they use M4A and the previous blog post explains that there are preferred file formats with a link to a list which doesn't contain M4A at all.


The preferred file format list is for preservation formats, while the M4A/AAC is used for access.


"The new MP4 files is not to be archived in DPS, as they are secured on the Wowza viewing platform."

If I read this correctly it means that those files will not be preserved due to DRM?


Well since they are preserving the original WAV, it doesn't really matter if they preserve the MP4. (I'm not sure DRM has anything to do with it or not.)


No, why?


I would have thought reencoding MP3 to MP4 would lose quality, even going to a much higher bitrate. Why not leave MP3s as MP3s?


It sounded to me like they are encoding lossless WAV to lossy MP4, and replacing MP3s that were also from the same WAV, and in any case also keeping the WAV.

That's why it was notable that in a couple of cases where the WAV was corrupt, they kept the original MP3 as being now the 'best available' copy. In no case did they transcode MP3->MP4.



yes, sorry I didn't see your comment. We're saying the same thing.


they're using AAC for the audio which has many benefits compared to mp3 or wav. MP4 also offers more comprehensive metadata options.


But a reencode is a reencode, you're gonna sacrifice some quality... (even if inaudible to human ears at 160kbits)


I interpret the below quote as: if there was a wav file and an mp3 file, they dumped the mp3 and created an mp4/aac from the wav (saving both the wav and the mp4/aac pair). Where there only was an mp3 file I assume they kept the mp3. Hence, not a lossy transcoding process.

> Some radio broadcasts were stored as mp3 and wav files, with accompanying checksum files. Other broadcasts were only stored as mp3. Before the re-archiving process began, it was decided to generate new MP4 playback files from the wav files to replace the varying qualities of the old mp3 files.


This is correct. The mp3 is a an access file sitting on our streaming servers (wowza), outside of the preservation environment. The old access files were of very low quality, so new m4a/aac files were created.

In the cases where we only had mp3 files, the mp3 file was preserved as our master in the preservation environment, with a copy sent to the streaming servers.


Why wowza instead of using backblaze or a some other cheap service? Backblaze has no egress, pair it with fastly for a super cheap cdn.


Our wowza streaming servers are hosted in-house, and are integrated with our authentication software. I don't know the nitty gritty details about the solution though, access is outside of my domain


How can we listen to radio from the 90s?



Is there any archives for english speaking radio from the 90's ?



What are you using for the search tech?


Cassandra and Elastic Search


Large datasets can be tricky to handle, as the normal workflows people take for granted may no longer work as expected.

For example, many filesystems from nix will scale just fine, self-check, and de-duplicate. However, accessing a path in a high-branching factor tree can cause problems for ls or rm etc.

Notably, for external BLOBS we found it simple to convert a filename into its sha512 hash with standard formatted media specific extensions, and include the extracted meta-data file in json (details of the encoding, label, and stats etc.) Thus, for quality of life improvements we would pack the sub-paths based on the file hash characters... so the smaller leaf path content entries were present in a given path (char[0] is local index, char[1..k] is the sub paths.)

It is strange, but when the files get big it is convenient to be able to audit each part in a decoupled way independent of the underlying filesystems/NFS/databases.

the file hash stored inside a database also infers the host OS file set label, and packed location

* the metadata is preserved along side the media in human readable utf8 json text, and thus the host node does not require knowledge of the media specific encoding during most operations (i.e. the file-server has minimal dependencies.)

* the file external-BLOB location is set by the size of k-1 hash string length used (k=128 chars in hash512 for example)

* normal CLI still works in each k-1 leaf path, and will be under ((2+w)16files) entries depending on how its implemented. Note, Windows usually demands k<124 sub-paths deep, and caps string lengths to under 260 chars.

BLOB file sets are trivially accessed/locked by many processes on the host OS, and additional files with extensions/analytics may be put beside the original media as json metadata etc.

* operates on top of whatever filesystem you are deploying at the moment, where we assume k<=128 will sit under your file system limits

* user and group read-only permissions are supported on the host NFS or JBOD

* Must turn-off auto-indexing filesystem search routines on the host OS node

* a database BLOB index can be rebuilt/ported from the archive tree leaf nodes

* source file corruption is self-evident (the metadata should match inside the json file as well)

* contents are obfuscated, but duplicate file checks are trivial (note this differs from block level de-duplication, which can be impractical if the data flow rate is high)

* tree-root-node sub-paths may be externally mounted on explicitly sharded volumes to share/backup the workload as needed... given the pseudo-random hash makes specific io-busy areas unlikely.

Mind you we only had to deal with only around 40TiB of sparse video data on that old project. Unsure if such a method would be performant for 2M hours of content.

Sometimes the metadata format inside media is versioned in nonstandard ways... if your intent is long term access support, than storing stats on the program and version used to read the file may be necessary to parse it properly in the future (even if a VM OS image snapshot with the codecs/parser is also archived as a read-only file.)

Best of luck =3


Very interesting




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: