Automatic audio text transcriptions

In honour of Global Accessibility Awareness Day (GAAD) today I’m throwing this method out in to the ether that is the web. However, it’s not the quote/ unquote “technique” I’m offering — in the sense I really expect anyone will use it. Rather it’s my aim to try and get people thinking about the content they consume and produce on and for the web, period. And thinking a little differently about said web content.

After all, that’s the point of going through the effort of raising awareness. To think about anything in a manner which you aren’t typically conditioned to think about them. Or in other words, it’s not so much the result I’m most interested in here, it’s the reasons for and process that give us that result. It’s my hope to draw some attention towards automatic text transcriptions of audio only podcasts, specifically.

And I’m aware such a solution is still a ways off from being practical — as in reliably useable. But it’s never too early to entertain prospects. And experiment.

Let’s be real, Podcasters are not typically artists with an abundance of resources to draw from to offer their listeners text transcripts to their content. And I’m specifically referring to podcasts I tend to enjoy and truly appreciate. Namely Escape Velocity Radio, Citizen Radio and Radio Dispatch, just to name three. But who has time to transcribe each of their shows? Or better yet who has the money to have every show professionally transcribed? Not these 3.

These Podcasters, or podcasters producing audio content for the web more broadly, would immensely benefit from a solution that would automatically transcribe their audio content. For the accessibility aspect alone. Immediately benefitting Deaf and hard of hearing users, obviously. Or translations into different languages? But what about discoverability? What if search engines had the ability to crawl and index your actual content, that which draws their users in? How exactly might that not be good for everyone interested?

Conversion to captions

So the idea was suggested to me I give YouTube’s Automatic Captions (or as Google refers to them, “auto-caps” for short) a try. It should be stated, captions are different than transcriptions. And the only way they are, that is relevant here at least, is captions are time stamped. Meaning captions are synced to show when the words are spoken on the video.

But I needed to do some preparations first. I had my choice of podcasts which I wanted to transcribe, with permission of course. But I needed to change the format before I uploaded the podcast to YouTube, into a video file format it would accept. Following Google suggestion I used the version iMovie that shipped with my Mac (they suggest a way to do it on Windows, as well) to convert an MP3 to a M4V, in this case.

Once I did that, I then simply uploaded my newly converted video to YouTube. And after an amount of time (the turn around was relatively quick, like a day) YouTube captions your video for you, automatically (should you choose to not add captions yourself).

And while some to a lot of the “auto-caps” results can be quite hilarious in places, depending on the source material of course, the job it does perform isn’t that bad, honestly. It’s really quite a ways from being left alone to do it’s thing. But credit where credit is due.

Conversion to transcripts

So now how did I get the captions off YouTube, so I could edit them back to a transcript and do what I originally intended with them? Which is have the posted along side the podcast they are from? I simply used a shareware application that allowed me to download my video. But it also let me download the captions into a separate SubRip Subtitle File (.srt). An .srt file is essentially a text file containing all your captions but with the aforementioned time stamps placed every sentence or two throughout the file.

Now this is a perfect opportunity to make use of something called Regular Expressions (Regex), in a text editor that supports their use. What Regex will allow a user to do is strip out patterns of sequential characters in a search-and-replace operation. Please refer to this Wikipedia entry for a much more lucid explanation than I’m capable of providing at this time.

Sadly, I didn’t go this route. Instead I powered through in my text editor. Deleting time stamps, fixing errors, pausing the audio track, clarifying who was speaking, rewinding the track, retyping subtleties I missed, etc. Or in other words, driving myself mad.

Full Disclosure: I didn’t get so far in converting the generated captions into a text transcript.

Point being

Anything worth doing isn’t ever easy to do. As such, I happen to believe if we lived in a perfect world, everything would be as accessible as humanly possible, to anyone that might wish to be a part of anything. But this isn’t a perfect world. And there are exceptions. Not excuses. But exceptions. Realistically providing transcripts to audio content is a justifiable exception for those working on shoe string budgets.

While things aren’t quite at the point just yet where this sort of solution is usable, it will never hurt to start thinking about these things. It’s only a matter of time.

Again, my approach today is merely to draw attention to a facet of relatively new craft that I feel is missing. And needed. I’m hoping to have people think outside of their experiences. It’s not my wish to judge anyone who may not be providing text transcriptions to their audiences. And please do not take the steps I outlined here as any sort of serious attempt at solving this issue. This was merely for fun.

Happy Global Accessibility Awareness Day!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.