Live Captioning

  • 608/708 updates
  • Formats
  • Ingesting Live captions (standards)
  • Live Caption Input sources
  • Technology Updates
  • TTML
  • Television parsing

Expert sharing knowledge.

How does live captioning work?

  • Here in the US we have 608/708 putting captions in the stream.
  • That starts with a stenocaptioner. They have a special keyboard for doing captioning, and that's hooked to a laptop. From there, there's a serial cable that converts it in 708 and puts it in a stream. Or to a web service.

In the broadcast world, captioners connect via a phone line. Back in the old days sometimes you would see modem noise in the captions.

  • Broadcast is what it is, you have to work with what they're sending.
  • For the town hall, that's a do it yourself, and more interesting. You could have stenocaptioning on site, perhaps you've been to a conference like this with a separate screen with the captions. We could have a complete transcript of what was said.
  • Actually at google we're lucky enough to have a contract with the best stenocaptioners. They sometimes go to the ietf meetings. Aside.
  • So, whether its on site or off site, if you want to stream that to other places you need a network. Google has their own protocol via http post to their server. but we created that ourselves. Each stenocaptionist software (couple companies make them) you need to work with them to support your protocols.
  • When youtube implemented that it was a separate team, now we're taking over and fixing some of the mistakes made. So we need to work with software companies to figure out what they have, figure out a new standard, of course we'd love that to come out of this working group.
  • Back to the chat application. Its possible - there are places that offer private captioning, other people who are deaf don't know or aren't comfortable working with sign language connect to the these companies over the phone, they provide their captionist, and their captions come as another screen. Or like hangouts, you can pin another screen as the captions.
  • how does it fit in? Its not exactly in-band with the video, more alongside. Those are the things I'm aware of.
  • If these things are receiving post, can people develop webapps that do it themselves if they want to just do it themselves.
  • Yes, they could, but if you wanna keep up a regular keyboard won't do it.
  • On hangouts, what delays do you typically see with the captionist. Can we adjust caption time after recording? If its interactive, you live with what you get. Captionists always have a lag. Sometimes they can predict what someone is going to say.
  • but if you have near-live (broadcast for example), by the time a live show is shown on the west coast they've had enough time to adjust captions, but that takes some time.
  • If you're recording a livestream, you could do that for the VOD.
  • There's another way to do it. Capture the transcript. For example, large meetings have a captionist. You can see captions, its interactive, captionist keeps up. But afterwards they take the transcript and don't even bother. Afterwards, they just use speech recognition to sync with video. but there are outside companies that will do it for you. The alignment works pretty well.
  • Kinda sounds like http post is a de-facto standard, even if its not described. Do those devices all share a common format?
  • Wish I knew. When we talked with them we wanted to develop our own protocol. What we got from them is close to webvtt.
  • But its very simple stuff, any of us could invent a new protocol in an hour. But you have to work with the software people, and take what they have now.
  • Interest in having a working group about live caption ingest.
  • In america - 608/708. People talking about scte 35 as well. So there are documents that talk about closed captioning scte, but I haven't seen them. Ad insertion triggers? Unsure if related. Maybe can be used to carry text? Cue tones?
  • We're probably okay on ingest. Do we need a standard?
  • If you're working with television, 608/708. Otherwise the field is open and there are no standards.
  • Right now there is no standard for sending partial lines. Is it appropriate to send partial characters. Would like to see webvtt extended.
  • Can we send complete lines.
  • 608 sends changes, which makes it hard. Change could go back.
  • WebVTT doesn't have a replace. Have to extend webvtt either way
  • Another suggestion, we've talked about segments. We can think of segments replacing previous segments. Hopefully we could update it in a way that seems live, and send captions again and again. You can't take it directly form a stenocaptionist.
  • In europe, practice is captionist followed by 'improver' who might correct things. Not sure if that's the same. In europe they're using speech recognition and then goes to a fixer-upper. There would be a lot of delay in that.
  • Very big difference between live broadcast and live teleconferencing.

See EBU documentation and toolkit:

	* open source toolkit: https://github.com/ebu/ebu-tt-live-toolkit
	* EBU spec: https://tech.ebu.ch/publications/tech3370

Live for broadcasting, number of seconds. captions for teleconference, difference.

Many links in the delivery chain. Might be advantage to having standards there, rather.

WebVTT live captions

  • issue 319 - that has spec from last year to extend webvtt with live captioning. Has things like adding a character, next row down, indicate which row something should go into. have underline, have style change, people have done 708 will notice. "delete till end of row, backspace", flash on,

erase displayed memory. 608/708 is command-based not file-based. WebVTT live captioning has tried to extend those commands in a way that replicates.

  • Rather than recreate what we did last year, please read and provide comments. see https://github.com/w3c/webvtt/issues/319
  • Reading the bug. Discussion. What is important.
  • Paint on helpful when people pause. Paint-on is useful to give you a sense of how fast people are speaking. * Sometimes though the captioner is just having a brain freeze.
  • Other part of the discussion in video being delayed. We've had this conversation, and its surprising how many people don't want to delay the video at all if they can avoid it. Less than 1 second latency in -> out. On the other hand, some kind of processing needs to happen. That's actually slower than captions, so we need to delay captions to wait for the video to catch up.
  • Also live chat - if you have to wait for the whole line to be typed in, people get disconnected from the conversation because they have to wait.
  • WebRTC? - can that help with this?
  • A spec for realtime chat.
  • Probably going to be transported over the data channel.
  • Spec in inclusive. analyzed what features there were to support 608/708. then we looked at, if this was just javascript, what would you need to do? how would we need to extend the text track api to do this.
  • Functionality in there to be able to use the text track api just to support live captioning with minimal delay. And the last column is, assume we're doing live captioning, and we're converting that into a webvtt file, (or alternatively its recording a live webrtc conversation with realtime chat) its wanting to record that into a webvtt file. Without any changes, without changing it into a cue-based format. What would we need to do to webvtt to enable that?
  • New timestamp with NOW timestamps and no ends, and the now would take the clock from the time, and be able to successively create the cue by plugging in these timestamps. Because we already have through timestamp markup we have the capability to do one-char-at-a-time displays.
  • Does the bug solve most problems?
  • But no progress, because no discussion. And we need to focus on version 1 before version 2 and live captioning stuff.
  • Who's burning issue is this? First challenge is to extend the text track API.
  • Recording in webvtt. maybe ttml can include something like that.
  • already requirements on how to do this with ttml
  • Discussion of starting with EBU spec, maybe its wrong though.
  • What about bitrate? 18kbs for character-by-character if you're re-sending a line. (guy did the math in his head to calculate 18kbs that was cool)
  • Discussion of this bitrate being difficult. Especially multi-language. We're in an MPD world. Non single-digit bitrate.

Discussing why one-character-per-time

  • text track api needs to be able to handle 608/708.
  • is this a problem that needs solving? text
  • Protocol needed for single-character. If we're dealing with broadcast content. Why do we want to send a character at a time? 608/708 compliance.
  • Conversion back to 608/708. Can we reconstruct into an archive format?
  • Conversation from last year - want to take an approach that uses the files for later. Getting something back that contains the cues is interesting. Players have already solved, with exception of rollup, 608 to vtt is cues. Displaying paint-on, pop-on, is solved. provided we get rollup, going from 608/708 to webvtt is fine.
  • If the souce isn't 608/708, what do we do?
  • Do we care about going back to 608?
  • Maybe? Some say strong no. If you take delivery of a master with 608, don't translate the 608 to webvtt. Just keep it around.
  • Are we saying we want to take this content from the web.
  • If we have the archive we can convert to 608.
  • When we capture 608 we save it that way. We always save the original and save it that way. But what we don't do it convert from 608 and
  • Don't throw away 608.
  • And if you don't get it, use a captions track. Use the webvtt file.
  • In the case of professional masters, there will always be ttml. But also 608, also STL track (optionally). Don't throw it away!
  • Question for john - a little bit before you were asking about how captions are generated. talking about stenocaptionist and friends. Do you know about that software?
  • No - tended to be more proprietary, with 608 as an output.
  • When youtube works with those vendors, we have to ask them to implement our protocol.
  • Timed text in mp4 document does have expressive power that has new end time for text you previously have so don't put it out.
  • Nick - could you make an end time for that segmented time in the future? Accuracy issues.
  • If you look at that bug, the very first one is proposing a NULL end time.
  • Perhaps first step to support hls/dash use case. Goes beyond that with realtime chat use case. Addressing other issues where you still want to support addition/deletion. Perhaps this isn't what needs to be solved in a webvtt file, but it does need to be solved in a text track API way. Then you could merge up into a rolled up cue and put that in.
  • Last year we just discussed what possibilities there are. And there are possibilities for WebVTT
  • Call for test cases to be generated between now and next year.
  • Without an editor driving it may not be possible. but if whoever takes over is engaged on live captioning then it could be possible.
  • Maybe in addition to an actual live person, we should invite someone (or more than one person) from companies that make the software. Agreement, companies may well be happy to do this.
  • Is there a resource that can describe current best practices for teleconference captioning.
  • Google has this for live updates internally. But not sure if best practices.
  • We can get with folks on this.
  • Concern is that if we don't build this into whatever spec we develop, then we are boxing out any use of web-generated content for broadcast.
  • Answering the question would be very helpful.
  • i think we asked if anyone wants backward conversion to 608. Danger is that as the web is used to create media, we can't go from web to broadcast. If its not backwards compatible the concern is that we'd never be able to go back.
  • The answer in practice is that in the US 100% of captions are made to conform to 608 standards. So ask that they provide it as 608 as well.
  • Isn't it the reverse? Going from ttml or webvtt to 608 is 'easy'
  • Very difficult to get the 608 timing write because you have a fixed pipe of bandwidth.
  • For broadcast they pay caption agencies to do reformatting which is the main source of business for them. Not creating but changing. Movie, blu ray, kids version. They have equipment to do that kind of work. So I'm not thinking this is a serious problem.
  • Text track api could be taken back into the working group.
  • Get more people involved! Get onto the bugs. And we'll start writing some test cases.
  • Suggestion to add a new section to webvtt v2, that add first-class support for segmented VTT.
  • Right now apple specifies how to use webvtt as part of a live stream with captions that can be updated in future segments. It would be good to take that spec, review it and propose it for vtt v2.