The mysterious case of the missing Sabi

Gotcha Goseiger
User avatar
takenoko
Team Baron
Team Baron
Posts: 36477
Joined: Mon Dec 10, 2007 8:33 pm
Gender: Toast
Favorite series: All of them
Alignment: Neutral
My boom: stick
Quote: <Lunagel> That's Toei's dumb fault
Type: ISFJ Protector
Location: Yami ni umare, yami ni kisu
Contact:

The mysterious case of the missing Sabi

Post by takenoko »

So here's something that Luna noticed. In the captions that air with Gotchard on TV-Asahi, the Sabi part of Sabimaru is always missing.

Image
This is how it's supposed to look. First two kanji are his family name Tsuruhara. And then last two are for his given name, Sabimaru

But here's how the captions present his name:
Image

The very first instance even. You can see that his name has gone from 4 kanji to 3. And whenever it's just Sabimaru or Sabimaru-sempai, it's all become Maru and Maru-sempai


I mean, it's a pretty complicated kanji. Maybe it has trouble displaying or it's an encoding issue? Could this be some sort of shorthand on the captioners part? Is it possible that the person who does the caption just made the mistake through this whole show? Seems like someone would have caught it at this point. I dunno. It's weird, what do you guys think is going on?
User avatar
Lunagel
Mofu~
Posts: 11089
Joined: Sat Jan 12, 2008 9:09 am
Favorite series: Magiranger
2nd Favorite Series: Gekiranger
Location: Japan

Re: The mysterious case of the missing Sabi

Post by Lunagel »

This is the caption as it aired on TV

Image

You can see that the 錆 looks oddly pixelated, comparatively. It's not a rare kanji, but it is one that's usually written in hiragana, and it's pretty uncommon for a name. I still think it's just an issue with our ripping software.
User avatar
takenoko
Team Baron
Team Baron
Posts: 36477
Joined: Mon Dec 10, 2007 8:33 pm
Gender: Toast
Favorite series: All of them
Alignment: Neutral
My boom: stick
Quote: <Lunagel> That's Toei's dumb fault
Type: ISFJ Protector
Location: Yami ni umare, yami ni kisu
Contact:

Re: The mysterious case of the missing Sabi

Post by takenoko »

Maybe, but when our broadcast software borked for episode 11 I had to grab the captions from an kitsuneko and they also had the same issue:

https://wiki.tvnihon.com/wiki/Kamen_Rid ... Transcript
User avatar
FangTrigger
ZAIA Tool
ZAIA Tool
Posts: 620
Joined: Sun May 30, 2010 12:33 pm
Favorite series: Decade
2nd Favorite Series: OOO
Alignment: Chaotic Neutral
My boom: Kamen Rider Sentai
Quote: Impossible is nothing.....Master everything!
Location: Within the darkest confines of a shadow.

Re: The mysterious case of the missing Sabi

Post by FangTrigger »

Two possible answers:

The simple answer could just be that TV-Asahi don't really care about it.

The writer in me thinks it's all part of some weird plot point where Sabimaru ends up becoming the key to a maguffin and they are using his disappearing name in the subtitles to start planting the seeds along with his sudden change to Kamen Rider Dread (he is the 'Chemy Expert' who discovered the way to locate UFO-X after all.)

It does remind me of a plot point in this audiobook I was listening to. Spoilers ahead for Mistborn book 2 The well of Ascension:
Spoiler
In the final chapters of the book, it's discovered that there's an entity that has the ability to pull a Goddess of Creation and rewrite the world, and it has been using this power to lure The Hero Of Ages to the well to release it.
Side note- if you're looking for a Sentai type audiobook, look for a series called Tangent Knights. Graphic Audio really knocked the ball out of the park with producing this one and it hits most/all of the usual Sentai tropes while being it's own story.
User avatar
takenoko
Team Baron
Team Baron
Posts: 36477
Joined: Mon Dec 10, 2007 8:33 pm
Gender: Toast
Favorite series: All of them
Alignment: Neutral
My boom: stick
Quote: <Lunagel> That's Toei's dumb fault
Type: ISFJ Protector
Location: Yami ni umare, yami ni kisu
Contact:

Re: The mysterious case of the missing Sabi

Post by takenoko »

I mean from luna's picture it looks like it's a weird coding issue, since it is in the broadcast. It looks like it's a different font from the rest of the text. Look at how pixelated the left part of Sabi is. For some reason the software people are using to rip the captions isn't able to pick up on that kanji
User avatar
Phoenix512
Rising to the Top
Rising to the Top
Posts: 6799
Joined: Tue Dec 11, 2007 4:29 pm

Re: The mysterious case of the missing Sabi

Post by Phoenix512 »

I would like to think that Sabi is referring to the tablet and Maru is the person.
Image
User avatar
Lunagel
Mofu~
Posts: 11089
Joined: Sat Jan 12, 2008 9:09 am
Favorite series: Magiranger
2nd Favorite Series: Gekiranger
Location: Japan

Re: The mysterious case of the missing Sabi

Post by Lunagel »

Sometimes you get the weird pixelated character on captions when it's an obscure/old character. I watch a lot of jp tv with captions (I can't hear without subtitles) and like 99% of the time there's a funky pixelated kanji it's someone's name. Whatever is causing the issue just doesn't like obscure kanji.
User avatar
takenoko
Team Baron
Team Baron
Posts: 36477
Joined: Mon Dec 10, 2007 8:33 pm
Gender: Toast
Favorite series: All of them
Alignment: Neutral
My boom: stick
Quote: <Lunagel> That's Toei's dumb fault
Type: ISFJ Protector
Location: Yami ni umare, yami ni kisu
Contact:

Re: The mysterious case of the missing Sabi

Post by takenoko »

>I would like to think that Sabi is referring to the tablet and Maru is the person.

Tablet's already got a name and it's isaac. Isaac Asimov reference, I'm guessing

>Whatever is causing the issue just doesn't like obscure kanji.

But Sabi isn't even an uncommon kanji, it's the kanji for rust and it's a common word. Hell, t his isn't even the first Sabimaru that I've run across, there's a blade in Sekiro called Sabimaru. Spelled like this but still, common enough:
錆び丸
ViRGE
Save the life
Posts: 2628
Joined: Fri Apr 24, 2009 5:35 pm

Re: The mysterious case of the missing Sabi

Post by ViRGE »

Thus far you've just looked at broadcast releases. Do the TTFC releases have captions? And if so, do they have the same missing kanji?
ViRGE
Save the life
Posts: 2628
Joined: Fri Apr 24, 2009 5:35 pm

Re: The mysterious case of the missing Sabi

Post by ViRGE »

takenoko wrote: Fri Dec 08, 2023 4:31 am I mean from luna's picture it looks like it's a weird coding issue, since it is in the broadcast. It looks like it's a different font from the rest of the text. Look at how pixelated the left part of Sabi is. For some reason the software people are using to rip the captions isn't able to pick up on that kanji
So I went too far deep down the rabbit hole here, and went researching Japanese broadcast captions. (Gee, thanks for nerd sniping me on a Friday night, Takenoko! :lol: )

TL;DR: Japanese broadcast captions allow for both encoded text and bitmap characters. So based on Luna's screenshot, your assumption appears to be correct. The version of 錆 they are using is a bitmap version. Your software either does not know how to handle those bitmap characters, or does so poorly, and as a result is just dropping them. Unfortunately, I could not tell you why Toei/TV-Asahi is not using the standard character; that is a linguistic issue rather than a technical issue.

Long version: So Japan has their own standard for broadcast closed captioning, which includes their own character set for those captions: ARIB STD-B24. Because international standards (e.g. UTF-8) came well after regional standards, Japan just kept building off of the earliest Japanese character encoding standards such as JIS X 208. And because Japan has so many different characters, which was hard to express in the days of very limited memory/bandwidth, in technical terms, shit gets all kind of weird when they want to display additional characters not found in the ARIB set.

To do that, they rely on bitmap characters, which are defined under Dynamically Redefined Character Set (DCRS). DRCS characters are low-resolution bitmaps, suitable for use in the limited bandwidth available for caption data. In modern times these bitmaps stand out since TVs have local font sets from which they can render high resolution text. This is why the difference is so stark in Luna's screenshot: almost all of the text is made up of high resolution locally rendered characters; and then you have that sole, lonely, low-res bitmap of 錆.

Anyhow, to properly handle ARIB captions, software needs to both be able to render text and render bitmap characters.

This is obviously a problem when you're trying to reduce captions to a text-only transcript, since you're covering an insane mix of text + graphics to just text. In principle this would require running Optical Character Recognition (OCR) on the bitmap characters and trying to map them to the closest UTF-8 character, but we're getting well out of my domain of expertise at this point. (I have no idea how good kanji OCR is these days, among other things).

In any case, the most likely explanation is that the software you're using to rip captions can only consume the textual characters, and not the bitmap characters. And, apparently, it's quietly dropping the bitmap characters rather than doing something more useful with them or informing you about it.

As for why Toei/TV-Asahi is using a bitmap character, that is beyond me. 錆 is in the ARIB character set (specifically, it's in the JIS X 208 set, which ARIB builds upon). So the regular form of the character is there. Which implies that Toei is somehow using an irregular form of the character, which isn't covered by the ARIB character set. At this point we're getting into linguistics rather than computer science, and I have no idea what (if any) significance there is of using a variant on the character. All I know is that it makes pure text ripping very hard.

Which is why I'm very curious if TTFC has captions. I would expect those to be in a pure text format, such as UTF-8. Which may shed some light as to what's going on.
Last edited by ViRGE on Sat Dec 09, 2023 2:57 am, edited 1 time in total.
User avatar
takenoko
Team Baron
Team Baron
Posts: 36477
Joined: Mon Dec 10, 2007 8:33 pm
Gender: Toast
Favorite series: All of them
Alignment: Neutral
My boom: stick
Quote: <Lunagel> That's Toei's dumb fault
Type: ISFJ Protector
Location: Yami ni umare, yami ni kisu
Contact:

Re: The mysterious case of the missing Sabi

Post by takenoko »

Wow, thank you for that deep dive. I couldn't tell you why Sabi in particular has such trouble, but my guess is that Japanese in general is an overly complicated language. And whatever ad hoc system they set up to get the captions working, works fine over broadcast, but fails to convert in whatever software we're using (in our case it's a lovely program called caption2assmod) which was made assuming some standard was followed that clearly wasn't for this one particular character

Don't ask me how we found this program. I'm just amazed we found something that works to strip the captions data from a Japanese transport stream file
User avatar
Lunagel
Mofu~
Posts: 11089
Joined: Sat Jan 12, 2008 9:09 am
Favorite series: Magiranger
2nd Favorite Series: Gekiranger
Location: Japan

Re: The mysterious case of the missing Sabi

Post by Lunagel »

ViRGE wrote: Sat Dec 09, 2023 1:21 am
Which is why I'm very curious if TTFC has captions. I would expect those to be in a pure text format, such as UTF-8. Which may shed some light as to what's going on.
Image

Regular encoding.

Well that's absolutely fascinating and slightly baffling. When I type 錆 there's only one version of this character that pops up. There are two obscure forms of the kanji, 銹 and 鏽, but they're both obviously not used in either broadcast or TTFC version. If it was just one instance I'd say it was the background messing with the OCR making it fuck up, but it's every single instance so... yeah idk, the more I think about it the less sense it makes.
User avatar
takenoko
Team Baron
Team Baron
Posts: 36477
Joined: Mon Dec 10, 2007 8:33 pm
Gender: Toast
Favorite series: All of them
Alignment: Neutral
My boom: stick
Quote: <Lunagel> That's Toei's dumb fault
Type: ISFJ Protector
Location: Yami ni umare, yami ni kisu
Contact:

Re: The mysterious case of the missing Sabi

Post by takenoko »

Nah, caption2assmod is completely digital (which is probably why it can't pick up the bmp sabi). File's only 800 kb large, too small to have a database for image files. I do kind of wonder how the data in the transport stream is represented though if it's capable of both a digital version as well as one off goofy bmp like sabi tacked on. The mysteries of Japanese
ViRGE
Save the life
Posts: 2628
Joined: Fri Apr 24, 2009 5:35 pm

Re: The mysterious case of the missing Sabi

Post by ViRGE »

Ahh, Caption2Ass. I haven't heard that name in a long, long time.

A quick check shows that there's another fork of it that appears to be newer than the mod version. Caption2Ass_PCR is circa 2020, and you may have more luck with it (though it may not be worth your time).

There's also Amatsukaze. Which although does far more than just captions, appears to have explicit support for handling DCRS characters (in this case, you could presumably map the bitmap version of 錆 to the text version). In case you really need more accurate transcripts.

Looks like someone also wrote a fairly comprehensive library for directly decoding and rendering ARIB captions during video file playback as well: https://github.com/xqq/libaribcaption/
ViRGE
Save the life
Posts: 2628
Joined: Fri Apr 24, 2009 5:35 pm

Re: The mysterious case of the missing Sabi

Post by ViRGE »

Lunagel wrote: Sat Dec 09, 2023 3:14 am
ViRGE wrote: Sat Dec 09, 2023 1:21 am
Which is why I'm very curious if TTFC has captions. I would expect those to be in a pure text format, such as UTF-8. Which may shed some light as to what's going on.
Image

Regular encoding.
Okay, that's very neat. Thanks for checking that, Luna.

So if I'm understanding that correctly, then my assumption was correct and Toei is distributing captions with the TTFC version that aren't using ARIB encoding. (Though a proper stream rip would be needed to confirm the format)

So if a subbing group did need a cleaner, pure text transcript, it appears they could get one from TTFC.
Well that's absolutely fascinating and slightly baffling. When I type 錆 there's only one version of this character that pops up. There are two obscure forms of the kanji, 銹 and 鏽, but they're both obviously not used in either broadcast or TTFC version. If it was just one instance I'd say it was the background messing with the OCR making it fuck up, but it's every single instance so... yeah idk, the more I think about it the less sense it makes.
Oh to be a Japanese TV broadcast engineer. There's a decent chance there's a valid technical reason for converting 錆 to a bitmap. There's also a decent chance it's for a completely bonkers reason that no one wants to touch. :lol:
takenoko wrote: Sat Dec 09, 2023 3:27 am Nah, caption2assmod is completely digital (which is probably why it can't pick up the bmp sabi). File's only 800 kb large, too small to have a database for image files. I do kind of wonder how the data in the transport stream is represented though if it's capable of both a digital version as well as one off goofy bmp like sabi tacked on. The mysteries of Japanese
Would I blow your mind if I told you that this is how DVD subtitles work: they're all bitmaps?

https://www.loc.gov/preservation/digita ... lr=blogsig

The DVD subtitle format (SUB/IDX, otherwise known as VobSub) is just a bunch of bitmaps and time codes. This way DVD players don't need local fonts at all. And because the bitmaps are all simple low-res 2-bit(?) elements, they're very small. Checking my media archive, the SUB/IDX files for an episode of Star Trek TNG are 5MB. And that's for a show 2x the length of KR and with everything being bitmaps.

Image

Now replace most of those bitmaps with text, as ARIB captions allow. 800KB is very reasonable for a file that's a bunch of text (primarily 2-byte characters, at that) and the occasional bitmap.
Post Reply

Return to “Kamen Rider Gotchard”