Extracting Chinese Hard Subs from a Video, Part 1

geza · on May 30, 2017

I actually solved this problem back in 2013 with a slightly more advanced technique (taking into account other signals such as motion), see http://up.csail.mit.edu/other-pubs/chi2014-smartsubs.pdf and http://up.csail.mit.edu/other-pubs/gkovacs-meng-thesis.pdf for the algorithm and https://github.com/gkovacs/extract-subtitle for the implementation (in python)

mintplant · on May 29, 2017

In my experience Tesseract improves massively if you can identify the font the text is written in and prepare a custom trained dataset for it to use. See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTess...

contingencies · on May 30, 2017

All of the failures were directly related to improperly isolated input. In addition, a huge percentage of Chinese text is written in very few fonts.

mmanfrin · on May 29, 2017

Could isolate the text by compositing neighbor images and getting the pixels that don't differ. The background is moving, the text is not.

This would also allow you to extract out the subs without having to OCR them/get characters. Could just erase all static artifacts (including subs but also things like watermarks).

KerrickStaley · on May 29, 2017

An approach I really want to try is taking a stream of the video without subs (can easily be found online) and subtracting the two. You'd have to deal with differences in resolution and compression between the two, and also handle cases where the background is either white or black, but in theory it should work very well. I haven't had time to dig into this.

0xfeba · on May 29, 2017

Seems like you could get gstreamer and some subtractive elements working pretty quick...

gbaygon · on May 30, 2017

Legitimate question: why would you want the first video (hardcoded subs) if you have a second stream without them, better resolution maybe?

kpozin · on May 30, 2017

In order to have access to vocabulary words. From the article: > I wanted to get a transcript of the episode’s dialog so I could study the unfamiliar vocabulary. Unfortunately, the video files I have only have hard subtitles

gbaygon · on May 30, 2017

Makes sense, thanks. I went straight to the technical details and missed that part.

stuaxo · on May 29, 2017

This has to be worth a try.

wolfgang42 · on May 29, 2017

This only works if the camera isn't fixed, though. In the frame from the post it might erase the dashboard, the car roof, and so on.

killin_dan · on May 29, 2017

Just define a small area of the screen to run on then. Subtitles are typically within a very small portion of the screen

thebooktocome · on May 30, 2017

Nope, every channel of CCTV seems to have its own subtitle convention.

killin_dan · on May 30, 2017

Which is irrelevant to stripping subs from one movie at a time.

wolfgang42 · on May 29, 2017

One possible extension of this is to erase the original subtitles using something like gimp-resynthesizer's Heal Selection, and then replace them with translated ones, all automatically. (I've redrawn the video frame from the post[1] so you can see what I mean.)

[1]: https://static.linestarve.com/ext/ycombinator-news/itm144408...

phire · on May 29, 2017

Using a 2d image resynthesis algorithm is sub-optimal for video. There is basically no chance of it picking the same result every single frame, you will see a flickering shadow of weirdness where the old characters were.

You have lots of infomation of what image should be there; You have motion vectors signaling when an object or background has moved into (or out of) a blocked area. Or a maybe the motion vectors indicate object/background is stationary and you can use infomation from before/after the subtitles to fill it in.

I'm not aware of such a 3d video resynthesis algorithm existing, but it should be possible.

sorenjan · on May 29, 2017

It's called video inpainting. I'm not familiar with it, but here's one paper with video examples I found: http://perso.telecom-paristech.fr/~gousseau/video_inpainting...

ec109685 · on May 30, 2017

Google Translate is able to do this if you point it at live video.

RandomBookmarks · on May 29, 2017

Lately I am using the Copyfish Chrome extension for help with Chinese subtitles/images. The very nice thing about it is that it plays nice with the Zhongwen dictionary, which is another essential Chrome extension for Chinese learners.

https://chrome.google.com/webstore/detail/copyfish-%F0%9F%90...

Before that I was using the "Chinese Subtitle Translator" software: https://ocr.space/blog/p/chinese-subtitles-translator.html (Source code at https://github.com/A9T9/Chinese-Subtitles-Translator )

It uses Microsoft OCR and gets very good results.

arnioxux · on May 29, 2017

http://projectnaptha.com/ is also a very similar project. I think it uses Ocrad.

amelius · on May 30, 2017

I was hoping for a tool that erases Chinese hard subs from the video.

philfrasty · on May 29, 2017

Can anyone who learned Chinese comment on the timeframe it takes to even get a basic understanding of what the subtitles say?

I used to learn English this way (watching US TV shows in English with English subs and very limited vocabulary). Eventually disabled the subs. Now watching everything in English. Love it.

edit: any tips on getting started with Chinese very welcome as well (apart from the standard stuff I find myself through Google or language courses)

ximeng · on May 30, 2017

After around 3000 hours of mostly self study I can understand 95+% of subtitles for TV shows and for a movie can follow enough dialog just through subtitles to understand what's going on at full speed. Study success scales pretty linearly with time spent as far as I can tell, assuming you have a fairly sensible study method. For 1000 hours study I'd imagine you'd get the basics of a lot of subtitles but probably not quickly enough to follow along at full speed.

Teachers can be helpful to point out some mistakes you'll miss through self study but unless you have a fair amount of cash to throw at the problem self study will likely be a better approach. For full fluency you probably want to target around 20k vocab, so it's a bit of a numbers game in terms of finding a quick way to improve your vocabulary. I use Skritter but I guess any SRS software should help a lot here.

Not entirely convinced by immersion as being necessary, I learned plenty of Chinese without it. Probably good for day to day vocab and motivation but unless your level is good you will still struggle to get into conversations where you use it, depending on how outgoing you are.

KerrickStaley · on May 29, 2017

To be honest, learning Chinese has been a long and somewhat painful process. I'm about 6.5 years in at this point (2.5 years of school + 4 years of self study) and am starting to be able to pick up a newspaper or watch a show and get the gist of what it's about. I don't have a natural knack for languages and I was never devoting 100% of my time to it (I did software in school and now for work), so your experience may differ. Also I never had a chance for immersion—people I know that have spent 6 months in Taiwan or China are often above my level even though they've been studying for less total time, so immersion is really helpful.

I've found that SRS apps like Anki or Memrise and podcasts like Popup Chinese or ChinesePod have been really helpful. I'm also developing an approach where I extract vocabulary words from videos and newspaper articles and pre-study them before watching/reading—I'm hoping to blog about this at some point.

imron · on May 30, 2017

> Can anyone who learned Chinese comment on the timeframe it takes to even get a basic understanding of what the subtitles say?

3-5 years depending on how much you study.

> any tips on getting started with Chinese very welcome as well

The best tip you'll ever get is to study every day.

For more concrete examples of what/how to study, here is an excellent post that contains a lot of advice for independent Chinese learners:

https://www.chinese-forums.com/forums/topic/43939-independen...

andai · on May 30, 2017

I can highly recommend All Japanese All The Time http://ajatt.com/

The author Khatzumoto taught himself Japanese in 18 months "by having fun", and then got a job in Japan (at Sony iirc). He then went on to use the same method to learn Chinese.

If you can get through the website's navigation, and scroll through the occasional incoherent rambling, there is solid gold advice throughout :) Good luck!

ps. While the site focuses on Japanese, the advice translates to learning any language. Plus there is advice and resources specifically for Chinese :)

RandomBookmarks · on May 29, 2017

>it takes to even get a basic understanding of what the subtitles say?

The main challenge with Chinese is that - unlike any Western language - you can not guess the characters you do not know. You either know it or you don't. Only the context might offer some clues.

Getting started with Chinese: Depends on your talent. For me (untalented) taken a "real" class was essential for a good start. Self-learning for languages does not work for me.

unityByFreedom · on May 29, 2017

> you can not guess the characters you do not know

That's not completely true.

Chinese characters are composed of parts, some of which are referred to as radicals. If you know the part/block, you can sometimes guess its meaning or sound. Also, new Chinese words are made of existing characters, so you can often guess multi-character words by knowing the individual characters. For example, the word for computer is "electric brain". You might be able to guess that if you knew that 電means electric and 腦 means brain

I recommend the site called http://hackingchinese.com

He has a lot of good tips for how to learn Chinese as a foreign adult

mikekchar · on May 30, 2017

Yeah, I think this is one of those things where if you get yourself into an "Asian languages are hard" mindset, you unnecessarily complicate things. I don't speak Chinese, but I speak Japanese, which I learned primarily from free reading (mostly manga, which is great because it is almost completely conversational Japanese).

While it is true that you can't guess the meaning of characters that you don't know without any context, you definitely can lean the mean of characters that you don't know with context. In the same way, while I might understand the Greek and Roman roots for words in English, if I see the work by itself without any context, I'm not going to be able to understand it. With context, I can usually puzzle it out.

However, it is a huge mistake to ignore the benefits of learning Chinese characters. It is the reverse that is most beneficial. In Chinese the 3600 most common characters cover some insane percentage of the common words in use (only 2200 for Japanese!). These characters form a powerful mnemonic for learning vocabulary. I can easily learn to read the characters for a word and the word faster than I can learn the word by itself. As you learn more characters, it only accelerates your progress.

2-3 thousand characters seems like a daunting task, but as I always point out in these threads, adult level proficiency in a language requires somewhere around 15,000 word families (1 family = a word, all its inflections/conjugations/etc, and all related compound words). People telling you that you can be "fluent" with 2000 words are selling you something (2000 word families is the level of a toddler -- and then people are disappointed that they can only speak like a toddler...).

The absolute best thing you can do for learning a language that has Chinese characters in its writing is to learn those characters. I recommend learning them at the same time as the vocabulary. And to get back to the original comment: even now if I hear an unfamiliar word, I will usually trace what I think the characters are on my hand. Often people will correct me and once I get the correct characters the meaning is almost always obvious.

lacampbell · on May 29, 2017

For example, the word for computer is "electric brain". You might be able to guess that if you knew that 電means electric and 腦 means brain

Or you could be like me, idly sitting in a Malaysian restaurant, reading "馬來" and trying to figure what "Horses Arriving" has to to do with curry.

unityByFreedom · on May 30, 2017

Haha, well, I feel that one's easy too if you know that sometimes characters are used as phonetics for words in other languages. Just like English! Although you might not expect that at first from a non-phonetically written language.

Every language has exceptions to "rules". In fact, there are few hard and fast rules for any language, since every language is a mix of another.

But, once you get comfortable with a language, you can appreciate the differences and make interesting guesses about the exceptions' origins. It can be fun.

I like http://www.hanzicraft.com/ because it breaks down the characters into parts you can click on to get their definitions or origins. Hacking chinese also has a cool resources section here [1], categorized, so you can browse through dictionaries, listening tools, practice tests, browser add-ons, etc. etc.

[1] http://challenges.hackingchinese.com/resources/Beginner

lacampbell · on May 30, 2017

Hanzicraft looks cool, though I feel like I should learn to speak Mandarin a lot better before worrying about reading it. Infact I'd be fairly happy to be a fluent speaker and illiterate.

peterburkimsher · on May 30, 2017

If you want to type characters from parts, decomposing and recomposing from radicals, I made

http://pingtype.github.io

darklajid · on May 30, 2017

Okay, people replying to you so far seem to understand what's going on. For those that don't get it: Why are these characters used in your particular case?

rangibaby · on May 30, 2017

I only know about Japanese but in the past characters could be chosen based on their pronunciation and not just their meaning e.g. 仏蘭西 (Buddha, Orchid, West, fu ran su, France). This is called ateji:

https://en.wikipedia.org/wiki/Ateji

The ateji for Malaysia is 馬来, Horse, come, ma rai.

Nowadays loanwords are usually written alphabetically instead of ateji. フランス and マレーシア. The old way is still written for abbreviations; 仏 is the equivalent of writing "Fr" in English.

An interesting case is America (亜米利加）where the abbreviation is 米 (pronounced "bei") not 亜 (pronounced "a") because A is for Asia! (亜細亜).

As any one reading HN should know, naming things is hard...

unityByFreedom · on May 30, 2017

Because 馬 sounds like ma and 來 sounds (a little) like lay (actually, lie).

So, Malay (ie, of Malaysia).

lacampbell · on May 30, 2017

Sorry, should have added a explanation.

The first character of 馬來 is "horse" and the second is "come/arrive". So I was sitting there puzzling it out until I realised that if I pronounced it in Mandarin, it sounded a lot like "Malay" - which is in fact what it means.

princeb · on May 30, 2017

or you could just memorize diannao as computer because that is easier.

trying to connect electric brain to computer is similar to trying to connect "breakfast" to the morning meal. it's interesting to know the compound words, but breakfast is always the meal in the morning and not always the meal after a fast.

unityByFreedom · on May 30, 2017

It's just an example of how you can intuit words you don't recognize. Memorizing all the words you don't know is.. challenging, to say the least

justinhj · on May 29, 2017

I use an app called Memrise and a website called chinesepod. There's a really good dictionary app called Pieco.

Going beyond the standard stuff I'd recommend a professional teacher since you really need to learn Chinese from multiple angles: learning to recognize and write the characters, reading pinyin, understanding spoken Mandarin and being able to speak.

pixelperfect · on May 29, 2017

Thanks for this! I'm looking forward to part 2 also.

A bit off-topic, but does anyone have Chinese TV show suggestions? I watched a few episodes of this show (他来了) already, but I didn't like it very much.

gpetukhov · on May 30, 2017

人民的名义: A new and very popular show about corruption.

爱情公寓: I think it's like Friends in Chinese. But I haven't watched Friends.

欢乐颂: Follows the lives of several young women sharing an apartment in Shanghai.

For the list of the most popular shows check here: https://movie.douban.com/tag/国产电视剧

imron · on May 30, 2017

It's a little dated now, but check out this list:

https://www.chinese-forums.com/forums/topic/24097-tv-series-...

purplethinking · on May 30, 2017

I'm watching 外科风云 on youtube. It's a hospital drama with some corruption thrown in. It has english subtitles as well.

Bakary · on May 30, 2017

There's In the Name of the People, which is similar to House of Cards

imron · on May 30, 2017

Hey OP, for a non-technical solution to your problem get in touch with these guys over WeChat:

http://www.bijianshang.com/page/contact/contact.php

They sell soft-copy transcriptions for any show you want.

They charge per episode so it can get pricey if you do an entire series but getting a couple of episodes is quite affordable.

KerrickStaley · on May 30, 2017

I was actually thinking of doing something like this using Amazon Mechanical Turk, maybe not to subtitle the entire show but just to get a much bigger test set than I have the patience to label myself. I'll check them out, thanks!!

imron · on May 30, 2017

No worries. Prices I've seen so far range from RMB 10 to RMB 100 per episode - probably depending on how popular the show is and whether they already have a transcript available because someone else wanted it or whether they'll need to transcribe it for you.

callesgg · on May 29, 2017

He should have used the fact that there is a black border around the text.

KerrickStaley · on May 29, 2017

I did! That's going to be in Part 2.

contingencies · on May 30, 2017

16 year Chinese learner here (not that it's relevant). I would try to hack a solution via the following approach.

1. Determine the area (if any) near the bottom with black-or-whiteness zones of a constant height (these are likely to be subtitles) by randomly selecting 10 frames from the middle of the movie. Extra points if you have it detect the sub color.

2. For each frame with unique subs, isolate the zones vertically (handles multi-line subs).

3. Determine black-or-whiteness of each vertical column in the text area. Moving inward from the left or right edge, crop everything until the black-or-whiteness within the constant height drops below a certain threshold. In the example shown, this would deal with ′…′二′′′'′ and would look like 0,0,0.01,0,0,0.01,3

4. Crop viciously within the assumed vertical height. This should remove issues like 逯 which should be 这.

5. There is probably a clustering-based approach you can use to remove background noise, either spatially or temporally, eg. temporally via imagemagick[0]: compare frame1.jpg frame2.jpg -compose src -fuzz 10% -highlight-color white -lowlight-color black output.jpg ... alternatively, if there is a surrounding color such as in the example, you could remove any pixel-groups that don't have it.

6. In terms of detecting frames in which you have a new set of subs, just compare the last black-white-extraction of the central maybe-has-subs area (ie. most commonly used portion thereof) with a delta of the last one, remembering no subs is also an option. In many cases this may align with keyframes.

If you like to solve image processing problems we're hiring - http://8-food.com/ - email in profile.

[0] http://www.imagemagick.org/Usage/compare/#difference

milankragujevic · on May 29, 2017

Actually very useful even for for other things, thanks for sharing! For example ripping DVD subtitles to SRT, or (I'm using my imagination) maybe in the future with content-aware fill removing hard coded subtitles and replacing them with filler space?

Drdrdrq · on May 29, 2017

That should actually be possible with todays technology. Take an image and draw subtitles on it. This is input to train NN while original image is training output. Even better, use video stream directly... Not easy, but not impossible either.

netsharc · on May 30, 2017

DVD subtitles are already a separate layer to the movie stream, but it is a bitmap. Because it's a separate layer, OCR-ing should be easy.

And if you ask why it's a bitmap, that's because bitmaps support more than just plain text: color and typefaces to name 2 things. Imagine if DVD players have to implement text decoding ("Is this subtitle stream in UTF-8 or maybe some Cyrillic Code Page?") and rendering (color, placement, font files, etc...)

Seanny123 · on May 30, 2017

I totally thought the state-of-the-art for OCR would be ConvNets, but apparently it isn't? Or are there just not any easily available/usable libraries that do OCR with Convents? Or is the benefit of ConvNets marginal enough to not be useful?

Seanny123 · on May 30, 2017

Found a paper! Yes, it is better. Maybe not for this specific application and given there's no pre-trained network available, I can totally understand the choice made. https://www.semanticscholar.org/paper/The-recognition-of-Chi...

anewhnaccount2 · on May 30, 2017

Here's a free complete OCR solution using LSTM https://github.com/tmbdev/ocropy

manav · on May 29, 2017

I've always thought a great feature of shows/movies with subtitles would be to the ability to display multiple at the same time.

It would really be useful for learning assuming the subtitles are translated well. Right now I've been able to do something with my own content by merging two subtitle files.

Maybe you could also add a transliteration of the characters if its a language like Chinese.

kcchouette · on May 30, 2017

I know an ocr tool to do that on any mp4 files with vapoursynth: https://bitbucket.org/YuriZero/yolocr/src/989cf68d66cddfcf7b...

cjy · on May 29, 2017

I'm also learning Mandarin and was wondering if this was possible (for a different show) just the other week! Thanks for the article, will be looking forward to Part 2 and 3. Also, is there an easy way to extract all the frames with unique subtitles?

KerrickStaley · on May 29, 2017

Once you have the text corresponding to each frame, you can de-dupe it with its neighbors based on Levenshtein distance (can't use exact-match because of recognition errors). I found that for this show subtitles generally hang on-screen for 1-3 seconds, so you wouldn't have to do many comparisons.

peterburkimsher · on May 30, 2017

cjy - please can you help me to find more double-subtitles? (Chinese and English, synced)

I have a program to add spaces between Chinese words, colours for the tones, pinyin, and a literal translation.

http://pingtype.github.io

I already made a feature to list all the unique words in a movie, sort them by their frequency, and make a study sheet. I also made bash script generator to use ffmpeg to cut the movie to the subtitle time.

All I need to do now is recombine the subtitles based on the words, to make videos with lots of example sentences.

It's much easier to study with a real English translation though, instead of a literal word-for-word transcription. If you could help me get more input data (names of movies or songs, srt files), that would be wonderful!

microcolonel · on May 30, 2017

It'd be very cool to use this to take a video with hard subs, inpaint the hard subs (even naïvely, or maybe with a motion compensator), and replace them with SRT/ASS subs.

stcredzero · on May 30, 2017

Is the term "hard sub" that well known outside of the fansubbing community and other kinds video aficionados?