Use AI to automate spacing for chinese text imports and generated texts | Voters

Use AI to automate spacing for chinese text imports and generated texts

planned

BasDe

Apply AI language models to solve an old and known issue with spacing that has not been solved yet for Chinese texts within LingQ. 
The current implemented algorithm for this has many issues. Issue comes up when importing chinese but also when generating texts, using the simplify lesson functionality. 
Using AI worked in Bing chat (for instance) when I typed the following
"Suppose we create a new language called 'space chinese'. And it would be totally the same as mandarin chinese, but with one difference. The writing would include spaces between chinese characters in places that correspond to where spaces are between the english words. How would you write the following text in space chinese: "取得成功的人往往都经历过许多失败"?"

January 23, 2024

Andy L

Google has a tokenizer API that does this within it's NLP library (as a cloud service). I'm sure other cloud providers would offer this too. Not being a chinese speaker though I can't comment on how well it performs

North Sprung

BasDe We looked into how well we are currently splitting new Chinese lessons. It seems like we are correctly split above 95% of words. 
There are some exceptions - for example, we tend to combine quantities and words together like 五年. We also occasionally append de5 to the adjective like so 大大的. 
Do you see lessons where the splitting is worse than this? As it stands, 95%+ accuracy seems quite decent.

BasDe

North Sprung: 95% does sound decent. And I have no way to quantify the hit/fail rates that I see. 
Thing is though, imagine that in English 5% of the spacing between syllables was wrong. That would create a pretty unpleasant reading experience. I suspect that the 5% that doesn't go right is not random but in particular area's, causing a multiplication in vocabulary. You have happy, sad, big, small and very. But also get veryhappy, verysad, verybig, verysmall. 
All in all 5% is not insurmountable and I think also in it's current state LingQ is already a great tool to help with language learning. But for me as a language learner, the quality of the material I am presented does factor into my motivation to learn from it. Maybe it's a bit of a purist approach, but the better the quality of the material, the easier and better I expect to be able to learn from it.

North Sprung

BasDe: It depends whether we can reliably get that last 5% split correctly using a different method. It's a bit trickier to split Chinese than to split English.

In your experience, is the AI splitting you used 100% accurate?

BasDe

North Sprung: I haven't done extensive testing on this. Just did a small PoC test, nothing more.

Mark Kaufmann

marked this post as

planned

Mark Kaufmann

BasDeYes, we should do this. We have added an improvement similar to this for Japanese. Chinese is next for this.

BasDe

Mark Kaufmann: Cool!