Use AI to automate spacing for chinese text imports and generated texts
planned
B
BasDe
Apply AI language models to solve an old and known issue with spacing that has not been solved yet for Chinese texts within LingQ.
The current implemented algorithm for this has many issues. Issue comes up when importing chinese but also when generating texts, using the simplify lesson functionality.
Using AI worked in Bing chat (for instance) when I typed the following
"Suppose we create a new language called 'space chinese'. And it would be totally the same as mandarin chinese, but with one difference. The writing would include spaces between chinese characters in places that correspond to where spaces are between the english words. How would you write the following text in space chinese: "取得成功的人往往都经历过许多失败"?"
North Sprung
BasDe We looked into how well we are currently splitting new Chinese lessons. It seems like we are correctly split above 95% of words.
There are some exceptions - for example, we tend to combine quantities and words together like 五年. We also occasionally append de5 to the adjective like so 大大的.
Do you see lessons where the splitting is worse than this? As it stands, 95%+ accuracy seems quite decent.
B
BasDe
North Sprung: 95% does sound decent. And I have no way to quantify the hit/fail rates that I see.
Thing is though, imagine that in English 5% of the spacing between syllables was wrong. That would create a pretty unpleasant reading experience. I suspect that the 5% that doesn't go right is not random but in particular area's, causing a multiplication in vocabulary. You have happy, sad, big, small and very. But also get veryhappy, verysad, verybig, verysmall.
All in all 5% is not insurmountable and I think also in it's current state LingQ is already a great tool to help with language learning. But for me as a language learner, the quality of the material I am presented does factor into my motivation to learn from it. Maybe it's a bit of a purist approach, but the better the quality of the material, the easier and better I expect to be able to learn from it.
North Sprung
BasDe: It depends whether we can reliably get that last 5% split correctly using a different method. It's a bit trickier to split Chinese than to split English.
In your experience, is the AI splitting you used 100% accurate?
B
BasDe
North Sprung: I haven't done extensive testing on this. Just did a small PoC test, nothing more.
Mark Kaufmann
planned
Mark Kaufmann
BasDeYes, we should do this. We have added an improvement similar to this for Japanese. Chinese is next for this.
B
BasDe
Mark Kaufmann: Cool!