
I can imagine the challenges that you describe. It is only through efforts like yours people will feel encouraged to produce better training datasets. I came across this dataset that has words with diacritics (though I’m not sure if it’s right to call them that since they are not accent marks) that seems to be different from the dataset that you are using: https://cvit-iiit-ac-in.translate.goog/research/projects/cvit-projects/indic-hw-data?_x_tr_sl=en&_x_tr_tl=hi&_x_tr_hl=hi&_x_tr_pto=tc
I can read/write hindi/devnagari well and am willing to help in anyway it may be possible for any incremental progress in this domain.
I get all your points and I think they are the reason this has not been solved yet. But at times like this, i take inspiration from the story of first version of Captcha that, I think, Yahoo! created. The simplicity of using two words, one known and the other unknown to practically get all-printed-words-ever transcribed is nothing short of awe inspiring. If the Indian government were to put all words in regional languages as a part of Indian version of such Captcha just to book tickets on Indian railways then the entirety of regional language text could be transcribed before we know it, besides giving valuable training datasets for ML/DL models too.
Nonetheless, i wish you the very best in your endeavours.