Yeah, that is probably doable. I plan to publish my methodology once I figure it out, and it should apply regardless of the language, though it will be simple.
- Is the comment unique? Ie, does it appear multiple times? + Score if unique, - score if it is not unique, and the less unique it is (ie, the more frequently it occurs, the more negative it gets)
- Is it longer than the average comment on the chain? (and a modifier for good the longer it is)
- How many "tokens" does it have - basically, how many words? You can have a reaaaaalllly long comment (as the length of the comment is counted in characters) but it might have markdown formatting, lots of links, or html in it that extends its length beyond the meaningful comment
- I'm not sure if it will be possible to do analysis row by row, but there is the Readability Tests (Flesch Kincaid) which is something built into Word and other writing tools to make stuff easy to understand: https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests
- If it isn't easy to understand, or written at a very high level, it might be seen to have good academic qualities.
There's a whole bunch of other things you can with natural language processing and data to extract insights. I am really looking forward to who swears most on the blockchain :P
Or maybe, we could find out what the most frequently mentioned colour is? blue, red, green, yellow, orange, purple, brown, or something else? :D
As an appendix to this comment, Microsoft Word tells me my grade level for this comment is...
I like to try and keep things mostly simple on HIVE, as not everyone has English as a first language.
RE: Text analytics reveal thirty two percent of comments on hive are not unique and at least ten percent add no value to discussion