Been hacking on a latent semantic analysis architecture for determining semantic similarity between a corpus and a new document, but I can't seem to get things to function in the direction that I want.
It's easy to take a new document and find the most similar document in the corpus, but it is extremely difficult to take a new document and determine its degree of similarity to the corpus as a whole. Considering that is exactly what I want to do, this is a little problematic.
(Sure, I could decide that the nearest document represents the corpus as a whole – but the problems with that approach are immediately obvious. So I need to figure out a way to generate a set eigenvector given the set as a whole. Perhaps some sort of matrix average of all the eigenvectors which are described by the documents in the corpus. This may require some thought.)
I've played around with using an Ngram-extractor two tokenized the content rather than a word-based solution, but that just gives me a new view of lexical similarity, which isn't what I want at all. Well, it might be what I want – I don't have enough information to make that decision quite yet. Exploring the vector space that describes all of the documents that I have uploaded in the last 60 days is surprisingly disappointing. The features which keep getting called out are relatively common words rather than things which I would imagine to be actually useful.
Which just generally makes me concerned given what I know of the entropic measure of the quality of some of the words it's calling out. I have the creepy feeling that I'm doing something wrong. That's never good.
And now the database server I'm using to pull data off of seems to be down. Fantastic. Either that or they've become tired of me hitting it for more data and blocked me out. Hard to tell. I would be the last one to begrudge them that decision, if that were so.
And this is what I do for fun. Man, I clearly have some real problems.