@lextenebris: I appreciate

I appreciate the tool reference, and I'll add it to my bag, but my current plan is to go a different direction with the analysis.

Right now I'm thinking I will extract the text of everything I have ever voted up, do some aggressive cleanup on it, break it into reasonable sentences with some nice filtering, and then treat that corpus as a single document. Then I'll hit it with latent semantic analysis from the gensim package to create a lexical/semantic eigenvector, and then use a nearness metric to figure out how "like" the things I've already liked any given document is.

This has the advantage of being able to update the LSA with new information without having to recompute the whole model, and should be relatively fast once that model is actually built. Doing all the text manipulation, stripping stuff, shifting around, filtering it – that's going to be, surprisingly, the slow part.

I'm not really fond of "keyword searching" as a means of discovery. It is very difficult to hit the sweet spot of too many keywords which don't actually give you enough content, and too few which are overbroad. A semantic distance measure at least gives you a sort of general, multiaxis way to approach the problem.

It may not be a good one, but it's the one I'm starting with.

(In theory I could use some sort of neural net, but there is no good reason to tie myself to a single training event like that, and neural nets which get ongoing training our real bear to code and deal with.)

Clearly I'm insane.