sparse symbol encoding

For another experiment I’ve decided to try to come up with sparse encoding for text characters. The intuition I have right now is the following:

Sparse representation should have a property that allows to measure their similarity. Otherwise, this sparse representation is like noise, and loses its utility down the processing pipeline. None of the letters are more similar to the other letters – at least not in their meaning. One can argue that there are certain properties that allow to group characters into vowels, consonants, punctuation, and so on. However, the utility I’m trying to achieve is quite different.

I’ve decided to use bigram letter frequency as a measure of encoding topology – arrange the letters in such a way that there is a bigger overlap between letters that have high probability of following each other. I don’t have a particular algorithm in mind that would allow to achieve this property. Nevertheless, I wanted to record this idea for future implementation. There might be something there…