Lately I’ve been thinking about the news. Mostly this involves me shouting obscenities at the radio or the internet for wasting my time with news items the depth of which couldn’t drown an ant and whose factual status makes fairy tales look like rigorous mathematical texts (you know the kind labeled “Introductory X”.) But also (and less violently) I’ve been pondering my favorite type of question, the quantification question: how would one “measure” the news?
Part of motivation for even suggesting that there is a measure of “news” is that if someone asked me if there was a measure of “information” back when I was a wee lad, I would have said they were crazy. How could one “measure” something so abstract and multifaceted as “information?” However there is a nice answer to how to measure information and this answer is given by the Shannon entropy. Of course this answer doesn’t satisfy everyone, but the nice thing about it is that it is the answer to a well defined operational question about resources.
Another thought that strikes me is that, of course Google knows the answer. Or at least there is an algorithm for Google News. Similarly Twitter has an algorithm for spotting trending topics. And of course there are less well known examples like Thoora which seeks to deliver news that is trending in social media. And probably there is academic literature out there about these algorithms, the best I could find with some small google-fu is TwitterMonitor: trend detection over the twitter stream. But all of this is very algorithm centered. The question I want to ask is what quantity are these services attempting to maximize (is it even the same quantity?)
The first observation is that clearly news has a very strong temporal component. If I took all of the newspapers, communications, books, letters, etc. that mankind has produced and regarded it without respect to time you wouldn’t convince many that there is news in this body of raw data (except that there are some monkeys who can type rather well.) Certainly also it seems that news has a time-frame. That is one could easily imagine a quantity that discusses the news of the day, the news of the week, etc.
A second observation is that we can probably define some limits. Suppose that we are examining tweets and that we are looking for news items on a day time scale. We could take the words in the different day’s tweets and make a frequency table for all of these words. A situation in which there is a maximum amount of news on the second day is then a situation where on the first day the frequency distribution over words is peeked one one word, while the second day is all concentrated on another word. One could probably also argue that, on the day time scale, if both frequency distributions were peaked on the same word, then this would not be (day scale) news (it might be week scale news, however.)
This all suggests that our friend, the news, is nothing more than the total variation distance. For two probability distributions $latex p(x)$ and $latex q(x) $, the variation distance between these distribution is $latex d(p,q)=frac{1}{2} sum_{x} |p(x)-q(x)|$ . This is also equal to $latex sup_{E subset X} |P(E)-Q(E)|$ where $latex P(E)=sum_{x in E} p(x)$ and similarly for $latex Q(E)$. Ah, so perhaps this is not as exciting as I’d hoped 🙂 But at least it gives me a new way to talk about the variational distance between two probability distributions: this is a measure of the news that we could associate with changing from one probability distribution to another.
Of course this is just one approach to thinking about how to quantify “news.” What are the drawbacks for my method and what should a real measure have that this one lacks? I mean whats the worst that could happen in thinking about this problem. Okay, so maybe you would learn how many holes it takes
to fill the Albert Hall.