Lately I’ve been thinking about the news. Mostly this involves me shouting obscenities at the radio or the internet for wasting my time with news items the depth of which couldn’t drown an ant and whose factual status makes fairy tales look like rigorous mathematical texts (you know the kind labeled “Introductory X”.) But also (and less violently) I’ve been pondering my favorite type of question, the quantification question: how would one “measure” the news?
Part of motivation for even suggesting that there is a measure of “news” is that if someone asked me if there was a measure of “information” back when I was a wee lad, I would have said they were crazy. How could one “measure” something so abstract and multifaceted as “information?” However there is a nice answer to how to measure information and this answer is given by the Shannon entropy. Of course this answer doesn’t satisfy everyone, but the nice thing about it is that it is the answer to a well defined operational question about resources.
Another thought that strikes me is that, of course Google knows the answer. Or at least there is an algorithm for Google News. Similarly Twitter has an algorithm for spotting trending topics. And of course there are less well known examples like Thoora which seeks to deliver news that is trending in social media. And probably there is academic literature out there about these algorithms, the best I could find with some small google-fu is TwitterMonitor: trend detection over the twitter stream. But all of this is very algorithm centered. The question I want to ask is what quantity are these services attempting to maximize (is it even the same quantity?)
The first observation is that clearly news has a very strong temporal component. If I took all of the newspapers, communications, books, letters, etc. that mankind has produced and regarded it without respect to time you wouldn’t convince many that there is news in this body of raw data (except that there are some monkeys who can type rather well.) Certainly also it seems that news has a time-frame. That is one could easily imagine a quantity that discusses the news of the day, the news of the week, etc.
A second observation is that we can probably define some limits. Suppose that we are examining tweets and that we are looking for news items on a day time scale. We could take the words in the different day’s tweets and make a frequency table for all of these words. A situation in which there is a maximum amount of news on the second day is then a situation where on the first day the frequency distribution over words is peeked one one word, while the second day is all concentrated on another word. One could probably also argue that, on the day time scale, if both frequency distributions were peaked on the same word, then this would not be (day scale) news (it might be week scale news, however.)
This all suggests that our friend, the news, is nothing more than the total variation distance. For two probability distributions $latex p(x)$ and $latex q(x) $, the variation distance between these distribution is $latex d(p,q)=frac{1}{2} sum_{x} |p(x)-q(x)|$ . This is also equal to $latex sup_{E subset X} |P(E)-Q(E)|$ where $latex P(E)=sum_{x in E} p(x)$ and similarly for $latex Q(E)$. Ah, so perhaps this is not as exciting as I’d hoped 🙂 But at least it gives me a new way to talk about the variational distance between two probability distributions: this is a measure of the news that we could associate with changing from one probability distribution to another.
Of course this is just one approach to thinking about how to quantify “news.” What are the drawbacks for my method and what should a real measure have that this one lacks? I mean whats the worst that could happen in thinking about this problem. Okay, so maybe you would learn how many holes it takes
to fill the Albert Hall.
As I recently noted on my website, Dow Jones now provides “smart” algorithms for analyzing/trading based on so-called news.
But one might also step back a bit, and pose the philosophical question: “If a tree falls in a forest, and no one is there to hear it, did it make a sound?” Or – more succinctly, there are differences between observation and the knowledge of reality.
In the context of news — perhaps it is only relevant to a practioner to the extent that it changes people’s behavior. For example, if there is a blizzard in Podunk, that isn’t news. But if the exact same blizzard hits Seattle, it IS news. (Obviously, dog bites man isn’t news. But man bites dog is news. So the model of pure news also needs to capture the prurient/voyeuristic element.)
Hence, any “useful” algorithm that measures “news” should probably incorporate: (1) the quantity and magnitude that the event impacts human thought and subsequent behavior; (2) the effect that news has, more generally, on the future.
This is a very utilitarian view — and will undoubtedly incur the wrath of theoreticians.
YES!! Let’s leave Trump stuck in the polarized Third Dimension crisis and let the Millennial kids move Internet into the EMPIRICAL Fourth Dimension Network.
I use a double fractal interfaced wiring diagram to visualize an electron/photon direct-connect–a shortcut through the algorithms that block Empirical messages. Pentium chips can’t solve the N versus NP problem for a very good reason—empirical data originates with the TV/Internet audience polls bottom up and is calculated in floating percentage averages. The Fourth Dimension is citizens’ push back, using high-speed, high-volume communication networks.
I don’t really think what you’re looking at (or what Google’s or Twitter’s algorithms are actually looking at) would be “news” in and of itself, but more a measure of spotting things which may be “news.” I use the quotes here because frequently these stories are articles which receive the most interest. This does not make them “news,” but rather “high interest and high traffic websites.” Whereas “news” tends to have a cultural or political impact of some kind, Twitter’s “Trending Topics” often lacks this aspect. Google News filters this non-news “noise” out by only utilizing known sources of “news” such as the AP wire, newspaper websites, and TV news websites.
News has several features which we can look at:
1) Novel and verified information, it is this verification that often lacks in most news outlets because they are so concerned with being able to put “FIRST!!!” even if that means “WRONG!!!!” since, you know, they can always write a retraction.
2) The temporal element does exist, but it varies for different topics. For example, scientific discoveries remain news for years due to the speed at which it spreads throughout society (read: slowly) while political and cultural news remains so for a matter of days. This topic-relativity also needs to be accounted for in some way.
Ultimately, what it comes down to is that something is “news” if the reader hasn’t read or heard it before (or doesn’t remember it) and is also interested in learning about it.
A compact Hausdorff topological space can be reconstructed from the Banach algebra of functions on the space. Similarly, the information in a piece of news can be reconstructed from the comments about it (as long as there are enough comments).
Surely after talking about entropy you would quantify news by relative entropy between the two probability distributions, rather than total variation distance?
Actually Matt that was my first thought, but Im not sure it works, for example the case where there are two different frequency one events doesn’t have a well defined relative entropy right?
Apropos of this weblog’s general theme, today’s Science has profilesphysics grad student Aaron O’Connell in an article titled Breakthrough of the Year: Bridging the Quantum and the Classical Worlds.
Aaron’s take-home lesson for other young quantum researchers:
Hmmm … somehow … the message about the excitement of quantum physics is not geting out … even to its youngest and most skilled practioners.
@John: To quote the late Philip K Dick, whose birthday is today, “The trouble with being educated is that it takes a long time; it uses up the better part of your life and when you are finished what you know is that you would have benefited more by going into banking.” The only drawback, as a friend of mine recently told me, is the whole “zero sum” nature of the game those in finance play.
More seriously I would ask: can I really advocate for people to go into quantum computing? Watching the good souls and brilliant minds that have been trashed upon by the system (that’s a Berkeley term, heh) doesn’t really make me comfortable doing so. I have been beyond lucky in what I’ve gotten to do, but there is not a day that goes by that I don’t think seriously about doing something different.
Damn as a physicist I should have been fine with an infinity in the definition, yes!
On Dick Lipton’s weblog, under the topic What Is Big? What Is Small?, I’ve posted an in-depth lexigraphic analysis of trends in career-related opportunities for 21st century STEM students.
The bottom line: today’s fastest-expanding career options—by far—appear to be “vampire, zombie, troll, and werewolf”. Also … get a tattoo, not a haircut. 🙂
“The bottom line: today’s fastest-expanding career options—by far—appear to be ‘vampire, zombie, troll, and werewolf’.” Ah so there is hope for academia!
Sure, I figured that was a problem also, Dave. But, wouldn’t it be great that this way you could potentially have an _infinite_ amount of news on any given day? (only log(infinity) but that’s still pretty big!)
Dave, here is an ancient academic classic:
Well, “news” is basically something you didn’t *already* know. Is that properly worked into the concept of “information”? Then it can’t be just a matter of analyzing bits, then it is relative to the recipient’s “mental state.” Well Dave, did you know about this:
My First Bacon
ThinkGeek’s My First Bacon proves someone upstairs loves me
(by Chris Pranger on 12/17/2010)
I’m sort of at a loss for how best to say this, so I’ll just say it: ThinkGeek has released a new plush toy for all ages, and it’s a large piece of talking bacon. The only explanation for this is that I did something incredibly noble while I was asleep last night as all of my dreams are coming true in one swift instance.
Hmmm, you guys think alike.
My pardons if you already noted this.
At the risk of taking this discussion too seriously, I’ll point out that you can replace the relative entropy by the Jensen-Shannon distance, and all the infinity problems go away. The relative entropy is often called the perplexity, and captures the degree of “surprise” in some stochastic process (by comparing prior to post).
I’m going to take this too seriously as well:
News can be quantified not just for time, but for recipient. The phrase “that’s news to me” makes perfect sense. On the other hand, you probably wouldn’t consider something read in the stacks of a library news, no matter how surprising.
So I think news is the information needed to describe the change in some accepted body of information, e.g. people’s knowledge of the world. One consequence is that a single article in the newspaper could be a large piece of news relative to one body of knowledge, but small relative to another. For example, the contents of the wikileaks diplomatic cables had a lot of news for the public, but very little for government insiders.
The probability measure you suggest does have a high chance of catching the topic of the news, at least for a new trend.