Category Archives: professional

Seduced by Big Data

What do you need of you want to be Big Brother? Big Data, of course!

Data is powerful, and “big data” is very powerful. I deal with it every day in my work as a research scientist at Adobe, where I write and utilize algorithms capable of processing petabytes of data. I’ve been actively recruited by the creepily-named data mining company, Palantir (named for the all-seeing stones in Lord of the Rings). In grad school learned about powerful statistical methods for discovering latent information hiding in plain sight in ordinary data, and I learned just how easy it is to infer entire social networks from pairwise relationships, like who you call on the phone or who you email.

The Bush and Obama administrations have been culling records of billions of phone calls, emails, web searches, and more every day for years, with shocking disregard for your and my right to privacy. (Your local senators and congressmen have almost universally gone along with this invasive practice.) Billions a day for years is most definitely big data, and just as definitely is the cause for the construction of NSA’s huge new data center in Utah, just across the freeway from my workplace.

Defenders of these surveillance programs say that all of this monitoring is okay because it’s only gathering metadata. It’s true that the actual content of your phone calls is not available without obtaining a more traditional sort of warrant. But the metadata being collected—phone numbers, IP addresses, which number called which number when—is extremely powerful. Phone numbers are very easily mapped to names and addresses. It would also be trivial to discover the social networks behind the phone calls. You and your friends and family would show up together on a “map” of connections, like the one I created of American senators in an earlier post. Please forgive a little reductio ad Hitlerum, but in the wrong hands, such a tool would have made Hitler’s “final solution” a simple matter of searching the computer for Jewish names and sending the Gestapo knocking. Those helping people escape would have been exposed by their connections to non-approved groups on the social graph—another easy search!

My point is that big data allows government to build tools of immense and invasive power, and that such power will prove too great a temptation for an ambitious politician to resist. And the more complete the government’s vision, the more full its grasp of every citizen’s life and relationships, the more cataclysmic the consequences should the government itself fall into unscrupulous hands.

But maybe that’s already happened.

hubrisWhen I saw that the various things I was working on and talking about with people brought a Tennyson poem, Doctrine and Covenants 45, a book about pre-Columbian civilizations, and my own poetic musings together in one place, I was filled with intellectual vanity.

At his request, I started telling my professor my thoughts about a thesis topic. He started to seem bored and anxious for the conversation to end—I guess he just doesn’t dig computational approaches to decipherment? Well, he asked and I answered so he can only blame himself, I suppose.

Sometimes I think I have a real contribution to make. Other times I feel like the poor, freaked-out kid in PhD Comics who is always getting dumped on and put in his place. Maybe both are true.

Visualizing Texts as Networks

Yes, Heman is in the Bible!

Yes, Heman is in the Bible!

For a recent project in my text mining class we were required to explore ways of visualizing textual data. This is a tricky problem due to the nature of text, mainly because it is both categorical and sparse. Visualizing numerical data is, relatively speaking, a piece of cake, because real numbers can all be related to each other rather easily. But categorical data doesn’t generally have it that easy. Even if you can order the data, such as putting words in alphabetical order, it doesn’t always give you much insight into the text that the words comprise. After all, we tend to care more about what the text says than about what letters the words it uses start with.

My work was inspired by the cool stuff at Visual Complexity, and heavily influenced by Chris Harrison’s “Visualizing the Bible” page. For one of his visualizations Chris Harrison’s approach was to extract a “social network” of sorts from the biblical text. The more often a name occurs, the bigger it and its dot are on the chart. If two names occur in the same verse they are considered to be somehow connected and have a line drawn between them. The result is pretty cool, so I decided to apply the same general approach to the leaked emails from the University of East Anglia’s Climatic Research Unit.

I grabbed the emails from somewhere-or-other on the Internet (a torrent, maybe? The filename is cru-foia.tbz2) and set to parsing them. Unfortunately, whereas the names of the Bible have been conveniently compiled, nobody has made a list of names in the CRU emails, as far as I know. So I had to resort to automatic means. To that end, I downloaded Lev Ratinov’s LBJ NER tagger from UIUC CS and used it to identify the names in each email. If two names showed up in the same email, a connection was made between them. The more times they co-occurred, the stronger the link. Once all of the names and their connections were counted, they were dumped to a file in the GEXF graph interchange format.

To actually produce a visualization, I used Gephi, an up-and-coming graph visualization tool that allows real-time, interactive visualization of complex graphs. I also worked with a couple of other datasets: a “scriptures” dataset, including the Bible, Apocrypha, Book of Mormon, and Pearl of Great Price; and a dataset covering LDS General Conference addresses from the 1880’s to about 1950. The General Conferences data has been processed using any top non-stopwords (words that aren’t things like ‘the’, ‘to’, ‘with’, etc.) rather than names, which gives less-interesting results.

The results with Gephi can be quite striking, as with the “Heman” image at the beginning of this post. Here’s a video of using Gephi to work with the CRU data:

Visualizing Texts as Networks from gephi on Vimeo.