Category Archives: data mining

Seduced by Big Data

What do you need of you want to be Big Brother? Big Data, of course!

Data is powerful, and “big data” is very powerful. I deal with it every day in my work as a research scientist at Adobe, where I write and utilize algorithms capable of processing petabytes of data. I’ve been actively recruited by the creepily-named data mining company, Palantir (named for the all-seeing stones in Lord of the Rings). In grad school learned about powerful statistical methods for discovering latent information hiding in plain sight in ordinary data, and I learned just how easy it is to infer entire social networks from pairwise relationships, like who you call on the phone or who you email.

The Bush and Obama administrations have been culling records of billions of phone calls, emails, web searches, and more every day for years, with shocking disregard for your and my right to privacy. (Your local senators and congressmen have almost universally gone along with this invasive practice.) Billions a day for years is most definitely big data, and just as definitely is the cause for the construction of NSA’s huge new data center in Utah, just across the freeway from my workplace.

Defenders of these surveillance programs say that all of this monitoring is okay because it’s only gathering metadata. It’s true that the actual content of your phone calls is not available without obtaining a more traditional sort of warrant. But the metadata being collected—phone numbers, IP addresses, which number called which number when—is extremely powerful. Phone numbers are very easily mapped to names and addresses. It would also be trivial to discover the social networks behind the phone calls. You and your friends and family would show up together on a “map” of connections, like the one I created of American senators in an earlier post. Please forgive a little reductio ad Hitlerum, but in the wrong hands, such a tool would have made Hitler’s “final solution” a simple matter of searching the computer for Jewish names and sending the Gestapo knocking. Those helping people escape would have been exposed by their connections to non-approved groups on the social graph—another easy search!

My point is that big data allows government to build tools of immense and invasive power, and that such power will prove too great a temptation for an ambitious politician to resist. And the more complete the government’s vision, the more full its grasp of every citizen’s life and relationships, the more cataclysmic the consequences should the government itself fall into unscrupulous hands.

But maybe that’s already happened.

“A Republic, If You Can Keep It”

NSA 9000

I’m really disturbed (surprised isn’t quite the right word) at what I’m learning about the government’s massive, untargeted surveillance of millions of American citizens over the last 7 years. I thought we had this debate during the Bush administration and all decided it was illegal and should stop. Apparently, folks at the NSA and in the Bush and Obama administrations had different ideas.

In case you aren’t aware, a secret court has created secret law supposedly authorizing the federal government to spy on you, your friends, and your family. That means every email you read or write, every search you run on Google, every call you make on Skype. And the bureacracy asking the secret FISA court for approval to do this is so massive and so obscured by secrecy that there exists no single list of all of its activities. It’s called Big Brother, after all.

In my mind this is a clear violation of the fourth amendment:

The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no warrants shall issue, but upon probable cause, supported by oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.

No, it doesn’t mention electronic communications (not having been invented by 1789), but they are the modern analogue to “papers”. The wide sweep of surveillance as currently conducted seems to blatantly violate the requirement that no warrant be issued without a specific description of the people, places, and things involved. There is nothing specific about monitoring all phone calls.

The Constitution provides strong protections on privacy that are in this case being clearly disregarded in the name of national security. Combine it with the recent revelations of IRS targeting of conservative groups, and Justice Department intimidation of journalists, and a picture emerges of a gartantuan bureaucracy in which the systems, processes, and perceived mandates of government overwhelm by its very nature the interests of the individual.

Sign me up for the class-action lawsuit.

A Picture’s Worth A Thousand Senators: Staring Into The Gaping Ideological Chasm That Divides Congress

In this post, I’m going to introduce you to a cool-looking graph, tell you what it means, and give the technical details of its generation—all because I think America might care. Here we go.


Politics in America are hopelessly partisan, and all of the bickering serves only to cripple our nation at a moment of crisis when decisive action is called for. You know it. I know it. Barack Obama and John Boehner know it. Your grandma knows it.

Or do we know it? The belief that American politics has become more polarized in recent decades is widespread. But is there any evidence for it? While I make no attempt to provide a complete explanation for this disturbing trend in our nation’s governance, in this post I present some work that I believe provides an answer—a resounding confirmation that, according to at least one view of the situation, the politics of the United States are now more deeply divided than ever.

Though this work was done in collaboration with Michael Dimond as part of an advanced data mining course (CS676) at BYU, I believe I am the sole author of the portions of our report excerpted below.

The Cool-Looking Graph

Here’s the pretty picture:

United States Senate Legislator Similarity Network 1789-2011.

Bask in its glory—and be grateful, because that thing took a lot of work! Make sure to click on the image to see the full-sized version. (It will open in a new window/tab.)

What It Means

The above graph is a visual representation of the United States Senate across 222 years of legislative history. It is, in essence, a social network of senators across time—who voted like whom, what cliques and factions formed, etc. In other words, retroactive Facebook for America’s past politicians? No, that’s going too far….

Anyway, here’s how to interpret the graph. Each node (circle) represents a senator. An arc is drawn between two nodes if the two senators at the endpoints voted on the same bill at least once and voted the same way on bills more than 75% of the time. Size and color of nodes indicate their centrality (a measure of importance) in the network. Scanning from left (1789) to right (2011), a few trends emerge:

  1. The height of the graph increases. Much of this can be attributed to the increase in the number of states, from 13 to 50, meaning the number of senators serving simultaneously increased by 74.
  2. The graph alternates between unity and polarization. Visually, unity looks like a single “stream” of nodes, whereas polarization is the graph splitting into two components that move in slightly different directions.
  3. In recent decades, the height of the graph has continued to increase in spite of the number of senators being fixed at 100 since 1959. I assert that this corresponds to the phenomenon of increased polarization between the two parties.

I am interested in whether the flow of the graph can be correlated with developments in the American two-party system. Feel free to let me know your thoughts on that. For those wishing to play with the graph data, it’s available here.

Technical Details

This stuff gets pretty computer sciencey, so only read on if you really want to nerd out.


The graph is generated using an aggregated and sanitized version of the THOMAS congressional data from This yields 2.1 GiB of primarily XML-encoded congressional data from the 1st to the 112th congress. The data includes a record of votes by all legislators on all roll calls since the 1st congress, as well as party affiliation.

Social Graph Inference

Let L be the set of all legislators and S be the set of all sessions of congress. We define a legislator-to-legislator similarity function \sigma : L \times L \rightarrow [0,1] that returns a similarity score for all pairs of legislators that ever voted on the same roll call:

  \sigma(l_{1},l_{2})=\frac{SameVotes(l_{1},l_{2})}{PossibleVotes(l_{1},l_{2})} \\  \\  \phantom{\sigma(l_{1},l_{2})}=\frac{\sum_{s \in S : l_{1} \in s \wedge l_{2} \in s} \sum_{r\in Rolls(s)} \beta\left [vote(l_{1},r)=vote(l_{2},r) \right ]}{\sum_{s \in S : l_{1} \in s \wedge l_{2} \in s} |Rolls(s)|}


  • Rolls(s) returns the set of all roll calls (votes) occurring in session s;
  • \beta[x] is an indicator function returning 1 when x is true, 0 otherwise;
  • vote(l,r) returns the vote cast by legislator l on roll r; and
  • l \in s is true iff legislator l served in congressional session s.

We use this similarity measure to construct a legislator affinity graph as follows:

Let G=(V,E) be an undirected graph with a set of vertices V and a set of weighted edges E, such that

  • V=\{Vertex(l) : l \in L\} and
  • E=\{Edge(l_{1}, l_{2}, \sigma(l_{1},l_{2})) : (l_{1},l_{2}) \in L \times L \wedge \sigma(l_{1},l_{2}) > \theta\}


  • Vertex(l) yields the vertex associated with a given legislator l;
  • Edge(l_{1},l_{2},w) yields an undirected edge with weight w and endpoints Vertex(l_{1}) and Vertex(l_{2}),
  • and \theta \in [0,1] is a minimum similarity threshold.


In practice, the above \theta must be set high (I used 0.75) to prevent the number of edges from being excessively large. Once the graph was constructed, it was loaded into Gephi, a graph visualization tool. Betweenness centralities were computed, nodes were sized and colored, and a force-directed layout algorithm was applied. I then manually rotated the graph so that earlier senators are located on the left and more recent senators on the right, to give the effect of a rough historical timeline. I exported this as an SVG file, then loaded it in the Inkscape vector graphics program. With the benefit of 16GB of RAM, I coaxed Inkscape into rendering a 20,000 pixel width PNG image of the graph. This was finally scaled to 10,000 pixels wide for web distribution using GIMP.


Thanks to Christophe Giraud-Carrier for teaching the class for which this graph was generated, and to Michael Dimond who, though not directly working on this portion of our project, was nevertheless an excellent collaborator. And to my friend who convinced me to finally finish this post.