Category Archives: research

A Picture’s Worth A Thousand Senators: Staring Into The Gaping Ideological Chasm That Divides Congress

In this post, I’m going to introduce you to a cool-looking graph, tell you what it means, and give the technical details of its generation—all because I think America might care. Here we go.

Introduction

Politics in America are hopelessly partisan, and all of the bickering serves only to cripple our nation at a moment of crisis when decisive action is called for. You know it. I know it. Barack Obama and John Boehner know it. Your grandma knows it.

Or do we know it? The belief that American politics has become more polarized in recent decades is widespread. But is there any evidence for it? While I make no attempt to provide a complete explanation for this disturbing trend in our nation’s governance, in this post I present some work that I believe provides an answer—a resounding confirmation that, according to at least one view of the situation, the politics of the United States are now more deeply divided than ever.

Though this work was done in collaboration with Michael Dimond as part of an advanced data mining course (CS676) at BYU, I believe I am the sole author of the portions of our report excerpted below.

The Cool-Looking Graph

Here’s the pretty picture:

United States Senate Legislator Similarity Network 1789-2011.

Bask in its glory—and be grateful, because that thing took a lot of work! Make sure to click on the image to see the full-sized version. (It will open in a new window/tab.)

What It Means

The above graph is a visual representation of the United States Senate across 222 years of legislative history. It is, in essence, a social network of senators across time—who voted like whom, what cliques and factions formed, etc. In other words, retroactive Facebook for America’s past politicians? No, that’s going too far….

Anyway, here’s how to interpret the graph. Each node (circle) represents a senator. An arc is drawn between two nodes if the two senators at the endpoints voted on the same bill at least once and voted the same way on bills more than 75% of the time. Size and color of nodes indicate their centrality (a measure of importance) in the network. Scanning from left (1789) to right (2011), a few trends emerge:

  1. The height of the graph increases. Much of this can be attributed to the increase in the number of states, from 13 to 50, meaning the number of senators serving simultaneously increased by 74.
  2. The graph alternates between unity and polarization. Visually, unity looks like a single “stream” of nodes, whereas polarization is the graph splitting into two components that move in slightly different directions.
  3. In recent decades, the height of the graph has continued to increase in spite of the number of senators being fixed at 100 since 1959. I assert that this corresponds to the phenomenon of increased polarization between the two parties.

I am interested in whether the flow of the graph can be correlated with developments in the American two-party system. Feel free to let me know your thoughts on that. For those wishing to play with the graph data, it’s available here.

Technical Details

This stuff gets pretty computer sciencey, so only read on if you really want to nerd out.

Data

The graph is generated using an aggregated and sanitized version of the THOMAS congressional data from govtrack.us. This yields 2.1 GiB of primarily XML-encoded congressional data from the 1st to the 112th congress. The data includes a record of votes by all legislators on all roll calls since the 1st congress, as well as party affiliation.

Social Graph Inference

Let L be the set of all legislators and S be the set of all sessions of congress. We define a legislator-to-legislator similarity function \sigma : L \times L \rightarrow [0,1] that returns a similarity score for all pairs of legislators that ever voted on the same roll call:

  \sigma(l_{1},l_{2})=\frac{SameVotes(l_{1},l_{2})}{PossibleVotes(l_{1},l_{2})} \\  \\  \phantom{\sigma(l_{1},l_{2})}=\frac{\sum_{s \in S : l_{1} \in s \wedge l_{2} \in s} \sum_{r\in Rolls(s)} \beta\left [vote(l_{1},r)=vote(l_{2},r) \right ]}{\sum_{s \in S : l_{1} \in s \wedge l_{2} \in s} |Rolls(s)|}

where

  • Rolls(s) returns the set of all roll calls (votes) occurring in session s;
  • \beta[x] is an indicator function returning 1 when x is true, 0 otherwise;
  • vote(l,r) returns the vote cast by legislator l on roll r; and
  • l \in s is true iff legislator l served in congressional session s.

We use this similarity measure to construct a legislator affinity graph as follows:

Let G=(V,E) be an undirected graph with a set of vertices V and a set of weighted edges E, such that

  • V=\{Vertex(l) : l \in L\} and
  • E=\{Edge(l_{1}, l_{2}, \sigma(l_{1},l_{2})) : (l_{1},l_{2}) \in L \times L \wedge \sigma(l_{1},l_{2}) > \theta\}

where

  • Vertex(l) yields the vertex associated with a given legislator l;
  • Edge(l_{1},l_{2},w) yields an undirected edge with weight w and endpoints Vertex(l_{1}) and Vertex(l_{2}),
  • and \theta \in [0,1] is a minimum similarity threshold.

Rendering

In practice, the above \theta must be set high (I used 0.75) to prevent the number of edges from being excessively large. Once the graph was constructed, it was loaded into Gephi, a graph visualization tool. Betweenness centralities were computed, nodes were sized and colored, and a force-directed layout algorithm was applied. I then manually rotated the graph so that earlier senators are located on the left and more recent senators on the right, to give the effect of a rough historical timeline. I exported this as an SVG file, then loaded it in the Inkscape vector graphics program. With the benefit of 16GB of RAM, I coaxed Inkscape into rendering a 20,000 pixel width PNG image of the graph. This was finally scaled to 10,000 pixels wide for web distribution using GIMP.

Acknowledgements

Thanks to Christophe Giraud-Carrier for teaching the class for which this graph was generated, and to Michael Dimond who, though not directly working on this portion of our project, was nevertheless an excellent collaborator. And to my friend who convinced me to finally finish this post.

Visualizing Texts as Networks

Yes, Heman is in the Bible!

Yes, Heman is in the Bible!

For a recent project in my text mining class we were required to explore ways of visualizing textual data. This is a tricky problem due to the nature of text, mainly because it is both categorical and sparse. Visualizing numerical data is, relatively speaking, a piece of cake, because real numbers can all be related to each other rather easily. But categorical data doesn’t generally have it that easy. Even if you can order the data, such as putting words in alphabetical order, it doesn’t always give you much insight into the text that the words comprise. After all, we tend to care more about what the text says than about what letters the words it uses start with.

My work was inspired by the cool stuff at Visual Complexity, and heavily influenced by Chris Harrison’s “Visualizing the Bible” page. For one of his visualizations Chris Harrison’s approach was to extract a “social network” of sorts from the biblical text. The more often a name occurs, the bigger it and its dot are on the chart. If two names occur in the same verse they are considered to be somehow connected and have a line drawn between them. The result is pretty cool, so I decided to apply the same general approach to the leaked emails from the University of East Anglia’s Climatic Research Unit.

I grabbed the emails from somewhere-or-other on the Internet (a torrent, maybe? The filename is cru-foia.tbz2) and set to parsing them. Unfortunately, whereas the names of the Bible have been conveniently compiled, nobody has made a list of names in the CRU emails, as far as I know. So I had to resort to automatic means. To that end, I downloaded Lev Ratinov’s LBJ NER tagger from UIUC CS and used it to identify the names in each email. If two names showed up in the same email, a connection was made between them. The more times they co-occurred, the stronger the link. Once all of the names and their connections were counted, they were dumped to a file in the GEXF graph interchange format.

To actually produce a visualization, I used Gephi, an up-and-coming graph visualization tool that allows real-time, interactive visualization of complex graphs. I also worked with a couple of other datasets: a “scriptures” dataset, including the Bible, Apocrypha, Book of Mormon, and Pearl of Great Price; and a dataset covering LDS General Conference addresses from the 1880’s to about 1950. The General Conferences data has been processed using any top non-stopwords (words that aren’t things like ‘the’, ‘to’, ‘with’, etc.) rather than names, which gives less-interesting results.

The results with Gephi can be quite striking, as with the “Heman” image at the beginning of this post. Here’s a video of using Gephi to work with the CRU data:

Visualizing Texts as Networks from gephi on Vimeo.

Downloads

Real, or Ideal? OR What To Name A Post When No Cohesive Theme Binds It Together

1. Academia

Last night I went to the “Evening for New Graduate Students” at BYU (which was actually secretly or not-so-secretly open to all graduate students—note for next year 🙂 ). President Samuelson spoke first. During that short talk and during his devotional address on Tuesday, I had the feeling that I have really undervalued President Samuelson’s ideas in the past. Maybe that’s because he doesn’t have the sort of voice you might hear on TV or the radio. The last part of the program was a speech by Dr. Wynn Sterling, Dean of Graduate Studies. He presented a very exciting view of grad school and the potential to become involved to a greater degree in mankind’s quest for knowledge. He encouraged us to engage in that pursuit, even to the point of disagreeing with our advisors and their colleagues. (This seemed promising for me, since I can never seem to keep my mouth shut at lab meetings, colloquiums, and thesis defenses….)

Dr. Sterling’s view of graduate school was idealistic. It contrasts with another common vision of the graduate experience: the realistic. This is the viewpoint of the likes of PHD comics and the satirical essay How to Publish a Scientific Comment in 1 2 3 Easy Steps (which I discovered via Greg Mankiw’s blog). It also seems to be confirmed by the extreme frustration felt by some of my friends in their master’s programs.

I do not accuse Dr. Sterling of any sort of blindness or naiveté when I say that his vision is idealistic. In fact, I like to think that he presented an idealist vision as a sort of counterpoint to the difficulties and even cynicism that often afflict grad students.

2. Opinion Leaders?

When the media announce a new trend in public opinion, I often respond skeptically, asking whether their report is cause or effect. Can data-based analysis determine whether this is just paranoia or if there are some instances of the media leading rather than merely reporting public opinion (not including editorial and opinion page articles)? Most recently articles like this on rising skepticism about the mission in Afghanistan have reminded me of this question.

3. Bathwater

Two retrospectives on the economists’ role in the financial crisis:

The two articles paint eerily similar and yet vitally different pictures. Largely, Eichengreen blames the crisis on selective reading and self-serving interpretation of free market economics. Krugman blames an idealistic romance with the neo-neoclassical economics that arose after Keynesianism faded. Eichengreen suggests that the future holds a prominent place for empirical economics research. Krugman highlights behavioral economics and hopes for a Keynesian renaissance.

Krugman’s paper is well-crafted, but I think Eichengreen’s is a better portrayal of reality. Maybe that’s my free-marketeer self speaking. But I just can’t help thinking there’s a baby sitting in the economic bathwater that people are dumping out their windows these days. The ideas I learned in my economics classes were not empty—they were just idealized. To abandon them wholesale now reminds me of the ideologically-motivated cataclysms that Chomsky led linguistics through every decade or so. To put it another way, while relativistic physics explained major gaps in the Newtonian model, it didn’t keep Newtonian physics from being a good-enough description of the world for most purposes. Newton wasn’t wrong so much as he was incomplete.

But it’s Eichengreen’s focus on empiricism that really wins me over. We live in an age of data: vast—almost incomprehensibly huge—stores of data waiting to be utilized. Actually making use of it is at once one of the greatest challenges and one of the greatest opportunities of our time. (I believed that even before my two weeks in a class about data mining.) These huge amounts of data give us an opportunity to reason inductively more than ever before, whereas past models of reality relied on a small number of unproven fundamental tenets (“axioms”, “theorems”, “laws”) from which a theoretical structure was assembled by means of deductive reasoning. While these deductive systems are very powerful in addition to having much the same elegance as mathematics (an aesthetic appeal not to be underestimated), they build a very large superstructure atop a relatively small foundation. Any cracks in the foundation can threaten the whole system.

In a way the tension between fact and theory mirrors the idealism/realism contrast mentioned earlier. Humans seem to have a cognitive bias in favor of uniform explanations of phenomena, giving fuel to idealistic theories. Linguists face a similar crisis of empiricism versus theory; sadly(?) there won’t likely be a linguistic analog to global economic catastrophe to shake their academic confidence and encourage a reassessment (Tower of Babel 2: Confoundations ?)

4. Why Are Academic Disciplines Polypolistic?

Or rather, when will disciplines rely less upon a small number of arbiters of what is or isn’t “credible scholarship”? Instead of a few important journals, couldn’t much of the discussion occur right here in the blogosphere? Are scholars really so ill-mannered that they can’t carry out their debates in real time before a world audience just like the open source hackers and the Wikipedians? Even the U.S. Congress seems transparent when compared to some of the academic oligarchies.

Had economics been democratized, in a sense, would it have been less susceptible to the sort of groupthink that seemingly got it into trouble? Or would it just have been a different type of groupthink? How do you kill the echo chamber without simply gagging everybody?

Speaking of open scholarly discourse, I now wish to present a(n) hypothesis [indefinite article parenthesized for correctness in certain British Commonwealth nations {hint: it’s not Fiji.}]:

5. A(n) Hypothesis

I hypothesize that music modeling will encounter much less of a data sparseness problem than word-level language modeling. This issue came up in a PhD thesis proposal I attended today, and it made me think: though I agree that music and human language are similar in many ways, music seems more closely analogous to the character-level or phonological properties of language, rather than to its word-level, syntactic properties. In other words, a phoneme trigram model’s entropy will be much closer to a note trigram model’s entropy than to a word trigram model’s entropy. Does that even make sense? And, is it correct?

6. Terminus

And so it ends. 10 bonus points if you read this.