Blog

  • Visualizing Texts as Networks

    Yes, Heman is in the Bible!
    Yes, Heman is in the Bible!

    For a recent project in my text mining class we were required to explore ways of visualizing textual data. This is a tricky problem due to the nature of text, mainly because it is both categorical and sparse. Visualizing numerical data is, relatively speaking, a piece of cake, because real numbers can all be related to each other rather easily. But categorical data doesn’t generally have it that easy. Even if you can order the data, such as putting words in alphabetical order, it doesn’t always give you much insight into the text that the words comprise. After all, we tend to care more about what the text says than about what letters the words it uses start with.

    My work was inspired by the cool stuff at Visual Complexity, and heavily influenced by Chris Harrison’s “Visualizing the Bible” page. For one of his visualizations Chris Harrison’s approach was to extract a “social network” of sorts from the biblical text. The more often a name occurs, the bigger it and its dot are on the chart. If two names occur in the same verse they are considered to be somehow connected and have a line drawn between them. The result is pretty cool, so I decided to apply the same general approach to the leaked emails from the University of East Anglia’s Climatic Research Unit.

    I grabbed the emails from somewhere-or-other on the Internet (a torrent, maybe? The filename is cru-foia.tbz2) and set to parsing them. Unfortunately, whereas the names of the Bible have been conveniently compiled, nobody has made a list of names in the CRU emails, as far as I know. So I had to resort to automatic means. To that end, I downloaded Lev Ratinov’s LBJ NER tagger from UIUC CS and used it to identify the names in each email. If two names showed up in the same email, a connection was made between them. The more times they co-occurred, the stronger the link. Once all of the names and their connections were counted, they were dumped to a file in the GEXF graph interchange format.

    To actually produce a visualization, I used Gephi, an up-and-coming graph visualization tool that allows real-time, interactive visualization of complex graphs. I also worked with a couple of other datasets: a “scriptures” dataset, including the Bible, Apocrypha, Book of Mormon, and Pearl of Great Price; and a dataset covering LDS General Conference addresses from the 1880’s to about 1950. The General Conferences data has been processed using any top non-stopwords (words that aren’t things like ‘the’, ‘to’, ‘with’, etc.) rather than names, which gives less-interesting results.

    The results with Gephi can be quite striking, as with the “Heman” image at the beginning of this post. Here’s a video of using Gephi to work with the CRU data:

    http://vimeo.com/10918828

    Visualizing Texts as Networks from gephi on Vimeo.

    Downloads

  • If You’re A Member of Congress, Vote ‘No’ on Health Care Reform!

    To quote one of my favorite economists, Gary Becker:

    Despite the long debate, many provisions of both the House and Senate bills remain highly controversial. These include, among many others, the way the uninsured would get coverage, the de-emphasis on health savings accounts, the postponement until 2018 of the elimination of the tax advantages from expensive employer-based health plans, no increase in the ability of persons and companies in one state to contract with insurance companies located in other states, and especially the minor efforts to raise out of pocket expenses by consumers of health care in order to reduce their overuse of doctors, drugs, and even hospitals. Such a badly designed health care bill would on the whole worsen rather than improve the American health care system…. [link]

    And, from his fellow-blogger, Richard Posner:

    Because [the health care reform bill currently in congress] is unpopular among the general public, its enactment by a simple majority in both Houses would raise a valid question about the representative character of Congress…. [T]he health care program has been kicking around in Congress for a year, and the inability of its supporters to convince the public of the program’s wisdom, coupled with the program’s enormous cost and its potentially disruptive consequences for the health care industry…and indeed the entire economy, may make people question the democratic legitimacy of enacting the program with just a simple majority in the House and Senate. [link]

    I oppose the health care reform bill because our government’s fiscal health is already in dire shape, and this bill will turn a dangerous fiscal disease into a terminal one. The Democrats’ push to pass this bill is a face-saving sellout of the welfare of the American people who, as I understand it, largely oppose the legislation. If you are a member of congress or perhaps an intern or electronic aggregator doing the bidding of one, be aware that this man wants you to vote ‘no’! Think of the future—not just this fall’s elections.

  • “No Man Is A Failure Who Has Friends”

    So wrote George Bailey’s guardian angel, and as I was flooded by your kind thoughts and congratulations today I felt rather like George Bailey being saved by those who loved him. Though I may not be who I wish I was, I know I’m not a failure because I have the thoughtfulest, kindest family and friends anybody could hope for. Thank you for your love.