Category: text mining

  • Prof. Mark Davies’ ouster at BYU

    Mark Davies was a professor of linguistics at BYU who created tools for analyzing large collections of text, a method known as corpus analysis. He ran a website, corpus.byu.edu, where these text collections were available for anyone to use.

    The site, now at english-corpora.org, and Dr. Davies’ website, describe a process of administrator mismanagement that led to Dr. Davies’ departure in 2020, along with the corpus project and website. See here and here.

    I did my B.A. in linguistics at BYU, and remember watching with interest as the the corpus page developed. It’s disheartening to see that such a valuable academic resource doesn’t have a permanent home at BYU.

    Removing the name-naming text and links, I will quote thus:

    This permanent loss of funding support was a punitive action … after Mark informed the university of serious “financial malfeasance” by the College of Humanities regarding income from the English corpora. Subsequently, administrators at BYU refused to help resolve the issue, which is part of a culture of ignoring whistleblowers and “closing ranks” and promoting “yes men” at BYU.

    It is not overly surprising that BYU would pay such little attention to academic productivity, since the primary mission of BYU is religious in nature, rather than academic. In certain respects, BYU is more like a religious seminary than an actual university. As a result, some people at BYU don’t really understand how to support and protect projects that have real academic importance and significance.

    Of course, there are two (or more) sides to every story. Part of why I unlink the specific callouts is that I have no way of knowing beyond Dr. Davies’ own words.

    But the critique leveled against the university rings true for me. I’ve long since come to feel that BYU did me a disservice by shielding me from critical information about the LDS church, which was not at that time covered in any of the many religion classes I took, or any other class for that matter. It seems unconscionable to have so many professors of such high qualification, and none mention any potential issues with the church they represent, except obliquely, after hours.

    “The glory of God is intelligence; or, in other words, light and truth” – it was all over campus. But the glaring exception is casting light on, and discussing the truth about, the church itself.

    EDIT: The word “ousted” may be too strong – Dr. Davies chose to retire, but the withdrawal of funding was strong pressure on him to do so.

    NOTE: I originally posted this on Reddit, check the discussion there, including a response by Dr. Davies, also seen on his blog.

  • Visualizing Texts as Networks

    Yes, Heman is in the Bible!
    Yes, Heman is in the Bible!

    For a recent project in my text mining class we were required to explore ways of visualizing textual data. This is a tricky problem due to the nature of text, mainly because it is both categorical and sparse. Visualizing numerical data is, relatively speaking, a piece of cake, because real numbers can all be related to each other rather easily. But categorical data doesn’t generally have it that easy. Even if you can order the data, such as putting words in alphabetical order, it doesn’t always give you much insight into the text that the words comprise. After all, we tend to care more about what the text says than about what letters the words it uses start with.

    My work was inspired by the cool stuff at Visual Complexity, and heavily influenced by Chris Harrison’s “Visualizing the Bible” page. For one of his visualizations Chris Harrison’s approach was to extract a “social network” of sorts from the biblical text. The more often a name occurs, the bigger it and its dot are on the chart. If two names occur in the same verse they are considered to be somehow connected and have a line drawn between them. The result is pretty cool, so I decided to apply the same general approach to the leaked emails from the University of East Anglia’s Climatic Research Unit.

    I grabbed the emails from somewhere-or-other on the Internet (a torrent, maybe? The filename is cru-foia.tbz2) and set to parsing them. Unfortunately, whereas the names of the Bible have been conveniently compiled, nobody has made a list of names in the CRU emails, as far as I know. So I had to resort to automatic means. To that end, I downloaded Lev Ratinov’s LBJ NER tagger from UIUC CS and used it to identify the names in each email. If two names showed up in the same email, a connection was made between them. The more times they co-occurred, the stronger the link. Once all of the names and their connections were counted, they were dumped to a file in the GEXF graph interchange format.

    To actually produce a visualization, I used Gephi, an up-and-coming graph visualization tool that allows real-time, interactive visualization of complex graphs. I also worked with a couple of other datasets: a “scriptures” dataset, including the Bible, Apocrypha, Book of Mormon, and Pearl of Great Price; and a dataset covering LDS General Conference addresses from the 1880’s to about 1950. The General Conferences data has been processed using any top non-stopwords (words that aren’t things like ‘the’, ‘to’, ‘with’, etc.) rather than names, which gives less-interesting results.

    The results with Gephi can be quite striking, as with the “Heman” image at the beginning of this post. Here’s a video of using Gephi to work with the CRU data:

    http://vimeo.com/10918828

    Visualizing Texts as Networks from gephi on Vimeo.

    Downloads