Twin blogs

Friday, 24 April 2015

Doing quantitative archaeology with open source software

This short post is written for archaeologists who frequently perform common data analysis and visualisation tasks in Excel, SPSS or similar commercial packages. It was motivated by my recent observations at the Society of American Archaeology meeting in San Francisco - the largest annual meeting of archaeologists in the world - where I noticed that the great majority of archaeologists use Excel and SPSS. I wrote this post to describe why those packages might not be the best choices, and explain what one good alternative might be. There’s nothing specifically about archaeology in here, so this post will likely to be relevant to researchers in the social sciences in general. It’s also cross-posted on the Software Sustainability Institute blog.

Prevailing tools for data analysis and visualization in archaeology have severe limitations

For many archaeologists, the standard tools for any kind of quantitative analysis include Microsoft Excel, SPSS, and for more exotic methods, PAST. While these software are widely used, there are a few limitations that are obvious to anyone who has worked with them for a long time, and raise the question about what alternatives are available. Here are three key limitations:
  • File formats: each program has its own proprietary format, and while there is some interoperability between them, we cannot open their files in any program that we wish. And because these formats are controlled by companies rather than a community of researchers, we have no guarantee that the Excel or SPSS file format of today will be readable by any software 10 or 20 years from now. 
  • Click-trails: the main interaction with these programs is by using the mouse the point and click on menus, windows, buttons and so on. These mouse actions are ephemeral and unrecorded, so that many of the choices made during a quantitative analysis in Excel are undocumented. When a researcher wants to retrace the steps of their workflow days, months or years after the original effort, they are dependent on their memory or some external record of many of the choices made in an analysis. This can make it very difficult for another person to understand how an analysis was conducted because many of the details are not recorded. 
  • Black boxes: the algorithms that these programs use for generating results are not available for convenient inspection to the researcher. The programs are a classic black box, where data and settings go it, and a result comes out, as if by magic. For moderately complicated computations, this can make it difficult for the researcher to interpret their results, since they do not have access to all of the details of the computation. This black box design also limits the extent to which the researcher can customise or extend built-in methods to new applications.
How to overcome these limitations?

For a long time archaeologists had few options to deal with these problems because there were few alternative programs. The general alternative to using a point-and-click program is writing scripts to program algorithms for statistical analysis and visualisations. Writing scripts means that the data analysis workflow is documented and preserved, so it can be revisited in the future and distributed to others for them to inspect, reuse or extend. For many years this was only possible using ubiquitous but low-level computer languages such as C or Fortran (or exotic higher level languages such as S), which required a substantial investment of time and effort, and a robust knowledge of computer science. In recent years, however, there has been a convergence of developments that have dramatically increased the ease of using a high level programming language, specifically R, to write scripts to do statistical analysis and visualisations. As an open source programming language with special strengths in statistical analysis and visualisations, R has the potential to be a solution to the three problems of using software such as Excel and SPSS. Open source means that all of the code and algorithms that make the program operate are available for inspection and reuse, so that there is nothing hidden from the user about how the program operates (and the user is free to alter their copy of the program in any way they like, for example, to increase computation speed).

Three reasons why R has become easier to use

Although R was first released in 1993, it has only been in the last five years or so that it has really become accessible and a viable option for archaeologists. Until recently, only researchers steeped in computer science and fluent in other programming languages could make effective use of R. Now the barriers to getting started with R are very low, and archaeologists without any background with computers and programming can quickly get to a point where they can do useful work with R. There are three factors that are relevant to the recent increase in the usability of R, and that any new user should take advantage of:
  • the release of an Integrated Development Environment, RStudio, especially for R
  • the shift toward more user-friendly idioms of the language resulting from the prolific contributions of Hadley Wickham, and 
  • the massive growth of an active online community of users and developers from all disciplines.
1. RStudio

For the beginner user of R, the free and open source program RStudio is by far the easiest way to quickly get to the point of doing useful work. First released in 2011, it has numerous conveniences that simplify writing and running code, and handling the output. Before RStudio, an R user had little more than a blinking command line prompt to work with, and might struggle for some time to identify efficient methods for getting data in, run code (especially if more than a few lines) and then get data and plots out for use in reports, etc. With RStudio, the barriers to doing these things are lowered substantially. The biggest help is having a text editor right next to the R console. The text editor is like a plain text editor (such as Notepad on Windows), but has many features to help with writing code. For example, it is code-aware and automatically colours the text to make it a lot easier to read (functions are one colour, objects another, etc.). The code editor has comprehensive auto-complete feature that shows suggested options while you type, and gives in-context access to the help documentation. This makes spelling mistakes rare when writing code, which is very helpful. There is a plot pane for viewing visualisations and buttons for saving them in various formats, and a workspace pane for inspecting data objects that you've created. These kinds of features lower the cognitive burden to working with a programming language, and make it easier to be productive with a limited knowledge of the language.

2. The Hadleyverse

A second recent development that makes it easier for a new user to be productive using R is a set of contributed packages affectionately known in the R user community as the Hadleyverse. User contributed packages are add-on modules that extend the functionality of base R. Base R is what you get when you download R from r-project.org, and while it is a complete programming language, the 6000-odd user contributed packages provide ready-made functions for a vast range of data analysis and visualization tasks. Because the large number of packages can make discovering relevant ones challenges, they have been organised into 'task views' that list packages relevant to specific areas of analysis. There is a task view for archaeology, providing an annotated list of R packages useful for archaeological research. Among these user-contributed packages are a set by Hadley Wickham (Chief Scientist at RStudio and adjunct Professor at Rice University) and his collaborators that make plotting better, simplify common data analysis activities, speed up importing data in R (including from Excel and SPSS files), and improve many other common tasks. The overall result is that for many people, programming in R is shifting from the base R idioms to a new set of idioms enabled by Wickham's packages. This is an advantage for the new user of R because writing code with Wickham's packages results in code that is easier to read by people, as well as being highly efficient to compute. This is because it simplifies many common tasks (so the user doesn't have to specify exotic options if they don't want to), uses common English verbs ('filter', 'arrange', etc.), and uses pipes. Pipes mean that functions are written one after the other, following the order they would appear in when you explain the code to another person in conversation. This is different from the base R idiom, which doesn't have pipes and instead has functions nested inside each other, requiring them to be read from the center (or inside of the nest) to the left (outside of the nest), and use temporary objects, which is a counter-intuitive flow for most people new to programming.

3. Big open online communities of users

A third major factor in the improved accessibility of R to new users is the growth of an active online communities of R users. There has long been an email list for R users, but more recently, user communities have former around websites such as Stackoverflow. Stackoverflow is a free question-and-answer website for programmers using any language. The unique concept is that it gamifies the process of asking and answering questions, so that if you ask a good question (ie. well-described, includes a small self-contained example of the code that is causing the problem), other users can reward your effort by upvoting your question. High quality questions can attract very quick answers, because of the size of the community active on the site. Similarly, if you post a high-quality answer to someone else's question, other users can recognise this by upvoting your answer. These voting processes make the site very useful even for the casual R user searching for answers (and who may not care for voting), because they can identify the high-quality answers by the number of votes they've received. It's often the case that if you copy and paste an error message from the R console into the google search box, the first few results will be Q&A pages on Stackoverflow. This is very different experience compared to using the r-help email list, where help can come slowly, if at all, and searching the email list, where it's not always clear which is the best solution. Another useful output from the online community of R users are blogs that document how to conduct various analyses or produce visualizations (some 500 blogs are aggregated at http://www.r-bloggers.com/). The key advantage to Stackoverflow and blogs, aside from their free availability, is that they very frequently include enough code for the casual user to reproduce the described results. They are like a method exchange, where you can collect a method in the form of someone else's code, and adapt it to suit your own research workflow.

There's no obvious single explanation for the growth of this online community of R users. Contributing factors might include a shift from SAS (a commercial product with licensing fees) to R as the software to teach students with in many academic departments, due to the Global Financial Crisis of 2008 that forced budget reductions at many universities. This led to a greater proportion of recent generations of graduates being R users. The flexibility of R as a data analysis tool, combined with  rise of data science as an attractive career path, and demand for data mining skills in the private sector may also have contributed to the convergence of people who are active online that are also R users, since so many of the user contributed packages are focused on statistical analyses.

So What?

The prevailing programs used for statistical analyses in archaeology have severe limitations resulting from their corporate origins (proprietary file formats, uninspectable algorithms) and mouse-driven interfaces (impeding reproducibility). The generic solution is an open source programming language with tools for handling diverse file types and a wide range of statistical and visualization functions. In recent years R has become the a very prominent and widely used language that fulfills these criteria. Here I have briefly described three recent developments that have made R highly accessible to the new user, in the hope that archaeologists who are not yet using it might adopt it as more flexible and useful program for data analysis and visualization than their current tools. Of course it is quite likely that the popularity of R will rise and fall like many other programming languages, and ten years from now the fashionable choice may be Julia or something that hasn't even been invented yet. However, the general principle that a scripted analyses using an open source language is better for archaeologists, and science generally, will remain true regardless of the details of the specific language.

No comments:

Post a Comment