Showing posts with label R. Show all posts
Showing posts with label R. Show all posts

Wednesday, 7 September 2016

rpostgis and RQGIS: two useful statistical tools for archaeologists


Hi all.
Just a short post to spread the announcement of the recent release of two R packages: rpostgis (see also here) and RQGIS (see also here and here). The first facilitates transfer between PostGIS "Geometry" objects (stored in PostgreSQL databases) and R spatial objects; the latter establishes an interface between R and QGIS and allows the user to access the many QGIS geoalgorithms from within R.
I tested them briefly and I think they are very useful tools to perform and simplify statistical and geo-statistical analyses in archaeological contexts. Here I present a quick example of usage. 
Firstly I imported a set of archaeological site-points stored in a PostgreSQL/PostGIS database. That is very simple with rpostgis package: it's enough to create a database connection (like in RpostgreSQL package) and launch the "pgGetGeom" function.

Then I used RQGIS package to run (within R) the QGIS geoalgorithm that builds a polygon from layer extent. After setting the same parameters required by QGIS, the function "run_qgis" creates a red polygon around the outermost points of my dataset.


Actually, must be careful to the version of QGIS we are using. With 2.14 there's no problem, but if you're using 2.16.1 or 2.16.2 (like me) you must modify the QGIS file "AlgorithmExecutor.py" (path for linux users should be: "/usr/share/qgis/python/plugins/processing/AlgorithmExecutor.py") as described in the web page. In the next future this problem should be correct by QGIS core team.
At the end I performed a specific point pattern analysis with the data imported and created by our two packages: in this example I calculated the Ripley's K function (for an archaeological example see here) in order to identify the distribution model (random, regular or clustered) of my archaeological sites.


In my opinion these two new R packages make easier and faster the traditional spatial analyses in R and facilitate a more virtuous integration between GIS, geo-database and statistics.
Bye.

Denis Francisci

Friday, 24 April 2015

Doing quantitative archaeology with open source software

This short post is written for archaeologists who frequently perform common data analysis and visualisation tasks in Excel, SPSS or similar commercial packages. It was motivated by my recent observations at the Society of American Archaeology meeting in San Francisco - the largest annual meeting of archaeologists in the world - where I noticed that the great majority of archaeologists use Excel and SPSS. I wrote this post to describe why those packages might not be the best choices, and explain what one good alternative might be. There’s nothing specifically about archaeology in here, so this post will likely to be relevant to researchers in the social sciences in general. It’s also cross-posted on the Software Sustainability Institute blog.

Prevailing tools for data analysis and visualization in archaeology have severe limitations

For many archaeologists, the standard tools for any kind of quantitative analysis include Microsoft Excel, SPSS, and for more exotic methods, PAST. While these software are widely used, there are a few limitations that are obvious to anyone who has worked with them for a long time, and raise the question about what alternatives are available. Here are three key limitations:
  • File formats: each program has its own proprietary format, and while there is some interoperability between them, we cannot open their files in any program that we wish. And because these formats are controlled by companies rather than a community of researchers, we have no guarantee that the Excel or SPSS file format of today will be readable by any software 10 or 20 years from now. 
  • Click-trails: the main interaction with these programs is by using the mouse the point and click on menus, windows, buttons and so on. These mouse actions are ephemeral and unrecorded, so that many of the choices made during a quantitative analysis in Excel are undocumented. When a researcher wants to retrace the steps of their workflow days, months or years after the original effort, they are dependent on their memory or some external record of many of the choices made in an analysis. This can make it very difficult for another person to understand how an analysis was conducted because many of the details are not recorded. 
  • Black boxes: the algorithms that these programs use for generating results are not available for convenient inspection to the researcher. The programs are a classic black box, where data and settings go it, and a result comes out, as if by magic. For moderately complicated computations, this can make it difficult for the researcher to interpret their results, since they do not have access to all of the details of the computation. This black box design also limits the extent to which the researcher can customise or extend built-in methods to new applications.
How to overcome these limitations?

For a long time archaeologists had few options to deal with these problems because there were few alternative programs. The general alternative to using a point-and-click program is writing scripts to program algorithms for statistical analysis and visualisations. Writing scripts means that the data analysis workflow is documented and preserved, so it can be revisited in the future and distributed to others for them to inspect, reuse or extend. For many years this was only possible using ubiquitous but low-level computer languages such as C or Fortran (or exotic higher level languages such as S), which required a substantial investment of time and effort, and a robust knowledge of computer science. In recent years, however, there has been a convergence of developments that have dramatically increased the ease of using a high level programming language, specifically R, to write scripts to do statistical analysis and visualisations. As an open source programming language with special strengths in statistical analysis and visualisations, R has the potential to be a solution to the three problems of using software such as Excel and SPSS. Open source means that all of the code and algorithms that make the program operate are available for inspection and reuse, so that there is nothing hidden from the user about how the program operates (and the user is free to alter their copy of the program in any way they like, for example, to increase computation speed).

Three reasons why R has become easier to use

Although R was first released in 1993, it has only been in the last five years or so that it has really become accessible and a viable option for archaeologists. Until recently, only researchers steeped in computer science and fluent in other programming languages could make effective use of R. Now the barriers to getting started with R are very low, and archaeologists without any background with computers and programming can quickly get to a point where they can do useful work with R. There are three factors that are relevant to the recent increase in the usability of R, and that any new user should take advantage of:
  • the release of an Integrated Development Environment, RStudio, especially for R
  • the shift toward more user-friendly idioms of the language resulting from the prolific contributions of Hadley Wickham, and 
  • the massive growth of an active online community of users and developers from all disciplines.
1. RStudio

For the beginner user of R, the free and open source program RStudio is by far the easiest way to quickly get to the point of doing useful work. First released in 2011, it has numerous conveniences that simplify writing and running code, and handling the output. Before RStudio, an R user had little more than a blinking command line prompt to work with, and might struggle for some time to identify efficient methods for getting data in, run code (especially if more than a few lines) and then get data and plots out for use in reports, etc. With RStudio, the barriers to doing these things are lowered substantially. The biggest help is having a text editor right next to the R console. The text editor is like a plain text editor (such as Notepad on Windows), but has many features to help with writing code. For example, it is code-aware and automatically colours the text to make it a lot easier to read (functions are one colour, objects another, etc.). The code editor has comprehensive auto-complete feature that shows suggested options while you type, and gives in-context access to the help documentation. This makes spelling mistakes rare when writing code, which is very helpful. There is a plot pane for viewing visualisations and buttons for saving them in various formats, and a workspace pane for inspecting data objects that you've created. These kinds of features lower the cognitive burden to working with a programming language, and make it easier to be productive with a limited knowledge of the language.

2. The Hadleyverse

A second recent development that makes it easier for a new user to be productive using R is a set of contributed packages affectionately known in the R user community as the Hadleyverse. User contributed packages are add-on modules that extend the functionality of base R. Base R is what you get when you download R from r-project.org, and while it is a complete programming language, the 6000-odd user contributed packages provide ready-made functions for a vast range of data analysis and visualization tasks. Because the large number of packages can make discovering relevant ones challenges, they have been organised into 'task views' that list packages relevant to specific areas of analysis. There is a task view for archaeology, providing an annotated list of R packages useful for archaeological research. Among these user-contributed packages are a set by Hadley Wickham (Chief Scientist at RStudio and adjunct Professor at Rice University) and his collaborators that make plotting better, simplify common data analysis activities, speed up importing data in R (including from Excel and SPSS files), and improve many other common tasks. The overall result is that for many people, programming in R is shifting from the base R idioms to a new set of idioms enabled by Wickham's packages. This is an advantage for the new user of R because writing code with Wickham's packages results in code that is easier to read by people, as well as being highly efficient to compute. This is because it simplifies many common tasks (so the user doesn't have to specify exotic options if they don't want to), uses common English verbs ('filter', 'arrange', etc.), and uses pipes. Pipes mean that functions are written one after the other, following the order they would appear in when you explain the code to another person in conversation. This is different from the base R idiom, which doesn't have pipes and instead has functions nested inside each other, requiring them to be read from the center (or inside of the nest) to the left (outside of the nest), and use temporary objects, which is a counter-intuitive flow for most people new to programming.

3. Big open online communities of users

A third major factor in the improved accessibility of R to new users is the growth of an active online communities of R users. There has long been an email list for R users, but more recently, user communities have former around websites such as Stackoverflow. Stackoverflow is a free question-and-answer website for programmers using any language. The unique concept is that it gamifies the process of asking and answering questions, so that if you ask a good question (ie. well-described, includes a small self-contained example of the code that is causing the problem), other users can reward your effort by upvoting your question. High quality questions can attract very quick answers, because of the size of the community active on the site. Similarly, if you post a high-quality answer to someone else's question, other users can recognise this by upvoting your answer. These voting processes make the site very useful even for the casual R user searching for answers (and who may not care for voting), because they can identify the high-quality answers by the number of votes they've received. It's often the case that if you copy and paste an error message from the R console into the google search box, the first few results will be Q&A pages on Stackoverflow. This is very different experience compared to using the r-help email list, where help can come slowly, if at all, and searching the email list, where it's not always clear which is the best solution. Another useful output from the online community of R users are blogs that document how to conduct various analyses or produce visualizations (some 500 blogs are aggregated at http://www.r-bloggers.com/). The key advantage to Stackoverflow and blogs, aside from their free availability, is that they very frequently include enough code for the casual user to reproduce the described results. They are like a method exchange, where you can collect a method in the form of someone else's code, and adapt it to suit your own research workflow.

There's no obvious single explanation for the growth of this online community of R users. Contributing factors might include a shift from SAS (a commercial product with licensing fees) to R as the software to teach students with in many academic departments, due to the Global Financial Crisis of 2008 that forced budget reductions at many universities. This led to a greater proportion of recent generations of graduates being R users. The flexibility of R as a data analysis tool, combined with  rise of data science as an attractive career path, and demand for data mining skills in the private sector may also have contributed to the convergence of people who are active online that are also R users, since so many of the user contributed packages are focused on statistical analyses.

So What?

The prevailing programs used for statistical analyses in archaeology have severe limitations resulting from their corporate origins (proprietary file formats, uninspectable algorithms) and mouse-driven interfaces (impeding reproducibility). The generic solution is an open source programming language with tools for handling diverse file types and a wide range of statistical and visualization functions. In recent years R has become the a very prominent and widely used language that fulfills these criteria. Here I have briefly described three recent developments that have made R highly accessible to the new user, in the hope that archaeologists who are not yet using it might adopt it as more flexible and useful program for data analysis and visualization than their current tools. Of course it is quite likely that the popularity of R will rise and fall like many other programming languages, and ten years from now the fashionable choice may be Julia or something that hasn't even been invented yet. However, the general principle that a scripted analyses using an open source language is better for archaeologists, and science generally, will remain true regardless of the details of the specific language.

Wednesday, 6 February 2013

Financial Candlestick chart for archaeological purposes: preliminary tests

Candlesticks chart is a plot used primarily in finance for describing price movements of quoted stock, derivative or currency over time. It is a combination of a line-chart and a bar-chart:

Chart made by R package "Quantmod"

the wick (i.e. one or two lines coming out from the polygon) illustrates the highest and lowest traded prices during the time interval represented; the body (i.e. the polygon) illustrates the opening and closing trades. Candlesticks chart seems apparently similar to box plots, but they are totally different (source: Wikipedia, 05/02/2013).
I often saw this kind of chart in newspapers or in TV, but only now I have undertaken to understand how it works; and so I had the “crazy” idea of applying candlesticks chart to archaeological data.
More specifically I thought about archaeological finds. For many of them - in particular for ceramic types - we know a starting date, i.e. the period in which a specific production begins, a time range of maximum diffusion, defined by an initial moment and a final one, and an end date after which there are no more traces of our object.
If we replace the 4 financial values of higest, lowest, opening and closing price with these 4 chronological value (starting date, initial and final moments of maximum diffusion's range, end date), we could use profitably candle plot for describing the life's path of each archaeological material found in a stratigraphic unit (US); the goal is to date the same US by comparing the candlesticks of all materials contained therein.

In R Candlesticks chart is provided by Quantmod package, Quantitative Financial Modelling & Trading Framework for R (http://cran.r-project.org/web/packages/quantmod/index.html). But this package is very specific for financial purposes and requires specific data types like time series (xts): so I put aside the idea of using the Quantmod package and I tried to build a new R function for plotting candlesticks with non-financial data.
This is a first (simplified) example of my preliminary tests (for which I have to thank R-help mailing list: https://stat.ethz.ch/mailman/listinfo/r-help).

I built a table like this:

find, min, int_1, int_2, max
find_a, 250, 300, 400, 550
find_b, 200, 350, 400, 450
find_c, 350, 400, 450, 500
find_d, 250, 350, 450, 500
find_e, 200, 400, 500, 600

For each archaeological object (find_a, find_b, find_c, …) is given the starting date (min), the initial and final moments of maximum presence range (int_1, int_2) and the end date (max), all in approximate years.
I plotted this dataframe in R using “with()” function that enables to build a personal environment from data. Here is the source code:

> US1 <- read.table("../example.txt", header=TRUE, sep=",")
> with(US1, symbols(find, (int_1 + int_2)/2, boxplots=cbind(.4,int_2-int_1,int_1-min,max-int_2,0), inches=F, ylim=range(US1[,-1]), xaxt="n", ylab="Years (AD)", xlab="Findings", main="Findings chronological distribution of US 1", fg="brown", bg="orange", col="brown"))
axis(1,seq_along(US1$find),labels=US1$find)

and here is the result:


Analyzing this plot it is possible to deduce that the layer US1 probably dates back to the first half of 5th century A.D.; the materials find_a and find_b could be residuals of previous ages.


As I said, this is just a simple example, but potentialities are clear. This method enables to plot the duration of archaeological materials and to compare the datings of objects found in stratigraphic units for assigning them a chronological information. The statistical environment could provide other advantages like probabilistic analyses, confidence intervals, etc., giving a mathematical-statistical support to the usual (and often subjective) dating of the archaeological layers.
The next steps will be the building of a specific R function for “archaeological” candle plot, starting from the simple code written above, and tests to plot the duration of archeological finds with other statistical techniques like seriation, boxplot, etc.

Any suggestions, websites, literature and bibliographic references about this topic, advice on R packages different from Quantmod that provide candlesticks chart without financial data are welcomed.

by Denis Francisci

Tuesday, 22 January 2013

manageR, a usefull plugin for QGIS

manageR is a QGIS plugin providing a simple and usefull interface to R statistical programming environment (http://www.r-project.org/). It is created by Carson J. Q. Farmer (http://www.ftools.ca/manageR) and is downloadable from this repository: http://www.ftools.ca/cfarmerQgisRepo.xml. To install it in QGIS is enough add such repository in QGIS Python Plugin Installer (Plugins → Fetch Python Plugins).

One of the most interesting things is that you can take data directly from the .dbf table of the shapefile layer loaded in QGIS and process them in R environment. Usually, when I work with PostgreSQL/PostGIS or SQLite/SpatiaLite for managing attributes table of vector layers, I connect directly database with R using RODBC or RSQLite packages. But if I have to use shapefiles and their .dbf tables, manageR could be a good solution, specially for fast and simple works.

Here, I would like to present a small example of plugin's use. In QGIS I created a distribution map of Roman funerary sites in Trentino-Alto Adige region (Northern Italy). The sites (blue dots) are registered in a simple shapefile and every single point is associated to a record stored in a .dbf table. As usual, the .dbf table is divided in several columns each of which contains different attributes about sites (ID, coordinates, height, date, etc.).


I need to plot an histogram of heights above sea level to get an immediate view of sites distribution based on heights. I can launch manageR from QGIS.


At first sight, manageR is a simple GUI that includes R command line, some toolbars for managing data, graphic devices, history, etc. and several buttons to make some of the most common statistical analysis.
As I said, in manageR I can import layer attributes with button “Action → Import Layers Attribute” (or CTRL+T) and then I can select the column I need (in my case, “height”) using R language.


Typing in R command line or using button “Analysis” in main toolbar, I can select and launch the statistical function I need and plot the diagram; in my example I plotted an histogram of heights a.s.l. of my funerary sites.



This is a simple example, but manageR plugin could be a very usefull tool for archaeologists, also for more complex works. Its main advantage is that it works directly with .dbf table, avoiding the export of data or the opening of .dbf file in Calc/Excel.

by Denis Francisci
BlogItalia - La directory italiana dei blog Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.