2012-11-22

Sampling CSV headers from the Data Hub

Recently, I decided to check how useful column headers in typical CSV files are. My hunch was that in many cases columns would be labelled ambiguously or that the header row would be simply missing from many CSVs. In such cases data may be near to useless, since hints how to use data are lacking.

To support my assumptions about the typical CSV file, I needed sample data. Many such files are listed as downloadable resources in the Data Hub, which is one of the most extensive CKAN instances. Fortunately for me, CKAN exposes a friendly API. However, an even friendlier way for me was to obtain the data by using the SPARQL endpoint of the Semantic CKAN, which offers access to the Data Hub data in RDF. Simply put:
This is the query that I used:
PREFIX dcat:    <http://www.w3.org/ns/dcat#>
SELECT ?accessURL
WHERE {
  ?s a dcat:Distribution ;
    dcat:accessURL ?accessURL .
  FILTER (STRENDS(STR(?accessURL), "csv"))
}

I saved the query in query.txt file and executed it on the endpoint:
curl -H "Accept:text/csv" --data-urlencode "query@query.txt" http://semantic.ckan.net/sparql > files.csv

In the command, I took advantage of content negotiation provided by OpenLink's Virtuoso and set the HTTP Accept header to the MIME type text/csv. I made curl to load the query from the query.txt file and pass it in the query parameter by using the argument "query@query.txt" (thanks to @cygri for this tip). The query results were stored in the files.csv file.

Having a list of CSV files, I was prepared to download them. I created a directory for the files that I wanted to get and moved into it with mkdir download; cd download. To download the CSV files I executed:
tail -n+2 ../files.csv | xargs -n 1 curl -L --range 0-499 --fail --silent --show-error -O 2> fails.txt
To skip the header row containing the SPARQL results variable name, I used -n+2. I piped the list of CSV files to curl. I switched the -L argument on in order to follow redirects. To minimize the amount of downloaded data I used --range to 0-499 to fetch only a partial response containing the first 500 bytes from servers that support HTTP/1.1. Finally, I muted curl with --silent and --fail to turn error reporting off and redirected errors to fails.txt file.

When the CSV files were retrieved, I concatenated their first lines:
find * | xargs -n 1 head -1 | sort | perl -p -e "s/^M//g" > 1st_lines.txt
head -1 outputted the first line from every file was passed to it through xargs. To polish the output a bit, I sorted it and removed superfluous characters with perl -p -e "s/^M//g". Finally, I had a list with samples of CSV column headers.

By inspecting the samples, I found that ambiguous column labels are indeed the case, as labels such as “amount” or “id” are fairly widespread. Examples of other labels that caught my attention included “A-in-A”, “Column 42” and the particularly mysterious label “X”. Disabiguating such column names would be difficult without additional contextual information, including examples of data from the columns or supplementary documentation. Such data could be hard to use, especially for automated processing.