A convenient way to access variable labels after importing Stata data from harbors

In R, some packages (e.g. haven ) insert label attributes into variables (e.g. haven ), which explains the main name of the variable. For example, gdppc may be labeled GDP per capita .

This is extremely useful, especially when importing data from Stata. However, I still cannot figure out how to use this in my workflow.

  • How to quickly view a variable and variable label? Right now I have to do attributes(df$var) , but it's hardly worth a peek (a la names(df) )

  • How to use these tags in stories? Again, I can use attr(df$var, "label") to access the string label. However, this seems cumbersome.

Is there any official way to use these tags in a workflow? I can of course write a custom function that wraps around attr , but it may break in the future when packages implement the label attribute differently. Thus, ideally, I would like the official way to be supported by haven (or other large packages).

+6
source share
3 answers

Solution with purrr package with tidyverse :

 df %>% map_chr(~attributes(.)$label) 
+6
source

This is one of the innovations discussed in rio (full disclosure: I wrote this package). Basically, it provides various ways to import variable shortcuts, including the Hawaiian way of doing things and strangers. Here's a trivial example:

Start by creating a reproducible example:

 > library("rio") > export(iris, "iris.dta") 

Import using foreign::read.dta() (via rio::import() ):

 > str(import("iris.dta", haven = FALSE)) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... - attr(*, "datalabel")= chr "" - attr(*, "time.stamp")= chr "15 Jan 2016 20:05" - attr(*, "formats")= chr "" "" "" "" ... - attr(*, "types")= int 255 255 255 255 253 - attr(*, "val.labels")= chr "" "" "" "" ... - attr(*, "var.labels")= chr "" "" "" "" ... - attr(*, "version")= int -7 - attr(*, "label.table")=List of 1 ..$ Species: Named int 1 2 3 .. ..- attr(*, "names")= chr "setosa" "versicolor" "virginica" 

Reading using haven::read_dta() using the variable’s own attributes, because the attributes are stored at the data.frame level and not at the variable level:

 > str(import("iris.dta", haven = TRUE, column.labels = TRUE)) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species :Class 'labelled' atomic [1:150] 1 1 1 1 1 1 1 1 1 1 ... .. ..- attr(*, "labels")= Named int [1:3] 1 2 3 .. .. ..- attr(*, "names")= chr [1:3] "setosa" "versicolor" "virginica" 

Read using haven::read_dta() using an alternative that we (rio developers) have found more convenient:

 > str(import("iris.dta", haven = TRUE)) 'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... - attr(*, "var.labels")=List of 5 ..$ Sepal.Length: NULL ..$ Sepal.Width : NULL ..$ Petal.Length: NULL ..$ Petal.Width : NULL ..$ Species : NULL - attr(*, "label.table")=List of 5 ..$ Sepal.Length: NULL ..$ Sepal.Width : NULL ..$ Petal.Length: NULL ..$ Petal.Width : NULL ..$ Species : Named int 1 2 3 .. ..- attr(*, "names")= chr "setosa" "versicolor" "virginica" 

By moving the attributes to the data.frame level, it is much easier to get them with attr(data, "label.var") , etc., rather than digging all the attributes of the variable.

Note: attribute values ​​will be NULL because I just write my own R dataset in a local file to make it reproducible.

+3
source

Using sapply in a simple function to return a list of variables, as in the Stata Variable window:

 library(dplyr) makeVlist <- function(dta) { labels <- sapply(dta, function(x) attr(x, "label")) tibble(name = names(labels), label = labels) } 
+2
source

All Articles