cspp/README.Rmd at master · IPPSR/cspp · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# cspp: A Tool for the Correlates of State Policy Project Data <img src="figures/cspp.png" height="150" align="right"/>

```{r echo=FALSE, results="hide", message=FALSE, warning=F}
library("badger")
```

<!-- badges: start -->
```{r, echo = FALSE, results='asis'}
cat(
	badge_cran_release("cspp", "blue")
	,badge_cran_download("cspp", "grand-total", "blue")
)
```
<!-- badges: end -->

**cspp** is a package designed to allow a user with only basic knowledge of R to find variables on state politics and policy, create and export datasets from these variables, subset the datasets by states and years, create map visualizations, and export citations to common file formats (e.g., `.bib`). An associated web application is available [here](https://cspp.ippsr.msu.edu/cspp/).

## Recent Updates:

-- Added integration with `csppData` to allow use of the most recent CSPP data (version 2.4)

-- Added the `corr_plot` function which allows the creation of correlation plots and correlation matrices.

-- Added the `plot_panel` function which facilitates the creation of timeseries plots.

-- Added the `core` argument to the `get_cspp_data` function. Set it to `TRUE` to merge in common and important variables from the CSPP data. Useful for teaching purposes.

-- Added integration with csppData, a package we created to hold the Correlates of State Policy Project data and codebook. The CSPP data is not provided with this package now. It instead automatically loads the csppData package and its data.

## The Correlates of State Policy

[The Correlates of State Policy Project](http://ippsr.msu.edu/public-policy/correlates-state-policy) compiles more than 2,000 variables across 50 states (+ DC) from 1900-2020. The variables cover 16 broad categories:

* Demographics and Population
* Economic and Fiscal Policy
* Government
* Elections
* Policy Scores and Public Opinion
* Criminal Justice and the Legal System
* Education
* Healthcare and Health Insurance
* Welfare Policy
* Rights and Anti-Discrimination Protections
* Environment
* Drug and Alcohol Policy
* Gun Control
* Labor
* Transportation
* Regulatory Policy


## Basic Use: Finding and Returning State Politics Data

```{r eval=F}
# For latest developmental verison:
library(devtools)
devtools::install_github("correlatesstatepolicy/cspp")

# For CRAN version:
install.packages("cspp")
```


The primary functions in this package are `get_var_info` and `get_cspp_data`. The basic workflow for using this package is to 1) find variables of interest and 2) pull them from the full data into a dataframe within the R environment. Below is a basic working example.

```{r message = FALSE}
# Load the package
library(cspp)

# Find variables based on a category
demo_variables <- get_var_info(categories = "demographics")

# Use these variables to get a full or subsetted version of the data
cspp_data <- get_cspp_data(vars = demo_variables$variable,
                           years = seq(2000, 2010))

```

The `get_cspp_data` function returns a properly formatted state-year panel, facilitating regressions and merging based on common state identifiers.
```{r}
library(dplyr)
glimpse(cspp_data[1:15],)
```


Even more generally, you can load the entire set of variables and/or the entire set of data (all 900+ variables) into R through passing these functions without any parameters:

```{r}
# All variables
all_variables <- get_var_info()

# Full dataset
all_data <- get_cspp_data()
```


## Finding Variables

Given the large number of variables in the data, we provide additional functionality within `get_var_info` to search for variables based on strings or categories. For instance, the following searches for `pop` and `femal` within the variable name, returning 31 variables:

```{r}
# Search for variables by name
get_var_info(var_names = c("pop","femal")) %>% dplyr::glimpse()
```

A similar line of code using the `related_to` parameter, instead of `var_name`, searches within the name **and** the description fields, returning 96 results:

```{r}
# Search by name and description:
get_var_info(related_to = c("pop", "femal")) %>% dplyr::glimpse()
```

You can also return whole categories of variables. The full list of variable categories is available within the help file for `?get_cspp_data`. You can alternatively see the list of categories through the below snippet of code.

```{r}
# See variable categories:
unique(get_var_info()$category)
```
```{r}
# Find variables by category:
var_cats <- get_var_info(categories = c("gun control", "labor"))
```

You can then use the variable column in this dataframe to pull data from `get_cspp_data` through `var_cats$variable`, an example of which is below.

Another option in finding a variable is to load the variables into a dataframe and use RStudio's filter feature to search:

![RStudio Filter](man/figures/README-filter.png)

## Pulling data

The function `get_cspp_data` takes the following parameters, all of which are optional:

* `vars` - The specific (exact match) variable(s) to pull. Takes a single variable or a vector of variable names.
* `var_category` - The category or categories from which to pull. Takes a single category or vector of categories from the 16 listed above.
* `states` - Select which states to grab data from. States must be abbreviated and can take a vector or individual state. See `?state.abb` for an easy way to load state abbreviations.
* `years` - Takes a single year or a vector or sequence of years, such as `seq(2001, 2005)`.
* `output` - Choose to write the resulting dataframe straight to a file. Optional outputs include `csv`, `dta`, or `rdata`.
* `path` - If outputting the file, choose where to write it to. If left blank, the file will save to your working directory.

In this example, the resulting dataframe includes the variables `c("sess_length", "hou_majority", "term_length")` as well as all variables in the category `demographics` for North Carolina, Virgina, and Georgia from 1994 to 2004.

```{r eval = F}
# Get subsetted data and save to dataframe
data <- get_cspp_data(vars = c("sess_length", "hou_majority", "term_length"),
                      var_category = "demographics",
                      states = c("NC", "VA", "GA"),
                      years = seq(1995, 2004))
```

You can also pass the `get_var_info` function into the `vars` parameter of `get_cspp_data`, skipping a step:

```{r}
# Use get_var_info to generate variable vector inline
get_cspp_data(vars = get_var_info(related_to = "concealed carry")$variable,
              states = "NC",
              years = 1999)
```

Where the two returned variables, `bjourn` and `bprecc`, deal with concealed carry of guns in motor vehicles and whether state laws pre-empt local laws, respectively.


### Citations

Each variable in the CSPP data was collected from external sources. We've made it easy to cite the source of each variable you use with the `get_cites` function.

This function takes a variable name or vector of variable names (such as that generated by the `get_var_info` function) and returns a dataframe of citations.

```{r}
# Simple dataframe for one variable
get_cites(var_names = "poptotal") %>% dplyr::glimpse()

# Using get_var_info to return variable citations
cite_ex <- get_cites(var_names = get_var_info(related_to = "concealed carry")$variable)
cite_ex$plaintext_cite[3:4]
```

There is also an option to output the citations to a .bib, .csv or .txt file:

```{r eval=F}
get_cites(var_names = "poptotal",
         write_out = TRUE,
         file_path = "~/path/to/file.csv",
         format = "csv")
```


## Maps and Choropleths

The `generate_map` function uses the CSPP data to generate US maps with states filled in based on the value of a given variable (also called choropleths). This function returns a `ggplot` object so it is highly customizable. The optional parameters are:

* `cspp_data` - A dataframe ideally generated by the `get_cspp_data` function. Any dataframe will work as long as it has the columns `st.abb`, `year`, and any additional column from which to fill in the map.
* `var_name` - The specific variable to use to fill in the map. If left blank, it will take the first column after `year` and `st.abb`.
* `average_years` - Default is FALSE. If set to TRUE, this returns a map that averages over all of the years per state in the dataframe. So if there are multiple years of population per state, it plots the average population per state in the panel.
* `drop_NA_states` - By default, the function keeps states that are missing data, resulting in them being filled in as gray. If this is set to TRUE, the states are dropped. See the example below.
* `poly_args` - A list of arguments that determine the aesthetics of state shapes. See `ggplot2::geom_polygon` for options.

**Note**: This function will attempt to plot any variable type; however, plotting character or factor values on a map will likely result in a hard to interpret graph.

```{r out.width='60%'}
library(ggplot2) # optional, but needed to remove legend

# Generates a map of the percentage of the population over 65
generate_map(get_cspp_data(var_category = "demographics"),
             var_name = "pctpopover65") +
  ggplot2::theme(legend.position = "none")
```

In this example, since the dataframe passed is generated by `get_cspp_data(var_category = "demographics")` and contains all years for all states in the data, the function by default returns the value of the most recent year without missing data.


If you set `drop_NA_states` to TRUE, and pass the function a dataframe containing only certain states, it only plots those states:
```{r out.width='60%'}
library(dplyr)

generate_map(get_cspp_data(var_category = "demographics") %>%
                dplyr::filter(st %in% c("NC", "VA", "SC")),
              var_name = "pctpopover65",
              poly_args = list(color = "black"),
              drop_NA_states = TRUE) +
  ggplot2::theme(legend.position = "none")
```

Since this function returns a `ggplot` object, you can customize it endlessly:
```{r out.width='60%'}
generate_map(get_cspp_data(var_category = "demographics") %>%
                dplyr::filter(st %in% c("NC", "VA", "SC", "TN", "GA", "WV", "MS", "AL", "KY")),
              var_name = "pctpopover65",
              poly_args = list(color = "black"),
              drop_NA_states = TRUE) +
  ggplot2::scale_fill_gradient(low = "white", high = "red") +
  ggplot2::theme(legend.position = "none") +
  ggplot2::ggtitle("% Population Over 65")
```

## Plot timeseries data
To facilitate the visualization of the timeseries and panel nature of the CSPP data, the `plot_panel` function takes a dataframe from `get_cspp_data` and plots a state-year panel in one of two formats. The parameters of this function are as follows:

* `cspp_data` - a dataframe generated by `get_cspp_data` or, alternatively, a dataframe with the columns `st.abb`, `year`, plus one other variable.
* `var_name` - the name of the variable to be plotted.
* `years` - the years to include in the panel.
* `colors` - three color values that are used in the plot. The first color takes the lowest values of the variable, the second the highest, and the third is the color used for NA values.
* `plot_type` - one of "grid" or "line". Defaults to "grid". Both are displayed next.

The function returns a `ggplot2` object, making it easier to change and add layers onto the generated plot.

A common research design is to use variation in policy adoption as a 'treatment' in a pseudo-experimental setting. This function makes it easy to visualize when states are subject to this treatment.

```{r}
# panel of all states' adoption of medical marijuana laws
cspp <- get_cspp_data(vars = "drugs_medical_marijuana")

# visualize panel:
plot_panel(cspp)
```

The function also works with continuous variables, such as the state policy liberalism score:

```{r}
plot_panel(cspp_data = get_cspp_data(vars = "pollib_median"),
           colors = c("firebrick4", "steelblue2", "gray"),
           years = seq(1960, 2010)) +
  ggplot2::ggtitle("Policy liberalism")
```

## Correlation Plots and Matrices
The `corr_plot` function geneates a correlation plot (in the style of a heat map) using `ggcorrplot`. Alternately, it outputs a correlation matrix for users to pass onto their own `ggcorrplot` call for further customization.

This function takes the following parameters:
* `data` - a dataframe. If generated by `get_cspp_data`, the function will work automatically. Otherwise, this function parses all numeric columns in the passed dataframe.
* `vars`- a vector of variable names to generate correlations between.
* `summarize` - whether to create state specific means for each variable. If set to TRUE, passed dataframe must contain the column `st.abb`. If FALSE, it will treat each observation independently in generating correlations.
* `labels` - whether to include labels on the correlation plot of the specific correlation value.
* `label_size` - the font size of the labels.
* `colors` - a vector of three colors that form the gradient in the plot.
* `cor_matrix` - if set to TRUE, instead of returning a ggplot object this function returns a correlation matrix.

```{r message=F}
corr_plot(data = get_cspp_data(years = seq(1990, 2010)),
          vars = c("pollib_median", "innovatescore_boehmkeskinner", "citi6013", "ranney4_control", "h_diffs"),
          colors = c("dodgerblue", "white", "firebrick"))
```
Alternatively, you can use similar code to generate a correlation matrix:
```{r message=F}
corr_plot(data = get_cspp_data(years = seq(1990, 2010)),
          vars = c("pollib_median", "innovatescore_boehmkeskinner", "citi6013", "ranney4_control", "h_diffs"),
          cor_matrix = TRUE)
```


## Network data
The function `get_network_data` returns a dataset from the [CSPP state networks data](http://ippsr.msu.edu/public-policy/state-networks) consisting of 120 variables. The data is structured as state dyads (an edge list).

```{r}
# Returns dataframe of state dyads
get_network_data() %>% dplyr::glimpse()
```

The function has two optional parameters `category` and `merge_data`. If a category or string of categories is specified, it returns variables only in that category (see the data documentation in the link above). Category options are "Distance Travel Migration", "Economic", "Political", "Policy", "Demographic".

```{r}
network.df <- get_network_data(category = c("Economic", "Political"))

names(network.df)
```

`merge_data` simplifies merging in data from the `get_cspp_data` function. The object passed to `merge_data` must be a dataframe with a variable named `st.abb`, or a dataframe generated by `get_cspp_data`. If the dataframe passed to this parameter has more than one observation per state (a panel) then this function averages over all values per state prior to merging.

```{r}
cspp_data <- get_cspp_data(vars = c("sess_length", "hou_majority"), years = seq(1999, 2000))

network.df <- get_network_data(category = "Distance Travel Migration",
                               merge_data  = cspp_data)

names(network.df)

library(dplyr)

cspp_data %>%
  arrange(st) %>%
  head()
# the merged value of Alaska's hou_majority value will be mean(c(-0.129, -0.115))

```

### Network Plotting Examples

Here are two examples of plotting the state network edgelist data using `igraph` and `ggraph`:

```{r message=F, warning=F, dpi=180}
library(ggraph)
library(igraph)

network.df <- select(network.df, from = st.abb1, to = st.abb2, ACS_Migration)

network.df %>%
  filter(from %in% c("NC", "VA", "SC", "GA")) %>%
  graph_from_data_frame() %>%
  ggraph(layout="fr") +
  geom_edge_link(aes(edge_alpha = ACS_Migration), edge_color = "royalblue") +
  geom_node_point() +
  geom_node_text(aes(label = name), repel = TRUE, point.padding = unit(0.2, "lines")) +
  theme_void() +
  theme(legend.position = "none")
```


```{r message=F, warning=F, dpi=180}
network.df %>%
  filter(from %in% c("NC")) %>%
  graph_from_data_frame() %>%
  ggraph(layout="linear") +
  geom_edge_arc(aes(edge_alpha = ACS_Migration), edge_color = "royalblue") +
  geom_node_text(aes(label = name), size = 2) +
  theme_void() +
  theme(legend.position = "none")
```


# Citation

> Caleb Lucas and Joshua McCrain (2020). cspp: A Package for The Correlates of State Policy Project Data.
R package version 0.3.3.

# Contact

[**Caleb Lucas**](https://caleblucas.com/) - Political Scientist, RAND Corporation
<br />
[**Josh McCrain**](http://joshuamccrain.com) - Assistant Professor, University of Utah ([Twitter](https://twitter.com/joshmccrain?lang=en))