**Understanding Syntax**

The Passion Driven Statistics curriculum is intended to help students perform basic data management and statistical tests across 4 major statistical software platforms (R, SAS, Stata and SPSS). This web page provides a library of basic commands that the user can copy and paste into R, SAS, Stata or SPSS to perform a variety data management tasks and basic statistical tests. Our goal is to help student’s use statistical computing as a building block in scientific reasoning and creativity. Rather than producing students who can think about statistics from a software-specific perspective, these resources are meant to help students move flexibly and confidently between statistical software environments.

It is important to note that we use the following convention when presenting software-specific syntax. **Bold** text indicates syntax that does not need to change (Some of the bolded text could be changed, but the fact that it is listed in bold indicates that it does not need to be). Unbolded text indicates syntax that needs to be adapted to your own project (e.g. the actual name of your data set, your unique variable names, etc.).

## Contents needed for every program

**Calling in a data set**SPSS

**GET FIL**E**=‘P:\QAC\qac201\Studies**\study name\filename.savStata

**use “****P:\QAC\qac201\Studies**\study name\filename"SAS

**LIBNAME in****“P:\QAC\QAC201**\study name;**DATA new; set in.**filename;R

**>**newdata**<**-**read.table(file =****“**filename.txt**”, sep = “\t”, header=T)**

**Sorting the data**SPSS

**SORT CASES BY**UNIQUE_ID.Stata

**sort**unique_idSAS

**proc sort; by**unique_id;R > title_of_data_set <-

title_of_data_set**[order(**title_of_data_set$unique_id**,decreasing=F).]**

## Abbreviating a data set to a smaller number of variables (i.e. columns)

**Selecting variables you want to examine**Because many data sets are very large in terms of both observations and variables, any analyses that you conduct could take several minutes. Subsetting or abbreviating the data based on the variables that you will be examining can shorten the analytic time required to run your program. While this will not make a huge difference if you are running a program only a few times, the time you will save can be substantial if you plan to work extensively with the data.

SPSS

**/KEEP**VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8. (Must follow the**SAVE OUTFILE**='dataname' command)Stata

**keep**var1 var2 var3 var4 var5 var6 var7 var8SAS

**KEEP**VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 VAR7 VAR8;R

**> var.keep <- c(“**VAR1**”, “**VAR2**”, “**VAR3**”, “**VAR4**”, “**VAR5**”, “**VAR6**”, “**VAR7**”, “**VAR8**”)****>**title_of_new_data_set**<-**new.data**[,**var.keep**]**

**Outputting your abbreviated data set**SPSS

**SAVE OUTFILE= 'P:\QAC\qac201\Studies**\study name\title_of_new_data_set’Stata

**save**filenameSAS

**Data libname**.title_of_new_data_set;**set***dataname*;**by***unique_id*;R

**>****write.table**(title_of_data_set**, file=**”filename.txt”**, sep=”\t”, row.names=F)****>**title_of_data_set**<-**title_of_data_set**[order(**title_of_data_set**$**unique_id**,decreasing=F),]**

## Data management tasks

**Basic Operations:**SPSS

EQ or =

>= or GE

<= or LE

> or GT

< or LT

NE

STATA

==

>=

<=

>

<

!=

SAS

EQ or =

>= or GE

<= or LE

> or GT

< or LT

NE

R

==

>=

<=

>

<

!=

**Identify missing data**Often, you must define the response categories that represent missing data. For example, if the number 9 is used to represent a missing value for a particular variable (VAR1), you must either designate in your program that this value represents missingness or else you must recode the variable into a missing data character that your statistical software recognizes. If you do not, the 9 will be treated as a real/meaningful value and will be included in each of your analyses.

SPSS

**RECODE**var1 (9=**SYSMIS**)Stata

**replace**var1=.**if**var1==9SAS

**if**VAR1=9**then**VAR1=.;R

**>**title_of_data_set**$**VAR1**[**title_of_data_set**$**VAR1==9] <-**NA**

**Recode responses to “no” based on skip patterns**There are a number of skip outs in some data sets. For example, if we ask someone whether or not they have ever used marijuana, and they say “no”, it would not make sense to ask them more detailed questions about their marijuana use (e.g. quantity, frequency, onset, impairment, etc.). When analyzing more detailed questions regarding marijuana (e.g. have you ever smoked marijuana daily for a month or more?), those individuals that never used the substance may show up as missing data. Since they have never used marijuana, we can assume that their answer to this question regarding daily use is “no”. This would need to be explicitly recoded. Note that we commonly code a “no” as 0 and a “yes” as 1.

SPSS

**RECODE**var1 (**SYSMIS**=7).Stata

**replace**var1=7**if**var1==.SAS

**if**VAR1=.**then**VAR1=7;R

**>**title_of_data_set**$**VAR1**[is.na(**title_of_data_set**$**VAR1)] <- 7

**Recoding string variables into numeric**In most software packages, it is important when preparing to run statistical analyses that all variables have response categories that are numeric rather than “string” or “character” (i.e. response categories are numbers rather than strings of characters and/or symbols). While it is not always needed, it is often recommended that all variables with string responses be recoded into numeric values. These numeric values are known as dummy codes in that they carry no direct numeric meaning

SPSS

**RECODE**TREE (‘Maple’=1) (‘Oak’=2) INTO TREE_N.Stata

**generate**TREE_N=.**replace**TREE_N=1 if TREE=="Maple"**replace**TREE_N=2 if TREE=="Oak"

OR by using the encode command

encode TREE, gen(TREE_N)SAS

**IF**TREE=‘Maple’**then**TREE_N=1;**else if**TREE= ‘Oak’**then**TREE_N=2;R

(Not necessary in R)

**Collapsing response categories**If a variable has many response categories, it can be difficult to interpret the statistical analyses in which it is used. Alternately, there may be too few subjects or observations identified by one or more response categories to allow for a successful analysis. In these cases, you would need to collapse across categories. For example, if you have the following categories for geographic region, you may want to collapse some of these categories:

*Region*: New England=1, Middle Atlantic=2, East North Central=3, West North Central=4, South Atlantic=5, East South Central=6, West South Central=7, Mountain=8, Pacific=9.*New_Region*: East=1, West=2.SPSS

**COMPUTE**new_region=2.

IF (region=1| region=2|region=3| region=5|region=6) new_region=1.Stata

**generate**new_region =2**replace**new_region=1 if region==1| region==2|region==3| region==5|region==6

OR by using the recode command

recode region (1/3 5 6=2) gen(new_region)SAS

**if**region=1 or region=2 or region=3 or region=5 or region=6 then new_region=1;**else if**region=4 or region=7 or region=8 or region=9**then**new_region=2;R

**>**new_region <-**rep(NA,**# of observations)

> new_region**[**title_of_data_set**$**region == 1**|**title_of_data_set**$**region == 2**|**title_of_data_set**$**region == 3**|**title_of_data_set**$**region == 5**|**title_of_data_set**$**region == 6**]**<**-**1

> new_region**[**title_of_data_set**$**region == 4 | title_of_data_set**$**region == 7**|**title_of_data_set**$**region == 8**|**title_of_data_set**$**region == 9**]**<**-**2

**Aggregating variables**In many cases, you will want to combine multiple variables into one. For example, a data set may include a variable for each of several different individual anxiety disorders. You may however be interested in anxiety more generally. In this case you could create a general anxiety variable in which those individuals who received a diagnosis of social phobia, generalized anxiety disorder, specific phobia, panic disorder, agoraphobia, or obsessive compulsive disorder would be coded “yes” and those who were free from all of these diagnoses would be coded “no”.

SPSS

**IF**(socphob=1|gad=1|specphob=1| panic=1|agora=1|ocd=1) anxiety=1.**RECODE**anxiety (**SYSMIS**=0).Stata

**gen**anxiety=1 if socphob==1|gad==1|specphob==1| panic==1|agora==1|ocd==1**replace**anxiety=0 if anxiety==.SAS

**if**socphob=1 or gad=1 or specphob=1 or panic=1 or agora=1 or ocd=1**then**anxiety=1;**else**anxiety=0;R

**>**anxiety**<- rep(0,**# of observations**)****>**anxiety**[**title_of_data_set**$**socphob == 1**|**title_of_data_set**$**gad==1**|**title_of_data_set**$**panic == 1**|**title_of_data_set**$**agora==1**|**title_of_data_set**$**ocd == 1**]****<-**1

**Creating a continuous variable**If you are working with a number of items that represent a single construct, it may be useful to create a composite variable or score. For example, I want to use a list of nicotine dependence symptoms meant to address the presence or absence of nicotine dependence (e.g. tolerance, withdrawal, craving, etc.). Rather than using a dichotomous variable (i.e. nicotine dependence present/absent), I want to examine the construct as a dimensional scale (i.e. number of nicotine dependence symptoms). In this case, I would want to recode each symptom variable so that yes=1 and no=0 and then sum the items so that they represent a single composite score ranging from 0 to 4 (i.e. 4 corresponding to the total number of symptoms measured and summed).

SPSS

**COMPUTE**nd_sum=sum(nd_symptom1 nd_symptom2 nd_symptom3 nd_symptom4).Stata

**egen**nd_sum=**rsum**(nd_symptom1 nd_symptom2 nd_symptom3 nd_symptom4)SAS

nd_sum=

**sum**(**of**nd_symptom1 nd_symptom2 nd_symptom3 nd_symptom4);R

**>**nd_sum**<-**title_of_data_set**$**nd_symptom1**+**title_of_data_set**$**nd_symptom2**+**title_of_data_set**$**nd_symptom3**+**title_of_data_set**$**nd_symptom4**>**title_of_data_set**$**nd_sum**<-**nd_sum

**Renaming variables**Given the often cryptic names that variables are given in some data sets, it can often be useful to rename them into something you find meaningful (i.e. easier to remember or type)

SPSS

**COMPUTE**newvarname=var1Stata

**rename**var1 newvarnameSAS

**RENAME**var1=newvarname;R

**> names(**title_of_data_set)**[**names(title_of_data_set)=="VAR1"**] <-**"newvarname"

**Subsetting data to a particular set of observations (i.e. rows)**It can also be necessary to subset the data so that you are including only those observations (i.e. rows of data) that assist in answering your particular research question. For example, if you are interested in identifying demographic predictors of depression, but only among Type II diabetes patients, you would need to subset the data to observations endorsing Type II Diabetes (i.e. diabetes2=1 or “yes”)

SPSS

**/SELECT**=diabetes2 EQ 1 (must be added as a command option)Stata

**if**diabetes2==1 (put this at the end of the command)SAS

**if**diabetes2=1; (put in the data step before sorting the data)R

**>**title_of_subsetted_data**<-**title_of_data_set**[**“diabetes2”==1**,]**

## Descriptive statistics (one variable at a time)

Descriptive statistics are used to describe the basic features of individual variables. Also known as univariate analysis, descriptive statistics summarize one variable at a time, across the observations in your data set.

**Displaying frequency tables**SPSS

**FREQUENCIES VARIABLES**=var1 var2 var3**/ORDER=ANALYSIS.**Stata

**tab1**var1 var2 var3SAS

**PROC FREQ; tables**var1 var2 var3;R

**> library(descr)****> freq(as.ordered(**title_of_data_set**$**VAR1**))****> freq(as.ordered(**title_of_data_set**$**VAR2**))****> freq(as.ordered(**title_of_data_set**$**VAR3**))**

**Central tendency**SPSS

**DESCRIPTIVES VARIABLES=**var1 var2 var3

/STATISTICS=MEAN STDDEVStata

**summarize**var1 var2 var3SAS

**proc means; var**var1 var2 var3;R

**> library(descr)****> freq(as.ordered(**title_of_data_set$var1**))****> freq(as.ordered(**title_of_data_set$var2**))****> freq(as.ordered(**title_of_data_set$var3**))**(Or for mean and sd: )

**> summary(**title_of_data_set**$**var1**)**

## Descriptive statistics (comparing two variables)

Descriptive statistics can also show one variable in the context of a second (i.e. bivariate),

**One categorical IV and one quantitative DV**SPSS

**MEANS TABLES**=IV**by**DV**/CELLS MEAN COUNT STDDEV.**Stata

**bys**IV:**summarize**DVSAS

**proc sort; by**IV;**proc means; var**DV;**by**IV;R

**> DV.byIV <- by(**title_of_data_set**$**DV**,**title_of_data_set**$**IV**, mean)****> DV.byIV****# for table****> barplot(DV.byIV, beside=T)****# for plot**

**One categorical IV and one categorical DV**SPSS

**CROSSTABS****/TABLES=**DV**by**IV.**/CELLS=COUNT ROW COLUMN TOTAL.**Stata

**tab**DV IV,**row column cell chi2**SAS

**Proc freq; tables**DV*IV;R

>

**table(**title_of_data_set**$**DV**,**title_of_data_set**$**IV**)**# for table

>**prop.table(table(**title_of_data_set**$**DV**,**title_of_data_set**$**IV**))**# for cell %ages

>**prop.table(table(**title_of_data_set**$**DV**,**title_of_data_set**$**IV**),1)**# for row %ages

>**prop.table(table(**title_of_data_set**$**DV**,**title_of_data_set**$**IV**),2)**# for column %age

>**barplot(prop.table(table(**title_of_data_set**$**DV**,**title_of_data_set**$**IV**),2)[rows,]))**# for plots of column percentageNote: If your IV is continuous, for graphing purposes, create meaningful categories and then use the code above.

## Descriptive statistics (adding a third variable)

**One categorical IV, one quantitative DV, and a categorical third variable**SPSS

**MEANS TABLES=**DV**BY**IV**BY**THIRD_VAR**/CELLS MEAN COUNT STDDEV.**Stata

**bys**IV third_var:**summarize**DVSAS

**proc sort; by**IV THIRD_VAR;**proc means; var**DV;**by**IV THIRD_VAR;R

**>ftable(by(**title_of_data_set**$**DV**, list(**title_of_data_set**$**IV**,**title_of_data_set**$**THIRD_VAR**), mean))**# to get table**> barplot(by(**title_of_data_set**$**DV**, list(**title_of_data_set**$**IV**,**title_of_data_set**$**THIRD_VAR**), mean), beside=T)**

**One categorical IV, one categorical DV, and a categorical third variable**SPSS

**CROSSTABS****/TABLES=**DV**BY**IV**BY**THIRD_VAR.Stata

**bys**IV third_var:**tab**DVSAS

**proc sort; by**THIRD_VAR;**proc freq; tables**DV*IV;**by**THIRD_VAR;R

**> ftable(**title_of_data_set**$**DV**,**title_of_data_set**$**IV**,**title_of_data_set**$**THIRD_VAR**)****# for table****> prop.table(ftable(****title_of_data_set$****DV,****title_of_data_set$****IV,****title_of_data_set$****THIRD_VAR))****# for cell****%ages****> prop.table(ftable(****title_of_data_set$****DV,****title_of_data_set$****IV,****title_of_data_set$****THIRD_VAR),1)****# for row****%ages****> prop.table(ftable(****title_of_data_set$****DV,****title_of_data_set$****IV,****title_of_data_set$****THIRD_VAR),2)****# for column %age****> barplot(prop.table(ftable(****title_of_data_set$****DV,****title_of_data_set$****IV,****title_of_data_set$****THIRDVAR),2)[rows,]))****# for plots of column percentage**Note: If your 3rd variable is continuous, create meaningful categories and then use the code above.

## Bivatiate statistical tests

**T-test**TBA

**Analysis of Variance (ANOVA)**SPSS

**ONEWAY**QUAN_DV**BY**CAT_IV**/STATISTICS DESCRIPTIVES.**Stata

**oneway**quan_DV cat_IV,**tabulate**SAS

**proc anova;****class**CAT_IV;**model**QUAN_DV = CAT_IV;**means**CAT_IV;R

**>****summary(aov(**DV**~**IV**, data=**title_of_data_set**))**

**Pearson Correlation**A Pearson

*correlation coefficient*evaluates the degree of linear relationship between quantitative two variables. It ranges from +1 to -1. A correlation of +1 means that there is a perfect, positive, linear relationship between the two variables. In other words, knowing the value of one variable, you can perfectly predict the value of the second.SPSS

**CORRELATIONS****/VARIABLES=**QUANIV QUANDV**/STATISTICS DESCRIPTIVES.**Stata

**pwcorr**quan_IV quan_DV,**sig**SAS

**Proc corr;**var QUAN_IV QUAN_DV;R

**> cor.test(**title_of_data_set**$**DV**,**title_of_data_set**$**IV**)**

**Chi-Square Test of Independence**A

*Chi-Square Test of Independence*compares frequencies of one categorical variable for different values of a second categorical variable. The null hypothesis is that the relative proportions of one variable are independent of the second variable; in other words, the proportions of one variable are the same for different values of the second variable. The alternate hypothesis is that the relative proportions of one variable are associated with the second variable.SPSS

**CROSSTABS****/TABLES=**CAT_DV by CAT_IV**/STATISTICS=CHISQ.**Stata

**tab**cat_dv cat_iv,**row col****chi2**SAS

**Proc freq; tables**CAT_DV*CAT_IV/**chisq;**R

**> chisq.test(**title_of_data_set**$**DV**,**title_of_data_set**$**IV**)**

## Multivatiate statistical tests

**Multiple Regression**Multiple regression is used when the DV (aka outcome variable) is quantitative.

SPSS

**REGRESSION****/DEPENDENT**QUAN_DV**/METHOD ENTER**IV THIRDVAR1 THIRDVAR2Stata

**reg**quan_DV IV THIRDVAR1 THIRDVAR2SAS

**Proc reg; model**QUAN_DV=IV THIRDVAR1 THIRDVAR2;R

**> summary(lm(**DV**~**IV**+**THIRDVAR1**+**THIRDVAR2**, data=**title_of_data_set**))**

**Logistic Regression**Logistic regression is used when the DV (aka outcome variable) is binary/dichotomous. Note that if the dependent variable is categorical, with more than two levels, it must be dichotomized (i.e. made into a two level variable), so that logistic regression can be used.

SPSS

**LOGISTIC REGRESSION**BINARY_DV with IV THIRDVAR1.Stata

**logistic**binary_DV IV thirdvar1 thirdvar2SAS

**Proc logistic; class**IV THIRDVAR (when these variables are categorical);**model**BINARY_DV=IV THIRDVAR1 THIRDVAR2;R

**> library(Design)****> my.ddist <- datadist(**title_of_data_set**)****> options(datadist = “my.ddist”)****> lrm(**DV**~**IV**+**THIRDVAR1**+**THIRDVAR2**, data=**title_of_data_set**)**# for p-values**> summary(lrm(**DV**~**IV**+**THIRDVAR1**+**THIRDVAR2**, data=**title_of_data_set**))**# for odds ratios