ManyLabs2 Data Analysis

ManyLabs2 (Corresponding coder: Fred Hasselman)

24 October 2015

Analysis Strategy

The ManyLabs2 data analysis strategy attempts to regard three principles that maximize research transparency:

  1. Principle of Equality: All data should be treated equally by a code. That is, the code should do its job generating results while at the same time being as naive as possible to the particular facts of the study being analysed. This will reduce any chances of bias with respect to the outcomes of a certain dataset or a particular study. If it is necessary to add study specific code, the second principle should be regarded.

  2. Principle of Transparency: All operations that are crucial for obtaining an analysis result should be available for inspection by anyone who wishes to do so. This should be possible without the help of the auhtors that generated the code. The operations concern the application of data filtering rules, computation of variables derived from original measurements, running an analysis and constructing graphs, tables and figures. If full transparency is not possible, the third principle should be regarded.

  3. Principle of Reproducibility: The most basic requirement for analysis results is that they should be reproducable given the original code and the original data set. However, any new implementation of the same analysis strategy in a different context, or application of the code to a different dataset, e.g. a replication study, should not be problematic. That is, outcomes may differ between data sets, but this should not be attributable to any details of the code or the analysis strategy.

R as a parser of online code.

The pre-registered Manylabs2 protocol describes a number of analyses per replication study that can be categorised as Primary (target replications per site), Secondary (additional analyses per site, e.g. on subgroups), and Global (analyses on the entire dataset).

These promised analyses have all been implemented in R in a transparent way and this implementation is now ready for an independent review.

Implementation

Functions avalaible in an R package on GitHub (PDF manual) extract information and instructions about each promised analysis a table that is openly accessible, the masteRkey spreadsheet.

Each row in the table represents an analysis, the columns contain specific information about the analysis:

The R package manylabRs contains the R functions that can read the information from this sheet and conduct analyses on the data.

Install the package

Several ways to install the package.

Source from GitHub

Use the code below to install the manylabRs package directly from GitHub.

library(devtools)
install_github("ManyLabsOpenScience/manylabRs")
library(manylabRs)
library(tidyverse)
library(broom)

Download tarball from GitHub

First download the tarball, then install the package locally through the RStudio package installer: Tools >> Install Packages...

Main function: get.analyses()

The main function to inspect is get.analyses().

It will take one or more take analysis (studies) from the masteRkey sheet and an indication of whether the analysis is: 1. global - will disregard the clusters in the data and use all valid caes for analyses, both primary and secondary analyses have a global variant. 2. primary- target analysis of replication study conducted for each lab seperately. 3. secondary - additional analyses for each lab conducted for each lab seperately. 4. order - presentation order analyses disregard the clusters int he data, each order is analysed seperately

Have a look at saveConsole.R which calls the testScript() function and creates a log file with lots of info about the analysis steps.

IMPORTANT FOR REVIEWERS

You will have to point the functionget.analyses() to where you downloaded these files:

ML2_RawData_S1.rds
ML2_RawData_S2.rds

The script will assume the data are in a subdirectory of a rootdirectory given by the arguments:

The example below runs a global analysis for Huang.1

df <- get.analyses(studies       = 1, 
                   analysis.type = 1, 
                   rootdir       = MyRootDir,
                   indir = list(RAW.DATA   = MyRawDataDir, 
                                MASTERKEY  = "", 
                                SOURCEINFO = ""))
## 
##  
##  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~get.analyses~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
##  # Downloaded keytable Googlesheet: ML2_masteRkey [https://docs.google.com/spreadsheets/d/1fqK3WHwFPMIjNVVvmxpMEjzUETftq_DmP5LzEhXxUHA/]
##  # 
##  # Downloaded data from OSF: 'ML2_RawData_S1.rds' and 'ML2_RawData_S2.rds'
##  # 
##  # Downloaded information about the data sources from Googlesheet: 'ML2_SourceInfo.xlsx' [https://docs.google.com/spreadsheets/d/1Qn_kVkVGwffBAmhAbpgrTjdxKLP1bb2chHjBMVyGl1s/]
##  # 
##  # Analyzing studies in study.global.include
##  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## 
##  
##  ~~~~~~HUANG.1~~~~~~
##  # 1 Huang.1 - START
## 
##  # 
##  # 1 Huang.1 - all: var.equal set to: FALSE
##  # 
##  # 1 Huang.1 - COMPLETED
##  ~~~~~~~~~~~~~~~~~~~~~~~

The object df contains two named lists:1

raw.case

This list contains dataframes with the relevant variables for each analysis, but before the analysis specific variable functions (vafun) are applied. There is a Boolean variable case.include which indicates whther a case is valid and should be included for analysis.

head(tbl_df(df$raw.case$Huang.1))
## # A tibble: 6 x 25
##                                                     .id   source
##                                                   <chr>    <chr>
## 1 ML2_Slate1_Brazil__Portuguese_execution_illegal_r.csv brasilia
## 2 ML2_Slate1_Brazil__Portuguese_execution_illegal_r.csv brasilia
## 3 ML2_Slate1_Brazil__Portuguese_execution_illegal_r.csv brasilia
## 4 ML2_Slate1_Brazil__Portuguese_execution_illegal_r.csv brasilia
## 5 ML2_Slate1_Brazil__Portuguese_execution_illegal_r.csv brasilia
## 6 ML2_Slate1_Brazil__Portuguese_execution_illegal_r.csv brasilia
## # ... with 23 more variables: huan1.1_Y1 <dbl>, huan1.1_R0 <int>,
## #   huan2.1_Y1 <dbl>, huan2.1_R0 <int>, Source.Global <chr>,
## #   Source.Primary <chr>, Source.Secondary <chr>, Country <chr>,
## #   Location <chr>, Language <chr>, Weird <int>, Execution <chr>,
## #   SubjectPool <chr>, Setting <chr>, Tablet <chr>, Pencil <chr>,
## #   StudyOrderN <chr>, IDiffOrderN <chr>, uID <int>, study.order <int>,
## #   analysis.type <fctr>, subset <fctr>, case.include <lgl>

aggregated

The dataframe in aggregated contains the data as is was analysed, after the varfun is applied.

glimpse(tbl_df(df$aggregated$Huang.1))
## Observations: 1
## Variables: 116
## $ study.id                   <int> 1
## $ study.slate                <int> 1
## $ study.name                 <fctr> Huang
## $ study.source               <fctr> all
## $ analysis.type              <fctr> Global
## $ analysis.name              <fctr> Huang.1
## $ stat.N                     <int> 6591
## $ stat.n1                    <int> 3331
## $ stat.n2                    <int> 3260
## $ stat.cond1.name            <chr> "High"
## $ stat.cond1.column          <chr> "ignore"
## $ stat.cond1.n               <dbl> 3331
## $ stat.cond1.mean            <dbl> 11.70344
## $ stat.cond1.sd              <dbl> 84.31498
## $ stat.cond1.median          <dbl> 0
## $ stat.cond1.trimmed         <dbl> 13.03014
## $ stat.cond1.mad             <dbl> 84.5082
## $ stat.cond1.min             <dbl> -185.87
## $ stat.cond1.max             <dbl> 178
## $ stat.cond1.range           <dbl> 363.87
## $ stat.cond1.skew            <dbl> -0.0797105
## $ stat.cond1.kurtosis        <dbl> -0.541009
## $ stat.cond1.se              <dbl> 1.46089
## $ stat.cond2.name            <chr> "Low"
## $ stat.cond2.column          <chr> "ignore"
## $ stat.cond2.n               <dbl> 3260
## $ stat.cond2.mean            <dbl> -22.69856
## $ stat.cond2.sd              <dbl> 88.77561
## $ stat.cond2.median          <dbl> -16
## $ stat.cond2.trimmed         <dbl> -26.39949
## $ stat.cond2.mad             <dbl> 102.2994
## $ stat.cond2.min             <dbl> -186
## $ stat.cond2.max             <dbl> 176.28
## $ stat.cond2.range           <dbl> 362.28
## $ stat.cond2.skew            <dbl> 0.3025972
## $ stat.cond2.kurtosis        <dbl> -0.6416739
## $ stat.cond2.se              <dbl> 1.554837
## $ test.type                  <fctr> t
## $ test.estimate              <dbl> 34.402
## $ test.estimate1             <dbl> 11.70344
## $ test.estimate2             <dbl> -22.69856
## $ test.statistic             <dbl> 16.12487
## $ test.p.value               <dbl> 2.146219e-57
## $ test.parameter             <dbl> 6554.05
## $ test.conf.low              <dbl> 30.21969
## $ test.conf.high             <dbl> 38.58431
## $ test.method                <chr> "Welch Two Sample t-test"
## $ test.alternative           <fctr> two.sided
## $ test.estype                <chr> "t"
## $ test.varequal              <lgl> FALSE
## $ ESCI.ncp                   <dbl> 16.12487
## $ ESCI.ncp.lo                <dbl> 14.14495
## $ ESCI.ncp.hi                <dbl> 18.10357
## $ ESCI.N.total               <dbl> 6591
## $ ESCI.n.1                   <dbl> 3331
## $ ESCI.n.2                   <dbl> 3260
## $ ESCI.d                     <dbl> 0.39726
## $ ESCI.var.d                 <dbl> 0.00062
## $ ESCI.l.d                   <dbl> 0.34848
## $ ESCI.u.d                   <dbl> 0.44601
## $ ESCI.U3.d                  <dbl> 65.44124
## $ ESCI.cl.d                  <dbl> 61.06087
## $ ESCI.cliffs.d              <dbl> 0.22122
## $ ESCI.pval.d                <dbl> 0
## $ ESCI.g                     <dbl> 0.39722
## $ ESCI.var.g                 <dbl> 0.00062
## $ ESCI.l.g                   <dbl> 0.34844
## $ ESCI.u.g                   <dbl> 0.44596
## $ ESCI.U3.g                  <dbl> 65.43957
## $ ESCI.cl.g                  <dbl> 61.05964
## $ ESCI.pval.g                <dbl> 0
## $ ESCI.r                     <dbl> 0.19484
## $ ESCI.var.r                 <dbl> 0.00014
## $ ESCI.l.r                   <dbl> 0.17167
## $ ESCI.u.r                   <dbl> 0.21768
## $ ESCI.pval.r                <dbl> 0
## $ ESCI.fisher.z              <dbl> 0.19737
## $ ESCI.var.z                 <dbl> 0.00015
## $ ESCI.l.z                   <dbl> 0.17339
## $ ESCI.u.z                   <dbl> 0.22122
## $ ESCI.OR                    <dbl> 2.05557
## $ ESCI.l.or                  <dbl> 1.88152
## $ ESCI.u.or                  <dbl> 2.2456
## $ ESCI.pval.or               <dbl> 0
## $ ESCI.lOR                   <dbl> 0.72055
## $ ESCI.l.lor                 <dbl> 0.63208
## $ ESCI.u.lor                 <dbl> 0.80897
## $ ESCI.pval.lor              <dbl> 0
## $ ESCI.cohensQ               <lgl> NA
## $ ESCI.cohensQ.l             <lgl> NA
## $ ESCI.cohensQ.u             <lgl> NA
## $ ESCI.bootR1                <lgl> NA
## $ ESCI.bootR2                <lgl> NA
## $ ESCI.bootCI.l              <lgl> NA
## $ ESCI.bootCI.u              <lgl> NA
## $ test.ConsoleOutput         <fctr> 
##  Welch Two Sample t-test
## 
## data:  ...
## $ source.N.sources.global    <int> 64
## $ source.N.sources.primary   <int> 64
## $ source.N.sources.secondary <int> 74
## $ source.N.countries         <int> 28
## $ source.N.locations         <int> 59
## $ source.N.languages         <int> 15
## $ source.Pct.WEIRD           <dbl> 72.19608
## $ source.Tbl.Execution       <chr> "Execution\nillegal   legal \n   33...
## $ source.Tbl.subjectpool     <chr> "SubjectPool\n  No  Yes \n2368 4111 "
## $ source.Tbl.setting         <chr> "Setting\n         In a classroom  ...
## $ source.Tbl.Tablet          <chr> "Tablet\n                          ...
## $ source.Tbl.Pencil          <chr> "Pencil\nNo, the whole study was on...
## $ source.N.studyorders1      <int> 6479
## $ source.N.IDiffOrderN       <int> 782
## $ source.N.uIDs              <int> 6589
## $ source.N.studyorders2      <int> 13
## $ source.Tbl.analysistype    <chr> "analysis.type\nGlobal \n  6589 "
## $ source.Tbl.subset          <chr> "subset\n all \n6589 "
## $ source.N.cases.included    <int> 6589
## $ source.N.cases.excluded    <int> 674

The output contains descriptives and sample summary characteristics and a variety of Effect Size measures. It also contains the console output of the statistical test that was conducted:

cat(paste0(df$aggregated$Huang.1$test.ConsoleOutput))
## 
##  Welch Two Sample t-test
## 
## data:  High and Low
## t = 16.125, df = 6554.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  30.21969 38.58431
## sample estimates:
## mean of x mean of y 
##  11.70344 -22.69856

A closer look: Four steps to get from data to results

The code of the function get.analyses() is not very readable because it include a lot of error-checking, error-reporting, conditional statements and… well… because we are not professional software engineers, but scientist doing the best we can.

The function was created to be able to handle a batch of many different analyses in one go and save the results many different files. Here, we will skip the error-checks and file-saving and focus on the four major steps by taking the first analysis in the masteRkey spreadsheet, Huang.1, and analysing the data from the source named brasilia.

The four steps that are be applied to all analyses in ManyLabs 2 are:

  1. Collect information about the analysis from the masteRkey spreadsheet.
  2. Extract the data and reorganise it so it can be passed to the analysis function
  3. Conduct the analysis
  4. Extract the results

Step 1 - Collect information

The first step is always to use the masteRkey spreadsheet to gather all the information needed to: * Extract the appropriate data (variables and cases) for this analysis and source * Conduct the appropriate analyis on the extracted data

# NOTE: This example follows some (but not all!!) steps of the main function: get.analyses()

# Select analysis 1 [Huang.1] and source 'brasilia'
runningAnalysis <- 1
runningGroup    <- 'brasilia'

# Get information about the analysis to run
masteRkeyInfo  <- get.GoogleSheet(data='ML2masteRkey')$df[runningAnalysis,]

# Get the appropriate 'raw' dataset [Slate1 or Slate2]
ifelse(masteRkeyInfo$study.slate == 1, data(ML2_S1),data(ML2_S2))
## Warning in data(ML2_S1): data set 'ML2_S1' not found
## [1] "ML2_S1"
# Organise the information into a list object
analysisInfo   <- get.info(masteRkeyInfo, colnames(ML2.S1), subset="all")

# Use analysisInfo to generate a cahin of filter instructions to select valid variables and cases
filterChain <- get.chain(analysisInfo)

Let’s have a look at the filterChain object:

filterChain
## $df
## [1] " %>% dplyr::select(1,6,149,150,156,157,520,521,522,523,524,525,526,527,528,529,530,531,534,535) %>% dplyr::filter(is.character(source))"
## 
## $vars
## $vars$Condition.High
## [1] "High %>% filter(huan1.1_Y1 > 0 & !is.na(huan1.1_Y1) & huan1.1_R0 > 0 & !is.na(huan1.1_R0))"
## 
## $vars$Condition.Low
## [1] "Low %>% filter(huan2.1_Y1 > 0 & !is.na(huan2.1_Y1) & huan2.1_R0 > 0 & !is.na(huan2.1_R0))"

It contains two fields: * $df a dplyr command for selecting the appropriate variables and (if applicable) filtering on source characteristics indicated by the column masteRkey$source.include (e.g. whether the data were collected on-line). For the present analysis we do not have to filter on source characteristics. * $vars selecting valid rows.

Step 2 - Extract the data

# Apply the filterChain to select aprropriate variables from ML2.S1
df.raw <- eval(parse(text=paste("ML2.S1", filterChain$df)))

# Apply the filterChain to generate a list object that represents the design cells
df.split <- get.sourceData(filterChain, df.raw[df.raw$source%in%runningGroup,], analysisInfo)

# Create a list object with data vectors and appropriate labels, that can be passed to the analysis function
vars   <- eval(parse(text=paste0(masteRkeyInfo$stat.vars,'(df.split)',collapse="")))

Step 3 - Conduct the analysis

# Get the paramers the parameters to use for the statistical analysis
stat.params <<- analysisInfo$stat.params

# Run the analysis listed in masteRkey column 'stat.test' usinf the data vectors in 'vars'
stat.test   <- with(vars, eval(parse(text = masteRkeyInfo$stat.test)))

Step 4 - Extract the results

# Return descriptives and summaries
describe <- get.descriptives(stat.test = stat.test, vars = vars, keytable  = masteRkeyInfo)

# Generate output
ESCI <- generateOutput(describe = describe, runningGroup = runningGroup, runningAnalysis = runningAnalysis)

Other algorithms


  1. the names correspond to the analysis name in the masteRkey