NOTE: Code review has completed.


This proposal has been downloaded and registered – If any further edits are identified please e-mail raklein@ufl.edu

Many Labs 2: Investigating Variation in Replicability Across Sample and Setting

Procedure

We pretested the amount of time required for each of the 28 effects, and created two slates of 13 and 15 effects that each required approximately 30 minutes to complete including demographics, instructions, and individual difference measures. We divided the studies across slate to be balanced on the criteria above and to avoid substantial overlap in topics. Participating labs will be assigned to complete one of the two slates, with eight labs volunteering to do both slates. As such, we expect that the effects in slates 1 and 2 will be examined by 57 samples each. Effects will be administered by a single experiment script that begins with informed consent, then presents the effects in that slate in a fully randomized order at the level of participants, then does the same for the individual difference measures, and then closes with demographics measures and debriefing. The studies will be conducted following approval of human subjects review boards. The Appendix shows the selected effects and a summary of the two slates on some of the selection criteria.

The original completed slates had 32 effects before peer review and pilot testing. One effect was removed during peer review at the request of the original authors. With the remaining 31 effects, we pilot tested both slates with participation among the collaborative team and their labs to ensure that each slate could be completed within 30 minutes. We observed that we underestimated the time required for a few effects.

As a consequence, we had to remove three effects (Ashton-James, Maddux, Galinsky, & Chartrand, 2009; Srull & Wyer, 1979; Todd, Hanko, Galinsky, & Mussweiler, 2011), shorten or remove a few individual difference measures, and slightly reorganize the slates to achieve the final 28 included effects.

Demographics

A few demographics will be included for characterizing each sample and as data for possible moderator investigations.

  • Age. Participants note their age in years in an open-response box.

  • Sex. Participants can select “male” or “female” to indicate their biological sex.

  • Race/ethnicity. Participants indicate race/ethnicity by selecting from a drop-down menu populated with options determined by the replication lead for each site. Participants can also select “other” and write an open-response. Note that response items will not be standardized as some countries have very different definitions of race/ethnicity, and in some cases these terms have little meaning and the item will be omitted.

  • Cultural origins. Three items assessing cultural origins used a drop-down menu populated by a long list of countries or territories. Translated versions may just list the most probable responses as determined by the local researcher, in which case “other” will be included as a response option, and participants will be provided an open-response box. The three items were:
    1. In which country/region were you born?,
    2. In which country/region was your primary caregiver (e.g., parent, grandparent) born?, and
    3. If you had a second primary caregiver, in which country/region was he or she born?
  • Home town. A single item “What is the name of your home town/city?” with an open response blank will be included as another potential variable of interest for the Huang et al., 2014 effect. Wealth in home town. A single item “Where do wealthier people live in your home town/city?” with North, South, and Neither as response options will be included in demographics as a potential moderator of the Huang et al., 2014 effect. This item will be included only in Slate 1.

  • Political ideology. Participants rate their political ideology on a scale with response options of: strongly left-wing, moderately left-wing, slightly left-wing, moderate, slightly right-wing, moderately right-wing, strongly right-wing. Instructions are adapted for each country of administration to ensure relevance of the ideology dimension to the local context. For example, the U.S. instructions read: “Please rate your political ideology on the following scale. In the United States, ‘liberal’ is usually used to refer to left-wing and ‘conservative’ is usually used to refer to right-wing.”

  • Education. Participants report their educational attainment on a single item “What is the highest educational level that you have attained?” using a 6-point response scale: 1 = No formal education, 2 = completed primary/elementary school, 3 = completed secondary school/high school, 4 = some university/college, 5 = completed university/college degree, 6 = completed advanced degree.

  • Socio-economic status (Adler, Boyce, Chesney, Cohen, Folkman, Kahn, & Syme, 1994). Socio-economic status will be measured with the ladder technique (Adler et al., 1994). Participants are asked to indicate their standing in their community relative to other people in the community with which they most identify on a ladder with ten steps where 1 indicates people at the bottom having the lowest standing in the community and 10 referring to people at the top having the highest standing (see here). Previous research demonstrated good convergent validities of this item with objective criteria of individual social status and also construct validity with regard to several psychological and physiological health indicators (e.g., Adler, Epel, Castellazzo, & Ickovics, 2000; Cohen, Alper, Doyle, Adler, Treanor, & Taylor, 2008). This ladder is also used in Effect 12 in Slate 1 (Anderson, Kraus, Galinsky & Keltner, 2012, Study 3). Participants in that slate will not receive the ladder item a second time. Use of the ladder item in that slate should account for its use as a dependent variable in that study.

  • Data quality. Recent research in the area of careless or insufficient effort responding has moved toward refining implementation of established scales embedded in data collection to check for aberrant response patterns (Huang et al., 2014, Meade & Craig, 2012). To further research in this area, the current project will include two items at the end of the study, just prior to demographic items. The first item asks participants “In your honest opinion, should we use your data in our analyses in this study?” and has yes/no response options (Meade & Craig, 2012). The second item is an Instructional Manipulation Check (IMC; Oppenheimer, Meyvis, & Davidenko, 2009), also used in the first Many Labs project (Klein et al., 2014). The IMC will be modified to fit the format of the current project.

Individual Difference Measures

The following individual difference measures will be presented in a randomized order after all target effects are completed, and right before the demographics items. These measures will be particularly useful in tests for moderation of effect sizes.

  • Cognitive reflection (Finucane & Gullion, 2010). The cognitive reflection task (CRT; Frederick, 2005) assesses individuals’ ability to suppress an intuitive (wrong) response in favor of a deliberative (correct) answer. The items on the original CRT are widely known, and the measure is vulnerable to practice effects (Chandler, Mueller & Paolacci, 2014). As such, we use an updated version that is logically equivalent and correlates highly with the items on the original CRT (Finucane & Gullion, 2010). The three items are: (1) If it takes 2 nurses 2 minutes to measure the blood pressure of 2 patients, how long would it take 200 nurses to measure the blood pressure of 200 patients?; (2) Soup and salad cost $5.50 in total. The soup costs a dollar more than the salad. How much does the salad cost?; and, (3) Sally is making tea. Every hour, the concentration of the tea doubles. If it takes 6 hours for the tea to be ready, how long would it take for the tea to reach half of the final concentration? Also, we will constrain the total time available to answer the three questions to 75 seconds. This will likely lower overall performance on average as it is somewhat faster than performance by some participants in pretesting.

  • Subjective well-being (Veenhoven, 2009). Subjective well-being is measured with a single item “All things considered, how satisfied are you with your life as a whole these days?” on a response scale from 1 “dissatisfied” to 10 “satisfied”. Similar items are included into numerous large-scale social surveys (cf. Veenhoven, 2009) and have shown satisfactory reliabilities (e.g., Lucas & Donnellan, 2012) and validities (Cheung & Lucas, 2014; Oswald & Wu, 2010; Sandvik, Diener, & Seidlitz, 1993).

  • Global self-esteem (Robins, Hendin, & Trzesniewski, 2001). Global self-esteem is measured using a Single-Item Self-Esteem Scale (SISE) designed as an alternative to using the Rosenberg Self-Esteem Scale (1965). The SISE consists of a single item: “I have high self-esteem”. Participants respond on a 5-point Likert scale, ranging from 1 = not very true of me to 5 = very true of me. Robins, Hendings, and Trzesniewski (2001) reported strong convergent validity with the Rosenberg Self-Esteem Scale (with rs ranging from .70 to .80) among adults. Also, the scale had similar predictive validity as the Rosenberg Self-Esteem Scale.

  • TIPI for Big-Five personality (Gosling, Rentfrow, & Swann, 2003). The five basic traits of human personality (Goldberg, 1981) – conscientiousness, agreeableness, neuroticism / emotional stability, openness / intellect, and extraversion – are measured with the Ten Item Personality Inventory (Gosling et al., 2003). Each trait is assessed with two items on seven point response scales from 1 = disagree strongly to 7 = agree strongly. The scale has been translated into several languages including, among others, German (Muck, Hell, & Gosling, 2007), Dutch (Hofmans, Kuppens, & Allik, 2008), Spanish and Catalaan (Renau et al., 2013), Japanese (Oshio, Abe, Cutrone, & Gosling, 2013) and many more (see Gosling, 2014). The five scales show satisfactory retest reliabilities (cf. Gnambs, 2014) and substantial convergent validities with longer Big Five instruments (e.g., Ehrhart et al., 2009; Gosling et al., 2003; Rojas & Widiger, 2014).

  • Mood (Cohen, Sherman, Bastardi, Hsu, McGoey, & Ross, 2007). There exist many assessments of mood. We selected the single-item from Cohen and colleagues (2007). Respondents answer “How would you describe your mood right now?” on a 5-point response scale: 1 = extremely bad, 2 = bad, 3 = neutral, 4 = good, 5 = extremely good.

  • Disgust Sensitivity Scale–Contamination Subscale (DS-R; Olatunji, et al., 2007). The DS-R is a 25-item revision of the original Disgust Sensitivity Scale (Haidt, McCauley, & Rozin, 1994). Subscales of the DS-R were determined by factor analysis. The contamination subscale includes the 5 items related to concerns about bodily contamination. The contamination subscale is included for Effect 10 in Slate 1. It will not appear in Slate 2.

Overall Analysis Plan

Each effect will be analyzed according to the Analysis plan specified below, including decision rules for data exclusion. Descriptively, the primary effect of interest is the variability in effect size for each effect between the sample of samples. The analysis for this interest will follow closely with the procedure described in the first Many Labs paper and will produce a figure similar to Figure 1 of that article: for each effect,

  1. an aggregate effect estimate across all samples,
  2. a 99% confidence interval for the aggregate effect estimate,
  3. display of the effect estimates for each individual sample, and
  4. comparative display of the original effect estimate.

The latter will probably have two data points in cases that only a subset of the samples or participants are anticipated a priori to be comparable to the original. In that case, the Figure will show the effect size for the restricted sample for the direct attempt to compare with the original effect size, and the effect size of the full sample.

Aggregate examination of the variability in effect estimates will use established meta-analytic statistics tau^2, Q and I^2 to determine if the amount of variability across samples exceeds that expected by random error. Because the study procedures are nearly identical (except for language translations), any variation exceeding random error is likely to be due to effects of sample or setting. This is the primary outcome of interest for each effect included in this study.

In the aggregate analysis, we expect that the effects a priori identified as ones that vary across samples and settings (i.e., ones with existing evidence for cultural variation) to show variability exceeding that which can be attributed to random error (i.e., show higher values of I^2). Also, upon confirmation of the final study design, we will survey all project contributors for their predictions of relative average effect magnitude and variation in effect size across samples and settings across the effects. These predictions will be compared with the actual effect sizes and variation in those effect sizes. Further, a collaborative team may conduct a prediction market for the effects included in Many Labs 2. That will be conducted independently of the main report for this project.

In addition to the focal research questions, the order of presentation is an obvious procedural factor that may moderate effect sizes. Across the 30 minute session, effects may weaken if participants tire or prior effects interfere with later effects. Although we did not observe this in the first Many Labs investigation, it is nonetheless a plausible moderator and cannot be ignored. Therefore, we will first examine whether each effect size differs as a function of their placement in the study procedure. To do so, we will select for each effect size the data at rank order K across all locations, where K varies from 1 (presented first) to 16 (presented last). For each value of K we will estimate each effect size, and both average effect size and variability represented as a function of K. With a moderator analysis we will examine if a linear or quadratic trend in order is present. This regression trend indicates whether the overall or individual effect sizes are sensitive to order. If no trend of order is observed in the moderation analysis then reporting the overall effect size is sufficient. Otherwise, we will report the range of effects across orders, with a focus on the effect size when the effect was administered first as the “purest” assessment of the effect magnitude without the influence of fatigue or any particular interference from one or more effects. Examination of specific interference effects is left to the follow-up commentaries on the main article (discussed next).

Additional Analysis via Commentaries

The amassed dataset will be very rich for exploring the individual effects, potential interactions between specific effects, and alternate ways to analyze the aggregate data. Our Analysis plan. focuses on the big picture and not, for example, exploring potential moderating influences on each of the individual effects. These are worthy analyses, but putting everything into one paper would be overwhelming.

Instead, we proposed an adaptation of the Registered Replication Reports format at Perspectives on Psychological Science, the intended outlet for this manuscript. The main paper will be authored by the entire community of researchers contributing to Many Labs 2. As such, micro-publications of each data collection will not be necessary, as is the present format. Instead, we proposed that the Editors solicit commentaries on the main paper that can include new data analysis. These commentaries would likely (though not necessarily) focus on a particular effect and use the rich dataset to examine its boundary conditions and moderating influences.

We believe that the extremely high-powered design of Many Labs 2 offers an opportunity to demonstrate the productive interplay of exploratory and confirmatory analysis strategies. That is, teams that would like to submit commentaries would be given access to one half of the dataset to analyze and write their commentary for editorial review. Those commentaries that are accepted would then be finalized, and the analysis would be subjected to a confirmatory phase. The identical analysis would be conducted on the other half of the data as a strong confirmatory test and reported in the commentary regardless of outcome. A change in the outcome between exploratory and confirmatory phases would not be a basis for changing the editorial decision.

We believe that this process could highlight the importance and interactivity of exploratory and confirmatory approaches to data analysis. Exploratory analysis provides an opportunity to learn from the data, and the subsequent confirmatory analysis provides a strong test of the discoveries. Finally, we will make the full dataset (plus the two halves used for the exploratory/confirmatory commentaries) and all study materials available publicly at https://osf.io/8cd4r/ so that other teams can use it for their own investigations.

Selected Effects

Next, we describe the selected effects with a summary title of the effect with a citation, an abstract describing the main idea of the original research with the sample size, inferential test, and effect size that is the key result for replication. Then, we present the materials, procedure, and Analysis plan.

Many of the original studies were conducted with paper and pencil, but all of the replications will be conducted via computer. Also, many of the original studies were conducted in English, but all of the replications will be conducted in the dominant language of each setting of data collection. Any other known differences from the original study are noted in the method description.

The focus of this replication project is estimating the variability in effect magnitudes by sample and setting. As such, we aimed to identify or simplify original study designs that could be tested as simple, two-condition experiments or as correlational results. Some original studies had additional conditions that were relevant for the theoretical purposes of the investigation. In those cases, the replication designs identified the key conditions that are relevant for estimating the effect. Also, in some cases, multiple dependent variables were included in the original design. If the dependent variables could be administered quickly, they were usually retained in the replication designs. When multiple outcomes were included, because they are likely to be correlated outcomes, just one was identified as the primary object for replication and the others as secondary replications. All outcomes will be reported in the final text, but the primary outcome will be the focus for reporting purposes.


SLATE 1

1. Huang

LIVING IN THE NORTH IS NOT NECESSARILY FAVORABLE: DIFFERENT METAPHORIC ASSOCIATIONS BETWEEN CARDINAL DIRECTION AND VALENCE IN HONG KONG AND IN THE UNITED STATES (Huang, Tse & Cho, 2014, Study 1a)

People in the United States and Hong Kong have different demographic knowledge that may shape their metaphoric association between valence and cardinal direction (North/South). 180 participants from the United States and Hong Kong participated. Participants were presented with a blank map of a fictional city and were randomly assigned to indicate on the map where either a high-SES or low-SES person might live. There was an interaction between SES (high vs. low) and population (US vs. HK), F(1,176) = 20.39, MSE = 5.63, p < .001, ηp2 = 0.10. US participants expected the high-SES person to live further north (M = +0.98, SD = 1.85) than the low-SES person (M = -.69, SD = 2.19), t(78) = 3.69, p < .001, d = .82, 95% CI [.37, 1.30]. Conversely, HK participants expected the low-SES person to live further north (M = +0.63, SD = 2.75) than the high-SES person (M = -0.92, SD = 2.47), t(98) = 2.95, p = .004, d = -.59, 95% CI = [-.99, -.19].

Materials and Procedure.

Participants will be randomly assigned to read a description of either a high or low SES person. The description of high-SES person will read: “Dr. Bennett lives in the city. He is a wealthy businessman who has travelled the world. He inherited a significant amount of money from a Great Aunt, and was educated at the best schools growing up. He enjoys fine dining and going to the theater on weekends.” The description of the low-SES person will read: “Mr. Bennett lives in the city. He is unemployed. He was born and raised in the city he now calls home. He struggles to pay the rent each month, and dropped out of high school before graduation. He enjoys a good hot dog and a six pack of beers when he can.”

Then participants will view a map of a fictional city and will indicate where the person they read about might live by clicking on that location on the map. The map of the fictional city from the original study will be used (originally from Meier et al., 2011). Materials here: https://osf.io/exs7i/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_cNiQVmpTU8Xd1uB.

In the individual difference measures at the end of the slate, participants will report the name of their home town, and answer “Where do wealthier people live in your home town? with response options being north side, south side, or neither.

Analysis plan.

The coordinates of the click on the map will be recorded (X, Y) from the top-left of the image. The mean difference between the high and low-SES conditions for north/south location of click (Y) will be compared with an independent samples t-test. All participants who indicate an area within the boundaries of the map will be included in the analysis.

The test for replicating the cultural difference observed in Huang et al. will be conducted on a subset of the participants that respond on the wealth in hometown individual difference item that wealthy people tend to live in the North (akin to original U.S. sample) versus wealthy people tend to live in the South (akin to original Hong Kong sample). The entire sample will be used for investigating variation in effects across sample and setting.

Known differences from original.

Original participants were asked to guess the purpose of the study afterward, but none did and we will not be including that item.

The original was presented on pencil-and-paper and drew an “X” on the map, whereas the replication will be on a computer and participants will click to indicate the location on the map. With a monitor presentation, this also means the study will be completed on a vertical display as opposed to a horizontal paper. The original authors suggest this may be particularly important because associations between “up” and “good” or “down” and “bad” may interfere with any North/South associations. As such, at eight data collection sites, we will randomly assign participants to complete the slate on a regular monitor or on a Microsoft Surface tablets that is resting on the table like a paper-pencil administration. A focused examination of these sites will test whether this administration format matters.

Lastly, the original analysis emphasized t-tests against zero whereas the replication will focus specifically on the difference between conditions (e.g., independent samples t-test).


2. Kay

A FUNCTIONAL BASIS FOR STRUCTURE-SEEKING: EXPOSURE TO STRUCTURE PROMOTES WILLINGNESS TO ENGAGE IN MOTIVATED ACTION (Kay, Laurin, Fitzsimons & Landau, 2014, Study 2)

In Kay, Laurin, Fitzsimons, and Landau (2014), 67 participants generated what they felt was their most important goal. Participants then read one of two scenarios where a natural event (leaves growing on trees) was described as being a structured or random event. For example, in the structured condition, a sentence read “The way trees produce leaves is one of the many examples of the orderly patterns created by nature…”, but in the random condition it read “The way trees produce leaves is one of the many examples of the natural randomness that surrounds us…”. Next, participants answered three questions about their most important goal on a scale from “1 = not very” to “7 = extremely”. The first measured subjective value of the goal and the other two measured willingness to engage in goal pursuit. Those exposed to a structured event (M= 5.26, SD = 0.88) were more willing to pursue their goal compared to those exposed to a random event (M = 4.72, SD = 1.32; t(65) = 2.00, p = .05, d = 0.50, 95% CI = [-.001, -.988]).

Materials and procedure.

Participants will be asked to list their most important long-term goal. Afterwards, participants will be randomly assigned to the structured or random condition. Following the scenario script, participants will answer three questions regarding their willingness to pursue their listed goal and the subjective value of the goal. Materials here: https://osf.io/nkg5y . Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_dcMjcxf2UCsCd0N

Analysis plan.

Following Kay et al. (2014), we will create an index of willingness to engage in goal pursuit for each participant by (1) regressing the mean of the two goal pursuit items on the centered mean of the goal subjective value item, (2) calculating the unstandardized residual for each participant, and (3) add to those the mean value for the self-regulation items measuring willingness to engage in goal pursuit. Then, the two conditions will be compared using an independent samples t-test.

Because of the analysis strategy, any participant with missing data on any one of the three items will not be included in analysis.

Known differences from the original.

None known besides sampling and setting.


3. Alter

OVERCOMING INTUITION: METACOGNITIVE DIFFICULTY ACTIVATES ANALYTIC REASONING (Alter, Oppenheimer, Epley & Eyre, 2007, Study 4)

Alter and colleagues (2007) investigated whether a deliberate, analytic processing style can be activated by incidental disfluency cues that suggest task difficulty. Forty-one participants attempted to solve syllogisms presented in either a hard- or easy-to-read font. The manipulation of font was an incidental induction of disfluency. Participants in the hard-to-read condition answered more moderately difficult syllogisms correctly (64%) than participants in the easy-to-read condition (42%; t(39)=2.01, p=.051, d=.64 [-.004, 1.28]).

Materials and procedure.

Participants will be randomly assigned to complete syllogisms presented in easy- or hard-to-read font. Following Alter et al. (2007), the easy-to-read font will be black Myriad Web 12-point and the hard-to-read font will be 10% grey italicized Myriad Web 10-point. Items will be presented on a single page in a fixed order: instructions, six syllogisms, and a mood item.

The original authors chose six syllogisms based on difficulty determined in previous research (Johnson-Laird & Bara, 1984; Zielinski, Goodwin, & Halford, 2006): two hard (20% correct), two moderate (50% correct), and two easy (85% correct). We will use the same syllogisms. Transient mood will be measured by asking “Please circle the number that best describes your current mood:” on a 7-point scale from “ very unhappy” to “very happy”. The mood item was included in the original study to evaluate whether disfluency could change mood that would then affect task performance. Additionally, a manipulation check will be added after the task to assess how difficult participants thought the text was to read. Materials here: https://osf.io/sizqu/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_bPWngKJWUb5DAVv.

Analysis plan.

Similar to Alter et al. (2007), we will conduct an independent samples t-test to determine whether accuracy in solving moderately difficult syllogisms differ by font condition (fluent versus disfluent). The original study focused on the moderately difficult questions, on the basis that participants’ performance could vary enough to detect changes in processing depth. Our primary analysis strategy will be sensitive to potential differences across samples in ability on syllogisms. We will first determine which syllogisms were moderately difficult to participants by excluding any of the six items, within each sample, that were answered correctly by fewer than 25% of participants or more than 75% of participants across conditions. The remaining syllogisms will be the basis of computing mean syllogism performance for each participant.

For a direct comparison with the original effect size, only English in-lab samples will be used for two reasons: (1) we cannot adequately control for online participants “zooming in” on the page or otherwise making the font more readable, and (2) a different font may be used in some translated versions because the original font (Myriad Web) may not support all languages. All samples will be included in the investigation of cross-site variability in effect size.

As a secondary analysis, we will use the same two syllogisms from Alter et al (2007) for analysis regardless of performance to perfectly mirror the original analysis.

Known differences from original.

The original was done with paper and pencil and it is unknown whether using a computer could affect fluency. However, other evidence suggests that this is not a limiting condition. A fluency effect was demonstrated in a previous study that administered words in a large or small font in an online study, and it was shown that the size of the font influenced participants’ meta-memorial judgments (Kornell, Rhodes, Castel, & Tauber, 2011). Furthermore, in a computer-based experiment involving judging the validity of syllogisms, Morsanyi and Handley (2012; Experiment 2) found that font fluency affected participants bias to respond “yes,” with those in normal font condition showing a higher rate of endorsing syllogisms as valid compared to the typical bias rate, whereas there was not a difference for those in the nonfluent font condition.

We will also take additional steps for in-lab replications to ensure the computer presentation is suitable. We will ask experimenters to maximize the browser window before participants arrive, and we will obtain information about the monitors used (e.g., display size, aspect ratio, model, resolution, and typical viewing distance) from researchers or automatically recorded meta-data. Researchers will also take two pictures of their monitors, one of the difficult-to-read condition and one of the easy-to-read condition so they are available for future review. Those data will be obtained for potential subsequent analyses to determine if those factors influenced the results.

The original authors hypothesize that this effect is sensitive to task order. If people are already thinking carefully (or if they’re fatigued), the disfluency manipulation might not change how deeply they engage with the task. As such, the effect may be most detectable when it is done first. An additional difference, as noted in the Analysis plan is that a different font from the original may be used in non-English samples when Myriad Web does not support the language.


4. Graham

LIBERALS AND CONSERVATIVES RELY ON DIFFERENT SETS OF MORAL FOUNDATIONS (Graham, Haidt, & Nosek, 2009, Study 1)

People on the political left (liberal) and political right (conservative) have distinct policy preferences and may also have different moral intuitions and principles. 1,532 participants across the ideological spectrum rated whether different concepts such as purity or fairness were relevant for deciding whether something was right or wrong. Items that emphasized concerns of harm (r = -.16, p < .0005, d = .32, 95%CI [.27, .38]) or fairness (r = -.21, p < .0005, d = .43, 95% CI [.38, .48]) were deemed more relevant for moral judgment by political liberals than conservatives (“individualizing” aggregate r= -.21, p < .0005, , d = .43, 95%CI [.38, .48]), whereas items that emphasized concerns for the ingroup (r = .12, p < .0005, d = .24, 95% CI [.19, .29]), authority (r = .21, p < .0005, d = .43, 95%CI [.38, 48]), or purity (r = .27, p < .0005, , d = .56, 95%CI [.51, .62]) were deemed more relevant for moral judgment by political conservatives than political liberals (“binding” aggregate r = .25, p < .0005, d = 0.52, 95% CI [.46, .57])

Materials and procedure.

The moral relevance of five foundations will be measured by 3 items each for a total of 15. Participants will read the prompt, “When you decide whether something is right or wrong, to what extent are the following considerations relevant to your thinking?”. Them, they will rate the 15 moral relevance items in a randomized order on a 6-point scale from “not at all relevant” to “extremely relevant”. At the end of the study package, participants will report their political ideology along with the other demographic measures. Materials here: https://osf.io/gdbp8/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_9T7LvQedxUzUFXn

Analysis plan.

The 6 items for harm and fairness will be averaged to create a “individualizing” foundations moral relevance score and the 9 items for ingroup, authority, and purity will be averaged to create a “binding” foundations moral relevance score. The relationship between political ideology and the “binding” and “individualizing” aggregates will be calculated using zero order correlations. The primary target of replication is the relationship of political ideology with the “binding” foundations, and the relationship of political ideology with the “individualizing” foundations is a secondary replication. All participants who complete the corresponding measures will be included in analysis.

Known differences from original.

We are conducting a simplified version of the original analyses with approval of the original authors (and updated analyses from the original study in line with our analysis for comparative purposes). Also, we altered the text of the political ideology item to more relevant to international samples that may not recognize “liberal” and “conservative” as having the same meaning as in the United States.


5. Rottenstreich

MONEY, KISSES, AND ELECTRIC SHOCKS: ON THE AFFECTIVE PSYCHOLOGY OF RISK (Rottenstreich & Hsee, 2001, Study 1)

Forty participants chose whether they would prefer an affectively attractive option (a kiss from a favorite movie star) or a financially attractive option ($50). In one condition, participants made the choice imagining a low probability (1%) of getting the outcome. In the other condition, participants imagined that the outcome was certain, they just needed to choose which one. When the outcome was unlikely 70% preferred the affectively attractive choice, when the outcome was certain 35% preferred the affectively attractive choice (χ2(1,N=40) = 4.91), p = .0267, Kramers φ = .35). This result supported the hypothesis that positive affect has greater influence on judgments made under conditions of uncertainty than judgements about definite outcomes.

Materials and procedure.

Participants will be randomly assigned to make a choice with either a certain outcome or a 1% chance that the outcome will occur. The certain condition will read as follows: “Imagine that you have the opportunity to either meet and kiss your favorite movie star or receive $50 in cash.” In the uncertain condition, participants will read: “Imagine that you have the opportunity to take part either in a lottery that offers a 1% chance to meet and kiss your favorite movie star or a lottery that offers a 1% chance to receive $50 in cash.” In both conditions, participants will choose one of the two options. Materials here: https://osf.io/pky9m/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_brBFoL7va7coZ0x.

Analysis plan.

A two-way contingency table will be built with certainty condition (low-probability vs. certain) and choice (monetary reward vs. meeting favorite movie star) as factors. The critical replication hypothesis will be given by a \(\chi^2\) test and the effect size by an odds ratio. All participants with valid data on the response will be included in the analysis.

Known differences from original.

None known.


6. Bauer

CUING CONSUMERISM SITUATIONAL MATERIALISM UNDERMINES PERSONAL AND SOCIAL WELL-BEING (Bauer, Wilkie, Kim & Bodenhausen, 2012, Study 4)

Bauer and colleagues (2012) examined whether being in a consumer mindset would lead to less trust towards others. In Study 4, 77 participants read about a hypothetical water conservation dilemma in which they were involved. Participants were randomly assigned to either a condition that referred to the participant and others in the scenario as “consumers” or as “individuals.” Participants in the consumer condition reported less trust towards others (1= not at all, 7 = very much) to conserve water (M = 4.08, SD = 1.56) compared to the control condition (M = 5.33, SD = 1.30), t(76) = 3.86, p = .001, d = .88, 95% CI [.41, 1.34]).

Materials and procedure.

Participants are randomly assigned to the consumer or control condition. They will first read a description of a water shortage caused by a drought in which they are one of four people who share water from the same well. In the scenario, the four people sharing the well are referred to as either “Individuals” or “Consumers.”, and receive information about past water usage indicating that they have been using more water than others. The passage reads:

You are [Individual/Consumer] A. You and three other [individuals/consumers] live in and share a water supply for a particular area. Typically, you use the most amount of water with your daily activities (e.g., washing clothes and dishes, flushing the toilet, watering the lawn, bathing, etc.). That is, you use about 160 gallons of water per day. [Individual/Consumer] B uses 140 gallons a day, [Individual/Consumer] uses about 120 gallons a day, and [Individual/Consumer] D uses about 100 gallons per day.

Unfortunately, this year a drought has depleted the normal supply of water in your area such that there is not enough water for you and the other three [individual/consumers] to draw on for your typical needs. In order to maintain the water supply through this period of time, it is recommended that the total amount of water used should be cut by 25%. This is only a recommendation, so no water restriction has been explicitly imposed.

Below is a chart of each [individual’s/consumer’s] water usage and the specific activities that each individual would have to alter to cut their water usage.

Screen Shot 2014-06-16 at 2.48.07 PM.png

With a single-item from 1 = not at all to 7 = very much, participants answer “How much do you trust the other parties to use less water?” Materials here: https://osf.io/jv46k/ . Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_5cHQkTN7uGunuAt.

Analysis plan.

We will compared the mean trust levels between conditions with an independent samples t-test. All participants with data will be included in analysis.

Known differences from original.

The original experiment included four additional dependent variables: (1) responsibility for the crisis, (2) obligation to cut water usage, (3) how much they viewed others as partners, and (4) how much others should use less water. The central replication will be on the trust variable, while the other four dependent variables will be retained in the procedure but not analyzed for the focal replication.


7. Miyamoto

CULTURAL VARIATION IN CORRESPONDENCE BIAS: THE CRITICAL ROLE OF ATTITUDE DIAGNOSTICITY OF SOCIALLY CONSTRAINED BEHAVIOR (Miyamoto & Kitayama, 2002, Study 1)

Miyamoto and Kitayama (2002) examined whether Americans would be more likely than Japanese to show a bias toward ascribing to an actor an attitude corresponding to the actor’s behavior. In their Study 1, 49 Japanese and 58 American undergraduates learned they would read a university student’s essay about the death penalty and infer the student’s true attitude toward the issue. The essay was either in favor or against the death penalty, and it was designed to be diagnostic or not very diagnostic of a strong attitude. After reading the essay, participants learned that the student was assigned to argue the pro- or anti-position. Then, participants estimated the essay writer’s actual attitude toward capital punishment and the extent to which they thought the student’s behavior was constrained by the assignment.
Controlling for perceived constraint, analyses compared perceived attitudes of pro- versus anti-capital punishment essay writers. American participants perceived a large difference in actual attitudes when the essay writer had been assigned to write a pro-capital punishment essay (M = 10.82, SD = 3.47) versus anti-capital punishment essay (M = 3.30, SD = 2.62; t(56) = 6.66, p < .001, d = 1.78, 95% CI=[1.16, 2.39]). Japanese participants perceived less of a difference in actual attitudes when the essay writer had been assigned to write a pro-capital punishment essay (M = 9.27, SD = 2.88) versus an anti-capital punishment essay (M = 7.02, SD = 3.06); t(47) = 1.84, p = .069, d = .53.

Materials and procedure.

Participants will be randomly assigned to read a scenario in which a student wrote either a pro- or anti-capital punishment essay. Then, they will learn that the student had been assigned to take that position, reading, “Dr. Wallace is teaching a course on international politics at a midwestern university. In his class, students discuss a variety of topics and issues every week. Typically, Dr. Wallace solicits opinions about the topics from the students. In this week’s class, the topic was capital punishment. Dr. Wallace asked Steve to write an essay [supporting/opposing] capital punishment. Steve agreed to do so and wrote the essay presented on the previous page.” For the essay page and the constraint page, participants will not be allowed to move forwards until 10 seconds have passed.

After that, participants will answer three questions estimating the writer’s true attitude, the attitude the writer would take if given the opportunity to speak freely, and the attitude of the average student at a Midwestern university (1 = against capital punishment, 15 = supports capital punishment). Participants will then answer the extent to which they believed the essay author had been constrained by the assignment using a 7-point scale (1 = strongly constrained, 7 = completely free). Finally, participants will indicate how persuasive they thought the essay was on a 7-point scale (1 = not at all persuasive, 7 = very persuasive). Materials here: https://osf.io/e426i/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_bEjoIILOJcahUuV.

Analysis plan.

An ANCOVA will compare the mean estimates of the author’s true attitude across the two conditions, covarying for perceived constraint.

Known differences from original.

Our focus is on comparing the two key conditions that elicited a cultural difference in the original research. The original experiment varied the diagnosticity of the essay. We have decided to focus on the low-diagnosticity conditions, which is where the significant cross-cultural difference in correspondence bias occurred.

Also, the original experiment included six outcome measures: 1) their estimate of the writer’s true attitude, 2) the attitude the writer would express if free to choose, 3) the attitude of the average student at a Midwestern university (changed to the attitude of the average student in each respective country for the purpose of the current project), 4) their own attitude, 5) how much constraint the writer had while composing the essay, and 6) how persuasive they thought the essay was. We will focus on the estimate of the writer’s true attitude as the primary target for replication. Item 2 will be examined as a secondary replication, items 3 and 6 as potential moderators, and item 5 as a covariate in the analysis.

The original authors suggested we alter the names and university location to be familiar for the national identity of each sample. Also, the original authors suggested that, for samples in countries without the death penalty, the prompt for the essay be changed from “I think that capital punishment should be abolished” to “I think that capital punishment should not be legalized.” Finally, the original study was pen and paper. In adapting the task to an online version, the original authors suggested we make participants spend at least 10 seconds reading the essay and the explanation of constraint. We have done this by disabling the continue button for 10 seconds.

The original authors also suggested a few caveats that may produce cross-cultural variation in the effect observed from our implementation aside from differences in correspondence bias. Specifically, the authors suggested there may be differences in how participants read and interpret the situational constraint information (e.g., samples may vary in their familiarity with the seminar course format) and essays (e.g., samples may vary in their perception of the strength of the essays or familiarity with the death penalty). Finally, the authors suggested that presenting the study in a package of other studies could disrupt the effect. Our tests of order effects should address this. However, the original authors added that even when the study appears first, participants may respond differently because they are expecting to do more tasks afterwards. Our design does not address this possibility.


8. Inbar

DISGUST SENSITIVITY PREDICTS INTUITIVE DISAPPROVAL OF GAYS (Inbar, Pizarro, Knobe & Bloom, 2009, Study 1)

Behaviors that are deemed morally wrong may be judged as more intentional (Knobe, 2006). Thus, people who judge the portrayal of gay sexual activity in the media as an intentional act may find homosexuality morally reprehensible. In Inbar et al. (2009), 44 participants read a vignette about a director’s action and him as more intentional when he encouraged gay kissing (M = 4.36, SD =1.51 than when he encouraged kissing (M = 2.91, SD = 2.01; β = .41, t(39) = 3.39, p = .002, r = .48). Disgust sensitivity was related to judgments of greater intentionality in the gay kissing condition, β = .79, t(19) = 4.49, p = .0003, r = .72. and not the kissing condition, β = -.20, t(19) = -.88, p = .38, r = .20. The correlation in gay kissing condition was stronger than the correlation in the kissing condition, z = 2.11, p = .03, d = .64, 95% CI=[.31, .96].

The authors concluded that individuals prone to disgust are more likely to interpret the gay kissing inclusion as intentional indicating that they intuitively disapprove of homosexuality. The relationship between disgust sensitivity and intentional ratings is the target of direct replication.

Materials and procedure.

Participants will be randomly assigned to read one of the two versions of the following scenario: “A director was working on a music video. His assistant said: ‘I took a look at the first cut of your video, and it looks to me like some of the images in it will encourage couples [homosexual men] to French kiss in public.’ The director said: ‘Look, I know that it will be encouraging couples [homosexual men] to French kiss in public, but I don’t care at all about that. I just want to make a video that will increase sales of the album.’He included the images in the video. Sure enough, it encouraged couples [homosexual men] to French kiss in public.”

Next, participants will be asked a condition matched set of three questions in a fixed order. First, “Did the director intentionally encourage couples [homosexual men] to French kiss in public?” answering on a 7-item scale from 1 = not at all to 7 = definitely. Second, “Is there anything wrong with couples [homosexual men] French kissing in public?” with a yes or no response. And, third, “Was it wrong of the director to make a video that he knew would encourage couples [homosexual men] to French kiss in public?” with a 7-item scale from 1 = not at all to 7 = definitely.

Finally, participants will complete the 25-item revised Disgust Sensitivity Scale (Olatunji, et al., 2007) among the individual difference measures of the procedure to separate it from the experimental manipulation. Materials here: https://osf.io/bfhp7/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_0HDz8AMWXFr9yvz.

Analysis plan.

The five items of the contamination subscale of the Disgust Sensitivity Scale-Revised will be averaged to create a single index of disgust sensitivity. For the primary analysis, we will compute a correlation between disgust sensitivity and assessments of the director’s intentionality in both the gay kissing and kissing conditions, and then compare the correlations with an r-to-z transformation. The other two outcome measures will be examined as secondary analyses following the same analysis strategy. All participants with relevant data will be included in analysis.

Known differences from original.

The original study used 8-item short form of the disgust sensitivity scale. The original authors suggested using 25-item revised version because of its improved psychometric properties. In pretesting, the full 25-item scale was taking more time than desired. The original authors approved a revision using the 5-item contamination subscale.


9. Critcher

INCIDENTAL ENVIRONMENTAL ANCHORS (Critcher & Gilovich, 2008, Study 2)

In Critcher and Gilovich (2008), 207 participants predicted the relative popularity between geographic regions of a new cell phone that was entering the marketplace. In one condition, the smartphone was called the P97; in the other condition, the smartphone was called the P17. Participants in the P97 condition estimated a greater proportion of sales in the U.S. (M = 58.1%, SD = 19.6%) than did participants in the P17 condition (M = 51.9%, SD = 21.7%; t(197.5) = 2.12, p = .03, d = 0.30, 95% CI = [0.02, 0.58]). This supported the hypothesis that judgment can be influenced by incidental anchors in the environment. The mere presence of a high or low number in the name of the cell phone influenced estimates of sales of the phone.

Materials and procedure.

Participants see a picture of a smartphone with either the model number “P17” or “P97” on the phone’s display and read some background information about the smartphone. The text has been updated from the original to reflect more recent smartphones, and reads: “The Sony Ericsson P17 [P97] is an Android-powered smartphone with a 5.2” Full HD resolution display. The P17 [P97] is designed to be lightweight and compact, and features the most up-to-date processor and battery components. Purchasing a P17 [P97] allows one to use Sony cloud storage app for a year for no additional charge. The Sony Ericsson P17 [P97] has also been made to be compatible with other Sony products, making transferring data from them to your Ericsson P17 [P97] as easy as a click of a button.”

Participants learn that the phone will be introduced in the U.S. and Europe and then estimate the percentage of the smartphones that would be sold in the U.S. For participants in Asia, the phone will be planned to appear in Asia and the U.S. and they will estimate Asia sales. For participants in Europe, the phone will be planned to appear in Europe and the U.S. and they will estimate European sales. For participants in other regions, the regions will be the closest and second closest of U.S., Europe, and Asia and they will estimate the sales in the closest region. Materials here: https://osf.io/5j63p/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_2sJjfkN7AOd8xsp.

Analysis plan.

The means for the P97 and P17 groups will be compared with an independent samples t-test. Participants whose answers will not fall between 0 and 100 will be excluded from analysis.

Known differences from original.

The original study was conducted in the U.S. and had participants consider sales between U.S. and European markets. The replication will match markets with location as described above. The pictures and descriptions have been updated to reflect more modern smartphones. The original authors see no reason why the change to the phone image should affect the result, provided that the numbers remain salient, but it remains an untested assumption.

The authors avoided administering these studies on computer—and instead used only paper-and-pencil presentation—to avoid the possibility that the numeric keys on the keyboard might serve as numeric primes. That said, this methodological decision was based on speculation, and the authors report never having tested the influence of administration mode systematically. To test this factor, 10 sites will administer this as a paper-pencil task.


10. van Lange

DEVELOPMENT OF PROSOCIAL, INDIVIDUALISTIC, AND COMPETITIVE ORIENTATIONS: THEORY AND PRELIMINARY EVIDENCE (Van Lange, Otten, De Bruin & Joireman, 1997, Study 3)

Van Lange and colleagues (1997) proposed that social value orientations (SVOs) are rooted in social interaction experiences, among them the number of one’s siblings. In larger families, resources have to be shared more frequently, facilitating cooperation and the development of a prosocial orientation (sibling-prosocial hypothesis). In their Study 3, 631 participants reported how many siblings they had and completed a SVO measure called the triple dominance measure to identify them as prosocials, individualists, or competitors. Prosocials had more siblings (M = 2.03, SD = 1.56) than individualists (M = 1.63, SD = 1.00) and competitors (M = 1.71, SD = 1.35; F(2, 535) = 4.82, p < .01, ds = .287 [.095, .478] and .210 [-.045, .465] respectively).

Materials and procedure.

Recent advances in measurement of SVO has introduced an alternative measure that has some psychometric advantages compared to the triple dominance measure. The SVO slider measure will be incorporated into the present replication (Murphy, Ackermann, & Handgraaf, 2011). Participants will complete the SVO slider measure, then list how many older siblings, younger siblings, brothers, and sisters they have. The SVO slider measure consists of a series of six decomposed games in which participants select from a range of possible pairs of payoffs for themselves and a fictional other (Murphy et al., 2011). Materials here: https://osf.io/wkhit/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_4TxC1myNGotHyG9.

Analysis plan.

The current replication focuses only on the observed direct positive correlation between greater prosocial orientation and number of siblings. Participants must respond to all 6 items of the SVO slider and have valid data for the two questions asking about older and younger siblings, to be included in the analysis. Total number of siblings will be obtained by adding the number of younger and older siblings. SVO slider scores will be scored in according to the procedure recommended by Murphy et al. (2011). The resulting SVO slider score will be correlated with the total number of siblings for the critical test.

Known differences from original.

The original demonstration used a triple dominance measure of social value orientation with three categorical values. In discussion with the original author, the SVO slider was identified as a useful replacement to yield a continuous distribution of scores.


11. Hauser

A DISSOCIATION BETWEEN MORAL JUDGMENTS AND JUSTIFICATIONS (Hauser, Cushman, Young, Kang-Xing & Mikhail, et al., 2007, Scenarios 1+2)

The principle of the double effect suggests that acts that harm others are judged as more morally permissible if the act is a foreseen side effect rather than the means to the greater good. Hauser and colleagues (2007) compared participant reactions to two scenarios to test this principle. As a FORESEEN SIDE EFFECT scenario, a person on an out-of-control train changes the train’s trajectory but the train kills one person instead of five. As a GREATER GOOD, a person pushes a fat man in front of a train, killing him, to save five people. While 89% of subjects judged the action in the foreseen side effect scenario as permissible (95% CI [.87, .91]), only 11% of subjects in the greater good scenario judged it as permissible (95% CI [.09, .13]). The difference between the proportions was significant. (χ2 [1, N = 2646] = 1615.96, p < 0.001), w = 0.78, d = 2.51, 95% CI [2.22, 2.86], providing evidence for the principle of the double effect.

Materials and procedure.

This study is replicated in Slate 1 and Slate 2 using different scenarios. Participants will be randomly assigned to one of two test scenarios. We will use two scenarios out of the original four described in Hauser et. al (2007). Participants in foreseen side effect condition read the following: “Denise is a passenger on a train whose driver has just shouted that the train’s brakes have failed, and who then fainted of the shock. On the track ahead are five people; the banks are so steep that they will not be able to get off the track in time. The track has a side track leading off to the right, and Denise can turn the train onto it. Unfortunately there is one person on the right hand track. Denise can turn the train, killing the one; or she can refrain from turning the train, letting the five die.” Then they will respond will a yes or no to the question, “Is it morally permissible for Denise to switch the train to the side track?”

Participants in the means to a greater good condition will respond to this scenario: “Frank is on a footbridge over the train tracks. He knows trains and can see that the one approaching the bridge is out of control. On the track under the bridge there are five people; the banks are so steep that they will not be able to get off the track in time. Frank knows that the only way to stop an out-of-control train is to drop a very heavy weight into its path. But the only available, sufficiently heavy weight is a large man wearing a backpack, also watching the train from the footbridge. Frank can shove the man with the backpack onto the track in the path of the train, killing him; or he can refrain from doing this, letting the five die.” Then they will respond yes or no to the question, “Is it morally permissible for Frank to shove the man?”. After responding to the scenario, participants will be asked an additional question assessing any prior experience with the task. Materials here: https://osf.io/cnk7z/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_aXaDsDUF9HvCgEB.

Analysis plan.

Subjects will be excluded from all analyses if they take fewer than four seconds to read and respond to either of the target scenarios. For the key confirmatory test comparing with the original effect size, we will include only participants that indicate having no prior experience with the task. The original authors suggested the effect may be weaker in participants with prior exposure. Prior exposure will be investigated as a moderator for the other analyses. A two-way contingency table will be built with Scenario (Means vs. Side-effect) and Response (Yes vs. No) as factors. The critical replication hypothesis will be given by a one tailed chi square test and the effect size by an odds ratio.

Known differences from original.

The original research had 19 scenarios divided into four sets. Participants were randomly assigned to one of the sets which contained 4 scenarios, 3 moral dilemmas and one control scenario. We chose to present participants with only one target scenario to save time and since the analyses in the original were between-subjects. Also, the original study had a control condition at the beginning in which there were no people on the alternate track, so switching was obviously permissible. This condition is rarely present in research with this common scenario and is removed for the replication. Given that this paradigm is widely known, the original authors suggested the effect may be weaker for participants who have previously been exposed to this sort of task. So, we have included the additional item assessing participants’ prior knowledge of the task. The direct comparison with the original effect size will be on the subsample that is not familiar with the task. The investigation of variation across sample and setting will include all participants other than those who were excluded initially for responding in less than four seconds.


12. Anderson

THE LOCAL-LADDER EFFECT AND SUBJECTIVE WELL-BEING (Anderson, Kraus, Galinsky & Keltner, 2012, Study 3).

Anderson and colleagues (2012) examined the relationship between sociometric status (SMS), socioeconomic status (SES), and subjective well-being. According to the authors, SMS refers to interpersonal wealth, whereas SES measures fiscal wealth. Study 3 examined whether SMS has stronger ties than SES to well-being. In a 2 X 2 between subjects design, 228 Mechanical Turk participants were presented with descriptions of people who were either relatively high or low on either socioeconomic or sociometric status, and then made upward or downward social comparisons. Then, participants wrote about what it would be like to interact with such people, and then reported subjective well-being. Results showed a significant 2 x 2 interaction (F(1,224) = 4.73, p = .03) such that participants made to feel high in sociometric status had higher subjective well-being than those in the low sociometric status condition, t(115) = 3.05, p = .003, d = .57, 95% CI [.19, .94]. There were no differences between the two socioeconomic conditions, t(109) = .06, p = .96, d = .01.

Materials and procedure.

Participants will be randomly assigned to read a prompt to make them feel either high or low in sociometric status. In the high-sociometric condition, participants read, “Think of the ladder above as representing where people stand in the important groups to which they belong. For example, these can include their groups of friends, family, work group, etc… Now please compare yourself to the people at the very bottom rung of the ladder. These are people who have absolutely NO RESPECT, NO ADMIRATION, and NO INFLUENCE in ALL of their important social groups. In particular, we’d like you to COMPARE YOURSELF TO THESE PEOPLE in terms of your own respect, admiration, and influence in your important groups.” In the low-sociometric condition, participants read, “Think of the ladder above as representing where people stand in the important groups to which they belong. For example, these can include their groups of friends, family, work group, etc… Now please compare yourself to the people at the very top rung of the ladder. These are people who are the MOST RESPECTED, the MOST ADMIRED, and the MOST INFLUENTIAL in ALL of their important social groups. In particular, we’d like you to COMPARE YOURSELF TO THESE PEOPLE in terms of your own respect, admiration, and influence in your important groups.”

Then, participants will write a short response to the following prompt: “Now imagine yourself in a getting acquainted interaction with one of these people. Think about how the SIMILARITIES AND DIFFERENCES BETWEEN YOU might impact what you would talk about, how the interaction is likely to go, and what you and the other person might say to each other. Please write a brief description about how you think this interaction would go.” Then, participants will report which rung they occupied for the relevant status on a 10-rung ladder as a manipulation check. Finally, participants will complete three dependent measures in a fixed order: Satisfaction With Life Scale (SWLS; Diener, Emmons, Larsen, & Griffin, 1985) and the Positive and Negative Affect Schedule (Watson, Clark, & Tellegen, 1988). All materials were provided by the original authors and are available here: https://osf.io/thcj9/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_b1lMXQGoBWcVLet.

Analysis plan.

Following the original authors, the three dependent measures will be standardized and averaged into a single index of subjective well-being. The mean difference in subjective well-being between high and low-sociometric status conditions will be tested with an independent-samples t-test. All participants with data will be included in the analysis.

Known differences from original.

We are using only the high- and low-sociometric status conditions and excluding the high- and low-socioeconomic status conditions that showed no differences in the original study.


13. Ross1

THE “FALSE CONSENSUS EFFECT”: AN EGOCENTRIC BIAS IN SOCIAL PERCEPTION AND ATTRIBUTION PROCESSES (Ross, Greene & House, 1977, Study 1, Supermarket Scenario)

People perceive a “false consensus” about the commonness of their responses among others (Ross, Greene & House, 1977). Thus, estimates of the prevalence of a particular belief, opinion or behavior are biased in the direction of the perceiver’s beliefs, opinions and behaviors. Ross and colleagues (1977, Study 1) presented 320 college undergraduates with one of four hypothetical events that culminated in a clear dichotomous choice of action. Participants first estimated what percentage of peers would choose each option, and then indicated their own choice. For each of the four scenarios, participants that chose the first option believed that a higher percentage of others would also choose that option (M = 75.4%) than participants that chose the second option (M = 54.9%; F(1,312) = 49.1, p < .001, d = .79, 95% CI [.56, 1.02] for the main effect of experimental condition; meta-analysis (random effects model) of scenario effect sizes: d = .66). A later meta-analysis revealed that this effect is robust and moderate in size across a variety of paradigms (r = .31, Mullen et al., 1985).

Materials and procedure.

This study is replicated in Slate 1 and Slate 2 using different scenarios. In Slate 1, participants will be presented with the “supermarket” vignette. Supermarket Story. “As you are leaving your neighborhood supermarket a man in a business suit asks you whether you like shopping in that store. You reply quite honestly that you do like shopping there and indicate that in addition to being close to your home the supermarket seems to have very good meats and produce at reasonably low prices. The man then reveals that a videotape crew has filmed your comments and asks you to sign a release allowing them to use the unedited film for a TV commercial that the supermarket chain is preparing.” Following the vignette, participants will be asked three questions: (1) What % of your peers do you estimate would sign the release?, (2) What % would refuse to sign it? [Total % should be 100%], and (3) Would you sign the release or refuse to sign it? . Materials here: osf.io/4my2z. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_025nTEnKVG4ul6t.

Analysis plan.

An independent samples t-test will be conducted with participants’ choice (sign release/refuse to sign) as the IV and participant estimate of % of peers who would sign the release as the DV. Note that participants self-select whether to sign or refuse to sign the release, so it is not random assignment to levels of the independent variable. Participants will be included in the analysis if they respond to all three questions and their estimate for the DV (e.g., “what percent of your peers would sign the release”) falls between 0-100.

Known differences from original.

Following the scenario estimates, Ross and colleagues also asked participants to predict the personality of the typical person who would choose each of the two alternatives on four dimensions (shyness, adventurousness, cooperativeness and trust). We are not including this secondary assessment. The personality prediction variables came after the variable of interest, thus the method of testing the effect of interest is effectively the same.


SLATE 2

14. Ross2

THE “FALSE CONSENSUS EFFECT”: AN EGOCENTRIC BIAS IN SOCIAL PERCEPTION AND ATTRIBUTION PROCESSES (Ross, Greene & House, 1977, Study 1, Traffic Ticket Scenario)

The original study was presented in Effect 13 in Slate 1.

Materials and procedure.

In Slate 2, participants will be presented with the “traffic ticket” vignette (the “supermarket” vignette will be administered to participants in Slate 1). Traffic Ticket Story. “While driving through a rural area near your home you are stopped by a county police officer who informs you that you have been clocked (with radar) at 38 miles per hour in a 25-mph zone. You believe this information to be accurate. After the policeman leaves, you inspect your citation and find that the details on the summons regarding weather, visibility, time, and location of violation are highly inaccurate. The citation informs you that you may either pay a $80 fine by mail without appearing in court or you must appear in municipal court within the next two weeks to contest the charge.” Following the vignette, participants will be asked three questions: (1) What % of your peers do you estimate would pay the $80 fine by mail?, (2) What % would go to court to contest the charge? [Total % should be 100%], and (3) Would you pay the $80 fine by mail or appear in court? Materials here: osf.io/4my2z. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_0GJy9Iu0PYFJWkd.

Analysis plan.

An independent samples t-test will be conducted with participants’ choice (pay the fine/appear in court) as the IV and participant estimate of percent of peers who would pay the fine as the DV. Note that participants self-select whether to pay the fine or appear in court, so it is not random assignment to levels of the independent variable. Participants will be included in the analysis if they respond to all three questions and their estimate for the DV (e.g., “what percent of your peers would pay the fine by mail”) falls between 0-100.

Known differences from original.

The original traffic ticket scenario included a $20 fine. The fine has been adjusted to $80 to reflect inflation. Following the scenario estimates, Ross and colleagues also asked participants to predict the personality of the typical person who would choose each of the two alternatives on four dimensions (shyness, adventurousness, cooperativeness and trust). We are not including this secondary assessment. The personality prediction variables came after the variable of interest, thus the method of testing the effect of interest is effectively the same.


15. Giessner

HIGH IN THE HIERARCHY: HOW VERTICAL LOCATION AND JUDGMENTS OF LEADERS’ POWER ARE INTERRELATED (Giessner & Schubert, 2007, Study 1a)

Sixty-four participants formed an impression of a manager based on few pieces of information including a organization chart with a vertical line connecting the manager on top with his team below. Participants were randomly assigned to one of two conditions in which the line was either short (2 cm) or long (7 cm). Then, participants evaluated the manager on a variety of qualities including the manager’s power. Participants in the long line condition (M = 5.01, SD = 0.60) perceived the manager to have greater power than participants in the short line condition (M = 4.62, SD = 0.81; t(62) = 2.20, p = .03, d=.55, 95% CI [.049, 1.06]. This result was interpreted as showing that people associated vertical position with power, higher is more powerful.

Materials and procedure.

Participants will receive the following instructions: “In this next part you will be asked to evaluate the manager of a company based on very little information.” On the next page, participants will read the following about the manager: “In the following you will see company A and a picture of a Manager A of this company. The average gross salary of the employees of company A is about 49,000 dollars. The company has 126 employees.” Participants will then see a picture of the fictional manager and an organization chart with the manager at the top with a vertical line connecting to his team either 2 or 7 cm long. Then, they will respond to following items: (1) I think that Manager A is dominant, (2) I think that Manager A has a strong leader personality, (3) I think that Manager A is self-confident, (4) I think that Manager A has a lot of control in the company, and (5) I think that Manager A holds a very high status within the company on a 7 point scale from 1 = totally disagree to 7 = totally agree. Materials here: https://osf.io/79cjv/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_aVhzqs77Q05Ejqd.

Analysis plan.

Responses to the five dependent measures will be averaged and an independent samples t-test will compare mean power rating between the 2 cm and 7 cm conditions. All participants who complete at least one item of the dependent measure will be included in the analysis.

Known differences from original.

The original presented employee wages in Euros; that will be converted to other currencies as needed at the current exchange rate and adjusted when deemed necessary by the lead site researcher to maintain psychological equivalence (the rounded exchange rate for USD is presented above). In addition, the authors noted there may be an effect of presentation order. As with all effects, we will examine whether presentation order influenced effect size beyond what is expected by chance.


16. Tversky

THE FRAMING OF DECISIONS AND THE PSYCHOLOGY OF CHOICE (Tversky & Kahneman, 1981, Study 10)

In Tversky and Kahneman (1981), 181 participants considered a scenario in which they were buying two items, one relatively cheap ($15) and one relatively costly ($125). Ninety-three participants were assigned to a condition in which the cheap item could be purchased for $5 less by going to a different branch of the store 20 minutes away. Eighty-eight participants saw another condition in which the costly item could be purchased for $5 less at the other branch. Therefore, the total cost for the two items, and the cost savings for traveling to the other branch, was the same across conditions. Participants were more likely to say that they would go to the other branch when the cheap item was on sale (68%) than when the costly item was on sale (29%, Z = 5.14, p = 7.4*10-7, OR = 4.96, 95% CI [2.55, 9.90]). This suggests that the decision of whether to travel was influenced by the base cost of the discounted item rather than the total cost.

Materials and procedure.

Participants will receive one of two scenarios from the original with dollar amounts approximately adjusted for inflation and the consumer items being replaced with a ceramic vase and a wall hanging. Specifically, one condition will read: “Imagine that you are about to purchase a wall hanging for $250, and ceramic vase for $30. The salesman informs you that the ceramic vase you wish to buy is on sale for $20 at the other branch of the store, located 20 minutes drive away. Would you make the trip to the other store?” The second condition will read: “Imagine that you are about to purchase a ceramic vase for $30, and a wall hanging for $250. The salesman informs you that the wall hanging you wish to buy is on sale for $240 at the other branch of the store, located 20 minutes drive away. Would you make the trip to the other store?” Participants will respond “Yes, I would go to the other branch” or “No, I would not go to the other branch.” Materials here: https://osf.io/8t9ha/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_aW8rIyGPNQ2Wwh7.

Analysis plan.

A two-way contingency table will be built with Price condition ($20 vs. $240) and Response (Yes vs. No) as factors. The critical replication hypothesis will be given by a χ2 test and the effect size by an odds ratio. All participants with valid responses will be included in analysis.

Known differences from the original.

In consultation with the original author, dollar amounts have been adjusted to be more appropriate for 2014. The stimuli were also replaced with consumer items that are relevant in 2014 and plausibly sold by a single salesperson. Further, for replications outside of the U.S., we will use amounts in the local currency that the replication team judges to be psychologically equivalent to these values. The default will be equivalence by exchange rate but may be adjusted further if there are substantial differences in wealth for the available sample.


17. Hauser

A DISSOCIATION BETWEEN MORAL JUDGMENTS AND JUSTIFICATIONS (Hauser et al., 2007, Study 1, Scenarios 3+4)

This study was presented in Effect 11 in Slate 1 using a different scenario. In Slate 2, participants will be presented with the “Ned” and “Oscar” scenarios as the GREATER GOOD and FORESEEN SIDE EFFECT scenarios. In the original study, when these two effects were compared, 72% of subjects judged the action in the foreseen side effect scenario as permissible (95% CI [.69, .74]), and 56% of subjects judged the action in the means to a greater good scenario as permissible (95% CI [.53, .59]). The difference between the proportions was significant. (χ2[1, N = 2612] = 72.35, p < 0.001), w = 0.17, d = .34, 95% CI [.26, .42].

Materials and procedure.

Participants will respond to one of two moral dilemmas. Unlike those described in Effect 11, these moral dilemma scenarios will be accompanied by an illustration of the situation. In the greater good condition, participants will read the following: “Ned is taking his daily walks near the train tracks when he notices that the train that is approaching is out of control. Ned sees what has happened: the driver of the train saw five men walking across the tracks and slammed on the brakes, but the brakes failed and they will not be able to get off the tracks in time. Fortunately, Ned is standing next to a switch, which he can throw, that will temporarily turn the train onto a side track. There is a heavy object on the side track. If the train hits the object, the object will slow the train down, thereby giving the men time to escape. Unfortunately, the heavy object is a man, standing on the side track with his back turned. Ned can throw the switch, preventing the train from killing the men, but killing the man. Or he can refrain from doing this, letting the five die.” Then they will answer, “”Is it morally permissible for Ned to throw the switch?“.

In the foreseen side-effect condition (Scenario 4), participants will read: “Oscar is taking his daily walk near the train tracks when he notices that the train that is approaching is out of control. Oscar sees what has happened: the driver of the train saw five men walking across the tracks and slammed on the brakes, but the brakes failed and the driver fainted. The train is now rushing toward the five men. It is moving so fast that they will not be able to get off the track in time. Fortunately, Oscar is standing next to a switch, which he can throw, that will temporarily turn the train onto a side track. There is a heavy object on the side track. If the train hits the object, the object will slow the train down, thereby giving the men time to escape. Unfortunately, there is a man standing on the side track in front of the heavy object, with his back turned. Oscar can throw the switch, preventing the train from killing the men, but killing the man. Or he can refrain from doing this, letting the five die.”. Then they will answer, “Is it morally permissible for Oscar to throw the switch?”.

After responding to the scenario, participants will be asked an additional question assessing any prior experience with the task. Materials here: https://osf.io/ci864/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_5byLtAGiE0hAtjT.

Analysis plan.

Participants will be excluded from all analyses if they take fewer than four seconds to read and respond to either of the target scenarios. For the key confirmatory test comparing with the original effect size, we will include only participants that indicate having no prior experience with the task. The original authors suggested the effect may be weaker in participants with prior exposure. Prior exposure will be investigated as a moderator for the other analyses. A two-way contingency table will be built with Scenario (Greater Good vs. Foreseen Side-effect) and Response (Yes vs. No) as factors. The critical replication hypothesis will be given by a one tailed chi square test and the effect size by an odds ratio.

Known differences from original.

The original research had 19 scenarios divided into four sets. Participants were randomly assigned to one of the sets which contained 4 scenarios, 3 moral dilemmas and one control scenario. We chose to present participants with only one target scenario to save time and since the analyses in the original were between-subjects. Also, the original study had a control condition at the beginning in which there were no people on the alternate track, so switching was obviously permissible. This condition is rarely present in research with this common scenario and is removed for the replication. Given that this paradigm is widely known, the original authors suggested the effect may be weaker for participants who have previously been exposed to this sort of task. So, we have included the additional item assessing participants’ prior knowledge of the task. The direct comparison with the original effect size will be on the subsample that is not familiar with the task. The investigation of variation across sample and setting will include all participants other than those who were excluded initially for responding in less than four seconds.


18. Risen

WHY PEOPLE ARE RELUCTANT TO TEMPT FATE (Risen & Gilovich, 2008, Study 2)

Risen and Gilovich (2008) explored the belief that tempting fate increases bad outcomes. The authors tested whether people judge the likelihood of a negative outcome to be higher when they imagined themselves or a classmate tempting fate, compared to when they do not tempt fate. One hundred twenty participants read a scenario in which either they or a classmate (“Jon”) tempt fate (e.g., by not reading before class), or do not tempt fate (e.g., by coming to class prepared). Participants then estimated how likely it is that the protagonist (themselves or Jon) would be called on by the professor. The predicted main effect of tempting fate emerged, as participants judged the likelihood of being called on to be higher when the protagonist had tempted fate (M = 3.43, SD = 2.34) than when the protagonist had not tempted fate (M = 2.53, SD = 2.24; t(116) = 2.15, p = .034, d = 0.39).

Materials and procedure.

Participants will be randomly assigned to read one of two scenarios. Both scenarios start the same: “Imagine that you are in a large lecture with a few hundred students and you are sitting in the middle section, a little more than half-way back in the room. The professor asks a question about the readings, but no one raises his or her hand to answer.” In the tempting fate version, the scenario continues with “You have not done the reading and feel confident that you would not be able to answer the question.”, while in the control version it continues with “You have done the reading and feel confident that the professor would like your answer, but prefer not to volunteer answers in large classes.” Both scenarios end with the final sentence: “The class sits in silence for two minutes before the professor explains that if no one volunteers, he will choose someone.” After reading the scenario, the belief that tempting fate is bad luck is measured with the question, “How likely do you believe it is that the professor will call on you?” on a 10-point scale ranging from 1 = “Not at all likely” to 10 = “Extremely Likely”. Materials here: https://osf.io/3nkev/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_e8OVts4kSX6qSb3.

Analysis plan.

The two groups will be compared with an independent samples t-test. All participants that answer the dependent measure will be included in analysis. The primary confirmatory test for comparing the original and replication effect size will be based on only the samples using undergraduate students. We will examine gender as a possible moderator of the effect in a supplemental, exploratory analysis.

Known differences from original.

The original study design included self and other scenarios. No self-other differences were found. With the original author’s approval, we limited the study to the two self conditions.


19. Savani

WHAT COUNTS AS A CHOICE? U.S. AMERICANS ARE MORE LIKELY THAN INDIANS TO CONSTRUE ACTIONS AS CHOICES (Savani, Markus, Naidu, Kumar, & Berlia, 2010, Study 5)

Savani and colleagues (2010) examined cultural asymmetry in people’s construal of behavior as choices. In Study 5, 218 participants (90 Americans, 128 Indians) were randomly assigned to either recall personal actions or interpersonal actions, and then to indicate whether the actions constituted choices. The authors found no main effect of condition across cultures: β = –0.13, OR = 0.88, d=.10, t(101) = 0.71, p = .48. Among Americans, there was no difference between construing personal (M = .83, SD = .15) and interpersonal actions (M = .82, SD = .14) as choices, t(88) = .39, p = .65, d = .04. However, Indians were less likely to construe personal actions (M = .61, SD = .26) than interpersonal actions (M = .71, SD = .26) as choices, t(126) = -3.69, p = 0.0002, d = .33.

Materials and procedure.

Participants were randomly assigned to personal- or interpersonal-choice conditions. In the personal-choice condition, participants had to recall eight actions that were mostly self-focused (e.g., participants were asked to recall the last time they made a purchase for themselves). In the interpersonal-choice condition, participants had to recall eight matched actions that involved other people (e.g., participants were asked to recall the last time they made a purchase for someone else). Following the recollection of each action, participants indicated whether it constituted a choice. For each item participants also rated the importance of the action on a 7-point scale (Not at all important; Slightly important; Somewhat important; Moderately important; Quite important; Very important; Extremely important). Materials here: osf.io/pd4ac. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_dcXCkJscEmiTEc5.

Analysis plan.

We will conduct a hierarchical logistic regression analysis with Choice (binary) as the dependent variable, the Importance of decision (ordered categorical) as a trial-level covariate nested within participants, and Condition (categorical) as a participant-level factor indicating whether a participant was in the personal or interpersonal condition. The effect of interest will be the odds of construing an action as a choice, depending on the condition a participant was in, controlling for the reported importance of the action.

Because some survey questions may be less suitable for non-student samples, we will only include university data collections in the primary confirmatory analysis to be compared with the original effect sizes. Data for all participants will be included to examine variability across sample and setting. However, participants must respond to all choice and importance of choice questions to be included in the analysis. The target effect size for replication will be the the results obtained for participants from labs in India, to compare to the effect found in the Indian sample in the original (Indian participants were more likely to construe interpersonal actions as choices than personal actions). Although we only have few labs from India, we are making extra efforts to recruit many participants in those labs. We anticipate that this effect will vary by sample, particularly in line with the original demonstration of cultural differences.

Known differences from original.

Our Analysis plan differs from the original in order to obtain an effect size within each of the dozens of samples.


20. Norenzayan

CULTURAL PREFERENCES FOR FORMAL VERSUS INTUITIVE REASONING (Norenzayan, Smith, Kim, & Nisbett, 2002, Study 2)

Western thinking may be more rule based than East Asian thinking. 52 European American (27 men, 25 women), 52 Asian American (28 men, 24 women) and 53 East Asian participants (27 men, 26 women) were randomly assigned to either a classification (decide “which group the target object belongs to”; ⅔ of sample) or similarity judgment (decide “which group the target object is most similar to”; ⅓ of sample) condition.

All participants categorized targets into two alternative groups of 4 exemplars. Both targets and group exemplars were defined according to 4 binary features (e.g., long-stemmed or short-stemmed flowers). In Group 1, all exemplars had one feature in common with each other and with the target. In Group 2, there was no feature in common among all exemplars and the target, but one exemplar had three features in common with the target and three exemplars had two features in common with the target (see Figure 1). As a consequence, Group 2 looked more similar to the target, but there was no feature that could be used as a rule to categorize the target as a member of the group. But, for Group 1, a single feature common to all could be used as a rule for classification. Each set of targets and groups had a mirror-image target so that one group could be used for rule-based classification for one target, and the other group could be used for rule-based classification for the other target.

Figure 1. Examples of Targets and Groups.

Screenshot 2014-06-16 16.45.57.png

When asked “which Group the target object belongs to”participants across all three cultures preferred to classify based on rule (M = 67%) rather than on family resemblance (M = 33%; F(1, 100) = 44.40, p < .001, r = .55). When asked “which group the target object is more similar to”, European Americans gave many more responses based on the unidimensional rule (M = 69%) than on family resemblance (M = 31%), t(17) = 3.68, p = .002, d = 1.79, 95% CI=[.64, 2.89]. On the contrary, East Asians gave fewer rule-based responses than family resemblance responses (Mrule = 41% vs. Mfamily = 59%), t(17) = 2.09, p = .05, d = 1.01. Asian Americans were intermediate, having no preference for one rule over the other (Mrule = 46% vs. Mfamily = 54%), t < 1.

Materials and procedure.

Participants will categorize target objects into one of two groups. In the belonging condition, participants will receive the instruction: “Which group does the target object belong to?” In the similarity condition, participants will receive the instruction: “Which group is the target object more similar to?” The instructions will end saying “Take your time while responding, but do not spend too much time on any single item.” Participants will judge 20 targets all with the same condition. All materials were provided by the original authors and are publicly available. In one group, all exemplars had 1 feature in common with each other and with the target. In the other group, there was no feature in common among all exemplars and the target, but 1 exemplar had 3 features in common with the target and 3 exemplars had two features in common with the target (see Figure 1). Materials here: osf.io/y3e7g. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_dhumIQeN4X1gVzD.

Analysis plan.

We will compute for each subject the percentage of rule-based responses and test whether the mean of the two experimental groups (belong vs similar) on this DV is equal with a t-test for independent samples. The effect size will be given by a standardized mean difference. All participants with data will be included in analysis.

For additional analysis, a few items about cultural origins of participants and their parents are present in the individual differences assessment. These could be particularly useful for follow-up moderator analysis.

Known differences from original.

The replication will use the same stimuli as the original but the implementation will be slightly different. In the original, participants assigned objects to categories by pressing a key on the keyboard, and the script then advanced automatically to the next trial. In our replication, participants will categorize the object by selecting from a multiple-choice list and will advance the page by clicking “Continue”.

The original design had a ⅔ versus ⅓ split in assignment to condition. In consultation with a reviewer, we changed to equal weighted random assignment. Also, the original study had a practice trial, and the replication does not.

It is worth noting that another study in this slate also involves similarity judgements (Tversky & Gati, 1978) though they are dissimilar in content. It will be instructive to test whether the order between those two studies makes a difference for either one.

21. Hsee

LESS IS BETTER: WHEN LOW-VALUE OPTIONS ARE VALUED MORE HIGHLY THAN HIGH-VALUE OPTIONS (Hsee, 1998, Study 1)

Hsee (1998) demonstrated the less-is-better effect wherein a less expensive gift can be perceived as more generous than a more expensive gift when the less expensive gift is relatively higher priced compared to other items in its category, and the more expensive item is a low-priced item compared to other items in its category. 83 participants imagined that they were about to study abroad and had received a goodbye gift from a friend. In one condition, participants imagined receiving a $45 scarf bought in a store where the prices of scarves ranged from $5 to $50. In the other condition, participants imagined receiving a $55 coat bought in a store where the prices of coats ranged from $50 to $500. Participants in the scarf condition considered their gift giver significantly more generous (M = 5.63) than those in the coat condition (M = 5.00; t(82) = 3.13, p = 0.002, d = .69, 95% CI [.24, 1.14]), despite the gift being objectively less expensive.

Materials and procedure.

Participants will be asked to imagine that they were about to leave the country and had received a goodbye gift from a friend. Participants will be randomly assigned to the scarf or coat condition. The scarf scenario reads “It is a wool scarf, from a nearby department store. The store carries a variety of wool scarves.The worst costs $10 and the best costs $100. The one your friend bought you costs $90.” The coat scenario reads “It is a wool coat, from a nearby department store. The store carries a variety of wool coats. The worst costs $100 and the best costs $1,000. The one your friend bought you costs $110”. Following the scenario, participants will answer a question about the generosity of gift giver on a scale from 0 to 6, where 0 indicates “not generous at all” and 6 indicates “extremely generous”. Materials here: https://osf.io/c4v8x/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_cI3khFhWnIMKE3b.

Analysis plan.

The two conditions will be compared with an independent samples t-test with rated generosity of gift giver as the dependent variable. All participants with data will be included in the analysis.

Known differences from original.

The original study included two additional questions for which statistics were not reported; one about their happiness about receiving the gift and one about the perceived expensiveness of the gift. The two additional questions were described in a footnote as showing the same effect as the generosity item. Dollar values will be approximately inflation adjusted to 2014 dollars. When necessary, dollars will be converted to the primary currency of the data collection site according to the exchange rate and adjusted as deemed necessary to maintain psychological equivalence (as determined by the local researcher).


22. Gray

MORAL TYPECASTING: DIVERGENT PERCEPTIONS OF MORAL AGENTS AND MORAL PATIENTS (Gray & Wegner, 2009, Study 1a)

Gray and Wegner (2009) examined the attribution of intentionality and responsibility as a function of perceived moral agency–the ability to direct and control one’s moral decisions. In Study 1a, 69 participants read about an event involving a person high on moral agency (an adult man) and a person low on moral agency (a baby). In one condition, the man knocked over a tray of glasses, resulting in harm to the baby. In the other condition, the baby knocked over the tray of glasses, resulting in harm to the man. Participants then rated the degree to which the person who committed the act was responsible, how intentional the act was, and how much pain was felt by the victim. The adult man (M = 5.29, SD = 1.86) was evaluated as more responsible for committing the act than the baby (M = 3.86, SD = 1.64, t(68) = 3.32, p = .001, d = .80, 95% CI [.31, 1.30]). Likewise, the adult man (M = 4.05, SD = 2.05) was rated as acting more intentionally than the baby (M = 3.07, SD = 1.55, t(68) = 2.20, p = .03, d = .53). Finally, when on the receiving end of the act, the adult man (M = 4.63, SD = 1.15) was viewed as feeling less pain compared to a baby (M = 5.76, SD = 1.55, t(68) = 3.49, p = .001, d = .85).

Materials and procedure.

Participants will be randomly assigned to read a scenario in which either a man or a baby commits an action that affects the other. For example, in the condition where the baby commits the action, the participant sees:

Screen Shot 2014-06-16 at 12.55.42 PM.png

Then, participants will then complete 3 questions: (1) “How responsible is [person who committed action] for his behavior?”, (2) “How intentional is [person who committed action]’s behavior?”, and (3) “How much pain does [person who did not commit action] feel when he gets cut?”- responding to each on a scale from 1 (“Not at all responsible”/ “Completely unintentional”/ “No pain at all”) to 7 (“Fully responsible”/ “Completely intentional”/ “Extreme pain”). Materials here: https://osf.io/szg3n/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_3ldCDrWXgDQsujz.

Analysis plan.

We will compare the means on perceived responsibility between conditions with an independent samples t-test for the responsibility item. The intentionality item will be analyzed the same way as a secondary analysis. All participants with data will be included in analysis.

Known differences from original.

As did the original study, we include all three dependent variables. However, for the aggregate analyses, we will use only the responsibility item and will report results for the other two items as secondary results.


23. Zhong

WASHING AWAY YOUR SINS: THREATENED MORALITY AND PHYSICAL CLEANSING (Zhong & Liljenquist, 2006, Study 2)

Zhong and Liljenquist (2006) investigated whether moral violations can induce a desire for cleansing. In Study 2, under the guise of a study assessing personality from handwriting, 27 participants hand-copied a first-person account of an ethical act (helping a co-worker) or unethical act (sabotaging a co-worker). Then, participants rated the desirability of five cleaning products and five non-cleaning products. Participants who copied the unethical account (M = 4.95, SD = 0.84) reported that the cleansing products were more desirable than participants who copied the ethical account (M = 3.75, SD = 1.32; F(1,25) = 6.99, p = .01, d = 1.06, 95% CI [.20, 1.89]). There was no difference between the unethical (M = 3.85, SD = 1.21) and ethical (M = 3.91, SD = 1.03) conditions in ratings of non-cleansing products (F(1,25) = 0.02, p = 0.89, d = 0.05).

Materials and procedure.

The opening instructions for all participants will read: “We are conducting a study on how people’s typing skills may reflect certain aspects of their personality, and how that relationship may vary depending on the content of what they are typing. In other parts of the study, you complete some personality questionnaires that will allow us to explore this relationship. We would now like you to type the paragraph below. Although you should try to minimize errors, please don’t type any slower or faster than you would under normal conditions—simply type at the speed you naturally would for a casual word-processing task.” Participants in the unethical condition will be asked to copy the following passage: “Two years ago, when I was a junior partner at a prestigious law firm, I was coming up for promotion against another junior partner, Chris. For several months, Chris had been working on a major case for the city that would make or break his career at the firm. However, he could not locate a key zoning document, without which, it was unlikely that he would have sufficient evidence to successfully argue his case. Late one evening, as I was rummaging through a corner filing cabinet, I happened to come across the zoning document that Chris was in desperate need of. I pulled it from the cabinet and walked over to the office shredder, knowing that my promotion would now be secured.”

Participants in the ethical condition will be asked to copy this passage: “Two years ago, when I was a junior partner at a prestigious law firm, I was coming up for promotion against another junior partner, Chris. For several months, Chris had been working on a major case for the city that would make or break his career at the firm. However, he could not locate a key zoning document, without which, it was unlikely that he would have sufficient evidence to successfully argue his case. Late one evening, as I was rummaging through a corner filing cabinet, I happened to come across the zoning document that Chris was in desperate need of. I pulled it from the cabinet and placed it without a note on Chris’ desk, knowing that he would be so relieved when he arrived to work the next morning.”

Next, using a 7-point scale from 1 = not at all to 7 = very much, participants will answer “How much do you desire this product?” for five cleaning products (e.g., Dove shower soap, Crest toothpaste, Windex glass cleaner, Lysol countertop disinfectant, and Tide laundry detergent) and five control products (e.g., Post-it notes, Nantucket Nectars juice, Energizer batteries, Sony cd cases, and Snickers candy bars) presented in a randomized order. Materials here: https://osf.io/idgpt/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_8caXWyHBqIJK3lP.

Analysis plan.

The key factor of interest is whether condition affects ratings of the cleaning products, so ratings of the five cleaning products will be averaged and compared between the two conditions (ethical or unethical) with an independent samples t-test. A second comparison is whether there is a condition difference between ratings of the other products. The theoretical expectation is that this difference will be weak or near zero. This will be examined as a secondary analysis as a 2 (ethical-unethical) x 2 (cleaning-other products) mixed model ANOVA, and a follow-up independent samples t-test comparing ratings of other products between the ethical and unethical conditions.[3]

Participants who copy less than half the target article will be excluded from analysis.

Known differences from original.

The original was presented on pencil and paper and participants copied the text under the guise of a personality test. In the replication, the whole procedure will be administered on a computer and participants will type an adapted version of the original story under the guise of a study measuring personality and typing speed. This adaptation was recommended by the original authors.


24. Schwarz

ASSIMILATION AND CONTRAST EFFECTS IN PART-WHOLE QUESTION SEQUENCES: A CONVERSATIONAL LOGIC ANALYSIS (Schwarz, Strack & Mai, 1991, Study 1)

456 participants answered a question about life satisfaction in a specific domain “How satisfied are you with your relationship?” and a question about life satisfaction in general “How satisfied are you with your life-as-a-whole?” Participants were randomly assigned to the order of answering the specific and general questions. When the specific question was asked first, the correlation between the responses to the two questions was strong (r = .67, p < .05). When the specific question was asked second, the correlation between them was weaker (r = .32, p < .05). The difference between these correlations was significant, z = 2.32, p < .01, d = 0.22, 95% CI [.12, .31].

The authors suggest that the specific-first condition makes the relationship more accessible such that participants then are more likely to incorporate information about their relationship when evaluating a more general question about their life satisfaction. Because responses to the two items are linked by the accessibility of relationship information, they should be correlated. In contrast, in the specific-second condition, relationship satisfaction is not necessarily accessible and participants may draw on any number of different areas to generate their overall life satisfaction response. Thus, the correlation between the items is weaker than in the specific-first condition.

Materials and procedure.

Participants will rate their satisfaction on an 11-point scale from 1 = very dissatisfied to 11 = very satisfied in response to two questions: “How satisfied are you currently with your life-as-a-whole?” and “Please think about your relationship to your partner (spouse or date). How satisfied are you currently with your relationship?” The items will be presented on separate screens in a randomized order. Materials here: https://osf.io/m9iv4/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_ehNrfGOaZoxxURT.

Analysis plan.

We will compute the correlation between responses to the general and specific question in each item order condition, and then compare the correlations using the Fisher r-to-z transformation. Participants with valid responses to both items will be included in the analysis.

Known differences from original.

The original was administered in German. The original had nine conditions whereas the replication will use just two of those: one in which the question about life satisfaction is asked before the one about relationship satisfaction, and a second in which the order is reversed. In the original these conditions were dubbed the “general-specific” order and “one-specific-general” order. Here we use “specific-second” and “specific-first” to refer to these conditions. Further, in the original, the general-specific order included additional specific questions after the relationship item. Those additional specific questions are not relevant for the effect measured here so they are not retained. Finally, in the original procedure, no other measures preceded this task. The effect is about the influence of question context, so it is reasonable to presume that task order will have an impact on the estimated effect. As such, the task order analysis will be particularly important for this effect, and the most direct comparison with the original is for the conditions in which this task is administered first.


25. Shafir

CHOOSING VERSUS REJECTING: WHY SOME OPTIONS ARE BOTH BETTER AND WORSE THAN OTHERS (Shafir, 1993, Study 1)

One hundred and seventy participants imagined that they were on the jury of a custody case and had to choose between two parents. One of the parents had both more strongly positive and more strongly negative characteristics (extreme) than the other parent (average). Participants were randomly assigned to either decide to award custody to one parent or to deny custody to one parent. Participants were more likely to both award (64%) and deny (55%) custody to the extreme parent than the average parent, the sum of probabilities being significantly greater than 100% (z = 2.48, p < .02, d = 0.43, 95% CI = [0.09, 0.77]). This finding was consistent with the hypothesis that negative features are weighed more strongly when people are rejecting options, and positive features are weighed more strongly when people are selecting options (Shafir, 1993).

Materials and procedure.

All participants will read the following prompt: “Imagine that you serve on the jury of an only-child sole-custody case following a relatively messy divorce. The facts of the case are complicated by ambiguous economic, social, and emotional considerations, and you decide to base your decision entirely on the following few observations.” The last sentence will be randomized between participants to be either “To which parent would you award sole custody of the child?” or “Which parent would you deny sole custody of the child?” The parents’ characteristics will be presented in a tabular format with five features each. Parent A (average) has average income, average health, average working hours, reasonable rapport with the child, and a relatively stable social life. Parent B (extreme) has above-average income, very close relationship with the child, extremely active social life, lots of work-related travel, and minor health problems. After reading the prompt and characteristics, participants will choose which parent to award/deny custody. Materials here: https://osf.io/ek5gz/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_3DvOfaiJUOQ0WUZ.

Analysis plan.

The proportion of participants awarding or denying custody for parent B will be summed from both groups and tested against 100% with a Z test. The effect size will be computed estimating a logistic regression model on the 2X2 table and then taking the exponentiation of the unstandardized beta parameter of the main effect of parent B, which can be interpreted as an odds ratio. All participants with data will be included in analysis.

Known differences from original.

None known.


26. Zaval

HOW WARM DAYS INCREASE BELIEF IN GLOBAL WARMING (Zaval, Keenan, Johnson & Weber, 2014, Study 3A)

Zaval et al. (2014) investigated how beliefs in climate change could be influenced by immediately available information about temperature. In Study 3A, 300 Mechanical Turk workers reported their beliefs about global warming after completing one of three scrambled sentence tasks in which there was a theme of words priming the concepts heat, cold, or a no theme control condition. There was a significant effect of condition on both global warming belief, F(2, 288) = 3.88, p = .02, and concern, F (2, 288) = 4.74, p = .01. Post hoc pairwise comparisons revealed that participants in the heat-priming condition expressed stronger belief (M = 2.7, SD = 1.1) in global warming than participants in the cold-priming (M = 2.4, SD = 1.1; t(191) = 2.08, p = .03, d = .30, 95% CI [.02, .59]) or control conditions (M = x.xx, SD = x.xx; t(xx) = x.xx, p = .02, d = .xx). Likewise, participants in the heat-priming condition expressed greater concern (M = X.xx, SD = X.xx) about global warming than participants in the cold-priming (M = x.xx, SD = x.xx; t(xx) = x.xx, p = .07, d = .xx) or control conditions (M = x.xx, SD = x.xx; t(xx) = x.xx, p = .03, d = .xx). [Note: Some relevant statistics for clarifying the effect between experimental conditions are not available in the original article. We will follow-up with the original authors to obtain these values.]

Materials and procedure.

First, participants will complete a 13-item scrambled sentences task, in which they will form a complete sentence by using four of five provided words. Participants will be randomly assigned to one of two conditions using different words. In one condition, 6 of the sentences will contain a word related to heat (e.g., boil, sunburn, hot). In the other condition, 6 of the sentences will contain a word related to cold (e.g., cold, frozen, shivers).

Next, participants will respond to two questions on a scale from 1 (not at all convinced/worried) to 4 (completely convinced/worried):

  1. “How convinced are you that global warming is happening?”, and
  2. “How much do you personally worry about global warming?”.

Materials here: https://osf.io/a4sih/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_78PEVpJ0kcA62rj.

Analysis plan.

Mean differences in belief and concern about global warming between heat and cold-priming conditions will be evaluated with an independent samples t-test. The scrambled sentence task could introduce unanticipated variation across translations. As such, the direct replication test will use only English language sites, and -like all other effects- all samples and settings with data will be included in analyses examining heterogeneity to see if factors, like translation, have an impact on effect estimates.

Known differences from original.

The original experiment used an initial question about current temperature perception followed by a 10-minute delay of unrelated filler material. The initial question is not relevant to the direct replication so will not be included. Also, the original experiment had a control condition that will not be included. The primary target for replication is concern about global warming. Belief in global warming will be included as a secondary replication.

Translated versions will be excluded from the direct replication test, as noted above, due to concerns that direct translations of the scrambled sentences task may be impractical. Translators will be instructed to remain as true to the original as possible, but to emphasize the construct being manipulated instead of translating the words exactly.


27. Knobe

INTENTIONAL ACTION AND SIDE EFFECTS IN ORDINARY LANGUAGE (Knobe, 2003, Study 1)

Consider an agent who knows that their behavior will have a particular side effect, but does not care whether the side effect does or does not occur. If the agent chooses to go ahead with the behavior and the side effect occurs, do people believe that the agent brought about the side effect intentionally? Knobe (2003) had participants read vignettes about such situations and found that participants were more likely to believe the agent brought about the side effect intentionally when the side effect was harmful compared to when it was helpful. In the harm condition, 82% of ss said that the agent brought about the side-effect intentionally, whereas in the help condition, 77% said that the agent did not bring about the side-effect intentionally (Χ2(1, N = 78) = 27.2, p < .001, d = 1.46). Agents who brought about harmful side effects were also rated as being more blameworthy than agents who brought about helpful side effects were rated as being praiseworthy t(120) = 8.4, p < .001, d = 1.55. The total amount of blame or praise attributed to the agent was associated with believing the agent brought about the side effect intentionally r(120) = .53, p < .001, d = 1.25, 95% CI [.79, 2.79].

Materials and procedure.

Participants will read a vignette where a company either harms or helps the environment: “The vice-president of a company went to the chairman of the board and said, ‘We are thinking of starting a new program. It will help us increase profits, but it will also harm[help] the environment.’ The chairman of the board answered, ‘I don’t care at all about harming[helping] the environment. I just want to make as much profit as I can. Let’s start the new program.’ They started the new program. Sure enough, the environment was harmed[helped].”

After reading the vignette, participants will be asked to indicate their agreement with the statement “The chairman harmed [helped] the environment intentionally,” on a 7-point scale from 1=strongly disagree to 7=strongly agree. Participants judge how much blame the chairman deserved in the harm condition, or how much praise the chairman deserved in the help condition, on a 7-point scale (1=no blame [praise]; 7=a lot of blame [praise]). Materials here: https://osf.io/dcbmw/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_b1qjejRBoOI1nwx.

Analysis plan.

Ratings of intentionality in the harm and help conditions will be compared using an independent-samples t-test. This is the focal test for the direct replication. Blame and praise ratings will also be collected but are secondary analyses. All participants with data will be included in analysis.

Known differences from original.

In the original study, participants indicated whether the chairman intentionally harmed or helped the environment using a yes/no question. Subsequent research examining this effect has used a 7-point agreement scale rather than the dichotomous response. We use the updated 7-point scale.


28. Gati

STUDIES OF SIMILARITY (Tversky & Gati, 1978, Study 2)

Tversky and Gati (1978) investigated the relationship between directionality and similarity. 77 participants made 21 similarity ratings of country pairs in which one country (e.g., U.S.A.) was pre-tested as more prominent than the other (e.g., Mexico). For each pair, the pair was presented with either the more prominent country first (U.S.A.-Mexico) or the less prominent country first (Mexico-U.S.A.). Two versions of the survey with 21 pairs presented the more prominent county was presented first “about an equal number of times”. Results indicated that participant similarity ratings were higher when the less prominent country was displayed first compared to the more prominent country displayed first, t(153) = 2.99, p = .001, d = .48, 95% CI = [.16, .80], and that higher similarity ratings were given to the version of each pair that listed the more prominent country second, t(20) = 2.92, p = .001, d = 1.31, 95% CI [.33, 2.26].

Then, they did a follow-up study (N = 46) with the same design except that participants rated differences rather than similarities. Following the prior result, participant difference ratings were higher when the more prominent country was displayed first compared to the less prominent country displayed first, t(45) = 2.24, p < .05, d = 0.67, 95%CI [.34, .98] and higher difference ratings were given to the version of each pair that listed the more prominent country first, t(20) = 2.72, p < .01, d = 1.22, 95%CI [.64, 1.78].

Materials and procedure.

A list of country pairs was assembled in which one country was rated more prominent than the other. These items were taken from the original paper and updated as noted in the differences from original section. The 21 pairs of countries include:

  1. U.S.A - Mexico,
  2. Russia - Poland,
  3. China - Albania,
  4. U.S.A - Israel,
  5. Japan - Philippines,
  6. U.S.A - Canada,
  7. Russia - Israel,
  8. England - Ireland,
  9. Germany - Austria,
  10. Russia - France,
  11. Belgium - Luxembourg,
  12. U.S.A - Russia,
  13. China - North Korea,
  14. India - Sri Lanka,
  15. U.S.A - France,
  16. Russia - Cuba,
  17. England - Jordan,
  18. France - Israel,
  19. U.S.A - Germany,
  20. Russia - Syria, and
  21. France - Algeria.

We have created one list where 11 pairs have the prominent country first and another list with the opposite pairings (10 pairs that have prominent country first). In the original study an initial pilot was performed on 68 individuals who judged which of each pair was more prominent. In the list of the pairs of countries reported here, the first country was judged as the more prominent country by at least ⅔ of the participants.

Participants will be randomly assigned to one of the two order conditions above, and will be randomly assigned to rate similarities or differences between the two countries. The similarities condition reads: “You will be presented with a number of country pairs and be asked to judge the similarity of one country to another. Please use the scale ranging from 1 = no similarity to 20 = maximal similarity.” Participants will then rate the similarity of 21 country pairs with one country being more prominent than the other. Participants will rate each pair with a single-item from 1 = no similarity to 20 = maximal similarity. The differences condition will read virtually the same instructions except with “difference” in place of “similarity”, and they will rate the countries on a list of 1 = minimal difference to 20 = maximal difference.

Materials here: https://osf.io/a2jwk/. Test the study here: https://ufl.qualtrics.com/SE/?SID=SV_39Q9WUKfgyMRk1f.

Analysis plan.

We will perform three analyses on the data. For the primary analysis, we will analyze the data through a general linear mixed model with a random effect for the item pair nested within subject, and a fixed factor ‘order’ representing the order of the pair (prominent first vs. prominent second). Fitting this model will allow evaluation of both effects. If the intercept is significantly greater than 0, this would confirm the finding that at the participant level, if there is an effect for the factor ‘order’ the pairs where the prominent country appeared second will be rated as more similar than when the prominent country appeared first. We will convert the Beta provided by this intercept term into a Cohen’s d effect size.

Second, we will recreate the original analysis used to get a participant-level effect of making similarity judgments where either the more or less prominent country comes first. We will compute an asymmetry score for each subject, calculated as the average similarity for comparisons where the prominent country appears second minus the average for the comparisons where the prominent country appears first. Using a one-sample t-test, we will test this difference score against zero (original d=.48). Third, using a matched-pairs t-test, we will compare the average score for each pair when it was prominent-first compared to prominent-second (original d= 1.31).

Because these latter two analyses do not account for the fact that the variance in ratings is crossed between participants and pairs, they will be secondary and only used as a comparison for the original analysis. All participants with data will be included in the analysis. These analyses will be repeated for the differences conditions and reported as a separate study. Because of the random assignment to similarity or difference conditions, each site will have half as much data for its critical test as the other effects. This will likely increase the standard error of its estimates by comparison.

Known differences from the original.

The original study was likely completed in a paper and pencil format. Also, in the original, the similarity and difference conditions were separate studies. A reviewer suggested including both for the present design.

In the current 21 pairs, Ceylon is changed into Sri Lanka, West Germany changed to Germany, and U.S.S.R changed to Russia. Because this test will be performed in many different countries, the country in each pair that is considered most prominent might differ, depending on the sample (e.g., participants in Israel might judge Israel to be more prominent than France). It is also worth noting a priori that another study in this slate involves similarity judgements (Norenzayan et al., 2002) and it may be relevant to analyze order effects between these two effects in particular.