The following post was reproduced (with permission) from the June 2014 Issue of Status: A Report on Women in Astronomy. The author is Nancy D. Morrison, The University of Toledo, Department of Physics & Astronomy.
Recently, we've heard a lot about the gender gap in wages: the full-time median salary for women is lower than that of men in almost all occupations,  and a gap persists in many occupations when age and skill level are controlled for. Explanations can be grouped broadly into three categories: bias, whether conscious or unconscious; entry of women into lower-wage occupations because of skills or preferences; and less competitiveness among women than among men.
There are many ways to slice the data. It is commonplace to say that workers in female-dominated occupations generally earn less than those in male-dominated ones. Women being less willing to negotiate is another point;  all are aspects of self-selection by women. Discrimination is still a factor.  Another recent finding  is that the salary gap is greatest in business and law, where per-hour pay for employees working longer hours is greatest, and thus reflects the culture and the structure of the occupation.
In science, we confront all these issues. In addition, the early stages of our careers are strongly affected by math-based tests such as the GRE, both the quantitative general test and the physics subject test, on which women tend to score lower than men. For example, on the quantitative general test in 2006-2007, the median score for women was more than 50 points lower than that for men, and the 75th percentile score was about 30 points lower.  This difference is enough to disqualify a significant number of women and minorities from graduate admission if a hard cutoff score of 700 is used, as it often is in elite programs. If we assume that women are just as good at math as men, then why the difference?
Interesting research on the performance of women and men on math-based tests has been carried out by Olga Shurchkov, Assistant Professor of Economics at Wellesley College.  In lab experiments, she assessed the performance of male and female students who were paid to solve verbal and math puzzles, in competitive and noncompetitive environments and with high and low time pressure. In her analysis, she took care to tease out various effects on the students' performance. She also carried out a labor market analysis to investigate whether her findings on time pressure and competition carry over into the workplace. Her paper provides background on the research area. The rest of this article discusses her methodology and findings, which bear on several aspects of the gender gap outlined above.
The verbal task was a "Words-in-a-Word" puzzle, which was based on an on-line game.  Subjects had to form as many shorter words as possible out of a longer word. The math puzzle was designed to be a comparable task: out of a string of digits, subjects had to find as many as possible sets of numbers that add to a target number. For example, in the string 1034582614 with a target number of 117, the correct solutions include 103 + 14, 54 + 63, and 14 + 43 + 60. For both the math and the verbal puzzle, precautions were taken to ensure that the difficulty of the puzzles was constant from trial to trial. In both puzzles, points were deducted for mistakes, such as illegal words, using a digit more times than it appeared in the numerical sequence, and so forth. Therefore, negative scores were possible.
In each verbal and each math session, there were three rounds in which the subjects had two minutes to solve each puzzle (high time pressure or short duration) and three rounds in which they had ten minutes (low time pressure or long duration). After each round, the subjects guessed their rank within their group of four, with payment for a correct guess.
Within each time pressure regime, different payment schemes were used to test the subjects' performance under, and preference for, competition. In a noncompetitive piece-rate treatment, no winner was announced, and each subject was paid in proportion to the number of points earned. In the competitive treatment, or "tournament," only the top scorer in each foursome was paid.
After completing the piece-rate and the tournament rounds, subjects were invited to choose which scheme to use for the later rounds, and their preference for competition was analyzed.
In the verbal sessions, the subjects comprised 27 groups of two men and two women and five groups of all men. Taking part in the math sessions were 84 people, 21 groups of two men and two women and three all-female groups. All the subjects were undergraduate and graduate students at Boston-area universities, mainly Harvard, who were not otherwise selected for scholastic ability. Gender was the only demographic characteristic discussed in the study. It was not emphasized at any point in the experiments, but the subjects could see the gender of the members of their group. After completing each puzzle, the subjects saw their own scores but not those of the others in their group, and they were given no information about their ranking within their group. At the end of the experiment, subjects were asked demographic questions and questions about their strategies during the experiment. They were also asked whether they thought men or women would be better at the math and verbal tasks.
In the noncompetitive setting, the mean math scores for men and women were virtually identical, and the distributions were not very different, as Figure 1 shows. While one or two male subjects scored very high, several of the men obtained negative scores by making mistakes. Thus, the men's and women's overall math abilities, as measured by their mean scores on this task, were similar. Stereotype threat was present: only 31% of male and female subjects thought that women would be better at the math task. In the tournament setting, the men did a bit better, while the women did significantly worse, showing a significant increase in the number of negative scores and a statistically significant drop in mean score. In the choice setting, significantly more men than women selected the tournament, in which the potential payoff was four times higher. This result holds up when performance in the previous rounds was controlled for, important since the highest scorers were all male. Confidence, as measured by rank guess, was also a predictor of a subject’s likelihood of entering the tournament.
The next variations were designed to determine how the women would perform relative to the men when the stereotype threat and the time pressure were relaxed. In the verbal puzzles, stereotype threat, if anything, favored the women, since a majority of both genders said they expected women to be better at the verbal puzzles than men. In this setting there was no significant difference between the women and the men in either the noncompetitive or the competitive setting. In addition, men and women were equally likely to choose the tournament when offered the choice in the later round.
|Figure 1. Distribution of math scores by gender in the noncompetitive (piece-rate) setting. Redrawn from Shurchkov’s  Figure 1(a). Frequency is proportional to the number of occurrences of each score, such that all frequencies add to 1.|
|Figure 2. Distribution of math scores by gender under competition with high (left) and low (right) time pressure, redrawn from Shurchkov’s  Figures 1(b) and 3(b) (left and right, respectively).|
When the time pressure was relaxed, both genders improved their scores significantly in the math task, and now there was no significant difference between the genders in either the piece-rate or the tournament setting. There was a high peak at the right-hand end of the score distribution for both genders, in both settings. Interestingly, in the tournament, the male subjects showed a high frequency of negative scores, indicating a high share of mistakes. Figure 2 compares the score distributions from the short- and long-duration math competitions. In the choice rounds, women were nearly twice as likely to select the tournament as they were in the high-pressure setting, while the men's choices remained the same.
In the verbal task, the men and women did about equally well in the piece-rate setting; the score distributions are almost identical. In the tournament setting, both genders improved, but the women improved dramatically. In the choice setting, there was little difference between women and men in likelihood of choosing the tournament, once confidence (rank guess) and prior performance were controlled for.
Also analyzed was mistake share, the number of points lost due to invalid solutions divided by the total possible number of points. In the math task, women's mistake share in the tournament setting decreased significantly when the time pressure was reduced. In the verbal task, on the other hand, the women's mistake share was unchanged, but the men's mistake share increased when time pressure was reduced. The hypothesis that this rise in mistake share might be due to men's greater preference for risk was explored and found wanting. More likely, the men used the extra time to increase the number of solutions they found, rather than their quality.
In the long-duration games, subjects had the option of quitting before the time expired. In the math tournament, women were significantly more likely than men to quit early, but there was no gender difference in the verbal tournament. Both genders were less likely to quit in the tournament than in the piece-rate setting, a result showing that competition has benefits.
Table 1 gives a brief summary of the results: the men and women performed about equally well in all but two of the settings: in the short-duration (high-pressure) math test, the men did much better, and in the long-duration (low-pressure) verbal test, the women did much better.
|Table 1. Results of piece-rate (noncompetitive) and
tournament (competitive) math and verbal games. |
M = men superior, W = women superior, E = men and women roughly equal.
Labor Market Analysis
In short, Shurchkov found that differences in math test scores between men and women were greatly diminished when stereotype threat, time pressure, and competition were removed from the setting. To see whether these findings are reflected in the labor market, she examined individual-level labor market data from the years 2003 through 2009. She grouped occupations into low- and high-pressure and math and verbal categories based on classifications from CareerCast.com.
Examples of high-pressure jobs that emphasize mathematical skills include financial analysis and management, while mathematicians, actuaries and accountants fall into a lower-pressure category. High-pressure jobs with a verbal emphasis include journalism, while people in low-pressure jobs in the verbal category include librarians, novelists and poets. Admittedly, few jobs are purely mathematical or verbal; rather, they combine these attributes in varying amounts.
Regression analysis of real earnings against demographic variables and controlled for gender revealed a significant (at the 1% level) salary gap in high-pressure math jobs, smaller gaps in high-pressure verbal jobs, and little if any gap in lower-pressure verbal jobs. The high-pressure math jobs are also the ones with a lower share of women: “... a woman is almost 20% less likely to work at a high-pressure math job than a man of similar characteristics.” Those probability differences are reversed in sign for the low-pressure jobs, both the math and the verbal.
Shurchkov designed her experiments well to separate competing effects. She provided objective evidence that the women in the study have similar basic math ability to the men, but she confirmed conventional wisdom that women perform less well than men on mathematical tasks – where stereotype threat may be in force – in competition and under time pressure. Removing either stereotype threat or competition enabled women to perform as well as men, and removing both sources or pressure enabled women to perform better than men: they excelled in the verbal competition, earning a higher average payout than the men. Part of the reason was that the women used the extra time to improve the quality of their work, while the men appeared to aim for increased quantity, thereby increasing their mistake share.
What do these results tell us about how we ought to be preparing future scientists? Although some argue that science is more competitive than it needs to be, some competition is inherent in the process. Nor are graduate admissions procedures likely to become less competitive. Therefore, the disadvantage that accrues to women from competition is unlikely to change, unless women learn to overcome it. Unlike competition, time pressure is not usually an essential feature of solving scientific problems. The most typical situations involving time pressure are answering tough questions in oral exams or after a talk and writing proposals to deadlines. In the former, we have to "think on our feet," a skill that we can learn in graduate school and beyond. In the latter, we use verbal skills at least as much as mathematical skills.
An essential feature of both the Math General and the Physics Subject GRE is solving numerous problems under time pressure, a skill that is minimally related to career success. The experimental research described here, showing that time pressure is a significant barrier to women’s demonstrating their math skills, suggests yet another reason to “ditch,’’ or at least down-weight, the GRE.  Indeed, research shows , ,  that students' GRE scores are poor predictors of any measure of success in graduate school. The same research shows that alternative methods – “noncognitive measures” of personal characteristics and professional skills – are much better predictors. Properly applied, they are free of gender and racial bias. Noncognitive assessment will be reviewed in a future issue of Status.
 R. Simmons and J. Bacal 2012, “How to Repair the Gender Pay Gap? Teach Negotiation Skills in College,” in Huff Post Women, November 8, 2012, Link
 C. Goldin 2014, “A Grand Gender Convergence: Its Last
Chapter,” American Economic Review 2014, 104(4): 1–30, Link
 C. Miller 2014, “Using Minimum Acceptable GRE Scores for Graduate Admissions Suppresses Diversity,” presentation in Special Session 337 at the American Astronomical Society's 223rd meeting, Link
 O. Shurchkov 2012, "Under Pressure: Gender Differences in Output Quality and Quantity under Competition and Time Constraints," Journal of the European Economic Association, 10(5):1189–
 See also J. A. Johnson 2014, “Increasing diversity by ditching the GRE,” Link
 W. Sedlacek 2014, “Why Doesn't the GRE or GPA Work in Selecting Graduate Students & What Alternatives Are There?”, Link
 C. Miller and K. Stassun 2014, “A test that fails,” Nature, 510, 303-304, Link