Thursday, January 19, 2017

Don’t test me: Using Fisher’s exact test to unearth stories about statistical relationships (Repost)

This week I had an article published in Data Driven Journalism on the use of Fisher's exact test with contingency tables.  It is reprinted here.

A common problem faced by data journalists  is how to determine if there is a statistical relationship between two categorical variables such as gender, race, or the share of the vote for two candidates in an election.  The simplest way to visualize the relationship is to represent the counts for each combination of two variables in a contingency table with the rows representing the levels of one variable and the columns representing the levels of the other variable.  The most commonly used statistical test for an association between the row and column variables is the chi-square (χ2) test.  The example in the table below illustrates this test using 2016 primary data for the American presidential election.
The columns in the above table shows the primary states won by Hillary Clinton and by Bernie Sanders on the Democratic side and Donald Trump placed in the same primary states on the Republican side.  The total number of states in the table is 51 because the District of Columbia is included.  For example, the column percent’s show that Trump won 86% of the primary states that Clinton won while he won 55% of the states that Sanders won.  Using the chi-square test, journalists can explore relationships between the states won by Trump, Clinton, and Sanders.

The chi-square test is based on calculating expected values for each cell in the table.  In the above example, we calculate the expected value (the value for that would be expected if there were no relationship among the variables) for the cell for states where Trump finished third on the Republican side and for states where Bernie Sanders won on the Democratic side by multiplying the row total for where Trump placed third (3) by the column total for states where Sanders won (22).  This product is then divided by the total number of observations for (51).  The formula for the expected value is given by:
That means that for this cell a value of 1.29 would be expected if the primary states where Trump finished third and Sanders won were completely independent of each other.  The observed value for this cell is 2, suggesting a higher count than would be expected.  Expected values would be computed for each cell in the table and the difference between the observed and expected values for each cell is computed, squared, divided by the expected value, and summed across the cells in the table according to the formula:
If the value for the chi-square exceeds the chi-square critical value for a given degree of freedom (found by multiplying the number of rows minus one and the number of columns minus one) and p-value, it is concluded that there is an association between the variables.

But there is a problem with the chi-square test.  The test is only an approximation of the distribution of counts in contingency tables. If more than 20% of the cells in the table have an expected value of less than five, the chi-square approximation does not work to test the hypothesis of an association between the row variable and the column variable (as is the case in the table above).  Both variables in the table are categorical, which means that the values of the variables can only take certain values such as gender, political affiliation, or placement in an election. A continuous variable is one that could take any value on the number line such as temperature, height, or weight. The major statistical packages will alert the user if this assumption is violated.  Violating the assumption causes the observed p-value to be incorrect and can lead to incorrect conclusions being made regarding the presence or absence of an association.

To overcome these limitations, there is an exact alternative to the chi-square test called Fisher’s exact test.  Rather than the chi-square distribution, which is an approximation of the distribution of observed and expected values, Fisher’s exact test is based on the hypergeometric probability distribution, which is the exact distribution of counts in a contingency table.
Here the Ri! are the factorials of the row totals (5!=5*4*3*2*1), Ci! are the factorials of the individual column totals, N! is the factorial of the table total and the aij! are the factorials for the individual cell values.  The Πij is the product coefficient of the individual cell values.  Such a formula is even more computationally intensive than the chi-square test, especially for tables with many rows and columns.  This is why the chi-square test was favored in the past because it took too much memory for computers to run.  These days it is less of an issue for computers to run the Fisher’s exact test and it is easy to run in the major statistical packages, such as R, SAS, SPSS, and STATA.

The commands to conduct Fisher’s exact and the chi-square test in R can be seen below, using the US Primary Election table above (yellow for Fisher’s exact test, green for the chi-square test).
The output for the Fishers exact test shows that there is a probability of 0.03653 of observing these table frequencies when there is no association between the rows and columns.  The chi-square test output shows a probability of 0.04217 for a relationship in the same table.  If we were using the 0.05 p value as the criteria for significance we would find a relationship for both tests in this case though the p-values differ.  In a case where the sample size is even smaller than in the example given above, this difference in p-values would be even greater and could lead us to reach the wrong conclusion regarding whether the is a relationship between the variables or not.  States which Hillary Clinton won in the primary season were more likely to be won by Donald Trump while states where Bernie Sanders won were more likely to have Trump finish 2nd or 3rd.

As a warning, the p-value should not be used as an indicator of the strength of the association between categorical variables.  Either the test is significant or not, which means that either the relationship is present or not.  The p-value is sensitive to sample size.  Often the odds ratio can be used to estimate the effect size but R only computes it in the fisher.test function for tables with 2 columns and 2 rows.

Fisher’s exact test provides a criterion for deciding whether the differences in observed percentages between two categorical variables in a sample are significant or just due to random noise in the data.  In the above example, the 86% of primary states won by Clinton and Trump are significantly different from the 55% of primary won by Sanders and Trump.  Journalists should always be careful about making these judgments by just looking at observed percentages or counts because of the subjectivity of such decisions.  Subjective decisions can be further clouded by ones preconceived notions about the issues related to the data.
**Related posts**

Patriotic Projections and Calculations


Statistics and Old Beliefs




Friday, January 13, 2017

6 Friends Competing for #100daysofus Sprout Grants

Occasionally a week comes along when I have a hard time getting inspired to write on a topic for my blog.  This week I came across these posts from three of my friends on Facebook, Twitter, and LinkedIn who are competing for sprout fund grants under the 100 Days of Us competition for various worthy projects.  I shared their posts on my Facebook page for this blog.  Looking at the website for this competition, I found entries for three other people that I know.  I thought I would put their videos for the submissions to the competition in one post to let them advocate for themselves.  You can help determine who wins by going to their pages on the 100 Days of Us website for each submission, clicking like on the submission page (you can click like on more than one page), and making a donation to the Sprout Fund.  The six video submissions for my friends are presented below.  They are all worthy causes.  I don't endorse one over the others.  I clicked like on all six of their pages.  The deadline is January 19 (the day before Trump takes office).

My friend Jennifer Sweda Jordan has worked as a reporter for the Associated Press, NPR, the Allegheny Front (a radio program on the environment in Pittsburgh).  I met her through Pax Cristi.  She is now seeking a grant for a media venture for artists with intellectual disabilities.


Mila Sanina has traveled a long way from Kazakhstan to PittsburghShe came here for studies at Pitt, interned at the PBS News Hour with hosts Gwen Ifill and Judy Woodruff and at CNN International, and was deputy managing editor at the Pittsburgh Post-Gazette.  She is now executive editor at Public Source and is seeking a grant to study the effect of the Trump administration on the Pittsburgh area.

I have worked with Moriah Ella Mason's father at Health Care for All PA.  She has been active in peace activism in the Middle East and is now seeking a grant to prevent discrimination against Muslims in Pittsburgh.

Nadya Kessler has emigrated to Pittsburgh from from Russia.  She now works for Global Pittsburgh and is now seeking a grant for a project to profile successful immigrant business men and women in Pittsburgh.

Ron Gaydos has been active in the Coffee Party and many other worthy causes in Pittsburgh for years.  He is now seeking a grant for a project to promote economic development in the area.

Dave and Erin Ninehouser have been union and healthcare activists for years. They are now pursuing the Hear Yourself Think Project to counteract right wing misinformation which is poisoning our political discourse.  They are seeking a grant to assist them in this endeavor.

There are more than 140 submissions to this competition with many worthy causes covering 19 issue areas.  Many submissions cover more than one issue.  You can have a say in who receives grants.  The winners will be announced on Jan. 20 with the hope of having the best impact over the first 100 days of Trump's presidency.


Looking at the Twitter page for 100 days of us it says that they're giving out grants of $5,000 each.  Their website says that they have a pot of $100,000 to give out with 150 applicants for a grant.  The current size of the pot means that they can give out grants to 20 applicants which is 13% of all applicants.  Assuming that each applicant has an equal chance of winning a grant, the six applicants featured in this post would have an expected number of grants of 0.8 which is slightly less than one.  With the current size of the pot it is not a certainty that any of those featured will win a grant.  This expected value is found by multiplying the 6 by the probability of one applicant winning (13%).

Donating to the fund increases the pot thus increasing the number of grants that the Sprout Fund can give out.  In order for the fund to give grants to all 150 applicants, the fund would need a pot of $750,000.  If the fund were to distribute funds to each applicant equally with the pot of $100,000 the grant would be $666.67.  Every increase in the pot of $5,000 through donations means that one more grant can be awarded.

**Related Posts**

The Ethics of Social Media Manipulation


It's All About The Likes

On Facebook, Fake News, and the Election


The Need for CSI Without Dead Bodies (& Similar Websites)

Saturday, December 31, 2016

Clairton HS vs. Bishop Guilfoyle HS: A Contrast in Poverty

Referees break up a fight in the Class A Title Game in 2014
The Pennsylvania class A high school football championship this year was a matchup of two schools with proud football traditions: Clairton High School near Pittsburgh and Bishop Guilfoyle High School in Altoona, PA.  The two teams met in the title game in 2014 and again this year.  Guilfoyle won the game 17-0 giving them their 47th straight win and third straight state title.

On a Winning Streak: Clairton High School Football Team from Eileen Blass on Vimeo

Clairton won four straight Class A titles from 2009 to 2012 winning 60 games in a row (a state record).   This however is where the similarities end for the two teams.  The video above shows how the city of Clairton changed since 1970 with the population declining from 15,051 to 6,681, an 56% decrease according to the census bureau (the person in the video who said it had decreased from 40,000 to 8,000 may have been including surrounding communities).  The census bureau currently estimates that the city has a median income of $30,207 and 28.6% of their citizens live in poverty.  

Altoona, PA (where Bishop Guilfoyle is located) over the same period, had a population decline from 63,115 in the 1970 census to an estimated 45,344, a 28% decrease.  According to the census bureau, it currently has a median income of $36,215 and 22.1% of it's citizens live below the poverty level.  

Bishop Guilfoyle is a catholic high school with a tuition $6500 and has the advantage of being able to recruit players and giving them financial assistance.  They can even recruit international students.  Their coach played on Penn State's 1995 Rose Bowl Championship team under coach Joe PaternoClairton, as a public high school can only use players who live in their school district (though some may move there just to play for them).  

The census bureau's annual Small Area Income and Poverty Estimate (SAIPE) for the school district has Clairton with a 45.6% poverty rate for it's student population (age 5-17) which is the second highest in the state (children from wealthier families in this community may go to private or catholic schools).  The Altoona School district (which does not include Guilfoyle) by contrast has an estimated poverty rate of 25.2%.  There is a higher degree of uncertainty in the estimates for smaller school districts.

The PIAA (Pennsylvania Interscholastic Athletic Association) has six classifications for their schools for school size.  They have separate championships for each class.  Of the 12 teams that made it to the championship game five of them were Catholic and three of those won.  The class 6A game was a matchup of two Catholic Schools with Pittsburgh Central Catholic (Dan Marino's alma mater) playing St. Joseph's prep in Philadelphia.  St. Joe's won 49-7.

Sports can be a source of pride for a community, especially for one that has fallen on hard times.  Just as in prehistoric times the best hunter in a tribe was a leader for the community.  Sports fills the void that was created now that no one needs to hunt to survive.  Noam Chomsky argues that sports can be a distraction from people's everyday problems but our sports leaders, such a Colin Kapernick can use their prestigious position to advocate for those less fortunate than them.

  **Related Posts**

Comparing McCort's Class of '16 to the Class of '88 in their College Plans

A Higher % of the McCort Class of '16 in the NHS than in the 1980's

Super Bowl XLV: A Battle of Champions Who Couldn't Compete Now Without a Salary Cap

Tuesday, December 20, 2016

US Life Expectancy Decreases and Those Who Want the Affordable Care Act Expanded Increases

This month it has been reported that life expectancy in 2015 in the US has decreased for the first time since 1993.  The decrease was from 78.9 years in 2014 to 78.8 in 2015.  The reasons for this decrease are unclear though the overall death rate increased by 1.2% last year.  The top causes of death had increased rates except for cancer.  Alzheimer's disease showed the largest increase in mortality.  The study's authors caution that this decrease in life expectancy of 0.1 years (which corresponds to a decrease of 37 days) may be a statistical aberration.

Views of the ACA (CBS News Poll)
Dec 2016
Feb 2015
Working well, keep as is
Good things, but changes needed
Needs to be repealed entirely

One statistic that has changed little in the last 6 years is the level of support for the Affordable Care Act (ACA or Obamacare).  The above table shows how views of the ACA have changed over the last year according to a CBS poll conducted this month.  Overall the poll showed that 45% of the public approve of the ACA while 50% do not.  The same respondents were asked what changes to the law were needed.  Only 10% wanted it kept as is (up from 6% last Feb).  The number who wanted changes made to the law increased to 63% from 60% while the number who want it repealed entirely decreased from 32% in 2015 to 25 this month.  The poll did break the numbers down by political party and found that 47% of Republicans wanted changes compared to 78% of Democrats and 61% of independents.  The poll did not specify which changes were needed to the law.

Opinion of ACA (Pew Research)
Nov 30-Dec 5, 2016
Oct 20-25, 2016
Mar 7-11, 2012
Sep 22-Oct 4, 2011
Jan 5-9, 2011
Nov 4-7, 2010
Expand it
Leave it as is
Repeal it
Don’t Know/Refused
Expand + As is

Another poll was published this month showing essentially a 48% to 47% approval/disapproval ratio (essentially a tie) for the ACA published by Pew research.  The poll did ask respondents (n=752) what changes they wanted to see to the ACA.  39% said expand it which is virtually unchanged from October but increased from 2012 and 2011 levels by 5 to 9%.  The number who wanted it left as is increased by 2% from October but decreased from 2011 by 5 to 7%.  The number who want it repealed decreased by 5% from October but stayed within the values from 2012 to 2010.  If one adds the % who want it expanded to those who want it left as is we find a consistent majority across the six year period ranging from 54% this month to 52% in Nov 2010.

It is a matter of interpretation exactly what "expand it" means but other polls have found similar results using wording of the question "Is the ACA not Liberal Enough" or "Approve" as is, found similar majorities when the two categories are added.  The conclusion is that a majority of the US public wants universal health care with a growing percentage wanting a better bill.  Such actions are unlikely in the short term in the Federal Government with Trump in the White House (though he once supported single payer) and the GOP controlling Congress.

  **Related Posts**

POLL: Dislike of healthcare law crosses party lines, 1 in 4 Dems want repeal - (But Doesn't Ask Why)

The US and Republicans Want Health Care Law Repealed....? 

Health Care and the 2014 Election in Pennsylvania


Santorum: No One Has Ever Died Because They Didn’t Have Health Care |The New Civil Rights Movement