[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

logistic regression warnings about probabilites of 0 or 1 and the Hauck-Donner phenomenon



Dear Lovers of logistic regression:

Several of you in POLS 707 have seen the warning in R for logistic regression. It is a (deceptively subtle) warning about fitted probabilities being 0 or 1. That happens if you have "complete separation" or something close to it.

Complete separation is the problem in which the 0 and 1 dependent variable separates itself so that it appears the inputs perfectly predict outcomes. Suppose the scatterplot looks like

                  1111111111111111
Y

    0000000000000
    <>

Note, in the middle, the Y's are separated. The predicted probabilities ought to be 0 for the left set of data and 1 for the other. Remember the graph on the board where the S shaped curve turned into this shape:

                |-------------------------
                |
                |
________________|

Here the slope in the middle is infinite, cannot be estimated.

This can happen if you "dummy up" your variables so that a small population segment is represented by a category (or combination of categories) such as "Chinese-speaking Caucasian one-legged males who live in Dubuque, Iowa". Observations like that may have homogeneous Y's, all 0's or all 1's in your data. Then logistic regression breaks.

If you have truly complete separation or something close to it, then logit estimation fails. You probably should seriously rethink your data or your model. Other times, you just get a warning, and that's where judgment and prudence come into play.

This problem is related to a problem with test statitics in Logistic regression known as the Hauck-Donner effect. It gets surprisingly little treatment in the textbooks. I don't think I've ever seen it mentioned in an econometrics-style text. It is in Modern Applied Statistics with S/R by Venables and Ripley, but that treatment is somewhat brief. You will find this discussed in many statistically-oriented email lists, especially r-help, but also others. Here's an email post from Brian Ripley in the late 1990s that is more conversational:

http://www.math.yorku.ca/Who/Faculty/Monette/S-news/0049.html

I'd paraphrase the problem this way. Consider the ratio of the parameter estimate to the standard error,

b/se

In econometrics, we used to say that was a t statistic, but that's only approximate.

The statisticians seem more inclined to say it should be compared against a Normal distribution, and the justification for that goes back to the Wald Chi-Square statistic,

b^2/se^2

and if you take the square root of that, you get back

b/se

and that is Normal. So many programs (like R) report that as a number with a Z distribution. Other programs like SAS report the squared value and call it a Chi-Square.

Anyway, if b is huge, say nearly infinite, because of complete (or nearly complete separation) then the standard error will also be massively huge, and the resulting test statistic will be small. Hauck and Donner showed that "Wald's test statistic decreased to zero as the distance between the parameter estimate and null value increases." Ironically, then, as your Null gets more and more wrong, the Wald stat gets smaller and smaller and you are less and able to reject a wrong Null.

Hence Ripley's point in the email cited above, which says that, paradoxically, if the value of b is either near zero or very far away from zero, the test statistic can be small.

The practical advice, then, is to run the model with all of the variables, and then run again with the questionable one removed, and conduct a likelihood ratio test. I had expected that such a test would reach the same conclusion as the Wald test. I think most people expect it will lead to the same conclusion. But Hauck & Donner show it is wrong to think that. So that's why, in R, when you conduct an anova() on a logit model, it reports the test of the Chi-Square for each individual variable in the model.

While browsing in Jstor for the Hauck-Donner article (1977) I found a few others you could also get. I think you have to go into JSTOR through the ku libraries databases list in order to use it for free. I've saved pdf copies of these, in case you have trouble accessing them. To aid my memory, here are some citations


Wald's Test as Applied to Hypotheses in Logit Analysis
Walter W. Hauck, Jr.; Allan Donner
Journal of the American Statistical Association > Vol. 72, No. 360 (Dec., 1977), pp. 851-853 Stable URL: http://links.jstor.org/sici?sici=0162-1459%28197712%2972%3A360%3C851%3AWTAATH%3E2.0.CO%3B2-K


A Reminder of the Fallibility of the Wald Statistic
Thomas R. Fears; Jacques Benichou; Mitchell H. Gail
The American Statistician > Vol. 50, No. 3 (Aug., 1996), pp. 226-227
Stable URL: http://links.jstor.org/sici?sici=0003-1305%28199608%2950%3A3%3C226%3AAROTFO%3E2.0.CO%3B2-G


Understanding Wald's Test for Exponential Families
Nathan Mantel
The American Statistician > Vol. 41, No. 2 (May, 1987), pp. 147-148
Stable URL: http://links.jstor.org/sici?sici=0003-1305%28198705%2941%3A2%3C147%3AUWTFEF%3E2.0.CO%3B2-D

On the Use of Wald's Test in Exponential Families
Michael Vaeth
International Statistical Review / Revue Internationale de Statistique > Vol. 53, No. 2 (Aug., 1985), pp. 199-214 Stable URL: http://links.jstor.org/sici?sici=0306-7734%28198508%2953%3A2%3C199%3AOTUOWT%3E2.0.CO%3B2-F

A Reminder of the Fallibility of the Wald Statistic
Thomas R. Fears; Jacques Benichou; Mitchell H. Gail
The American Statistician > Vol. 50, No. 3 (Aug., 1996), pp. 226-227
Stable URL: http://links.jstor.org/sici?sici=0003-1305%28199608%2950%3A3%3C226%3AAROTFO%3E2.0.CO%3B2-G

Judging Inference Adequacy in Logistic Regression
Dennis E. Jennings
Journal of the American Statistical Association > Vol. 81, No. 394 (Jun., 1986), pp. 471-476 Stable URL: http://links.jstor.org/sici?sici=0162-1459%28198606%2981%3A394%3C471%3AJIAILR%3E2.0.CO%3B2-V

A Note on Confidence Bands for the Logistic Response Curve
Walter W. Hauck
The American Statistician > Vol. 37, No. 2 (May, 1983), pp. 158-160
Stable URL: http://links.jstor.org/sici?sici=0003-1305%28198305%2937%3A2%3C158%3AANOCBF%3E2.0.CO%3B2-8

I found at least one political science article that cites Hauck-Donner:

Issue Voting in Gubernatorial Elections: Abortion and Post-Webster Politics
Elizabeth Adell Cook; Ted G. Jelen; Clyde Wilcox
The Journal of Politics > Vol. 56, No. 1 (Feb., 1994), pp. 187-199
Stable URL: http://links.jstor.org/sici?sici=0022-3816%28199402%2956%3A1%3C187%3AIVIGEA%3E2.0.CO%3B2-Z

Other Cool things I found while snooping in JSTOR

The Danger of Extrapolating Asymptotic Local Power
Forrest D. Nelson; N. E. Savin
Econometrica > Vol. 58, No. 4 (Jul., 1990), pp. 977-981
Stable URL: http://links.jstor.org/sici?sici=0012-9682%28199007%2958%3A4%3C977%3ATDOEAL%3E2.0.CO%3B2-4


Best Subsets Logistic Regression
David W. Hosmer; Borko Jovanovic; Stanley Lemeshow
Biometrics > Vol. 45, No. 4 (Dec., 1989), pp. 1265-1270
Stable URL: http://links.jstor.org/sici?sici=0006-341X%28198912%2945%3A4%3C1265%3ABSLR%3E2.0.CO%3B2-7

--
Paul E. Johnson                       email: pauljohn_AT_ku.edu
Dept. of Political Science            http://lark.cc.ku.edu/~pauljohn
1541 Lilac Lane, Rm 504
University of Kansas                  Office: (785) 864-9086
Lawrence, Kansas 66044-3177           FAX: (785) 864-5700