[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
logistic regression warnings about probabilites of 0 or 1 and the Hauck-Donner phenomenon
Dear Lovers of logistic regression:
Several of you in POLS 707 have seen the warning in R for logistic
regression. It is a (deceptively subtle) warning about fitted
probabilities being 0 or 1. That happens if you have "complete
separation" or something close to it.
Complete separation is the problem in which the 0 and 1 dependent
variable separates itself so that it appears the inputs perfectly
predict outcomes. Suppose the scatterplot looks like
1111111111111111
Y
0000000000000
<>
Note, in the middle, the Y's are separated. The predicted probabilities
ought to be 0 for the left set of data and 1 for the other. Remember
the graph on the board where the S shaped curve turned into this shape:
|-------------------------
|
|
________________|
Here the slope in the middle is infinite, cannot be estimated.
This can happen if you "dummy up" your variables so that a small
population segment is represented by a category (or combination of
categories) such as "Chinese-speaking Caucasian one-legged males who
live in Dubuque, Iowa". Observations like that may have homogeneous
Y's, all 0's or all 1's in your data. Then logistic regression breaks.
If you have truly complete separation or something close to it, then
logit estimation fails. You probably should seriously rethink your data
or your model. Other times, you just get a warning, and that's where
judgment and prudence come into play.
This problem is related to a problem with test statitics in Logistic
regression known as the Hauck-Donner effect. It gets surprisingly
little treatment in the textbooks. I don't think I've ever seen it
mentioned in an econometrics-style text. It is in Modern Applied
Statistics with S/R by Venables and Ripley, but that treatment is
somewhat brief. You will find this discussed in many
statistically-oriented email lists, especially r-help, but also others.
Here's an email post from Brian Ripley in the late 1990s that is more
conversational:
http://www.math.yorku.ca/Who/Faculty/Monette/S-news/0049.html
I'd paraphrase the problem this way. Consider the ratio of the
parameter estimate to the standard error,
b/se
In econometrics, we used to say that was a t statistic, but that's only
approximate.
The statisticians seem more inclined to say it should be compared
against a Normal distribution, and the justification for that goes back
to the Wald Chi-Square statistic,
b^2/se^2
and if you take the square root of that, you get back
b/se
and that is Normal. So many programs (like R) report that as a number
with a Z distribution. Other programs like SAS report the squared value
and call it a Chi-Square.
Anyway, if b is huge, say nearly infinite, because of complete (or
nearly complete separation) then the standard error will also be
massively huge, and the resulting test statistic will be small. Hauck
and Donner showed that "Wald's test statistic decreased to zero as the
distance between the parameter estimate and null value increases."
Ironically, then, as your Null gets more and more wrong, the Wald stat
gets smaller and smaller and you are less and able to reject a wrong Null.
Hence Ripley's point in the email cited above, which says that,
paradoxically, if the value of b is either near zero or very far away
from zero, the test statistic can be small.
The practical advice, then, is to run the model with all of the
variables, and then run again with the questionable one removed, and
conduct a likelihood ratio test. I had expected that such a test would
reach the same conclusion as the Wald test. I think most people expect
it will lead to the same conclusion. But Hauck & Donner show it is wrong
to think that. So that's why, in R, when you conduct an anova() on a
logit model, it reports the test of the Chi-Square for each individual
variable in the model.
While browsing in Jstor for the Hauck-Donner article (1977) I found a
few others you could also get. I think you have to go into JSTOR through
the ku libraries databases list in order to use it for free. I've saved
pdf copies of these, in case you have trouble accessing them. To aid my
memory, here are some citations
Wald's Test as Applied to Hypotheses in Logit Analysis
Walter W. Hauck, Jr.; Allan Donner
Journal of the American Statistical Association > Vol. 72, No. 360
(Dec., 1977), pp. 851-853
Stable URL:
http://links.jstor.org/sici?sici=0162-1459%28197712%2972%3A360%3C851%3AWTAATH%3E2.0.CO%3B2-K
A Reminder of the Fallibility of the Wald Statistic
Thomas R. Fears; Jacques Benichou; Mitchell H. Gail
The American Statistician > Vol. 50, No. 3 (Aug., 1996), pp. 226-227
Stable URL:
http://links.jstor.org/sici?sici=0003-1305%28199608%2950%3A3%3C226%3AAROTFO%3E2.0.CO%3B2-G
Understanding Wald's Test for Exponential Families
Nathan Mantel
The American Statistician > Vol. 41, No. 2 (May, 1987), pp. 147-148
Stable URL:
http://links.jstor.org/sici?sici=0003-1305%28198705%2941%3A2%3C147%3AUWTFEF%3E2.0.CO%3B2-D
On the Use of Wald's Test in Exponential Families
Michael Vaeth
International Statistical Review / Revue Internationale de Statistique >
Vol. 53, No. 2 (Aug., 1985), pp. 199-214
Stable URL:
http://links.jstor.org/sici?sici=0306-7734%28198508%2953%3A2%3C199%3AOTUOWT%3E2.0.CO%3B2-F
A Reminder of the Fallibility of the Wald Statistic
Thomas R. Fears; Jacques Benichou; Mitchell H. Gail
The American Statistician > Vol. 50, No. 3 (Aug., 1996), pp. 226-227
Stable URL:
http://links.jstor.org/sici?sici=0003-1305%28199608%2950%3A3%3C226%3AAROTFO%3E2.0.CO%3B2-G
Judging Inference Adequacy in Logistic Regression
Dennis E. Jennings
Journal of the American Statistical Association > Vol. 81, No. 394
(Jun., 1986), pp. 471-476
Stable URL:
http://links.jstor.org/sici?sici=0162-1459%28198606%2981%3A394%3C471%3AJIAILR%3E2.0.CO%3B2-V
A Note on Confidence Bands for the Logistic Response Curve
Walter W. Hauck
The American Statistician > Vol. 37, No. 2 (May, 1983), pp. 158-160
Stable URL:
http://links.jstor.org/sici?sici=0003-1305%28198305%2937%3A2%3C158%3AANOCBF%3E2.0.CO%3B2-8
I found at least one political science article that cites Hauck-Donner:
Issue Voting in Gubernatorial Elections: Abortion and Post-Webster Politics
Elizabeth Adell Cook; Ted G. Jelen; Clyde Wilcox
The Journal of Politics > Vol. 56, No. 1 (Feb., 1994), pp. 187-199
Stable URL:
http://links.jstor.org/sici?sici=0022-3816%28199402%2956%3A1%3C187%3AIVIGEA%3E2.0.CO%3B2-Z
Other Cool things I found while snooping in JSTOR
The Danger of Extrapolating Asymptotic Local Power
Forrest D. Nelson; N. E. Savin
Econometrica > Vol. 58, No. 4 (Jul., 1990), pp. 977-981
Stable URL:
http://links.jstor.org/sici?sici=0012-9682%28199007%2958%3A4%3C977%3ATDOEAL%3E2.0.CO%3B2-4
Best Subsets Logistic Regression
David W. Hosmer; Borko Jovanovic; Stanley Lemeshow
Biometrics > Vol. 45, No. 4 (Dec., 1989), pp. 1265-1270
Stable URL:
http://links.jstor.org/sici?sici=0006-341X%28198912%2945%3A4%3C1265%3ABSLR%3E2.0.CO%3B2-7
--
Paul E. Johnson email: pauljohn_AT_ku.edu
Dept. of Political Science http://lark.cc.ku.edu/~pauljohn
1541 Lilac Lane, Rm 504
University of Kansas Office: (785) 864-9086
Lawrence, Kansas 66044-3177 FAX: (785) 864-5700