[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
getting plots for subsets; funny business about as.numeric(myfactor)
Hey:
The "expand.grid" code applied to Alana's data generated more than
100,000 lines of output, because there were originally 10 values for
each of 6 variables. So I'd urge you to let me re-think and simplify
that scheme before you waste too much time.
Anyway, for some good news:
In the lab today, Alana and I were wanting to show a plot for a subset
of cases. I never did find out what was wrong because we went to the
faculty meeting after that, and when we returned to the lab, we
re-started a new R session and I (subsconsciously, probably) did not
keep pushing on the plotting of subsets problem.
Just now I started thinking again and I think I have figured what we did
wrong. In the following, the only command you have not seen from me
before is "gl". "gl" stands for "generate factor levels". (The name
"gl" is rather reminiscent of those old stories about Unix. The list
files command is "ls" and they say it is called that because the
programmer's girlfriend's name was Lisa Strathman. "mv" is for move,
but it really represents "Martha's Vineyard." There are all kinds of
nutty stories like that, most of which are made up late in the night by
people who are sending emails to their students :) "gl" is in that
mold). Anyway, gl is a way to create factor variables "automagically".
This one creates a variable with 5 categories, and there are 20
observations in each one. Perhaps you did not read the
manual on subset yet, but I think we did use it.
x <- rnorm(100)
y <- rnorm(100)
#it would work to simply do:
myfac <-gl(5,20)
# Could have specified labels here if we wanted to.
mydf <- data.frame(x,y, myfac)
# Neatness counts. Now that x,y, and myfac are in the data frame,
# remove them from the workspace
rm(x,y, myfac)
#observe: this does create a subset: subset(myfac,myfac==3)
#observe this does too: subset(myfac %in% 3)
#observe this does too: subset(myfac %in% c(3,4))
# I THINK the %in% notation is preferred, partly because it
# can have multiple matches
# This makes a plot with group 3
with( subset(mydf,myfac==3), plot(x,y))
# So does this
with (subset(mydf,myfac%in% 3), plot(x,y))
# And this will make a plot with groups 2 and 4
with (subset(mydf,myfac %in% c(2,4), plot(x,y))
# Do you want to see the subgroups on the same plot?
# if we do "as.numeric(myfac)" it gives back integers for the groups,
# in order. So we can specify plotting characters or text
with (subset(mydf,myfac%in% c(2,3)), plot(x,y,pch=as.numeric(myfac)))
with (subset(mydf,myfac%in% c(2,3)), plot(x,y,pch=as.character(myfac)))
# I hope you remember you can use plot(..., type="n") and then
# layer on lines, text, points, and symbols if the simple little plot
# command is not beautiful enough.
# This works: I just tested it:
with (subset( mydf,myfac%in% c(2,4)) ,
{plot(x,y,type="n");
text(x,y,labels=myfac)} )
# The {} symbols are needed to keep the plot&text commands together.
Now, I wonder why this effort to plot subsets was not working when we
tried it before 4pm? I can't say for sure, but I have a pretty good
hypothesis. Look what happens when you mistype the subset command with
one equal sign, not two
subset(mydf,myfac = 2)
For reasons I do not know, that gives back ALL cases. No subsetting is
performed.
Oh, one other thing. I got this all messed up at one time playing
around with value labels the factor. Instead of naming them 1,2,3,4,5,
suppose we want the 5 groups with names a,b,c,d,e.
myfac <- gl(5,20,labels=c("a","b","c","d","e"))
In some ways that is handy, some ways not.
The subset command now must match "a" "b", not the numbers.
subset(mydf,myfac=="a")
with (subset(mydf,myfac%in% c("c","b")),
plot(x,y,pch=as.character(mydf$myfac)
))
If you do as.numeric(mydf$myfac), it will give you back some numbers.
They will be consecutive integers for the factor labels
I'll leave you to ponder this beautiful gem. Note how the shuffling of
the letters in the new gl command does not change the as.numeric()
output at all.
> as.character(mydf$myfac)
[1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
"a" "a"
[19] "a" "a" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b"
"b" "b"
[37] "b" "b" "b" "b" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c"
"c" "c"
[55] "c" "c" "c" "c" "c" "c" "d" "d" "d" "d" "d" "d" "d" "d" "d" "d"
"d" "d"
[73] "d" "d" "d" "d" "d" "d" "d" "d" "e" "e" "e" "e" "e" "e" "e" "e"
"e" "e"
[91] "e" "e" "e" "e" "e" "e" "e" "e" "e" "e"
> as.numeric(mydf$myfac)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2
[38] 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4
4 4 4 4
[75] 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
>
> myfac <- gl(5,20,labels=c("e","b","a","d","c"))
> as.numeric(myfac)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2
[38] 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4
4 4 4 4
[75] 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
Now, if you can get your mind around this, here is supposed to be the
best way to convert a factor variable into 0-1-2, etc.
as.numeric (levels (mydf$myfac)[as.numeric (mydf$myfac)])
read the help page for levels().
Factors give me a headache.
--
Paul E. Johnson email: pauljohn_AT_ku.edu
Dept. of Political Science http://lark.cc.ku.edu/~pauljohn
1541 Lilac Lane, Rm 504
University of Kansas Office: (785) 864-9086
Lawrence, Kansas 66044-3177 FAX: (785) 864-5700