Statistics: How to determine correlations between nominal variables?
Asked by
Fyrius (
14578)
May 28th, 2010
Tl;dr: I’ve got two non-linear (nominal) variables here, and I need to figure out whether there’s a statistically significant correlation between the two.
Long version: As part of my thesis, I’ve been looking at a heckload of different languages and determining for each language how they handle two constructions, in order to figure out if there’s a connection between particular approaches. The hypothesis in a nutshell is that if a language uses approach X for construction A, it’ll use approach Y for construction B.
And now I need to figure out whether the data support the hypothesis or not. And I’m not sure how to do that.
Can you point me in the right direction?
I fail at statistics. :/
Observing members:
0
Composing members:
0
16 Answers
It sounds like there are actually four variables (A, B, X, and Y). A Chi square might help to set things up.
Imagine four boxes two on top of two. Across the top it says “Approach” and along the left side, it says “Construction. So the two upper boxes will contain the values for Construction A (the left will be Approach X and the right will be Approach Y). The two bottom boxes will contain the values for Construction B (the left will be Approach X and the Right will be Approach Y).
I love SPSS for doing this type of stats, it works pretty much like an Excel spreadsheet and you plug in the values and it crunches the numbers.
Based on what you said, I would expect that the upper left box and the lower right box will hold the majority of your values and there won’t be very high numbers in the upper right or lower left box. SPSS will help determine the p value.
If you have a language where you KNOW your hypothesis is true, you can test each of the other languages against it in a Chi Square. This will let you test your null hypothesis: all things being equal, you would NOT expect that Approach X will be any different than Approach Y for construction A.
A Chi Square won’t prove your hypothesis is true per se, but will show strength of the association.
@Kayak8‘s approach is a nice way to test for some some correlation. There are some other tests that are stronger (i.e. find a correlation that chi-squared misses). I’ll try to find the one I’m thinking of.
Also, if you want to find the functional form for the correlation, you can use least squares your input matrix is:
[X, X^2, X^3, etc… e^X, log(X), etc…]
Stick in any functional forms of X you think might work and run the least squares algorithm from linear algebra (matrix form) and you might find something.
By the way, can we look at your data (or a sample of it)? It might help in coming up with other ideas.
@Kayak8
Hang on. Approach Y can’t be used for construction A. For construction A it’s a matter of approach X versus various approaches not-X, and for B a matter of approach Y versus several approaches not-Y. You can’t take an approach to one construction and apply it to another one. It’s a specific solution to a specific problem.
As it happens, approaches X and Y do look a lot alike, which is why I’m doing this. But still I can’t assume them to be the same thing.
But thanks, I’ll look into Chi Square analysis.
To clarify a bit further what I’m dealing with: these constructions A and B are the reflexive (“I troll myself”) and the passive (“I am trolled (by him)”).
In English the reflexive is formed with a pronoun (“myself”) and the passive with an auxiliary construction, but other languages can do it differently. For each meaning there is a small number of possible ways how it can be expressed in a given language. What I want to know is whether languages with one particular approach to the reflexive tend to have one particular approach to the passive.
So your data looks like this? (Edit for readability. The commas are column delimiters.)
Language,Reflexive Approach,Passive Approach
Lang 1,A,X
Lang 2,A,Y
Lang 3,B,X
Lang 4,B.Y
etc…
No no, it’s more like this.
Language; Reflexive approach; Passive approach
Lang 1; X; Y
Lang 2; not-X; Y
Lang 3; X; not-Y
Lang 4; not-X; not-Y
Where there are several options for not-X and several options for not-Y.
Here’s some actual data. Some languages don’t have passives.
Language; Reflexive approach; Passive approach
Eastern Aranda; reflexive verb affix; N/A.
Arabana; reflexive verb affix; passive verb affix.
Dehu; reflexive verb affix; syntactic transformation.
Bahasa Indonesia; pronouns; passive verb affix / auxiliaries.
What I want to know is if there’s a correlation between the reflexive verb affix approach for reflexives and the passive verb affix approach for passives.
I suspect you won’t get very far with a chi-squared test unless the correlation is very distinct. Even if there is a correlation, its probably pretty weak so you will need a lot of data to tease it out.
However, thats generally true with statistical tests…
That’s another thing… I’ve found just 32 data points. :/
I think I’m going to need an analysis that can work in relative data poverty.
Most tests fail with less than around 30 or 40… i.e. they are not really valid.
This might help.
I think you should drop the statistical tests and see what your data says just by looking. You can do something similar to a chi-squared but don’t worry about the p-value, if the correlation looks strong you may be able to say something as long as you can come up with a good causal explanation.
I see…
Well, I’ve done some simple calculations now, taking the ratio of verb-affix-passives to anything-else-passives for languages with verb-affix-reflexives and for languages without them, and comparing the two ratios. Looking at it that way, the answer seems pretty clear-cut, actually.
I’ll have to add some caveats, but I think I still have the best answer I can get with my present resources. Cool.
Thanks for your help anyway, both of you. :D
My first instinct was also a χ2, but now that I’m thinking about it I wonder if there’s another way to frame the problem in terms of interval or ratio data. What you’re looking for is not really a correlation, right? Because as you point out, these are non-linear variables, and all a correlation really tells you is how well a line describes the data. In this case, the answer would be: not at all.
I like your final analysis. I’m thinking maybe there is some way to do a hypothesis test, maybe using nonparametric stats? I’ll keep thinking about it.
@nikipedia
I see. So I suppose “correlation” is technically not the right term, then. Noted. Thanks.
What would this sort of relation be called, then?
@Fyrius To be technical, you didn’t mean ‘non-linear” but really “categorical.”
Ah. Thanks for the correction.
I found microarray analysis (see Sturm) beneficial to use even with categorical (nominal) data. (It is easy to get a licence for academic use). The output is very visual and easy to intgerpret once you understand the analysis.
Answer this question
This question is in the General Section. Responses must be helpful and on-topic.