General Question

gorillapaws's avatar

What can Google's Ngram viewer tell us about our society?

Asked by gorillapaws (30865points) December 20th, 2010

Google has recently released a free web-app called Books Ngram Viewer that lets anyone view a graph of how frequently a word/phrase has appeared in books each year going back hundreds of years. Here’s an article from TIME that gives more background about it.

What can this technology tell us about our culture? Is it an effective measure of what our society is concerned with over time? Have you discovered anything cool using it?

When I search for God, there’s an interesting pattern with a gradual decline over the years, followed by a recent upswing in popularity. apocalypse has a huge surge that builds up to right before 2000, and then begins to sharply drop off after. liberties has dramatically declined over time since the 1800’s. nigger is particularly interesting because there are several distinct peaks and valleys.

I’m curious to hear what your thoughts about the Ngram Viewer are, and discoveries you have made.

Observing members: 0 Composing members: 0

31 Answers

the100thmonkey's avatar

I think one problem with the Ngram viewer is that in its basic form such as in the webapp, it is only possible to compare frequency over time. Without being able to examine the context in which the word frequencies rise and fall, it is impossible to draw any conclusions about the cultural trends – it’s useful as a complementary tool for (dis)confirming hypotheses one already has, but it’s extremely difficult to draw sound hypotheses from only this data.

Using the data with a dedicated concordancer like Xaira or AntConc (both free) would allow much deeper analysis of the datasets. Unfortunately, the data does not seem to be compatible with them. I think the problem is one of copyright.

Kayak8's avatar

Thanks for sharing the link—I found it curious! I thought the following were interesting:
gay,lesbian
scrapbook
drowning

gorillapaws's avatar

@the100thmonkey I agree that you can’t draw hard conclusions, using this tool alone, but I would think that due to the massive amount of data this thing is working with, many contextual subtleties will be washed out. One interesting point highlighted by @Kayak8‘s search of “gay” is that the term was widely used to mean happy in earlier literature, so these semantic nuances (I think that’s the correct term) are important to bear in mind when using a tool like this.

I agree that it would be much more useful if more sophisticated tools were available for analysis, perhaps this will be made available to serious researchers in the future (possibly with licensing payments to the publishing industry somehow).

@Kayak8 you’re welcome, and thanks for sharing.

CyanoticWasp's avatar

Thanks for the link. Very interesting.

I attempted to replicate one of @Kayak8‘s graphs, so I thought I was looking at “drowning”, but I had misspelled it as “drowing”—and I still got a graph.

I was curious about the progressions for “evolution”, “postmodern” and (depressingly) “liberty”.

gorillapaws's avatar

Here’s another interesting one fear, faith.

This one’s a bit sad: logic

CyanoticWasp's avatar

Take a look at the graphs for cash, credit, economy.

Integrity and honor are interesting, too.

Kayak8's avatar

mass murder is also interesting. If you run holocaust,mass murder at the same time it is more interesting.

the100thmonkey's avatar

Logic in the British English corpus.

@gorillapaws – your point about the use of the word ‘gay’ is really what I’m saying. It’s impossible to deduce that the meaning of the word has changed without other contextual information.

Kayak8's avatar

Not surprisingly, different disease names also seem to correspond with various outbreaks/fear of the disease. First I ran TB, then thinking it was referred to as tuberculosis, I put that through. Influenza was also interesting.

gorillapaws's avatar

I found this to be telling: compromise.

Hobbes's avatar

Ejaculated is another interesting one in the same vein, since it used to mean saying something loudly and suddenly.

LuckyGuy's avatar

The peak for the word Fluther was in 1966! Who knew?

iamthemob's avatar

Sin has an interesting upswing after the year 2000 after a steady decline…I wonder…what…happened…;-)

Kayak8's avatar

@iamthemob Look how Sin,Faith do together!

MissAnthrope's avatar

Using this, I just learned there’s a saint with my name! I have an unusual-ish name and had no idea!

There are also a lot of pigs and cows with my name, too. Hmm.

Anyway, thanks for posting the link to this.. it’s very interesting to look at. I hesitate to make any interpretations, though.

iamthemob's avatar

@Kayak8 – I find that upsetting. But unsurprising. Sigh….

the100thmonkey's avatar

AAAAARRRGH!

It is impossible to make any kind of judgement on the data unless we are asking the right questions!

Please – if you are going to comment on the relative frequencies of words in the corpus that google has published, you need to do at least three things:

1. be clear about the hypotheses – and assumptions – that inform the query you make.

2. ask yourself whether you have a hypothesis you wish to test or whether you wish to form a hypothesis from what falls out of the data.

3. be open about – and include – your answers to the first two questions. Your posts are meaningless without this.

It’s a database, nothing more. Moreover, the tool that Google have released is incredibly limited – beyond @MissAnthrope‘s discovery of isolated facts, this tool – in the form that it is in and without being able to contextualise the results with other data – is limited.

Corpus analysis is a central feature of my job. Please be careful when using tools like this: it won’t kill you if you cock it up, but I’m seeing far too much naÏveté here for me to be confident that we aren’t all going to just start repeating our prejudices while citing ‘information’ (as opposed to ‘data’) that really just reflects our propensity for confirmation bias.

Data is only as good as the questions you ask of it. The only judgement we are able to make on the data with the tool we have been provided with is “Hmmmm… that’s interesting. I wonder if [...] is true”.

CyanoticWasp's avatar

Well, damn, @the100thmonkey… I wish you’d speak up more often about confirmation bias when the topics verge into some of the various political and economic ones where most of the jellies routinely confirm each other’s prejudices. But you’re right, except… ‘data’ are plural.

iamthemob's avatar

@the100thmonkey – I would agree with @CyanoticWasp about discussions covering political/economic situations, etc…but more because it seems to be part of your expertise, and not because your post seems to be a valid response to a group of “Oooh…this is neat…weird…” reactions. Also, I would hesitate to dictate to users how to respond on this thread as if you are enforcing certain rules, particularly when the google ngram viewer demonstrates, beyond a shadow of a doubt, that your relevance has been in decline over the past 30 years.

gorillapaws's avatar

I’m with @the100thmonkey here. To use the Sin, Faith example above, there are many possible interpretations, but clearly we can’t make many firm conclusions beyond the fact that both words tend to appear with similar frequencies over time. One might hypothesize that:
1. People who are writing about faith may also be caused to write about sin
2. People writing about sin will be caused to write about faith
3. Some other phenomena cause people to write about both sin and faith, or
4. They’re actually completely unrelated and the similarity in trend lines are pure coincidence.

Other possibilities could be that “faith” might have been a popular name at certain points, or “sin tax” (or some other phrase/colloquialism) might’ve been a popular phrase at a certain period of time to the point where it might alter the curves significantly. Also synonyms might crop up that might be used instead to represent the concept, reducing the frequency of the word, while the frequency of the concept remains the same (or increases). The only way to draw more significant conclusions from data like this is to engage in the type of rigorous statistical analysis of the kind that @the100thmonkey is describing and isn’t possible using the tools as given.

The data might reveal patterns that are worth further exploration however, and that’s more what I was curious about. You guys have found some interesting things so far, thanks so much for sharing.

iamthemob's avatar

The data might reveal patterns that are worth further exploration however, and that’s more what I was curious about. You guys have found some interesting things so far, thanks so much for sharing.

I think that his, however, is the opposite of what @the100thmonkey seems to advocate. The steps indicate the need to form a hypothesis regarding what frequency means over the time periods. If one had a hypothesis about the outcome, indeed, there would be need to control for whatever potential confirmation bias might accompany that hypothesis.

I do wonder what set @the100thmonkey off though. The handful of responses from a few users don’t really contain what appear to be any solid statements about what the user thought the data revealed – I think mine came close, and I attempted to clarify with the smiley that I was joking a bit.

But, let’s say that I was to pursue a hypothesis that the increase in the use of the words “sin” and “faith” were do to an increased fundamentalism following 9/11 and during the Bush era…it would be because this data revealed something surprising, and I drew a potential connection. I just don’t see people really claiming seeing anything other than patterns.

funkdaddy's avatar

Well, despite the chance of getting called out, I found this interesting

yellow,blue,green,red,white,black,orange,brown

Colors generally follow very similar curves except at times when they would likely be used to describe races. White peaks during WW2, then white and black shoot up similarly during the civil rights movement.

I think what makes many of these interesting is that with a little bit of back story you can see history in something as simple as the relative occurrences of a word.

wundayatta's avatar

You can tell any story you want to explain any of those count changes. You could theorize that the words reflect the prevalence of concern about issues those words are often associated with (which is my theory). But so what? Is it going to help you?

Right now, this is like arguing over the meaning of an abstract painting. If you wanted to do something serious, you’d want to compare the trends for one word against a set of other words—either related or unrelated. Or better yet, you’d run a factor analysis to try to figure out which words most closely changed in lock step with each other.

But then what do you have? I am very skeptical of word counting. I think people make much more of it than is there. I think it can give you some guidance about themes that might be contained in the documents you are looking at, but the only way to test whether those themes are there (whether you are interpreting the counted words in any sensible way) is to actually read the material the words were counted from.

Who’s going to do that?

This is qualitative data analysis. This is not something that can really be quantified beyond counting words. Is anyone surprised that the term “traffic jam” first starts appearing around 1900 and then increases in a linear fashion since then, except for a few downticks. The first is around world war II; the second around the oil crisis in the mid-70s, and the third, which is still continuing, started somewhere around 2005, when the wars in Afghanistan and Iraq were starting.

In other words, the use of the term “traffic jam” seems to decline when the oil supply for consumers seems to be declining and/or when the economy is declining. Except that during the depression, the use of the phrase kept increasing, although a slightly slower rate.

So, how do we explain this? My guess is that with less access to gasoline, there are fewer cars on the road and that, with fewer cars, there are fewer traffic jams and people write about traffic jams less often. It could also be that people care less about traffic jams at these times when they are worried about other things in the economy.

So what does it mean that the words and and the have been slowly declining over the last century?

Kayak8's avatar

I, for one, am not drawing any conclusions (I have a sense of the limits of data about which I know nothing). I just think it is a hoot to see the squiggly lines go up and down.

the100thmonkey's avatar

@iamthemob – it doesn’t take much to set me off these days. It was actually a discussion on my Facebook wall about this topic that got me going. :)

Hobbes's avatar

While I’m aware that confirmation bias is a real problem, and that these sorts of data can easily lead to it, I think it is possible for a layman to draw some common sense conclusions from these numbers. Take War for example. There are two giant spikes right around the times of World War I and World War II. Does anyone really think this is coincidence? I know that such a clear relationship is next to impossible to determine for many other words, and professional studies should not attempt to draw conclusions like this, but we’re just people on an internet forum pointing out interesting trends in data for fun. Lighten up, man.

iamthemob's avatar

@the100thmonkey – that makes much, much more sense then. ;-)

the100thmonkey's avatar

@iamthemob – yes, one of my… internet tics… is to leave out far too much detail when I’m fired up.

Answer this question

Login

or

Join

to answer.

This question is in the General Section. Responses must be helpful and on-topic.

Your answer will be saved while you login or join.

Have a question? Ask Fluther!

What do you know more about?
or
Knowledge Networking @ Fluther