Software Studies Beta: the meaning of statistics and digital humanities

"I still hear this persistent fear of people using computational analysis in the humanities bringing about scientism, or positivism. The specter of Cliometrics haunts us. This is completely backwards."
Trevor Owens, Discovery and Justification are Different: Notes on Science-ing the Humanities, 11/19/2012.

As the number of people using quantitative methods to study "cultural data" is gradually increasing (right now these people work in a few areas which do not interact: digital humanities, empirical film studies, computers and art history, computational social science), it is important to ask: what is statistics, and what does it mean to use statistical methods to study culture? Does using statistics immediately make you a positivist?

Here is one definition of statistics:

"A branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data" (http://www.merriam-webster.com/dictionary/statistics)

Wikipedia article drops the reference to "mathematics" and repeats the rest:

"Statistics pertains to the collection, analysis, interpretation, and presentation of data." (http://en.wikipedia.org/wiki/Outline_of_statistics).

Without the reference to mathematics, this description looks very friendly - there is nothing here which directly calls for positivism, or scientific method.

But of course this is not enough to argue that statistics and humanities are compatible projects. So let's continue. It is standard to divide statistics into two approaches: descriptive and inferential.

"Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data." (http://www.socialresearchmethods.net/kb/statdesc.php).

Examples of descriptive statistics are mean (the measure of central tendency) and standard deviation (the measure of dispersion).

Range_of_height_measurements_of_union_soldiers_1864

Adolphe Quetelet. A graph showing the distribution of height measurements of soldiers. This graph was a paet of many early statistical studies of Quetelet which led him to formulate a theory of "average man" (1835) which states that many measurements of human traits follow a normal curve. Source: E. B. Taylor, Quentelet on the Science of Man, Popular Science, Volume 1 (May 1872).

Application of descriptive statistics does not have to be followed by inferential statistics. The two serve different purposes. Descriptive statistics is only concerned with the data you have – it is a set of diverse techniques for summarizing the properties of this data in a compact form. In contrast, in inferential statistics, the collected data is only a tool for making statements about what is outside this sample (e.g, a population):

"With descriptive statistics you are simply describing what is or what the data shows. With inferential statistics, you are trying to reach conclusions that extend beyond the immediate data alone." (http://www.socialresearchmethods.net/kb/statdesc.php).

Traditionally, descriptive statistics usually summarized the data with numbers. Since the 1970, computers gradually made the use of graphs for studying the data (as opposed to only illustrating the findings) equally important. This was pioneered by John Tukey who came up with the term exploratory data analysis. "Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics in easy-to-understand form, often with visual graphs, without using a statistical model or having formulated a hypothesis." (http://en.wikipedia.org/wiki/Exploratory_data_analysis). Tukey's work lead to the development of statistical and graphing software S, which in its turn lead to R, which today is the most widelly used computing platform for data exploration and analysis.

Accordingly, the current explanation of descriptive statistics on Wikipedia includes both numbers and graphs: "Descriptive statistics provides simple summaries about the sample and about the observations that have been made. Such summaries may be either quantitative, i.e. summary statistics, or visual, i.e. simple-to-understand graphs." (http://en.wikipedia.org/wiki/Descriptive_statistics).

What we get from this is that we don't have to do statistics with numbers - visualizations are equally valid. This should make the people who are nervous about digital humanities feel more relaxed. (Of course, visualizations bring their own fears - after all, humanities always avoided diagrams, let alone graphs, and the fact that visualization is now allowed to enter humanities is already quite amazing. (For example, current Cambridge University Press auhor guide still says that illustrations can only be used if it’s really necessary, because they distract readers from following the arguments in the text.)

Francis Galton’s first correlation diagram, showing the relation between head circumference and height, 1886. Source: Michael Friendly and Daniel Danis, The Early Origins and Development of the Scatterplot. The article suggests that this diagram was an intermediate form between a table of numbers and a true graph.

However, we are not finished yet. Besides numbers and graphs, we can also summarize the data using parts of this data. In this scenario, there is no translation of one media type into another type (for example, text translated into numbers or into graphs). Although such summaries are not (or not yet) understood as belonging to statistics, they perfectly fit the definition of descriptive statistics.

For example, we can summarize a text with numbers such as the average sentence length, the proportion between nouns and verbs, and so on. We can also use graphs: for example, a bar chart that shows frequency of all words used in the text in ascending order. But we can also use some words from the text as its summary.

One example is the popular word cloud. It summarizes a text by showing us most frequently used words that are scaled in size according to how often they are used. It carries exactly the same information, as a graph which plots their frequencies - but if the latter foregrounds the graphic representation of the pattern, the former foregrounds the words themselves.

Another example is a phrase net technique available on manyeyes (http://www-958.ibm.com/software/data/cognos/manyeyes/page/Phrase_Net.html). It is a graph that shows most frequent pairs of words in a text. Here is a prase net which shows first 20 most frequent word pairs in Jane Austen's Pride and Prejudice (1813):

Interactive version of this graph which allows you to change parameters.

In the previous examples, the algoritms extracted some words or phrases from a text and organized them visually - therefore it possible to argue that they still belong to graph method of descriptive statistics. But consider now the technique of topic modeling which recently has been getting lots of attention in digital humanities (http://en.wikipedia.org/wiki/Topic_model; http://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/). The topic model algorithm outputs a number of sets of semantically related words. Each set of words is assumed to represent one theme in the text. Here are examples of three such sets from the topic model of the articles in journal Critical Inquiry (source: Jonathan Goodwin, Two Critical Inquiry Topic Models, 11/14/2012).

meaning theory interpretation question philosophy language point claim philosophical sense truth fact argument knowledge intention metaphor text account speech

history historical narrative discourse account contemporary terms status context social ways relation discussion essay sense form representation specific position

public war time national city american education work social economic space people urban culture corporate building united market business

In my lab, we have been developing free software tools for summarizing large image and video collections. In some of our applications, the whole collection is translated into a single high-resolution visualization (so these are not summaries) but in others, a sample of an image set, or parts of the images are used as the summaries. For example, here is the visual summary of 4535 covers of Time magazine which uses one one pizel wide horizontal column from from each cover. It shows the evolution of the covers design from 1923 to 2008 (left to right), compressing 4535 covers into a single image:

(For a detailed analysis of this and other related techniques for what I call exploratory media analysis, and their difference from information visualization, see my article What is Visualization?)

We can also think of other examples of summarizing collections / artifacts in different media by using parts of these collections / artifacts. On many web sites, video is summarized by a series of keyframes. (In computer science, there is a whole field called video summarization devoted to the development of algorithms to represent a video by using selected frames, or other parts of a video.)

To summarize a complex image we can translate it into a monochrome version that only shows the key shapes. Such images have been commonly used in many human cultures.

Picasso. Self Portrait. 1907. A summary of the face which uses outlines of the key parts.

Some symbols can also act as a summaries. For example, modernity and industrialization were often summarized by images of planes, cars, gear, workers, and so on. Today in TV commercials, a network society is typically summarized by a visualization showing a globe with animated curve connecting many points.

Example of an object becoming a symbol. The gears stand in for industrialization. The cover of Siegfried Gedeon's Mechanization Takes Command (1947).

There is nothing "positivist" or "scientific" about such same-media summaries, because they are what humanities and the arts have always been about. Art images always summarized visible or imaginary reality by representing only some essential details (like contours) and omitting others. (Compare to the goal of descriptive statistics to come up with "the basic features of the data.") A novel may summarize everything that happened over twenty years in the life of characters by only showing us a few of the events. And every review of a feature film includes a short text summary of its narrative.

Until development of statistics in the 19th century, all kinds of summaries were produced manually. Statistics "industrializes" this process, substituting subjective summarization by the objective and standardized measures. While at first these were only summaries of numerical data, the development of computational linguistics, digital image processing, and GIS in the 20th century also automates production of summaries of media such as texts, images, and maps.

Given that production of summaries is the key characteristics of human culture, I think that such traditional summaries created manually should not be opposed to more recent algorithmically produced summaries such as a word cloud or a topic model, or the graphs and numerical summaries of descriptive statistics, or binary (i.e., only black and wite without any gray tones) summaries of photographs created with image processing (in Photoshop, use Image > Adjustments > Threshold). Instead, all of them can be situated on a single continuous dimension.

On the one end, we have the summaries that use same media as the original "data." They also use the same the same structure (e.g., a long narration in the film or a novel is summarized by a condensed narrative presented in a few sentences in a review; a visible scene is represented by the outlines of the objects). We can also put here metonymy, (the key rhetorical figure), and Pierce's icon (from his 1867 semiotic triad icons/index/symbol).

On the other end, we have summaries which can use numbers and/or graphs and which present information which is impossible to see immediately by simply reading / viewing / listening / interacting with the cultural text or a set of texts. Examples of such summaries are a number representing the average number of words per sentence in a text which has tens of thousands of sentences, or the graph showing relative frequencies of all the words appearing in a long text.

I don't think that we can find some hard definite threshold which will separate summaries which can only be produced by algorithms (because of the data size) from the ones produced by humans. In the 1970s, without the use of any computers, French film theorists advanced the idea that most classical Hollywood films follow a single narrative formula. This is just one example of countless "summaries" produced in all areas of humanities (whether they are accurate is a separate question).

I don't know if my arguments will help us when we are criticized by people who keep insisting on a wrong chain of substitutions: digital humanities=statistics=science=bad. But if we keep explaining that statistics is not only about inferences and numbers, gradually we will be misunderstood less often.

(Note: in the last fifteen years, new mathematical methods for analyzing data which overlap with statistics become widely used – referred by "umbrella" terms uch as “data mining," "data science" and “machine learning.” They have very different goals than either descriptive or inferential statistics. I will address their assumptions, their use in industry, and how in digital humanities we may use them differently in a future article).

Logo (Home Page Only)

Software Studies Initiative

Logo and Side Nav

Software Studies Initiative

News

Search

Browse News Archive

Tuesday, November 27, 2012

the meaning of statistics and digital humanities

Labels

Footer

Software Studies Initiative