Bluefin Labs analyze 5 billion of online comments and 2.5 million minutes of TV every month.
This visualization shows the relations between Gatorade brand and the male viewers of different TV shows.
Source: Bluefin Mines Social Media To Improve TV Analytics, Fast Company, 11-07-2011.Echonest offer information on 30 million songs and 1.5 million music artists.
Source: the.echonest.comPaul Butler's visualization of a sample of 10 million friends from Facebook, using company' data warehouse.
Source: Paul Butler, Visualizing Friendships. In a article called
Computational Social Science (Science, vol. 323, no. 6, February 2009, the leading researchers in network analysis, computational linguistics, social computing, and other fields which now work with large data write:
"The capacity to collect and analyze massive amounts of data has transformed such fields as
biology and physics. But the emergence of a data-driven 'computational social science' has been much slower. Leading journals in economics, sociology, and political science show little evidence of this field. But computational social science is occurring in Internet companies such as Google and Yahoo, and in government agencies such as the U.S. National Security Agency. Computational social science could become the exclusive domain of private companies and government agencies. Alternatively, there might emerge a privileged set of academic researchers presiding over private data from which they produce papers that cannot be critiqued or replicated. Neither scenario will serve the long-term public interest of accumulating, verifying, and disseminating knowledge."
Substitute the word
humanities in the above paragraph, and it now describes perfectly the issues involved in large-scale analysis of cultural data. Today
digital humanities scholars are mostly working with the archives of digitized historical cultural archives which were created by libraries and universities with the funding from NEH and other institutions. These archives and their analysis is very important - but this work does not engage with the massive amounts of cultural content and peoples' conversations and opinions about it which exist on social media platforms, personal and professional web sites, and elsewhere on the web. This data offers us unprecedented opportunities to undertand cultural processes and their dynamics and develop new concepts and models which can be also used to better understand the past. (In our lab, we refer to computational analysis of large contemporary cultural data as
cultural analytics.)
Contemporary media and web industries are dependent on the analysis of this data. This analysis enables search, recommendations, video fingerprinting, identification of trending topics, and other crucial functions of their services. Because of its scale and technical sophistication, perhaps we should call it
"computational humanities." The players in computational humanities are Google, Facebook, YouTube,
Bluefin labs,
Echonest, and other companies which analyze social media signals (blogs, Twitter, etc.) and the content of media on social networks. They do not usually ask theoretical questions which can be directly related to humanities, but the types of analysis they perform and the techniques they use can be easily extended to ask these questions.
The questions posed in the paragraph I quoted above are directly applicable to "computational humanities." We can ask: Will computational humanities remain the exclusive domain of private companies and government agencies? Will we see a privileged set of academic researchers presiding over private data from which they produce computational humanities papers that cannot be critiqued or replicated?
These questions are essential for the
future of humanities. In this respect, NEH/NSF
Digging Into Data competitions are very important as they try to push humanists to think on the scale of computational humanities, and collaborate with computer scientists. To quote from the description of 2011 competition:
"The idea behind the Digging into Data Challenge is to address how "big data" changes the research landscape for the humanities and social sciences. Now that we have massive databases of materials used by scholars in the humanities and social sciences -- ranging from digitized books, newspapers, and music to transactional data like web searches, sensor data or cell phone records -- what new, computationally-based research methods might we apply?"
In our lab, we are hoping to make a contribution towards bridging the gap between "digital humanities" and "computational humanities." Our data sets range from the small historical datasets - for instance, 7000 year-old stone arrow heads and
paintings of Piet Mondrian and Mark Rothko - to large scale contemporary user-generated content such as
1,000,000 manga pages or 1,000,000 images from deviantArt (the most popular social network for user-generated art). We also write papers for both
humanities and
computer science audiences. All our work is collaborative, involving students in digital art, media art, and computer science. And although our largest image sets are still tiny in comparison to the data analyzed by the companies I mentioned above, they are much bigger than what humanists and social scientists usually work with. The new
visualization tools we have developed already allow you to explore patterns across 1,000,000 images, and we are gradually scaling them up.