The Data Revolution Will be CrowdSourced

Living in the San Francisco Bay Area, one quickly develops an allergy to any claim of a ‘revolution’ in a particular field. But it is now abundantly clear to librarians, archivists, computer scientists, and many social scientists that we are in a transformational age. Terabytes of textual and video data are being created or scanned into existence everyday. While these data include silly tweets, they also include the archives of national libraries, news accounts of activities around the world, journal articles, online conversations, vital email correspondence, surveillance of crowds, videos of police encounters, and much more. If we can understand and measure meaning from all of these data describing so much of human activity, we will finally be able to test and revise our most intricate theories of how the world is socially constructed through our symbolic interactions.

But that’s a big ‘if.’ Natural language and video data, compared to other data computer scientists have been pushing around for decades, are incredibly difficult to work with. Computers were initially built for data that can be precisely manipulated as unambiguous electrical signals flowing through unambiguous logic gates. The meaning of the information encoded in our human languages, gestures, and embodied activities, however, is incredibly ambiguous and often opaque to a computer. We can program the computer to recognize certain “strings” of letters, and then to perform operations on them (much like the operator of Searle’s Chinese Room), but no one yet has programmed a computer to experience our human languages as we do. That doesn’t mean we don’t try. There are three basic approaches to helping computers understand human symbolic interaction, and language, in particular:

  1. We can write rules telling them how to treat all the different multi-character strings (i.e. words) out there.

  2. We can hope that general artificial intelligence will just “figure it out.”

  3. We can show computers how we humans process language, and train them through an iterative process, to read and understand more like we do.

The first two approaches are doomed, and I’ll say more about why. The third approach provides a way forward, but it won’t be easy. It will require that researchers like us recruit hundreds or thousands of people (i.e., crowds) into our processes. So, unpacking this post’s title: our ability to make sense of and systematically analyze the dense, complex, manifold meaning inhering in now ubiquitous and massive textual and video data will depend on our ability to enlist the help of many other humans who already know how to understand language, situations, emotion, sarcasm, metaphor, the pacing of events, and all the other aspects of being an agentic organism in a socially constructed world – all the stuff of social life that computers just won’t ever understand without our help.
Not Enough Rules
The great (and horrible) thing about computers is that – as long as you use the magic words of their ‘artificial languages’ – they will do exactly what you tell them to do. For many, this fact leads to the quick conclusion that we can just write rules telling computers how to process all of our more ambiguous ‘natural languages.’ Feed it a dictionary. Feed it a thesaurus. Tell it how grammar works. Then, they imagine, the computer will be able to speak and write as we do … Would that it were so easy.
Unfortunately, the natural languages we use to communicate everyday are so much more ambiguous than the artificial languages computers read that it is only a modest exaggeration to suggest that writing rules allowing a computer to pass a Turing test (i.e. to so aptly converse with a human that it could fool that human into believing it too was human) would require us to write almost as many rules as there are natural language sentences. Consider, for example, the seemingly easy challenge of parsing an address field from a thousand survey forms. The first several characters before a space are the street number, right? And then the characters after the space are the street name, no? Well… sadly, the natural world is not so well organized, even for highly structured data like addresses. Sometimes addresses start with a building name, not the street address. Sometimes, too, contrary to what we might think, addresses include two separate numeric strings, or even alphabetical characters in the street number string. In fact, there are over 40 exception rules necessary to reliably parse something as simple as the address field of a standard form.
In fact, the computer’s stupid-perfect following of instructions has inspired a genre of blog posts entitled “Falsehoods Programmers Believe About ______.” A Google search of this phrase should provide readers with ample humility about the plausibility of writing rules to teach computers natural language. If relatively simple tasks like parsing addresses, time, names, and geographic locations from structured forms generate so much frustration, imagine the difficulties inherent in parsing sentences like: “She saw him on the mountain with binoculars.” Did he have the binoculars? Was she on the mountain? Perhaps a sentence three paragraphs earlier explained that she was carrying the binoculars while walking along the beach. But, when should the computer compare information across such distant sentences?
By the time even the most patient rule-writer has directed a computer to read just one newspaper, accounting for all the “what they really meant to say” situations, the monumental effort will have produced countless contradictory rules along with many that are torturously complex. Moreover, they’re likely to be poorly designed for the next newspaper, let alone War and Peace, a Twitter feed, or transcripts of local radio news.
Cognitive linguists would argue that the problem with the rule-writing approach is its distance from humans’ actual processing of language. The goal should not be to train the computer to behave like the operator of Searle’s Chinese room, but to train it to understand Chinese (or any natural language) like a fluent speaker. If our ultimate goal is to build computer programs to process terabytes of textual data as humans do, shouldn’t we be attempting to train computers to read them (and even their ambiguities) as we do?
Go is Easy
People have become very excited lately by the development of “deep learning” artificial intelligence technology. Heralded for its ability to defeat humans in complex games like Chess and Go, the technology is also spookily appealing in its mimicry of the actual human brain. It does not include ancient structures like the hippocampus, nor is it directly connected to a breathing, walking, eating mammal. But it does use simulated neurons and neural connections to learn much like we humans do. Our brains often (though not always) learn through a process of neural network potentiation via back-propagation. To sketch that out very simply: some network of neurons fires together in our brains whenever we think a particular thought, imagine a specific memory, or perform a singular task. If that firing does something sensible or useful for us, a chemical propagates back through all the neurons of the network to encourage those neurons to fire together in the future. To learn how to add numbers through this mechanism, for example, is to increase the (chemical) potential that a network of neurons performing the addition function will fire whenever we see two numbers with a ‘+’ sign between them. The computer brain behind “deep learning” behaves similarly. As it gets positive or negative feedback about its performance on some task, it increases or decreases the probability that it will perform similarly the next time it faces a similar task. (More on this below).
People have become so excited about “deep learning” technology and its potential for parsing language data because it recently did something that seems very hard indeed: it beat the World Champion of Go, the most complex strategic game invented by humans. If a computer can beat one of our smartest humans at a very complex game, the reasoning goes, surely a computer can read the New York times and give us a juicy hot take on the latest scandal. Sadly, no.
The success of “deep learning” depends crucially on domain constraints that do not resemble those of our wide open social world. In the simple world of Go, there is a clear winner and loser. The players can only make one of a several moves per turn. And the space of possible action (while more complex and dynamic than Chess or other games) is orders of magnitude smaller than in the vast social world. To understand why this matters, it’s helpful to first have an (at least hand-wavy) understanding of how AlphaGo, the winning computer, learned to play the game.
As explained above, “deep learning” does its learning through simulated neural networks. The AlphaGo computer actually uses two such learning networks. One has the task of figuring out which position AlphaGo should play from, which position is most likely to lead to a win. The second has the task of gaming out (or simulating) the best move AlphaGo could make from any given position. These two networks communicate to determine AlphaGo’s best move from the best position, a thought process likely to seem familiar to anyone who has played the game. But writing rules for each of these neural networks, and their coordination on a single turn-taking, was not enough to make AlphaGo particularly good at the game.
Just as our brains learn (i.e. potentiate the coordinated firing of neurons) based upon feedback, AlphaGo’s “deep learning” system also required feedback – a lot of it – to develop proficiency at the game. That feedback came in two forms: first it learned by comparing itself to excellent human players. When shown a Go board, its two neural networks would settle upon a move. Then it would learn what an identically-situated masterful human player did in the past. If it chose the same as the human, it was “rewarded” slightly, potentiating the two neural networks to perform similarly in future scenarios. Otherwise, it was “punished” slightly so that it would be less likely to make the same mistake again. This sort of learning is called “supervised machine learning” because humans (or at least data they have generated) stand over the shoulder of the machine and let it know when it is right or wrong.
But even this training through millions of games played by many human masters was not enough to make AlphaGo great. Next, AlphaGo was programmed to train by playing against itself. In this step, the computer had no more humans to rely upon. It just knew the game very well, all the strategies it had learned and, crucially, what it meant to score points and win or lose. After several million games against itself, it learned to keep pursuing the strategies that allowed it to win, while eschewing the strategies that caused its clone to lose. This sort of learning – harkening back to behavioral social scientists like B.F. Skinner – is called ‘reinforcement’ learning. Even without human input, the rules for scoring in any well-defined game can be translated into ‘objective’ or ‘loss’ functions which provide feedback to the machine, reinforcing those behaviors more likely to lead to the objective of a win.
By now readers probably have an inkling why Go is so easy compared to parsing a conversation or a news article. Even for formal political debates, there is no clear winner or loser, no clear method for scoring points. Neither does there seem to be obvious objective or loss functions that one could write in order to help a computer understand how to be a good conversationalist. Even a sensemaking task like accurately parsing a news article doesn’t seem to be one that can be boiled down to a concise list of rules. The social world is not a game, or at least not a single game (or well-defined list of games) with recognizable rules that players are consistently incented to follow.
As NYU cognitive psychologist and AI researcher Gary Marcus has put it: “In chess, there are only about 30 moves you can make at any one moment, and the rules are fixed. In Jeopardy [where the computer ‘Watson’ has also bested human champions] more than 95% of the answers are titles of Wikipedia pages. In the real world, the answer to any given question could be just about anything, and nobody has yet figured out how to scale AI to open-ended worlds at human levels of sophistication and flexibility.” One of the foundational thinkers of AI, Gerald Sussman, put it even more succinctly: “you can’t learn what you can’t represent.”

Previous
Previous

On Independence Day, Honor the Founders by Improving Their Work

Next
Next

TextThresher’s Minimum Viable Product Complete