Monday, October 13, 2014

Positivism and Big Data

This is an outcome of a conversation with Rita Kop regarding the article The View from Nowhere.

She writes, data scientists use their quantitative measure, as positivists do, by putting a large number veneer over their research

This misrepresents positivism, just as Nathan Jurgenson does in his original article.

The core of positivism is that all knowledge is derived from experience. The core tenet of positivism is that there is a knowable set of observation statements which constitute the totality of experience. The number of sentences doesn’t particularly matter; in some cases (eg. in Popper’s falsifiability theory) even one such sentence will be significant.

Quine’s “Two Dogmas of Empiricism” most clearly states the core of positivism in the process of attacking it. The two dogmas are:

  1. Reductionism – that all theoretical statements can be reduced to a set of observation statements (by means of self-evident logical principles);
  2. The analytic-synthetic distinction – this is the idea that observation statements can be clearly and completely distinguished from theory, which of course, turns out not to be true (because we have ‘theory-laden data’.

We can apply these principles to big data analytics, of course, and we can use the standard criticisms of positivism to do it:

  1. Underdetermination – this is the ‘problem of induction’ or the ‘problem of confirmation’. Theoretical statements cannot be deduced from observation statements, hence, we rely on induction, however, observational data underdetermines theory – for any given set of observation statements, an infinite set of theoretical statements is consistent with each statement equally well confirmed;
  2. Observer bias – the language we employ in order to make observation statements must exist prior to the making of the statement, and hence, adds an element of theory to the statement. This language is typically a product of the culture of the investigator, hence, language introduces cultural bias.
To Quine’s two objections I am inclined to add a third: logicism. This core element of logical positivism in particular in general escapes challenge. It is essentially the idea that the relation between data and theory can be expressed logically, that is, in systems comprised of statements and inferences.

It’s not that data scientists put a ‘veneer’ over their research by virtue of large numbers. Rather, it is that, by virtue of following positivism’s basic tenets, they subscribe to one of positivism’s core principles: a difference that does not make (an observable) difference is no difference at all. The contemporary version is “You can’t manage what you can’t measure.”

Flip this, and you get the assertive statement: if there is anything to be found, it will be found in the data. If there is any improvement to process or method that can be made, this will result in a change in the data. It is this belief that places an air of inevitability to big data.

We can employ Quine’s objections to show how the data scientists’ beliefs are false.
  1. The same data will confirm any number of theories. It is not literally true that you can make statistics say anything you want, but even when subscribing to Carnap’s requirement (that the totality of the evidence be considered; no cherry-picking) you can make statistics say may different things. This will be true no matter how much data there is.
  2. The collection of the data will presuppose the theory (or elements of the theory) it is intended to represent (and, very often, to prove). I’ve often stated, as a way to express this principle, that you only see what you’re looking for.
In my own epistemology, though I remain unreservedly empiricist, I have abandoned Quine’s ‘logical point of view’. In particular, I propose two major things:

  1. Theory (and abstractions in general) are not generated through a process of induction, but rather through a process of subtraction. These are not inferences to be drawn from observations, but rather, merely ways of looking at things. For example, you can see a tiger in front of you, but if you wish, you can ignore most of the detail and focus simply on the teeth, in which case we've generalized it to "a thing with teeth".
  2. Neither observations nor theories are neutral (nor indeed is there any meaningful way of distinguishing the two (which is why I don’t care whether connectivism is a theory)). Rather, any observation is experienced in the presence of the already-existing effects of previous observations, which is the basis for the phenomenon of ‘recognition’, which in turn is the basis for knowledge.
These constitute a consistent empirical epistemology, however, they are in important ways inconsistent with the core tenets of big data analytics (but that said, this depends a lot on how the analytics are carried out).

In particular, it conflicts with the idea that you can take one large set of data, representing any number of individuals, and draw general conclusions from it, because these data are embedded in personal perspectives, which are (typically) elided in big data analytics. Hence, big data is transforming deeply contextual data into context-free data. Any principles derived from such data are thereby impacted.


Your comments will be moderated. Sorry, but it's not a nice world out there.