The Technium

The Google Way of Science


[Translations: Japanese]

Pb Feeding F

There’s a dawning sense that extremely large databases of information, starting in the petabyte level, could change how we learn things. The traditional way of doing science entails constructing a hypothesis to match observed data or to solicit new data. Here’s a bunch of observations; what theory explains the data sufficiently so that we can predict the next observation?

It may turn out that tremendously large volumes of data are sufficient to skip the theory part in order to make a predicted observation. Google was one of the first to notice this. For instance, take Google’s spell checker. When you misspell a word when googling, Google suggests the proper spelling. How does it know this? How does it predict the correctly spelled word? It is not because it has a theory of good spelling, or has mastered spelling rules. In fact Google knows nothing about spelling rules at all.

Instead Google operates a very large dataset of observations which show that for any given spelling of a word, x number of people say “yes” when asked if they meant to spell word “y.” Google’s spelling engine consists entirely of these datapoints, rather than any notion of what correct English spelling is. That is why the same system can correct spelling in any language.

In fact, Google uses the same philosophy of learning via massive data for their translation programs. They can translate from English to French, or German to Chinese by matching up huge datasets of humanly translated material. For instance, Google trained their French/English translation engine by feeding it Canadian documents which are often released in both English and French versions. The Googlers have no theory of language, especially of French, no AI translator. Instead they have zillions of datapoints which in aggregate link “this to that” from one language to another. 

Once you have such a translation system tweaked, it can translate from any language to another. And the translation is pretty good. Not expert level, but enough to give you the gist. You can take a Chinese web page and at least get a sense of what it means in English. Yet, as Peter Norvig, head of research at Google, once boasted to me, “Not one person who worked on the Chinese translator spoke Chinese.”  There was no theory of Chinese, no understanding. Just data. (If anyone ever wanted a disproof of Searle’s riddle of the Chinese Room, here it is.)

If you can learn how to spell without knowing anything about the rules or grammar of spelling, and if you can learn how to translate languages without having any theory or concepts about grammar of the languages you are translating, then what else can you learn without having a theory?

In a cover article in Wired this month Chris Anderson explores the idea that perhaps you could do science without having theories.

This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.

Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

There may be something to this observation. Many sciences such as astronomy, physics, genomics, linguistics, and geology are generating extremely huge datasets and constant streams of data in the petabyte level today. They’ll be in the exabyte level in a decade. Using old fashioned “machine learning,” computers can extract patterns in this ocean of data that no human could ever possibly detect. These patterns are correlations. They may or may not be causative, but we can learn new things. Therefore they accomplish what science does, although not in the traditional manner.

What Anderson is suggesting is that sometimes enough correlations are sufficient. There is a good parallel in health. A lot of doctoring works on the correlative approach. The doctor may not ever find the actual cause of an ailment, or understand it if he/she did, but he/she can correctly predict the course and treat the symptom. But is this really science? You can get things done, but if you don’t have a model, is it something others can build on?

We don’t know yet. The technical term for this approach in science is Data Intensive Scalable Computation (DISC). Other terms are “Grid Datafarm Architecture” or “Petascale Data Intensive Computing.” The emphasis in these techniques is the data-intensive nature of computation, rather than on the computing cluster itself. The online industry calls this approach of investigation a type of “analytics.” Cloud computing companies like Google, IBM, and Yahoo(pdf), and some universities have been holding workshops on the topic. In essence these pioneers are trying to exploit cloud computing, or the OneMachine, for large-scale science. The current tools include massively parallel software platforms like MapReduce and Hadoop (see my earlier post), cheap storage, and gigantic clusters of data centers. So far, very few scientists outside of genomics are employing these new tools. The intent of the NSF’s Cluster Exploratory program is to match scientists owning large databased-driven observations with computer scientists who have access and expertise with cluster/cloud computing.

My guess is that this emerging method will be one additional tool in the evolution of the scientific method. It will not replace any current methods (sorry, no end of science!) but will compliment established theory-driven science. Let’s call this data intensive approach to problem solving Correlative Analytics. I think Chris squander a unique opportunity by titling his thesis “The End of Theory” because this is a negation, the absence of something. Rather it is the beginning of something, and this is when you have a chance to accelerate that birth by giving it a positive name. A non-negative name will also help clarify the thesis. I am suggesting Correlative Analytics rather than No Theory because I am not entirely sure that these correlative systems are model-free. I think there is an emergent, unconscious, implicit model embedded in the system that generates answers. If none of the English speakers working on Google’s Chinese Room have a theory of Chinese, we can still think of the Room as having a theory. The model may be beyond the perception and understanding of the creators of the system, and since it works it is not worth trying to uncover it. But it may still be there. It just operates at a level we don’t have access to.

But the models’ invisibility doesn’t matter because they work. It is not the end of theories, but the end of theories we understand. Writing in response to Chris Anderson’s article George Dyson says this much better:

For a long time we were stuck on the idea that the brain somehow contained a “model” of reality, and that AI would be achieved by constructing similar “models.” What’s a model? There are 2 requirements: 1) Something that works, and 2) Something we understand. Our large, distributed, petabyte-scale creations, whether GenBank or Google, are starting to grasp reality in ways that work just fine but that we don’t necessarily understand.

Just as we will eventually take the brain apart, neuron by neuron, and never find the model, we will discover that true AI came into existence without ever needing a coherent model or a theory of intelligence. Reality does the job just fine.

By any reasonable definition, the “Overmind” (or Kevin’s OneComputer, or whatever) is beginning to think, though this does not mean thinking the way we do, or on any scale that we can comprehend.

What Chris Anderson is hinting at is that Science (and some very successful business) will increasingly be done by people who are not only reading nature directly, but are figuring out ways to read the Overmind.

What George Dyson is suggesting is that this new method of doing science — gathering a zillion data points and then having the OneMachine calculate a correlative answer  — can also be thought of as a method of communicating with a new kind of scientist, one who can create models at levels of abstraction (in the zillionics realm) beyond our own powers.

So far Correlative Analytics, or the Google Way of Science, has primarily been deployed in sociological realms, like language translation, or marketing. That’s where the zillionic data has been. All those zillions of data points generated by our collective life online. But as more of our observations and measurements of nature are captured 24/7, in real time, in increasing variety of sensors and probes, science too will enter the field of zillionics and be easily processed by the new tools of Correlative Analytics.  In this part of science, we may get answers that work, but which we don’t understand. Is this partial understanding? Or a different kind of understanding?

Perhaps understanding and answers are overrated. “The problem with computers,” Pablo Picasso is rumored to have said, “is that they only give you answers.”  These huge data-driven correlative systems will give us lots of answers — good answers — but that is all they will give us. That’s what the OneComputer does –  gives us good answers. In the coming world of cloud computing perfectly good answers will become a commodity. The real value of the rest of science then becomes asking good questions.




Comments
  • Rama

    Nice post. Much more thoughtful and balanced than Chris Anderson’s piece.

    >> I am suggesting Correlative Analytics rather than No Theory because I am not entirely sure that these correlative systems are model-free. <<

    I believe you nailed the crux of the argument here. There IS an implicit model – it is just that it is “statistical”, not “causal”.

    For example, correlative analytics include techniques such as “nearest neighbor” methods. These have been used for a long time in machine-learning and statistics and are supported by a well-developed mathematically-rigorous theory of how they work, what their limitations are, and how they compare to other methods.

    To claim that the new approach is model-free or theory-free is just plain wrong.

  • Michael Brown

    Slartibartfast is ignoring that theory != “understanding”.
    Doc, this is the thin edge of the singularity making itself manifest….

  • http://blog.beara.ie Dave

    ‘may be beyond the perception and understanding of the creators of the system, and since it works it is not worth trying to uncover it. But it may still be there. It just operates at a level we don’t have access to.’

    ‘It’ sounds like god to me. :-(

  • Genomik

    My term that I think describes this is Massive Empiracism. I guess science has always been top down. Have a theory, get empiracal data to support it. Now you can go bottom up, computers can do massive empiracism and then later on u develop theories.

    Massemp or MassEmpire for short. Eventually when you devolop in 20 years quadrillions of data points you can start to manufacture reality or create a world. A Mass Empire!

    Genomik

  • James Page

    Is this the end of Popper’s scientific method? Popper proposed that science is shown by having a negative hypothesis. We can test googles translations efforts with a Turing Test. Googles translations IMHO fail that test. So we have to be very careful when we make grand statements like Anderson has said that this is real science.

    But this will give a whole lot aspiring Kuhn, Poppers, Simmons, and Lakotosh’s something to get stuck into.

  • Allen

    So you can know the answer without understanding the question? Only through understanding the question can you properly use the answer. Douglas Adams saw it all.

    42 I say! It’s 42.

  • nordsieck

    Umm… no.

    This might work for certain classes of simple problems, but when one starts to analyze complex systems, this approach totally fails. Ask economists – people have been trying to do this to the stock market for decades and it always blows up.

  • http://dovdox.com Alan Dove

    Excellent post, Kevin, even though I disagree with its thesis. The main problem is that extrapolating from Google Analytics to model-free science leaves out science’s key contribution.

    Scientists don’t just see and predict correlations, they develop robust organizing principles, called theories, which provide vastly greater intellectual leverage than any merely correlative model. Theories are tools to accomplish a particular purpose, yes, but good theories also become something more, with uses that may extend far beyond what their original developer intended.

    A Google-style system might (eventually) be able to take all of the variations in finch beaks, beetle wings, and pigeon plumage and predict that species in general will vary within certain parameters. But if you want to do away with theories, that’s not nearly enough. The same machine would also have to construct a complete but succinct summary of Darwinian natural selection and speciation. Then, if you really want to prove the point, have the machine use that summary – and nothing else – to explain something that seems to conflict with it. Male nipples, for example.

    The same is also true in every other hard science. Don’t let the engineers tell you otherwise.

  • vanderleun

    “There may be something to this observation.”

    Ah, the faint praise phrase deployed in a delicate way. Probably the nicest way to say that the huff and the puff of Anderson’s puff piece (AKA “My next book proposal” -or- “Hey it was a cover story in my magazine.”) has a lot less in it than meets the eye.

  • http://www.manyworlds.com Steve Flinn

    I think Slartibartfast above gets it exactly right — a theory (or model) is simply a convenient term for an algorithmically compressed description of a future state. Such a compressed description is required primarily for ease of communication where bandwidth is low (such as human-to-human communication). It is less necessary where there is large-scale computing complexes doing the work. Of course, until those computational complexes can also ask all the right questions, and not just provide answers, it is useful to encode predictive algorithms as models comprehensible to humans, but over time even that need is bound to erode . . .

    Also agree with Roland Dobbins above that this line of thinking validates the Chinese Room argument to the extent that all Searle’s thought experiment really demonstrates is that as a practical matter we can construct a computing system that has unlimited intelligence without necessarily being conscious as we conceive it. Consciousness as we perceive it seems to be an extra layer of models (compressible algorithms) that constitute the compressed mode of communications with which we talk to others, and to ourselves. The talking to ourselves part is basically what we call consciousness. It’s not at all necessary for an arbitrarily high level of predictive intelligence, however. (What Searle did not prove, IMO, is that you could not simulate consciousness in a computer, only that it is unnecessary to achieve the ability to translate Chinese to any level of desired competency — just like the Google spell checker).

  • http://peacelovesmusings.blogspot.com PeaceLove

    I think you need to re-read Searle’s Chinese Room thought experiment, because Google’s Chinese translation theories quite literally validate Searle’s proposition, not disprove it, heh.

    That’s the impression I got, too, from reading the Wikipedia entry. But I’m no philosopher or AI guy so i figured maybe I just didn’t get it. Kevin, care to elucidate?

    • Kevin Kelly

      To the 3 or 4 commenters who pointed out that Google’s translation machine validates, rather than disproves Searle’s proposition, as I suggested:

      Searle proposed the idea of the Chinese Room as an argument for the impossibility of an artificial mind. The Room would be translating, but would not understand because no part inside the room understood Chinese. In that respect, the Chinese Room is identical to Google.

      But since from the outside, most users of the Google translator can’t tell whether anyone inside Google knows or understands Chinese (imagine it continues to improve), so that in that regard, it behaves as if it understands Chinese. Therefore this particular mind is not impossible. Searle’s argument against the impossibility of an artificial mind is wrong (although he is right that no one inside the Room will understand Chinese.)

  • k

    When will the OneMachine/Cloud have questions? Is it asking for answers already? This would be a fascinating moment.

  • http://brendandunphy.blogspot.com/ Brendan Dunphy

    “If you can learn how to spell without knowing anything about the rules or grammar of spelling, and if you can learn how to translate languages without having any theory or concepts about grammar of the languages you are translating, then what else can you learn without having a theory? ” And who (not what) has learnt to spell or translate in this way? No one I have ever heard of and wothout a considerable amount of Gooooooooooooogling (life)time ever will because we do ‘think’, remember or ‘learn’ like a computer!

  • cjewel

    But Google hasn’t “learned” to spell in this scenario. It’s only supplying a brute-forced best guess. I’m not saying there’s anything wrong with that, I’m saying this article misstates what’s going on since it implies cognition where there is none.

  • Russell Jarvis

    You are confused. Google translating tools supports John Searles ‘Chinese room’ argument it isn’t a counter demonstration of it. It is a human who reads the product and understands the semantic content of the translation. The program just follows rules (evaluates syntax, but the syntax doesn’t provide the program with semantics).

    The Google translating program(s) _is_ the man in the box who doesn’t understand Chinese, but has successfully replied to a question merely by syntactic (symbol) manipulation.

  • c3

    humanity and life is the questions you ask. not the answers you get.

    Picasso kicks the tech love butt.:)

  • indir

    In fact, Google uses the same philosophy of learning via massive data for their translation programs. They can translate from English to French, or German to Chinese by matching up huge datasets of humanly translated material. For instance, Google trained their French/English translation engine by feeding it Canadian documents which are often released in both English and French versions. The Googlers have no theory of language, especially of French, no AI translator. Instead they have zillions of datapoints which in aggregate link “this to that” from one language to another.

  • www.onlineyiz.biz

    But Google hasn’t “learned” to spell in this scenario. It’s only supplying a brute-forced best guess. I’m not saying there’s anything wrong with that, I’m saying this article misstates what’s going on since it implies cognition where there is none.

  • indir

    I think Slartibartfast above gets it exactly right — a theory (or model) is simply a convenient term for an algorithmically compressed description of a future state. Such a compressed description is required primarily for ease of communication where bandwidth is low (such as human-to-human communication). It is less necessary where there is large-scale computing complexes doing the work. Of course, until those computational complexes can also ask all the right questions, and not just provide answers, it is useful to encode predictive algorithms as models comprehensible to humans, but over time even that need is bound to erode …

    Also agree with Roland Dobbins above that this line of thinking validates the Chinese Room argument to the extent that all Searle’s thought experiment really demonstrates is that as a practical matter we can construct a computing system that has unlimited intelligence without necessarily being conscious as we conceive it. Consciousness as we perceive it seems to be an extra layer of models (compressible algorithms) that constitute the compressed mode of communications with which we talk to others, and to ourselves. The talking to ourselves part is basically what we call consciousness. It’s not at all necessary for an arbitrarily high level of predictive intelligence, however. (What Searle did not prove, IMO, is that you could not simulate consciousness in a computer, only that it is unnecessary to achieve the ability to translate Chinese to any level of desired competency — just like the Google spell checker).

  • Serge

    So… SkyNet anyone?
    Am I the only one scared of a robot dominated future?
    google becomes the first true A.I. and proceeds to enslave mankind “for our own good”

  • andrew charles

    This is just empiricism with large datasets. It’s not particularly novel, and is still hamstrung by the problem of induction.

    Empirical studies can and do tell us a lot but unless they lead to a causative theory and/or a dynamical model we’re left bindly assuming the past will continue to repeat iteself.

  • Richard

    Regarding Searle’s Chinese Room Experiminent. I once listened (years ago) to some recorded lectures by John Searle and my understandng of his position is that an experiment may pass the Turing Test (i.e. if an outsider cannot tell whether he is dealing with a mind or not, then the Turing Test has been passed); but that doesn’t prove that there is a mind, or an artificial intelligence operating. Google’s technique may pass the Turing Test; but Searle’s argument still stands: there is no artificial mind because there is no understanding. It is this understanding that makes a mind. Perhaps all this shows is that Turing’s Test is an inadequate criterion.

  • Sid Almasi

    Hi, Russell. For some reason, I still follow this debate — you respond (correctly about the theory of the thing) to Richard but ignore the fact that the Google translating program(s) is overwhelmingly incorrectly replying to questions related to Chinese translation. And not verb-misconjugation incorrectly, either — inability to determine word boundaries (some Chinese words are one character, and some are two), inability to identify proper names, etc. It is more than possible that the Searles argument works better for some languages than for others.

  • http://www.drzoltan.com Dr. Zoltan!

    This reminds me a lot of Deep Thought.

  • http://www.nathanhangen.com/blog Nathan

    First of all, great blog, I really enjoy most every post. Secondly, great article…you’ve got me fascinated with this idea and as a science/tech/futurist wanna-be, I’m intrigued by the ideas presented here.

  • http://gilesbowkett.blogspot.com Giles Bowkett

    That isn’t the Google way! That’s Bayesian networks. The credit for this lies with Judea Pearl, Clark Glymour, and the Bayes community. Google turned it into a business, but the Bayesian people knew what they were doing.

    Cause and effect itself may just be a special case of Bayes’ Theorem.

    • Kevin Kelly

      @Giles > That isn’t the Google way! That’s Bayesian networks.

      Good point. Maybe this approach should be called the Bayesian Way of Science.

  • Sid Almasi

    I would point out that, absent any kind of ideological or predictive argument, today’s Google most certainly cannot translate average modern Chinese into English. The cases in which one can intuit the gist of a translated article or paragraph are outliers: the best it can do in most cases is identify the topic of a passage. Here’s an example, from http://translate.google.com/translate?u=http%3A%2F%2Fwww.drunkpiano-liuyu.net%2F%3Fp%3D286&sl=zh-CN&tl=en&hl=en&ie=UTF-8

    “However, reportedly the world’s “no love for no reason, no cause of hate.”"

    I would translate the original as “However, people say that in this world, “there is no unconditional love, and there is no unconditional hatred.”" Perhaps one day — maybe soon — Google or other automatic translation services will be able to translate languages easily (and additionally identify this as a quote from Mao Zedong), but people here shouldn’t be kidding themselves that Google language tools have any significant or reliable efficacy between Chinese and English, and nobody should be foolish enough to congratulate themselves for having ‘translated’ a language they can’t speak, read or write. Because how do you know you’ve translated it?

  • Slartibartfast

    I’ve seen this article running all over the place in the past few days and I just don’t see what the fuss is about. The point of a model or theory is that it reduces the essence of a body of data into a compressed form that is sufficient to make quantitative predictions about situations that can be outside the scope of the original data set. As a physicist, this is the example that comes to mind most obviously. Maxwell’s equations “encode” an ungodly amount of potential applications, predictions, or “data” to use the lose terminology being used in the original article. The stronger the theory, the higher the rate of compression between the expression of the theory and the amount it permits one to predict. This compression represents an _understanding_ of the underlying phenomena. If you simply want something to make predictions for practical purposes but don’t need an understanding of _why_ those predictions come about, yes, you might just skip the model. I would argue that the understanding encoded in a good (i.e. successful) theory/model paves the way towards further expansion and understanding of new phenomena. Take the spelling example you gave. While it is true that Google may be able to correct a spelling mistake in a search, it takes a linguist with an understanding of how language evolves to see how variations in spelling come about in time…

  • Roland Dobbins

    I think you need to re-read Searle’s Chinese Room thought experiment, because Google’s Chinese translation theories quite literally *validate* Searle’s proposition, not disprove it, heh.

  • http://glencampbell.name/ Glen

    So when will The Google achieve consciousness?

  • http://www.jacobmathai.org jacob mathai

    Interesting post.

  • jose

    A similar approach is discussed, in passing, in last week’s New Yorker essay on speech recognition:

    A speech recognizer, by learning the relative frequency with which particular words occur, both by themselves and within the context of other words, could be “trained” to make educated guesses. Such a system wouldn’t be able to understand what words mean, but, given enough data and computing power, it might work in certain, limited vocabulary situations, like medical transcription, and it might be able to perform machine translation with a high degree of accuracy.

  • John Le Drew

    I think that the point to remember here is that there is no ‘End of the Theory’. Scientists will still have to form theories to – as you put it – ask the right questions. Scientists will still have to form theories but they can then use the Cloud to process and compare those theories against the available data. Or, they may feed in data and see what ‘answers’ come out, and form their theories in reverse. Either way the answers will mean nothing without the questions. What is the The meaning of life? That would be 42 of course…

  • http://stevendphillips.com Steven Phillips

    I liked this post better than the Wired article you mentioned. I think many of us will be remembering the term “Correlative Analytics.”

    You managed to say only one crazy thing — “If anyone ever wanted a disproof of Searle’s riddle of the Chinese Room, here it is.” This potential new form of science may prove to be quite useful, but it won’t explain qualia.

  • http://tinyrock.com Richard

    Very interesting, similar to what Stephen Wolfram suggested in A New Kind of Science about mining the computational universe. Its fantastic to see the datasets and tools maturing enough to produce useful results now. Now where can I get me one of these big datasets…

  • http://www.eandmu.com Alasdair

    I think maybe, on a much smaller scale, we already use this method of understanding.

    Mimicry in learning.

    And perhaps Intuition and Wisdom are from this type of understanding

  • http://www.unsprungmedia.com Bruce Warila

    Correlative Science sounds fascinating. Great post! I wonder if this can be applied to things that constantly change and evolve (music tastes, weather, political views, etc)?

  • http://thenoisychannel.blogspot.com/ Daniel Tunkelang

    I was appalled by Chris Anderson’s article, because his suggestion that “correlation is enough” is not only demonstrably wrong, but also the root of much bad science. I also think he is misunderstanding how Google and others benefit from the vast increase in data.

    Having more data doesn’t mean you can just analyze it for patterns and treat those as discoveries. In fact, the term “data mining” used to mean exactly that, and it was pejorative–since it would discover meaningless correlations like one between the Super Bowl winner and stock market performance.

    Having more data makes it easier to both *generate* and *test* hypotheses. But it is still important to keep these activities separate. That data hygiene is at the heart of the scientific method. Correlation does not supersede causation.

  • http://www.searchenginecaffe.com Jeff

    I think calling large-scale statistical ML the “Google Way of Science” is an exaggeration. After all, Google does not use model free machine learning for its most important system, the ranking of its search results.

    In a recent interview with Google’s Peter Norvig by Anand Rajaraman, Peter relates,

    “…Google still uses the manually-crafted formula for its search results… Google’s search team worries that machine-learned models may be susceptible to catastrophic errors on searches that look very different from the training data. They believe the manually crafted model is less susceptible to such catastrophic errors on unforeseen query types.”

    In short, ML techniques are good at understanding past data, but without principled models can we trust them to generalize on future unknown data?

    For more, see also the NYTimes article “Google Keeps Tweaking Its Search Engine”

    • Kevin Kelly

      @jeff > “Google still uses the manually-crafted formula for its search results. Google’s search team worries that machine-learned models may be susceptible to catastrophic errors on searches that look very different from the training data. ”

      That is interesting. Google’s spell checker and translation engine are also tweaked — by hand. In that way there is a model they are trying to approach — the way humans translate and spell.

  • http://wfeigenson.googlepages.com Walter Feigenson

    Great article!

    In the mid 90s I was at ClariNet, which was the Internet’s first newspaper. We had many news feeds (AP, UPI, AFP, more), and we distributed these via USENET throughout the world. Bell Labs was one of our customers, and they were taking essentially the same approach to build a lexicon for their speech processing technology. They used our news feeds because they were one of the largest sources of English-language text available at the time.

    Isn’t this massive database the basis for Google’s true future – way past “simple” Internet search?

  • http://www.ekzept.net ekzept

    Datasets alone, no matter how massive, aren’t enough to produce and use hypotheses. Sure a “gods’ eye view” can be very helpful and allow synoptic insights not possible any other way. But far more important than quantity of data is quality of data. Big datasets are often Just There. More often than not, there aren’t ways of drilling down on a case and seeing where and how it was collected. Sometimes the case isn’t informative at all. Sometimes the case is biased because of instrumental or measurement error. Sometimes the case is informative, but knowing something about the measurement could make it more so.

    The idea that you can just throw away bad cases because you have so many good ones presumes the ones that are “bad” are known. In experimental work, much new stuff is found by eliminating or ruling out or correcting out biases and effects one layer at a time, and looking at the residuals. It’s not like getting a big set of points and plotting them, or doing a regression on them. It’s more like having a conversation with them.

    A good example of this can be found in the 13th June 2008 issue of Science, D Purves, S Pacala, “Predictive Models of Forest Dynamics”. There are many other examples from the LHC’s detectors to geophysical prospecting and climate work.

  • Fernando

    I would call this kind of tool “brute force science”.

  • http://mndoci.com Deepak

    I’ve ranted enough about this in comments and blog posts that am beginning to reach a point of saturation. The real value of science is always asking good questions and then using the available data to try and answer them to the best of our availability.

    I am sorry, but Chris’ article is garbage and illustrates how little he really knows about science. We all know that more data means new approaches to science, especially since this has happened so quickly.

    We’ve always worked with partial understanding, or in the case of medicine, less than partial understanding, but that’s precisely why medicine is beginning to fail. Not knowing mechanisms, etc is what results in a VIOXX. Not knowing why is what creates the next disaster.

    Data by itself cannot tell us enough. It has to be used to try and answer relevant questions. If we don’t do that, we are only causing science harm and taking the easy way out.

  • http://www.apsed.com/blog/ Alain Pierrot

    Good post (once more)!

    I’m convinced that models are implied in any Correlative Analytics, since correlation implies a segmentation, qualification and symbolic representation of “data” about the (real) world.

    I’m also convinced that scientific methodology has something to say (or work upon) such ‘hidden’ models, and is able to sort them between unvalid and unproved models.

    As for language translation, the actual results of Google’s approach yield only a hint of what texts are talking about, but are definitely out of reach about what they mean.

    Just remember the fact that a dogm such as the virginity of Mary evolved from a difficulty of translation from semitic to Greek, where ‘parthenos’ mixed the meanings of young woman with virgin… Fairly naïve, illiterate, people later built upon the latter sense and raised havoc.

    Semantics don’t work without a cultural ontology, segmented with a vocabulary. The meaning of a text implies far more features, as an instantiation mediated with linguistic available tools.

  • Rogier

    Finally, science restored as philosophy.

  • snake

    Science has always been about trying to uncover the underlying causal relationships. Correlation is far less interesting and does not improve understanding. I think abandoning models can be popular only in engineering, i.e. a field oriented purely on results and in weak science, where weak scientists do “research” by letting the machine find stuff for them not having any reasonable insight into the data, leave alone any expectations.

  • http://chalamanishruti.com/ Chalamani Shruti

    http://www.google.edu — eventually?
    *shudders*