Pattern Recognition and 'Big Data'

Posted by Cytel

May 19, 2016 6:00:00 AM

The explosion in healthcare information and “big data “has been one of the most written about topics in the last few years. These big data in the form of electronic health records, diagnostic tests, genomics, proteomics, not to mention data from wearable devices and apps have the potential to transform healthcare.  That potential can only be realized though through the application of advanced analytics to recognize patterns from the vast information available. As such, disciplines such as pattern recognition play a pivotal role in the future of healthcare.


A vast potential

The potential is vast.  For instance, AstraZeneca recently set up an initiative to screen around two million genomes sequences for information that will guide drug discovery and development. As part of this initiative, the company has set up a number of collaborations including with US sequencing specialist Human Longevity. The intention is for Human Longevity to sequence the clinical samples and analyze them using machine-learning, pattern recognition and other techniques. (1)   At the other end of the spectrum, an article in the UK’s Guardian, (2) cites the ‘Cloudy with a Chance of Pain’ study (3) - a large scale smartphone based study investigating a correlation between the weather and symptoms of pain.  This will then evolve into a ‘ Citizen Science’ initiative in which participants and the public are urged to become detectives and pattern spotters- suggesting associations and hypotheses to researchers, who will then conduct formal analyses based on these proposals.

The crucial role of statistics and statisticians

 In tackling big data, cognitive computers do have an advantage over people with their ability to consume huge amounts of information.  However, where they are less effective is having the judgment or ability to understand the implications of decisions which are made. In 2014, Google hit the headlines for all the wrong reasons when their initiative Google Flu Trends, “a poster child for the power of big data analysis” (4) was accused of “big data hubris”. In other words, the assumption that big data analysis trumps traditional collection and analysis.  Then, in October 2015 the American Statistical Association released a policy statement (5) advocating for the critical role statistical science has to play in big data analysis.

 They noted in the statement that ‘’ three professional communities, all within computer science and/or statistics, are emerging as foundational to data science: (i) Database Management enables transformation, conglomeration, and organization of data resources, (ii) Statistics and Machine Learning convert data into knowledge, and (iii) Distributed and Parallel Systems provide the computational infrastructure to carry out data analysis.’’

The statement also noted that note that “the next generation of statistical professionals needs a broader skill set and must be more able to engage with database and distributed systems experts.”

 Example of Cytel Consultants recent work in Pattern Recognition

In our own recent work in Pattern Recognition, our statistical experts have been working in multidisciplinary teams alongside engineers and computer scientists to support product development.  We outline some key aspects of one such case study below:


Our client wanted to build and validate a medical device based on a statistical classifier that detects abnormality in a certain physiological function using accelerometer signals triggered by this function. The objective was to design a Proof of Concept trial where the collected data are to be used for building the classification algorithm and build the statistical classifier using curve classification techniques.


Initial sample-size calculations were based on previous knowledge of the signal-to-noise ratio, and the number of signal features that can explain variability and prevalence of abnormality in the sampled population. The classifier was built sequentially in blocks: a new block of data treated as test-set, accuracy measures calculated, and the test-set combined with the training-set for updating the classifier. The learning curve of the classifier then followed sequentially until the desired accuracy was obtained. Over-run samples were treated purely as a test-set for initial estimates for the Validation study design. The statistical classifier was built using several techniques:

  • Segmentation using Hidden Markov Models
  • Feature extraction using time and frequency domain analysis
  • Feature selection using elastic nets and random forest 
  • Predictive model using Linear Discriminant Analysis, Support Vector Machines and Random Forests
  • Extensive analysis to check for overfitting

 Cytel is continuing to grow its team of experts who are able to take on Pattern Recognition and other big data related projects.  To find out more about the work of our statistical consulting team, click below. 



With thanks to Rajat Mukherjee

Rajat Mukherjee, Ph.D. is a Principal Statistician at Cytel. His expertise includes Bayesian clinical trials, Adaptive designs, complex epidemiological studies, survival analysis and pattern recognition. 

 Further reading






Topics: Bayesian Methods, Computing, Clinical Data, biostatistics, machine learning, big data, pattern recognition

The Cytel blog keeps you up to speed with the latest developments in biostatistics and clinical biometrics.  Sign up for updates direct to your inbox. You can unsubscribe at any time.


Posts by Topic

see all

Recent Posts