Russian scientists are teaching machines understand biomedical texts -- International Aging Research Portfolio

LONDON - Dec. 20, 2013 - PRLog -- Many biomedical datasets contain millions of records that require algorithms with the elements of machine learning to structure and classify the data automatically relying on the expert training sets – prior work of human experts that classify or label some number of records. The automatic classification algorithms use the training sets to find similar records and classify them into the appropriate categories.

When the data sets are diverse and have large training sets, the accuracy of the commonly-used algorithms can be very high. But for data sets with incomplete labels these algorithms will perform poorly often requiring additional human input and sequential training steps.

A group of Russian scientists led by Anton Kolesov proposed a solution by adding a pre-classification step and filling the gaps in labels by selecting them from the nearest neighbors, where labels are available.

The paper published in Computational and Mathematical Methods in Medicine titled “On Multilabel Classification Methods of Incompletely Labeled Biomedical Text Data” describes the methods that help overcome the problem of incomplete labels by adding an extra step, training set modification, before classifying the dataset using the standard methods. In this article the authors tested teh two algorithms for training set modification: Weighted k-Nearest Neighbour (WkNN) and Soft - Supervised Learning (SoftSL). Both of these approaches are based on similarity measurements between data vectors that significantly improved the classification accuracy of the common algorithms, Support Vector Machines (SVM) and Random Forest (RM).

"The International Aging Research Portfolio draws data from the very many sources and relies on machine learning algorithms including Support Vector Machines (SVM) to automatically classify, link and analyze biomedical grants and publications. But since it is a non-profit effort built and maintained by volunteers, there are relatively few training sets and the the datasets are incompletely labeled that significantly diminishes the accuracy of the non-probabilistic classifier. This new method proposed by Anton to tackle the problem allows to use nearest neighbors to enrich the datasets when the labels are clearly lacking. This pre-classification step improved the classification accuracy and laid the foundation for further improvements of the knowledge management system.

I am certain that many other teams working with biomedical text data may find the results of our experiments useful and may consider implementing the pre-classification step on incompletely-labelled biomedical datasets.", said Maria Litovchenko, a graduate student at the Ludwig Maximilian University of Munich and the co-author of the study.

The methods presented in the paper may be extrapolated to non-biomedical texts and to any data sets with incomplete labels.

# # #

About the International Aging Research Portfolio:
IARP is an independent non-profit initiative serving the aging research community run by a volunteer team of over 100 developers and category editors. As the only centralized knowledge management system containing international grant databases, publications and project information, IARP provides highly granular, current information to scientists, funding organizations and policy makers, as well as a platform for collaboration and research. Presently, the system incorporates grant databases from the National Institutes of Health, European Commission, Canadian Institutes of Health Research, Australian National Health and Medicine Research Council and other sources and provides a categorized directory of research projects linked to related publications within the MEDLINE abstract database.

Media Contact
International Aging Research Portfolio
***@agingportfolio.org
+16265937957

End