News By Tag
News By Place
Russian scientists are teaching machines understand biomedical texts
One of the common problems in automatic classification of biomedical text data is lack absence of sufficient text labels. A team of Russian computer and biomedical scientists found a way around this problem by adding a pre-classification step.
By: International Aging Research Portfolio
When the data sets are diverse and have large training sets, the accuracy of the commonly-used algorithms can be very high. But for data sets with incomplete labels these algorithms will perform poorly often requiring additional human input and sequential training steps.
A group of Russian scientists led by Anton Kolesov proposed a solution by adding a pre-classification step and filling the gaps in labels by selecting them from the nearest neighbors, where labels are available.
The paper published in Computational and Mathematical Methods in Medicine titled “On Multilabel Classification Methods of Incompletely Labeled Biomedical Text Data” describes the methods that help overcome the problem of incomplete labels by adding an extra step, training set modification, before classifying the dataset using the standard methods. In this article the authors tested teh two algorithms for training set modification:
"The International Aging Research Portfolio draws data from the very many sources and relies on machine learning algorithms including Support Vector Machines (SVM) to automatically classify, link and analyze biomedical grants and publications. But since it is a non-profit effort built and maintained by volunteers, there are relatively few training sets and the the datasets are incompletely labeled that significantly diminishes the accuracy of the non-probabilistic classifier. This new method proposed by Anton to tackle the problem allows to use nearest neighbors to enrich the datasets when the labels are clearly lacking. This pre-classification step improved the classification accuracy and laid the foundation for further improvements of the knowledge management system.
I am certain that many other teams working with biomedical text data may find the results of our experiments useful and may consider implementing the pre-classification step on incompletely-
The methods presented in the paper may be extrapolated to non-biomedical texts and to any data sets with incomplete labels.
# # #
About the International Aging Research Portfolio:
IARP is an independent non-profit initiative serving the aging research community run by a volunteer team of over 100 developers and category editors. As the only centralized knowledge management system containing international grant databases, publications and project information, IARP provides highly granular, current information to scientists, funding organizations and policy makers, as well as a platform for collaboration and research. Presently, the system incorporates grant databases from the National Institutes of Health, European Commission, Canadian Institutes of Health Research, Australian National Health and Medicine Research Council and other sources and provides a categorized directory of research projects linked to related publications within the MEDLINE abstract database.
International Aging Research Portfolio