The data necessary to answer myriad questions - about, say, the correlations between the industrial use of certain chemicals and incidents of disease, or between patterns of news coverage and voter-poll results - may all be online in form of plain text.
However, extracting data from plain text and organising it for quantitative analysis may be prohibitively time consuming.
Researchers from Massachusetts Institute of Technology (MIT) in the US developed a new approach to information extraction.
For instance, humans might label parts of speech in a set of texts, and the machine-learning system will try to identify patterns that resolve ambiguities - for instance, when "her" is a direct object and when it is an adjective.
Typically, computer scientists will try to feed their machine-learning systems as much training data as possible.
That generally increases the chances that a system will be able to handle difficult problems.
In the new research, scientists trained their system on scanty data.
A machine-learning system will generally assign each of its classifications a confidence score, which is a measure of the statistical likelihood that the classification is correct, given the patterns discerned in the training data.
With the new system, if the confidence score is too low, the system automatically generates a web search query designed to pull up texts likely to contain the data it is trying to extract.
If the confidence score remains too low, it moves on to the next text pulled up by the search string, and so on.
Every decision the system makes is the result of machine learning. The system learns how to generate search queries, gauge the likelihood that a new text is relevant to its extraction task, and determine the best strategy for fusing the results of multiple attempts at extraction.
The researchers compared their system's performance to that of several extractors trained using more conventional machine-learning techniques.
Disclaimer: No Business Standard Journalist was involved in creation of this content
You’ve reached your limit of {{free_limit}} free articles this month.
Subscribe now for unlimited access.
Already subscribed? Log in
Subscribe to read the full story →
Smart Quarterly
₹900
3 Months
₹300/Month
Smart Essential
₹2,700
1 Year
₹225/Month
Super Saver
₹3,900
2 Years
₹162/Month
Renews automatically, cancel anytime
Here’s what’s included in our digital subscription plans
Exclusive premium stories online
Over 30 premium stories daily, handpicked by our editors


Complimentary Access to The New York Times
News, Games, Cooking, Audio, Wirecutter & The Athletic
Business Standard Epaper
Digital replica of our daily newspaper — with options to read, save, and share


Curated Newsletters
Insights on markets, finance, politics, tech, and more delivered to your inbox
Market Analysis & Investment Insights
In-depth market analysis & insights with access to The Smart Investor


Archives
Repository of articles and publications dating back to 1997
Ad-free Reading
Uninterrupted reading experience with no advertisements


Seamless Access Across All Devices
Access Business Standard across devices — mobile, tablet, or PC, via web or app
