One of the paradoxes about intelligence is that it is quite hard to develop an objective measure to test and quantify it. In the era of Artificial Intelligence, methods have been developed to enable computers to solve specific tasks that were earlier supposed to require intelligence.
But AI programmes that solve specific problems often do so by using repetitive methods of fast calculation and number-crunching that are very different from the ways in which humans approach the same problems. Those specific solutions don't help the AI learn how to solve other types of problems. For example, an image recognition AI cannot do meaningful analysis of email messages, or online buying habits, for contextual references.
Humans are reckoned to possess general intelligence, if they can solve many different types of problems, including problems that they have never encountered before. They do so by using a combination of abstract reasoning, pattern-recognition, memory etc, to find solutions.
AI is generally considered to display "narrow" intelligence. Most IQ tests designed for humans are supposed to measure general intelligence, ranging from abstract logic, verbal competence to comprehend statements and visual-spatial ability to recognise patterns. IQ tests are in themselves, a little controversial, but they usually contain those three classes of problems.
A typical text of visual-spatial ability is the display of a successive series of images that are linked to each other by some logical "rule", with a final image left blank. This could be the number of objects in each image, the progressive shapes of the images, colour, placement etc. The person being tested is supposed to identify the next image, by discovering the underlying logical link.
This type of image test is called the Raven's Progressive Matrices (RPM) test, after John Carlyle Raven, who generated the first such matrices in the 1930s. Researchers at DeepMind, the Alphabet subsidiary that created the Alphazero chess and Go playing programmes, have now started "training" AI to generate and solve these RPM problems. The results are interesting and described in a paper, Measuring abstract reasoning in neural networks (http://proceedings.mlr.press/v80/santoro18a/santoro18a.pdf) .
The researchers first created a programme to generate RPMs. It is important to understand that RPMs don't explicitly state what the relationship is, and it could be many different things, as anybody who has ever sat an IQ test even for entertainment, will know. For example, it could be a progressively larger (or smaller) number of objects in each successive image. Or, it may be a sequence of different geometric images, or the same geometric shapes displayed in different relationships, or something else. To solve a test successfully, it is necessary to discover that logic.
So, any artificial generation of RPM will use a logical sequence to create those matrices. The DeepMind team tried several approaches to try and create programmes that could successfully invert that logic. They tested programmes on solving RPMs with only one type of underlying sequence -- the same type that was being generated. They tested programmes on solving RPMs of different types of sequences. They tried augmenting the programme's knowledge with different types of training. (The entire 1.2 million training sets and 20,000 test questions can be downloaded).
When the training and the test RPMs were based on exactly the same types of logic, the AIs solved the problems correctly about 75 per cent of the time. When they could discern the underlying logical rules, solution rates rose to 87 per cent. But when the problems were different, solving rates dropped sharply. This was true even for variations that humans would consider minor, such as changes of colour between training and testing matrices.
Even 75 per cent is a low solution rate, which may compare poorly to that of an intelligent human. Although the paper refuses to provide a "human baseline" performance, it says that human participants "with a lot of experience" in solving RPM, generally scored about 80 per cent in informal testing. There's a likely selection bias, since we're dealing with a bunch of highly experienced, intelligent humans.
However, the paper says the AIs could not generalise or extrapolate well. If, for example, a training set involved understanding number progressions (each successive image contained more objects) while the test set involved size progressions (each successive image contained bigger objects), the AI was stumped. A smart human would be able to generalise the principle of "more" in such circumstances.
The conclusions are tentative. Since systems performed poorly when extrapolating logical rules, or dealing with the unfamiliar, those will have to be focus areas for further research. The tests do suggest that AI general intelligence may be further away than the optimists believe.