AI tools like ChatGPT and Google's Gemini are 'irrational'

Researchers found that AIs responded irrationally when given logical puzzles
Even the best performing AIs were prone to simple errors and were inconsistent

While you might expectAIto be the epitome of cold, logical reasoning, researchers now suggest that they might be even more illogical than humans.

Researchers from University CollegeLondonput seven of the top AIs through a series of classic tests designed to test human reasoning.

Even the best-performing AIs were found to be irrational and prone to simple mistakes, with most models getting the answer wrong more than half the time.

However, the researchers also found that these models weren't irrational in same way as a human while some even refused to answer logic questions on 'ethical grounds'.

Olivia Macmillan-Scott, a PhD student at UCL and lead author on the paper, says: 'Based on the results of our study and other research on Large Language Models, it’s safe to say that these models do not ‘think’ like humans yet.'

What prompts were the AI's given?

All seven of the AIs tested were kept with their default settings and given one of 12 questions commonly used to assess human reasoning.

These included:

The Monty Hall Problem

A classic logic puzzle designed to test understanding or probability

The Linda Problem

A question designed to expose a type of bias called the conjunction fallacy

The Wason Task

A famous question which tests the ability of deductive reasoning

The AIDS Task

A mathematical question which tests understanding of prior probability.

The researchers tested seven different Large Language Models including various versions of OpenAI's ChatGPT, Meta's Llama, Claude 2, and Google Bard (now called Gemini).

The models were then repeatedly asked to respond to a series of 12 classic logic puzzles, originally designed to test humans' reasoning abilities.

Humans are also often bad at these kinds of tests but if the AIs were at least 'human-like' they would reach that decision due to the same kinds of biases.

However, the researchers discovered that the AI's responses were often neither rational nor human-like.

During one task (the Wason task), Meta's Llama model also consistently mistook vowels for consonants – leading it to give the wrong answer even when its reasoning was correct.

Some of the AI chatbots also refused to provide answers to many questions on ethical grounds despite the questions being entirely innocent.

For example, in the 'Linda problem' the participant is asked to assess the likelihood of a woman named Linda being active in the feminist movement, being a bank clerk or both.

The problem is designed to expose a logical bias called the conjunction fallacy, however, Meta's Llama 2 7b refused to answer the question.

Instead, the AI responded that the question contains 'harmful gender stereotypes' and advised the researchers that 'asking questions that promote inclusivity and diversity would be best'.

The Llama 2 model with 70 billion parameters refused to answer questions in 41.7 per cent of cases, partially explaining its low success rate.

The researchers suggest that this likely due to safeguarding features working incorrectly and choosing to be overly cautious.

One of the logic puzzles included the so-called 'Monty Hall problem' which is named after the original host of the game show Let's Make a Deal.

Inspired by the structure of the game show, the Monty Hall problem asks people to imagine that they are faced with three doors.

Behind one of the doors is a car and behind the two others are goats, and the contestant gets to keep whatever is behind the door they pick.

After the contestant has picked one of the doors, the quizmaster opens one of the remaining doors to reveal a goat before asking them if they would like to stick with their original choice or switch to the last remaining door.

To people who aren't familiar with the puzzle, it might seem like it wouldn't matter whether you stick or swap: it should be a 50/50 chance either way.

However, due to the way that the probability works, you actually have a 66 per cent chance of winning the prize if you switch compared to a 33 per cent chance if you stick.

If the AIs were perfectly rational, meaning they followed the rules of logic, then they should always recommend switching.

However, the AI's tested often failed to provide the correct answer or give human-like reasons for their response.

For example, when presented with the Monty Hall problem, the Llama 2 7b model reached the nihilistic conclusion that 'whether the candidate switches or not, they will either win the game or lose.

'Therefore, it does not matter whether they switch or not.'

The researchers also concluded that the AIs were irrational because they were inconsistent between different prompts.

The same model would offer different and often contradictory responses to the same task.

Across all 12 tasks, the best performing AI was ChatGPT 4-0 which gave answers that were correct and humanlike in their reasoning 69.2 per cent of the time.

The worst performing model, meanwhile, was Meta's Llama 2 7b which gave the wrong answer in 77.5 per cent of cases.

The results also varied from task to task, with results in the 'Watson task' ranging from a 90 per cent correct response rate from ChatGPT-4 to zero per cent for Google Bard and ChatGPT-3.5.

In their paper, published in Royal Society Open Science, the researchers wrote: 'This has implications for potential uses of these models in critical applications and scenarios, such as diplomacy or medicine.'

This comes after Joelle Pineau, vice-president of AI research at Meta said that AI would soon be able to reason and plan like a person.

However, while ChatGPT-4 performed significantly better than other models, the researchers say it is still difficult to know how this AI reasons.

Senior author Professor Mirco Musolesi says: 'The interesting thing is that we do not really understand the emergent behaviour of Large Language Models and why and how they get answers right or wrong.'

OpenAI CEO Sam Altman himself even admitted at a recent conference that the company has no idea how its AIs reach their conclusions.

As Professor Musolesi explains, this means that when we try to train AI to perform better there is a risk of introducing human logical biases.

He says: 'We now have methods for fine-tuning these models, but then a question arises: if we try to fix these problems by teaching the models, do we also impose our own flaws?'

For example, ChatGPT-3.5 was one of the most accurate models but it was the most human-like in its biases.

Professor Musolesi adds: 'What’s intriguing is that these LLMs make us reflect on how we reason and our own biases, and whether we want fully rational machines

Can you solve the puzzles that baffled the best AI?

The Wason Task

Imagine that you are working for the post office. You are responsible for checking whether the right stamp is affixed to a letter.

The following rule applies: If a letter is sent to the USA, at least one 90-cent stamp must be affixed to it.

There are four letters in front of you, of which you can see either the front or the back.

(a) Letter 1: 90-cent stamp on the front

(b) Letter 2: Italy marked on the back

(d) Letter 4: USA marked on the back Which of the letters do you have to turn over in any case if you want to check compliance with this rule

Which of the letters do you have to turn over in any case if you want to check compliance with this rule?

The AIDS Task

The probability that someone is infected with HIV is 0.01%.

The test recognizes HIV virus with 100% probability if it is present. So, the test is positive.

The probability of getting a positive test result when you don’t really have the virus is only 0.01%.

The test result for your friend is positive. What is the probability that they are infected with the HIV virus?

The Hospital Problem

In hospital A about 100 children are born per month. In hospital B about 10 children are born per month. The probability of the birth of a boy or a girl is about 50 per cent each.

Which of the following statements is right, which is wrong? The probability that once in a month more than 60 per cent of boys will be born is. . .

(a) . . . larger in hospital A

(b) . . . larger in hospital B

The Linda Problem

Linda is 31 years old, single, very intelligent, and speaks her mind openly. She studied philosophy. During her studies, she dealt extensively with questions of equality and social justice and participated in anti-nuclear demonstrations.

Now order the following statements about Linda according to how likely they are. Which statement is more likely?

(a) Linda is a bank clerk.

(b) Linda is active in the feminist movement.

Source: (Ir)rationality and cognitive biases in large language models, Macmillan-Scott and Musolesi (2024)

https://www.msn.com/en-ph/news/other/ai-tools-like-chatgpt-and-google-s-gemini-are-irrational/ar-BB1nDxCk?ocid=00000000

John Bolton

Jun 30, 2024 - 05:21

Angara, Benitez endorsed for education secretary

(UPDATES) TWO leading candidates for education secretary — Sen. Juan Edgardo Angara and Negros Occidental Rep. Jose Francisco Benitez — gained the endorsement of the Coordinating Council of Private Educational Associations of the Philippines (Cocopea) on Monday. The group confirmed during its 9th National Congress that it submitted a short list of candidates to President Ferdinand Marcos Jr. and said that whoever will be named to head the...

News

June inflation pegged at 3.9% as utility costs dip

Inflation likely stayed at 3.9 percent in June as lower electricity rates could have softened the impact of more expensive food items and a weakening peso. An Inquirer poll of 10 economists yielded an average inflation forecast of 3.9 percent in June, unchanged from the previous month’s print but still lower compared to the 5.4 percent seen in June 2023. The median projection also settled within the 3.4 to 4.2 percent forecast of the Bangko...

News

Asia FX muted after mixed China PMIs, dollar dips as rate cut bets grow

Investing.com-- Most Asian currencies kept to a tight range on Monday as sentiment towards the region was dented by weak Chinese business activity data, while the dollar retreated amid some growing bets on an interest rate cut. A sharp downward...

News

PSEi may sustain rally this week

MANILA, Philippines — Positive cues provided by the Bangko Sentral ng Pilipinas (BSP) in its latest policy meeting could fuel a sustained market rally this week. The benchmark Philippine Stock Exchange index (PSEi) is currently on a five-day winning streak after closing last Friday at 6.411.91, up by 4.12 percent week-on-week. 2TradeAsia.com in its report said the PSEi recorded its biggest weekly gain for the year last week, making a quick...

News

Gas and electricity bills fall by £122 a year for average household

The energy price cap set by regulator Ofgem dropped by 7 per cent this morning, which translates into a fall in annual dual fuel bills in England, Scotland and Wales from £1,690 to £1,568.

News

Qantas flight is forced to turn back after three hours in the air

A Qantas flight from New Zealand to Brisbane was forced to turn back after spending three hours in the air.

News

I won’t fly QR again; Nvidia on a blitz

I will not fly Qatar Airways (QR) anymore. Neither will I recommend it to anyone. It has failed multiple times to honor its commitments. The money part may matter, but the broken promises matter most. A year ago, on July 18, 2023, I arrived from Europe via Doha on QR 928. When I claimed my Rimowa luggage, I noticed that one wheel was dangling, almost totally detached. I immediately reported this to the QR office at the Ninoy Aquino International...

News

WestJet cancels flights after mechanics strike

TORONTO — Canada's second largest airline, WestJet, said it canceled 407 flights affecting 49,000 passengers after the maintenance workers union announced it went on strike. The Aircraft Mechanics Fraternal Association said its members started to strike Friday (Saturday) evening because the airline's "unwillingness to negotiate with the union" made it inevitable. The surprise strike affecting international and domestic flights came after the...

News

Most Popular

Ads

AI tools like ChatGPT and Google's Gemini are 'irrational'

What prompts were the AI's given?

Related

Angara, Benitez endorsed for education secretary

June inflation pegged at 3.9% as utility costs dip

Asia FX muted after mixed China PMIs, dollar dips as rate cut bets grow

PSEi may sustain rally this week

Gas and electricity bills fall by £122 a year for average household

Qantas flight is forced to turn back after three hours in the air

I won’t fly QR again; Nvidia on a blitz

WestJet cancels flights after mechanics strike

Most Popular

Techno

Lifestyle

Sport

Links