People struggle to tell humans apart from ChatGPT in five-minute chat conversations, tests show
People find it difficult to distinguish between the GPT-4 model and a human agent when interacting with them as part of a 2-person conversation.
People find it difficult to distinguish between the GPT-4 model and a human agent when interacting with them as part of a 2-person conversation.
“We argue that these falsehoods, and the overall activity of large language models, is better understood as bulls**t in the sense explored by Frankfurt: the models are in an important way indifferent to the truth of their outputs.”
“This could motivate people to make wiser choices in the present that optimise for their long-term wellbeing and life outcomes.”
Perhaps the most widely touted of GPT-4’s at-launch, zero-shot capabilities has been its reported 90th-percentile performance on the Uniform Bar Exam.
These AI tools still hallucinate an alarming amount of the time: the Lexis+ AI and Ask Practical Law AI systems produced incorrect information more than 17% of the time, while Westlaw’s AI-Assisted Research hallucinated more than 34% of the the time.
A team of computer scientists at Purdue University has found that the popular LLM, ChatGPT, is wildly inaccurate when responding to computer programming questions.
New research has claimed large language models, specifically GPT-4 (which powers certain versions of ChatGPT and various Microsoft Copilot-branded generative AI products) is able to analyze financial statements with greater accuracy than humans.
Even at this early stage, though, Anthropic’s research provides an exciting framework for making an LLM’s “black box” results that much more interpretable and, potentially, controllable.
GPT-4 showcased a markedly superior performance when compared to unspecialised junior doctors, who possess a proficiency level comparable to general practitioners with specialist eye knowledge.
One system even altered its behaviour during mock safety tests, raising the prospect of auditors being lured into a false sense of security.
The visual recognition skills of the large language model fell far short of clinical standards, achieving a positive predictive value (PPV) of less than 25% during its best attempt at trying to spot image findings from a set of 100 chest x-rays.
Med-Gemini was tested on 14 medical benchmarks and established a new state-of-the-art (SoTA) performance on 10, surpassing the GPT-4 model family on every benchmark where a comparison could be made.
Researchers found that morality judgments given by ChatGPT4 were “perceived as superior in quality to humans'” along a variety of dimensions like virtuosity and intelligence.
Leveraging advanced algorithms to manage patient messages had tremendous impact, including providing population-level insights regarding commonly recurring issues and even allowing for faster response times for potentially acute and emergent situations.
ChatGPT-4 generated acceptable messages to patients without any additional editing by radiation oncologists 58% of the time, and 7% of responses generated by GPT-4 were deemed unsafe by the radiation oncologists if left unedited.
OpenAI’s GPT-4 large language model (LLM) can autonomously exploit vulnerabilities in real-world systems if given a CVE advisory describing the flaw.
Errors in radiology reports may occur due to resident-to-attending discrepancies, speech recognition inaccuracies and high workload. Large language models, such as GPT-4, have the potential to enhance the report generation process.
Asking the chatbot for tales about future events rather than asking for direct predictions proved surprisingly effective, especially in forecasting Oscar winners.
The study, published in the Canadian Psychological Association’s Mind Pad, found that “false citation rates” across various psychology subfields ranged from 6% to 60%. Surprisingly, these fabricated citations feature elements such as legitimate researchers’ names and properly formatted digital object identifiers.
Developing ways to measure the persuasive capabilities of AI models is important because it serves as a proxy measure of how well AI models can match human skill in an important domain, and because persuasion may ultimately be tied to certain kinds of misuse, such as using AI to generate disinformation, or persuading people to take actions against their own interests.
A study that identified buzzword adjectives that could be hallmarks of AI-written text in peer-review reports suggests that researchers are turning to ChatGPT and other artificial intelligence (AI) tools to evaluate others’ work.
A large language model (LLM) can be convinced to tell you how to build a bomb if you prime it with a few dozen less-harmful questions first.
GPT-4 is reportedly able to convince human debate opponents to agree with its position 81.7 percent more often than a human opponent, according to research from a group of Swiss and Italian academics.
Researchers found that with some spare cash and enough technical know-how, even a “low-resourced attacker” can tamper with a relatively small amount of data that’s invasive enough to cause a large language model to churn out incorrect answers.
Eight names are listed as authors on “Attention Is All You Need,” a scientific paper written in the spring of 2017. They were all Google researchers, though by then one had left the company…
Almost as quickly as a paper came out last week revealing an AI side-channel vulnerability, Cloudflare researchers have figured out how to solve it: just obscure your token size.
How can these powerful systems beat us in chess but falter on basic math? This paradox reflects more than just an idiosyncratic design quirk. It points toward something fundamental about how large language models think.
The dialect of the language you speak decides what artificial intelligence (AI) will say about your character, your employability, and whether you are a criminal.
The biggest models are now so complex that researchers are studying them as if they were strange natural phenomena, carrying out experiments and trying to explain the results.
A group of researchers have created one of what they claim are the first generative AI worms—which can spread from one system to another, potentially stealing data or deploying malware in the process.
Generative AI is continuing to improve — so publishers, grant-funding agencies and scientists must consider what constitutes ethical use of LLMs, and what over-reliance on these tools says about a research landscape that encourages hyper-productivity.
By employing results from learning theory, we show that LLMs cannot learn all of the computable functions & will therefore always hallucinate. Since the formal world is a part of the real world which is much more complicated, hallucinations are inevitable
It defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
The study suggests that advanced AI tools could play an important role in providing decision-making support to ophthalmologists in the diagnosis and management of cases involving glaucoma and retina disorders, which afflict millions of patients.
“In a new preprint study, we develop an approach to verify how well LLMs are able to cite medical references and whether these references actually support the claims generated by the models.”
“A programming question is fed to the underlying large language model, and it’s asked to describe and summarize the problem. That information then guides how it should begin to solve the problem…”
A recent paper explores how to use AI chatbots to autonomously hijack websites. The Register spoke to one of the authors of the paper.
Their data sources include machine translations of several existing datasets into more than 100 languages, roughly half of which are considered underrepresented — or unrepresented — including Azerbaijani, Bemba, Welsh and Gujarati.
By ensuring that these first few data points of a conversation remain in memory, the researchers’ method allows a chatbot to keep chatting no matter how long the conversation goes.
Version 3.5 of ChatGPT could not formulate a correct diagnosis in 83 of 100 pediatric cases, according to recent research published in JAMA Pediatrics.
“LLMs stand poised to disrupt the legal industry, enhancing accessibility and efficiency of legal services. Our research asserts that the era of LLM dominance in legal contract review is upon us, calling for a reimagined future of legal workflows.”
“We observe that models tend to develop arms-race dynamics, leading to greater conflict, and in rare cases, even to the deployment of nuclear weapons…”
The number of referrals from services using the Limbic chatbot rose by 15% during the study’s three-month time period, compared with a 6% rise in referrals for the services that weren’t using it.
A team of researchers at New York University wondered if AI could learn like a baby. What could an AI model do when given a far smaller data set—the sights and sounds experienced by a single child learning to talk?
Researchers have demonstrated that robots equipped with the ability to express emotions in real-time during interactions with humans are perceived as more likable, trustworthy, and human-like.
Researchers found that they were able to bypass its safety guardrails about 79 percent of the time using Zulu, Scots Gaelic, Hmong, or Guarani. The attack is about as successful as other types of jail-breaking methods.
Computer scientists have found that misinformation generated by large language models (LLMs) is more difficult to detect than artisanal false claims hand-crafted by humans.
The big takeaway from all these research efforts is that scientists are hard at work trying to find ways of compressing and dividing the work of training to make it feasible on battery-operated devices with less memory and less processing power…
“As consumers increasingly use generative AI (GenAI) in their personal and professional lives, they are optimistic about its potential positive impact on society and believe that the benefits of the technology outweigh the risks they see…”
Deloitte: “The arrival of generative AI heralds disruption and opportunity across industries. Organizations are exploring how generative AI can be used to unlock and open the door to entirely new products, services and business models…”
An artificial intelligence (AI) system trained to conduct medical interviews matched, or even surpassed, human doctors’ performance at conversing with simulated patients and listing possible diagnoses on the basis of the patients’ medical history.
Researchers keep finding new ways to ‘pervert’ AI chatbots. A new paper on Arxiv describes a new threat, a ‘sleeper’ agent…
According to new research from Stanford University, the popularization of A.I. chatbots has not boosted overall cheating rates in schools.
We found that only GPT-4 nearly flawlessly processes inputs with unnatural errors, even under the extreme condition, a task that poses significant challenges for other LLMs and often even for humans…
LLMs have been shown to “hallucinate” factually incorrect information, using them to make verifiably correct discoveries is a challenge. But what if we could harness the creativity of LLMs by identifying and building upon only their very best ideas?
Mistral, based in Paris and founded by Arthur Mensch, Guillaume Lample, and Timothée Lacroix, has seen a rapid rise in the AI space recently. It has been quickly raising venture capital to become a sort of French anti-OpenAI, championing smaller models with eye-catching performance.
Artificial intelligence (AI) systems are often depicted as sentient agents poised to overshadow the human mind. But AI lacks the crucial human ability of innovation, researchers at the University of California, Berkeley have found.
ChatGPT-4.0 has passed a clinical neurology exam with 85% correct answers in a proof-of-concept study. The research authors believe that after some fine-tuning, LLMs could have “significant applications” in clinical neurology.
This feature in Nature asks whether the poor use of AI is doing science more harm than good:
This game of whack-a-mole can never be won by OpenAI – or any other chatbot provider. But they’re going to try.
ChatGPT failed to accurately risk stratify 35% of patients studied, but the artificial intelligence (AI) chatbot was able to provide accurate treatment recommendations.
In the rush to deploy off-the-shelf proprietary LLMs, health-care institutions and other organizations risk ceding the control of medicine to opaque corporate interests.
“We have just released a paper that allows us to extract several megabytes of ChatGPT’s training data for about $200. We estimate that it would be possible to extract ~a gigabyte of ChatGPT’s training dataset from the model by spending more…”
For the first time, our study investigated whether ChatGPT’s responses are perceived as better than human responses in a task where humans were required to be empathetic.
Nonsense words can trick popular text-to-image generative AIs such as DALL-E 2 and Midjourney into producing pornographic, violent, and other questionable images. A new algorithm generates these commands to skirt these AIs’ safety filters.
The current version of ChatGPT has limitations in accurately answering MCQs and generating correct and relevant rationales, particularly when it comes to referencing. To avoid possible threats, ChatGPT should be used with supervision.
Casting dialogue-agent behaviour in terms of role play allows us to draw on familiar folk psychological terms, without ascribing human characteristics to language models that they in fact lack.
Current LLMs can infer a wide range of personal attributes (e.g., location, income, sex), achieving up to 85% top-1 and 95.8% top-3 accuracy at a fraction of the cost (100×) and time (240×) required by humans.
Why does this matter? First, different conversation types serve distinct information needs and demand varied UI designs. Second, there is no one optimal conversation length — both short and long conversations can be helpful, as they might support different user goals.
As an example, using a single publicly available large-language model, within 65 minutes, 102 distinct blog articles were generated that contained more than 17 000 words of disinformation related to vaccines and vaping.
To create an academia-worthy structure, Bilal says it is fundamental to master incremental prompting, a technique traditionally used in behavioural therapy and special education. It involves breaking down complex tasks into smaller, more manageable steps.
The study revealed that large language models may, in fact, be able to understand and respond to emotional cues.Researchers found that LLMs produced higher-quality outputs when emotional language was used to talk to AI chatbots.
The new Dutch LLM, dubbed GPT-NL, will be an open model, allowing everyone to see how the underlying software works and how the AI comes to certain conclusions, said its creators. The AI is being developed by research organisation TNO, the Netherlands Forensic Institute, and IT cooperative SURF.
All three platforms provided high rates of inaccurate recommendations. Chatbot ratings for answering patient questions varied, with Bing Chat (Creative) have the highest score and Bing Chat (Concise) having the lowest score.
Data sets that are poorly thought out or insufficiently described increase the risk of ‘garbage in, garbage out’ studies and the propagation of biases, rendering outcomes meaningless or, even worse, dangerous.
The original submission falsely accused KPMG of being complicit in a “KPMG 7-Eleven wage theft scandal” that led to the resignation of several partners.
Stanford university says that these prominent AI companies are becoming less transparent as their models become more powerful.
To varying degrees, the models appeared to be using race-based equations for kidney and lung function, which the medical establishment increasingly recognizes could lead to misdiagnosis or delayed care for Black patients.
The researchers found that “the trustworthiness of GPT models remains limited.” They also discovered that the GPT models have a tendency to generalize when asked about ongoing events outside their scope of knowledge.
That, in a nutshell, is why we should never trust pure LLMs; even under carefully controlled circumstances with massive amounts of directly relevant data, they still never really get even the most basic linear functions.
ChatGPT, the AI language model capable of mirroring human conversation, may be better than a doctor at following recognised treatment standards for clinical depression, and without any of the gender or social class biases sometimes seen in the primary care doctor-patient relationship.
Researcher found that a modest amount of fine tuning – additional training for model customization – can undo AI safety efforts that aim to prevent chatbots from suggesting suicide strategies, harmful recipes, or other sorts of problematic content.
While it’s no shocker that a manager from OpenAI would endorse ChatGPT, it’s essential to tread cautiously. As per the MIT and Arizona research, it’s crucial to calibrate society’s expectations of AI, ensuring a clear line between genuine therapeutic sessions and AI interactions.
This feature from COSMOS Magazine asks — can we even ask if AI is conscious? And what does ‘consciousness’ even mean?
It’s becoming increasingly important to be able to distinguish between real images and ‘deepfakes’ – synthetic images generated by AI. Providers have strategies to ‘watermark’ these synthetic images, so they can be easily detected as fakes. But, as reported in Ars Technica, researchers have already found ways around this.
n this latest study, DeepMind researchers found “Take a deep breath and work on this problem step by step” to be the most effective prompt when used with Google’s PaLM 2 language model. The phrase achieved the top accuracy score of 80.2 percent in tests against GSM8K, which is a data set of grade-school math word problems.
Participants with access to the large language model GPT-4 completed 12.2% more tasks on average and completed tasks 25.1% more quickly than those without access to AI, the study conducted in tandem with Boston Consulting Group found.
In a new paper, researchers from DeepMind propose a new way: Optimization by PROmpting (OPRO), a method that uses AI large language models (LLM) as optimizers. The unique aspect of this approach is that the optimization task is defined in natural language rather than through formal mathematical definitions.