I have an idea how to test whether AI can be a good scientist:
Train on all published scientific knowledge and observations up to certain point, before a breakthrough occurred. Then see if your AI can generate the breakthrough on its own.
For example, prior to 1900 quantum theory did not exist. Given what we knew then, could AI reproduce the ideas of Planck, Einstein, Bohr etc?
If not, then AI will never be useful for generating scientific theory.
I don’t think this is the main point of the paper. They’re not claiming that AI is capable of scientific breakthroughs. Rather, they argue that AI excels at summarising vast amounts of existing scientific knowledge.
Or just have the AI generate new specific experimental setups and parameters that we can try and be like "oh yeah, we just made a room temperature superconductor".
Honestly given what we know about physics, the AI should be able to simulate physics within itself or deduce certain things we've missed.
i only performed a quick read of the paper but couldn't find how many humans they used to generate their expected human performance, this seems to be the main content:
> To ensure that we did not overfit PaperQA2 to achieve high performance on LitQA2, we generated a new set of 101 LitQA2 questions after making most of the engineering changes to PaperQA2. The accuracy of PaperQA2 on the original set of 147 questions did not differ significantly from its accuracy on the latter set of 101 questions, indicating that our optimizations in the first stage generalized well to new and unseen LitQA2 questions (Table 2).
> To compare PaperQA2 performance to human performance on the same task, human annotators who either possessed a PhD in biology or a related science, or who were enrolled in a PhD program (see Section 8.2.1), were each provided a subset of LitQA2 questions and a performance-related financial incentive of $3-12 per question to answer as many questions correctly as possible within approximately one week, using any online tools and paper access provided by their institutions. Under these conditions, human annotators achieved 73.8% ± 9.6% (mean ± SD, n = 9) precision on LitQA2 and 67.7% ± 11.9% (mean ± SD, n = 9) accuracy (Figure 2A, green line). PaperQA2 thus achieved superhuman precision on this task (t(8.6) = 3.49, p = 0.0036) and did not differ significantly from humans in accuracy (t(8.5) = −0.42, p = 0.66).
Academic writing is notoriously hard to read and often poorly written. If this lives up to billing it will be a game changer - no need to rely on sporadic, manual, intrinsically limited nature of surveys from academics, analysts through to gym bros, reddit posters.
One of my big uses of LLM's has been searching through medical research. The issue has been a few times running into confidence where it shouldn't be but I have found it hallucinates a lot less in science topics than it does for more common topics.
1. Shove all potentially relevant data (e.g. entire book or library) into an LLM (quite expensive and the needle-haystack problem exists even in recent models iirc -- though splitting into many smaller prompts seems to solve it without substantially increasing price).
2. Vector database (in my experience overcomplicated and spotty performance, often not much better than an expanded keyword search, sometimes worse?)
3. Web search (generate queries, run them on duckduckgo, read the top N results) -- decent in theory but most top search results are crap, I need to adapt this method to use only high quality sources instead of general purpose search engines.
Extremely dangerous for that one detail you will not expect also since hallucination becomes more rare; extremely dangerous in the hands of practitioners non paranoid in front of hallucination.
(Incidentally: apparently, somebody recently lost a house after having some chatbot write the contract. This is indicative of the possible level of carelessness of users.
Edit: I am trying to find that piece of news, but it seems non trivial. Maybe the original reference itself, which reported the news, was victim of hallucination? Meanwhile, I have found this noteworthy piece: firm allows people to have contractual terms presented by chatbot, which presents hallucinated terms, and loses legal action - https://mashable.com/article/air-canada-forced-to-refund-aft... )
I have an idea how to test whether AI can be a good scientist:
Train on all published scientific knowledge and observations up to certain point, before a breakthrough occurred. Then see if your AI can generate the breakthrough on its own.
For example, prior to 1900 quantum theory did not exist. Given what we knew then, could AI reproduce the ideas of Planck, Einstein, Bohr etc?
If not, then AI will never be useful for generating scientific theory.
I don’t think this is the main point of the paper. They’re not claiming that AI is capable of scientific breakthroughs. Rather, they argue that AI excels at summarising vast amounts of existing scientific knowledge.
That's literally what "knowledge synthesis" is. Not just summarizing, but "the combination of ideas to form a theory or system."
Breakthroughs are just a special case of synthesis.
Formally speaking, breakthroughs are not simply a subset of synthesis, as they can exist outside the realm of prior knowledge.
Or just have the AI generate new specific experimental setups and parameters that we can try and be like "oh yeah, we just made a room temperature superconductor".
Honestly given what we know about physics, the AI should be able to simulate physics within itself or deduce certain things we've missed.
> Honestly given what we know about physics, the AI should be able to simulate physics within itself or deduce certain things we've missed.
If by "AI" you mean language models, then no, it will not "be able to simulate physics within itself". No way.
It can simulate basic problems well enough when viewed as a black box. Give it one of Galileo's experiments.
Oh no I mean if we claim we have an AGI and it's true, it should be able to do that. LLMs are not that
Fair enough.
And in fact, I think that's an interesting line to consider for determining if something is in fact an AGI.
Discover quantum mechanics or you’re a failure!
I hope your approach with your kids is a bit more nuanced.
Is your second sentence sincere? Attacking someone's parenting to win rhetorical points on an unrelated topic is pretty low.
How dare he have high expectations of the AI product!
High expectations are one thing, and I’m an AGI skeptic, but when did being the smartest person ever become a requirement of AGI?
Since always. That's what AGI means.
AGI doesn’t have a universally accepted definition.
i only performed a quick read of the paper but couldn't find how many humans they used to generate their expected human performance, this seems to be the main content:
> To ensure that we did not overfit PaperQA2 to achieve high performance on LitQA2, we generated a new set of 101 LitQA2 questions after making most of the engineering changes to PaperQA2. The accuracy of PaperQA2 on the original set of 147 questions did not differ significantly from its accuracy on the latter set of 101 questions, indicating that our optimizations in the first stage generalized well to new and unseen LitQA2 questions (Table 2).
> To compare PaperQA2 performance to human performance on the same task, human annotators who either possessed a PhD in biology or a related science, or who were enrolled in a PhD program (see Section 8.2.1), were each provided a subset of LitQA2 questions and a performance-related financial incentive of $3-12 per question to answer as many questions correctly as possible within approximately one week, using any online tools and paper access provided by their institutions. Under these conditions, human annotators achieved 73.8% ± 9.6% (mean ± SD, n = 9) precision on LitQA2 and 67.7% ± 11.9% (mean ± SD, n = 9) accuracy (Figure 2A, green line). PaperQA2 thus achieved superhuman precision on this task (t(8.6) = 3.49, p = 0.0036) and did not differ significantly from humans in accuracy (t(8.5) = −0.42, p = 0.66).
Academic writing is notoriously hard to read and often poorly written. If this lives up to billing it will be a game changer - no need to rely on sporadic, manual, intrinsically limited nature of surveys from academics, analysts through to gym bros, reddit posters.
One of my big uses of LLM's has been searching through medical research. The issue has been a few times running into confidence where it shouldn't be but I have found it hallucinates a lot less in science topics than it does for more common topics.
How are you searching with LLMs?
The strategies I've tried are:
1. Shove all potentially relevant data (e.g. entire book or library) into an LLM (quite expensive and the needle-haystack problem exists even in recent models iirc -- though splitting into many smaller prompts seems to solve it without substantially increasing price).
2. Vector database (in my experience overcomplicated and spotty performance, often not much better than an expanded keyword search, sometimes worse?)
3. Web search (generate queries, run them on duckduckgo, read the top N results) -- decent in theory but most top search results are crap, I need to adapt this method to use only high quality sources instead of general purpose search engines.
I have a free software which does it: https://github.com/impredicative/newssurvey
As for the how, it is noted in its readme in the Approach section.
> hallucinates a lot less in science topics
Extremely dangerous for that one detail you will not expect also since hallucination becomes more rare; extremely dangerous in the hands of practitioners non paranoid in front of hallucination.
(Incidentally: apparently, somebody recently lost a house after having some chatbot write the contract. This is indicative of the possible level of carelessness of users.
Edit: I am trying to find that piece of news, but it seems non trivial. Maybe the original reference itself, which reported the news, was victim of hallucination? Meanwhile, I have found this noteworthy piece: firm allows people to have contractual terms presented by chatbot, which presents hallucinated terms, and loses legal action - https://mashable.com/article/air-canada-forced-to-refund-aft... )
Github: https://github.com/Future-House/paper-qa