For a case study would be nice if the case were actually studied…
> had unusually legible handwriting, but even “easy” early modern paleography like this is still the sort of thing that requires days or weeks of training to get the hang of.
Why would you need weeks of training to use some OCR tool? No comparison to any used alternatives in the article. And only using "unusually legible" isn't that relevant for the… usual cases
> This is basically perfect,
I’ve counted at least 5 errors on the first line, how is this anywhere close to perfection???
Same with translation: first, is this an obscure text that has no existing translation to compare the accuracy to instead of relying on your own poor knowledge? Second, what about existing tools?
> which I hadn’t considered as being relevant to understanding a specific early modern map, but which, on reflection, actually are (the Peter Burke book on the Renaissance sense of the past).
How?
> Does this replace the actual reading required? Not at all.
With seemingly irrelevant books like the previous one, yes, it does, the poor student has a rather limited time budget
I agree, I probably should've gone into more detail on the actual case studies and implications. I may write this up as a more academic article at some point so I have space to do that.
To your point about OCR: I think you'll find that the existing OCR tools will not know where to begin with the 18th century Mexican medical text in the second case study. If you can find one that is able to transcribe that lettering, please do let me know because it would be incredibly useful.
Speaking entirely for myself here, a pretty significant part of what professional historians do is to take a ton of photos of hard-to-read archival documents, then slowly puzzle them out after the fact - not by using any OCR tool (because none of them that I'm aware of are good enough to deal with difficult paleography) but the old fashioned way, by printing them out, finding individual letters or words that are readable, and then going from there. It's tedious work and it requires at least a few days of training to get the hang of.
For those looking for a specific example of an intermediate-difficulty level manuscript in English, that post shows a manuscript of the John Donne poem "A Triple Fool" which gives a sense of a typical 17th century paleography challenge that GPT-4o is able to transcribe (and which, as far as I know, OCR tools can't handle - though please correct me if I'm wrong). The "Sea surgeon" manuscript below it is what I would consider advanced-intermediate and is around the point where GPT-4o, and probably most PhD students in history, gets completely lost.
re: basically perfect, the errors I see are entirely typos which don't change the meaning (descritto instead of descritta, and the like). So yes, not perfect, but not anything which would impact a historical researcher. In terms of existing tools for translation, the state of the art that I was aware of before LLMs is Google Translate, and I think anyone who tries both on the same text can see which works better there.
re: "irrelevant books," there's really no way to make an objective statement about what's relevant and what's not until you actually read something rather than an AI summary. For that reason, in my own work, this is very much about augmenting rather than replacing human labor. The main work begins after this sort of LLM-augmented research. It isn't replaced by it in any way.
I wanted to say this, but could not express it as well.
I think what your points also reveal is the biggest success factor of ChatGPT: it can do many things that specialised tools have been doing (better), but many ChatGPT users had not known about those tools.
I do understand that a mere user of e.g. OCR tooling does not perform a systematic evaluation with the available tools, although it would be the scientific way to decide for one.
For a researcher, however, the lack of knowledge about the tooling ecosystem seems concerning.
> Granted, Monte had unusually legible handwriting, but even “easy” early modern paleography like this is still the sort of thing that requires days or weeks of training to get the hang of.
He isn't talking about weeks of training to learn to use OCR software, he means weeks of training to learn to read that handwriting without any assistance from software at all.
I'd love to read way more stuff like this. There are plenty of people writing about LLMs from a computer science point of view, but I'm much more interested in hearing from people in fields like this one (academic history) who are vigorously exploring applications of these tools.
I’m working with Neo-Latin texts at the Ritman Library of Hermetic Philosophy in Amsterdam (aka Embassy of the Free Mind).
Most of the library is untranslated Latin. I have a book that was recently professionally translated but it has not yet been published. I’d like to benchmark LLMs against this work by having experts rate preference for human translation vs LLM, at a paragraph level.
I’m also interested in a workflow that can enable much more rapid LLM transcriptions and translations — whereby experts might only need to evaluate randomized pages to create a known error rate that can be improved over time. This can be contrasted to a perfect critical edition.
And, on this topic, just yesterday I tried and failed to find English translations of key works by Gustav Fechner, an early German psychologist. This isn’t obscure—he invented the median and created the field of “empirical aesthetics.” A quick translation of some of his work with Claude immediately revealed concept I was looking for. Luckily, I had a German around to validate the translation…
LLMs will have a huge impact on humanities scholarship; we need methods and evals.
Thank you! Have been a big fan of your writing on LLMs over the past couple years. One thing I have been encouraged by over this period is that there are some interesting interdisciplinary conversations starting to happen. Ethan Mollick has been doing a good job as a bridge between people working in different academic fields, IMO.
A basic problem is they're trained on the Internet, and take on all the biases. Ask any of them so purposed edX to MIT or wrote the platform. You'll get back official PR. Look at a primary source (e.g. public git history or private email records) and you'll get a factual story.
The tendency to reaffirm popular beliefs would make current LLMs almost useless for actual historical work, which often involves sifting fact from fiction.
Couldn’t LLMs cite primary sources much the same way as a textbook or Wikipedia? Which is how you circumvent the biases in textbooks and wikipedia summaries?
This is a showcase of exactly what LLMs are good at.
Handwriting recognition, a classic neural network application, and surfacing information and ideas, however flawed, that one may not have had themselves.
This is really cool. This is AI augmenting human capabilities.
Good read on what someone in a specific field considers to have been achieved (rightly or wrongly). It does lead me to wonder how many of these old manuscripts and their translations are in the training set. That may limit its abilities against any random sample that isn't included.
Then again, maybe not; OCR is one of the most worked on problems, so the quality of parsing characters into text maybe shouldn't be as surprising.
Off topic: it's wild to me that in 2025 sites like substack don't apply `prefers-color-scheme` logic to all their blogs.
The intractable problem, here, is that “LLMs are good historians” is a nearly useless heuristic.
I’m not a historian. I don’t speak old spanish. I am not a domain expert at all. I can’t do what the author of this post can do: expertly review the work of an LLM in his field.
My expertise is in software testing, and I can report that LLMs sometimes have reasonable testing ideas— but that doesn’t mean they are safe and effective when used for that purpose by an amateur.
Despite what the author writes, I cannot use an LLM to get good information about history.
This is similar to the problem with some of the things people have been doing with o1 and o3. I've seen people share "PhD level" results from them... but if I don't have a PhD myself in that subject it's almost impossible for me to evaluate their output and spot if it makes sense or not.
I get a ton of value out of LLMs as a programmer partly because I have 20+ years of programming experience, so it's trivial for me to spot when they are doing "good" work as opposed to making dumb mistakes.
I can't credibly evaluate their higher level output in other disciplines at all.
> There are, again, a couple errors here: it should be “explicación phisica” [physical explanation] not “poetic explanation” in the first line, for instance.
The image seems to say "phicica" (with a "c"), but that's not Spanish. "ph" is not even a thing in Spanish. "Physical" is "física", at least today, IDK about the 1700's. So, if you try to make sense of it in such a way that you assume a nonsense word is you misreading rather than the writer "miswriting", I can see why it assumes it might say "poética", even though that makes less sense semantically.
Author here, I agree that my read may not be correct either. It’s tough to make out. Although keep in mind that “ph” is used in Latin and Greek (or at least transliterations of Greek into the Roman alphabet) so in an early modern medical context (I.e. one in which it is assumed the reader knows Latin, regardless of the language being used) “ph” is still a plausible start to a word. Early modern spelling in general is famously variable - common to see an author spell the same word two different ways in the same text.
> After all (he said, pleadingly) consciousness really is an irreducible interior fortress that refuses to be pinned down by the numeric lens (really, it is!)
I love this line and the “flattening of human complexity into numbers” quote above it. It sums up perfectly how I feel about the whole LLM to AGI hype/debate (even though he’s talking about consciousness).
Everyone who develops a model has to jump through the benchmark hoop which we all use to measure progress but we don’t even have anything approaching a rigorous definition of intelligence. Researchers are chasing benchmarks but it doesn’t feel like we’re getting any closer to true intelligence, just flattening its expression into next token prediction (aka everything is a vector).
Yeah precisely. Ever since the "brain as computer" metaphor was birthed in the 50s-60s the chief line of attack in the effort to make "intelligent" machines has been to continually narrow what we mean by intelligence further and further until we can divest it of any dependence on humanist notions. We have "intelligent" machines today more as a byproduct of our lowering the bar for what constitutes intelligence than by actually producing anything we'd consider remotely capable of the same ingenuity as the average human being.
>> One of the well-known limitations with ChatGPT is that it doesn’t tell you what the relevant sources are that it looked at to generate the text it gives you.
This isn't a limitation, this is critically dangerous. Commercial AI is a centralized, controlled, biased LLM. At what point will someone train it to say something they want people to believe? How can it be trusted?
Consensus based information is still best, and I don't feel LLMs will give us that.
I wonder (hope) that for any given issue, the majority of the internet/the training data, and therefore the model's output, will be fairly near to the truth. Maybe not for every topic, but most.
E.g., the models won't report that unicorns are real because the majority of the internet doesn't report that unicorns are real. Of course, there may be issues (like ghosts?) where the majority of the internet isn't accurate?
But the gist of its argument just seems to be that they don't know fine details of history, and make the same generalized assumptions that humans would make with only a cursory knowledge of a particular topic. This seems unavoidable for a model that compresses a broad swath of human knowledge down to a couple hundred gigabytes.
Using AI as a research tool instead of a fact database is of course a whole different thing.
One thing I'd love if models would get to help me confirm a thing or find the source od soemthing I have a vague memory of and which may be right or wrong, I just don't know.
E.g. I have this recollection of a quote, slightly pithy, from around the 19 hundreds about hobby clubs controlling social life, maybe from Mark twain, maybe not.
I just cannot come up with the prompt that gets me the answer, instead I just get hallucination after hallucination, just confirming whatever I put in, like a student who didn't study for the test and is just going along with what the professor is asking at the oral exam.
In my experience, these AI models haven't been great with knowledge about one specific figure (like a President). I wonder if there's a movement to start introducing these AI models to books or e-books that aren't accessible online? I wish I could be able to discuss the less publicly known details of historical figures' lives or upbringings with AI, but it's clear that more niche information that you can only read about isn't available to it.
Still waiting for someone to train an LLM entirely from sources written before a chosen date and be able to discuss concepts with someone apparently lacking any knowledge of the world after that date.
might work for say post the 1800's in literate countries, but for e.g. Rome our sources are so sparse and so far removed from the time they're writing about that it would be worse than nothing.
"What would have happened if ChatGPT was invented in the 17th century? MonadGPT is a possible answer. MonadGPT is a finetune of Mistral-Hermes 2 on 11,000 early modern texts in English, French and Latin, mostly coming from EEBO and Gallica. Like the original Mistral-Hermes, MonadGPT can be used in conversation mode. It will not only answer in an historical language and style but will use historical and dated references. This is especially visible for science questions (astronomy, medecine). Obviously, it's not recommended to follow any advice from Monad-GPT." Available to install and run locally -- or you can try it out for free online."
In the 1950s, most people believed that the Soviets made the biggest contribution to stopping the Nazis. However, today, most people think it was actually the Americans who played the biggest role in defeating the Nazis.
> "In 1945, the French public said the Soviets did the most to defeat Nazi Germany - but in 2024 they're most likely to say it was the Americans"[0]
Are there any successful models that weren't trained with RLHF, or using a system with RLHF. I'm curious if this could be done without a fine tune step that would't meaningfully bias this.
Normally I balk when commenters go “well they you’re the perfect person to go do it!”, but actually… this is the kind of thing that sounds like it could be a fun project if you’re legit interested. The necessary datasets are likely not hard to gather and collate, a lot of it is probably on places like Project Gutenberg or can be gleaned through OCR of images downloaded from various publicly available archives.
Granted, you’d need to spend about a year on this and for a lot of that time your graphics card (and possibly whole computer) would be unusable, but then if the results were compelling you’d get a cool 15 minutes of internet fame when you posted your results.
yes! There's this measure of historical expertise that involves "eating the brains", so to speak, of the people living back then such that if you time traveled back to a bar or street in [insert period], you could carry on a conversation about events going on in that time :) I would love something that uses newspaper fragments, books, etc. to simulate this experience!
The only reason LLMs “work” is because they are trained on a vast corpus of (text-based) human interactions online. The main reason LLMs weren’t a thing 25 years ago, was because there just wasn’t enough scrapeable and useful data available online…
Reduce the dataset to “knowledge as of year 1880” - and it’s not certain you’d even be able to “interact” with the LLM in any meaningful way…
Now the question is how can I, someone without a PhD in history but currently a PhD candidate in another discipline, use these tools to reliably interrogate topics of interest and produce at least a graduate level understanding of them?
I know this is possible, but the further away I get from my core domains, the harder it is for me to use these tools in a way that doesn’t feel like too much blind faith (even if it works!)
I think the trick here is to treat everything these models tell you as part of a larger information diet.
Like if you have a friend who's very well-read and talkative but is also extremely confident and loves the sound of their own voice. You quickly learn to treat them as a source of probably-correct information, but only part of they way you learn any given topic.
I do this with LLMs all the time: I'm constantly asking them clarifying questions about things, but I always assume that they might be making mistakes or feeding me convincing sounding half-truths or even full hallucinations.
Being good at mixing together information from a variety of sources - of different levels of accuracy - is key to learning anything well.
You ask them for references and check yourself. They are good exploratory and hypothesis generating tools, but not more. Getting a sensible sounding answer should not be an excuse for you to confirm. Often, the devil is in the details.
If you click on the Evaluation links, you can see how you can use multiple LLMs to validate LLM response. The evaluation of the accurate response is interesting since Llama 3.3 was the most critical.
At this point, you would ask Llama to explain why the response was not 100% which you can use to cross reference other LLMs or to do your own research.
> Now the question is how can I, someone without a PhD in history but currently a PhD candidate in another discipline, use these tools to reliably interrogate topics of interest and produce at least a graduate level understanding of them?
You can't. Because LLM's are statistical generative text algorithms, dependent upon their training data set and subsequent reinforcement. Think Bayesian statistics.
What you are asking for is "to reliably interrogate topics of interest", which is not what LLM's do. Concepts such as reliability are orthogonal to their purpose.
I'm not sure what good will a system that only focuses on targeted truths will ever do to humanity, we already live in a world were stats are only valid if they do not offend a single person. The reason AI's are so doctored are that sometimes we just do not want to hear the truth, and we dont.
Interesting perspective. I appreciate that it tests the models at different "layers" of understanding.
I have always felt that LLMs would fall apart beyond the summarization. Maybe they would be able to regurgitate someone else's analysis. The author seems to think there's some level of intelligent creativity at play
I'm hopeful that the author is right. That truly creative thinking may be beyond the abilities of LLMs and be decades away.
I think the author doesn't consider the implications of broad use of LLM societally. Will people be willing to fund human historian grad students when they can get a LLM for a fraction of the price? Will prospective historians have gained the training necessary if they've used an LLM through all of school?
I believe the education system could figure it out over time.
I'm more worried that LLMs like this will be used as further justification to defund or halt humanities research. Who needs a history department when I can get 80% for the cost of a few chatGPT queries?
The problem is one of trust, and it's very difficult to trust the output of LLMs to be correct/true vs "truthy" without extensive verification that may be either as laborious as doing the original research or that may be difficult or impossible without knowledge and understanding of the internals and sources that may not be available.
I'm no professional historian, but every time I try this kind of thing I'm very disappointed in the results.
A hobby of mine is editing Wikipedia articles about Australian motorsport (yes, I have an odd hobby, sue me).
The vehicles in the premier domestic auto racing category in Australia, the Supercars Championship, are unique to the category. Like NASCAR, they're built on a dedicated space frame chassis with body panels that look like either a Mustang or a Camaro draped over the top.
I'd seen occasional claims on forums that when the organising body was deciding on the design of the current generation of cars, they considered using the "Group GT3" rules that are used for a bunch of racing series around the world (including the German DTM championship, the GT World Challenge events raced across Europe, Asia, and Australia, and the IMSA GTD and GTD Pro categories). If true, it might be an interesting side note to the article about the Supercars Championship.
So I asked Copilot (the paid model) to find articles in motor sport media about this (there are a number of professional online publications that cover the series extensively). It confidently claimed that yes, indeed, there was some interest in using GT3 cars in the Supercars championship, and pointed me to three articles making this case.
The first was an article featuring quotes from the promoter of the DTM series saying what a good idea it was to have a common car across different national series. So the first article was relevant, but didn't actually show that anyone involved in the administration of the Supercars Championship was interested in the idea.
The second and third references were articles about drivers and teams whose core business is the Supercars championship also running cars in the local GT3 championship (while not explicitly mentioned in the article, they do this for a large wad of cash from the rich hobbyists who co-drive and fund most GT3 racing). Copilot's interpretation of the articles was just flat-out wrong.
Yes, this was a sample size of one historical query, but its response was very poor.
This seems to sap the intrigue out of research. But I get it. My impression of academia is antiquated. People have Jobs to do. Capital J. And this is more convenient to them. Even though I think it makes them look sort of dumb. But that’s just me and I’m not an academic anyhow.
While I welcome the rise of parallel shadow institutions as civilization grows spiritlessly utilitarian, the future for common sense looks bleak.
Robert Nozick (in Examined Life) asked how we feel if we found out, say, Beethoven seriously composed music based on a secret formula, which is entire mechanical and required no effort for him at all.
Would we still appreciate the music in the same way? If not, does our appreciation really stem from the fact that we feel he has also struggled like we do, and nevertheless produced something incredible.
I remember as a very small child watching figure skaters on TV and thinking "that's no big deal". And before I started programming: "it's just logic, all very straightforward". But that was before I first entered an ice rink or centre-d a div
Maybe we don't really appreciate something unless we appreciate it is hard in a visceral way.
Are LLMs good historians? Of course not. These types of articles always have some click/rage-bait title declaring AI supremacy at whatever task. I have used ChatGPT 4o to help translate old high German from broadsides printed in the 15-16th centuries into English and it seems to work pretty well. I don't think I'm doing serious ground-breaking research, but I feel like LLMs open doors and expand access to many things that were once completely locked without specialized knowledge.
Someone who knows a lot of history is a history buff.
A historian works with (and may even seek out in musty rooms) primary and secondary sources to produce novel research and interpretation.
An AI is at best limited to ~reading sources that human historians/archivists/librarians have already identified and digitized.
Certainly value to be had here wrt to finding needles in and making sense of already-digitized historical records, but that's more like a research assistant.
For a case study would be nice if the case were actually studied…
> had unusually legible handwriting, but even “easy” early modern paleography like this is still the sort of thing that requires days or weeks of training to get the hang of.
Why would you need weeks of training to use some OCR tool? No comparison to any used alternatives in the article. And only using "unusually legible" isn't that relevant for the… usual cases
> This is basically perfect,
I’ve counted at least 5 errors on the first line, how is this anywhere close to perfection???
Same with translation: first, is this an obscure text that has no existing translation to compare the accuracy to instead of relying on your own poor knowledge? Second, what about existing tools?
> which I hadn’t considered as being relevant to understanding a specific early modern map, but which, on reflection, actually are (the Peter Burke book on the Renaissance sense of the past).
How?
> Does this replace the actual reading required? Not at all.
With seemingly irrelevant books like the previous one, yes, it does, the poor student has a rather limited time budget
I agree, I probably should've gone into more detail on the actual case studies and implications. I may write this up as a more academic article at some point so I have space to do that.
To your point about OCR: I think you'll find that the existing OCR tools will not know where to begin with the 18th century Mexican medical text in the second case study. If you can find one that is able to transcribe that lettering, please do let me know because it would be incredibly useful.
Speaking entirely for myself here, a pretty significant part of what professional historians do is to take a ton of photos of hard-to-read archival documents, then slowly puzzle them out after the fact - not by using any OCR tool (because none of them that I'm aware of are good enough to deal with difficult paleography) but the old fashioned way, by printing them out, finding individual letters or words that are readable, and then going from there. It's tedious work and it requires at least a few days of training to get the hang of.
If anyone wants to get a sense of what this paleography actually looks like, this is something I wrote about back in 2013 when I was in grad school - https://resobscura.blogspot.com/2013/07/why-does-s-look-like...
For those looking for a specific example of an intermediate-difficulty level manuscript in English, that post shows a manuscript of the John Donne poem "A Triple Fool" which gives a sense of a typical 17th century paleography challenge that GPT-4o is able to transcribe (and which, as far as I know, OCR tools can't handle - though please correct me if I'm wrong). The "Sea surgeon" manuscript below it is what I would consider advanced-intermediate and is around the point where GPT-4o, and probably most PhD students in history, gets completely lost.
re: basically perfect, the errors I see are entirely typos which don't change the meaning (descritto instead of descritta, and the like). So yes, not perfect, but not anything which would impact a historical researcher. In terms of existing tools for translation, the state of the art that I was aware of before LLMs is Google Translate, and I think anyone who tries both on the same text can see which works better there.
re: "irrelevant books," there's really no way to make an objective statement about what's relevant and what's not until you actually read something rather than an AI summary. For that reason, in my own work, this is very much about augmenting rather than replacing human labor. The main work begins after this sort of LLM-augmented research. It isn't replaced by it in any way.
I wanted to say this, but could not express it as well. I think what your points also reveal is the biggest success factor of ChatGPT: it can do many things that specialised tools have been doing (better), but many ChatGPT users had not known about those tools.
I do understand that a mere user of e.g. OCR tooling does not perform a systematic evaluation with the available tools, although it would be the scientific way to decide for one. For a researcher, however, the lack of knowledge about the tooling ecosystem seems concerning.
Full quote:
> Granted, Monte had unusually legible handwriting, but even “easy” early modern paleography like this is still the sort of thing that requires days or weeks of training to get the hang of.
He isn't talking about weeks of training to learn to use OCR software, he means weeks of training to learn to read that handwriting without any assistance from software at all.
Do you know any OCR tools that work on early modern English handwriting?
I'd love to read way more stuff like this. There are plenty of people writing about LLMs from a computer science point of view, but I'm much more interested in hearing from people in fields like this one (academic history) who are vigorously exploring applications of these tools.
I’m working with Neo-Latin texts at the Ritman Library of Hermetic Philosophy in Amsterdam (aka Embassy of the Free Mind).
Most of the library is untranslated Latin. I have a book that was recently professionally translated but it has not yet been published. I’d like to benchmark LLMs against this work by having experts rate preference for human translation vs LLM, at a paragraph level.
I’m also interested in a workflow that can enable much more rapid LLM transcriptions and translations — whereby experts might only need to evaluate randomized pages to create a known error rate that can be improved over time. This can be contrasted to a perfect critical edition.
And, on this topic, just yesterday I tried and failed to find English translations of key works by Gustav Fechner, an early German psychologist. This isn’t obscure—he invented the median and created the field of “empirical aesthetics.” A quick translation of some of his work with Claude immediately revealed concept I was looking for. Luckily, I had a German around to validate the translation…
LLMs will have a huge impact on humanities scholarship; we need methods and evals.
Thank you! Have been a big fan of your writing on LLMs over the past couple years. One thing I have been encouraged by over this period is that there are some interesting interdisciplinary conversations starting to happen. Ethan Mollick has been doing a good job as a bridge between people working in different academic fields, IMO.
A basic problem is they're trained on the Internet, and take on all the biases. Ask any of them so purposed edX to MIT or wrote the platform. You'll get back official PR. Look at a primary source (e.g. public git history or private email records) and you'll get a factual story.
The tendency to reaffirm popular beliefs would make current LLMs almost useless for actual historical work, which often involves sifting fact from fiction.
Couldn’t LLMs cite primary sources much the same way as a textbook or Wikipedia? Which is how you circumvent the biases in textbooks and wikipedia summaries?
This is a showcase of exactly what LLMs are good at.
Handwriting recognition, a classic neural network application, and surfacing information and ideas, however flawed, that one may not have had themselves.
This is really cool. This is AI augmenting human capabilities.
Good read on what someone in a specific field considers to have been achieved (rightly or wrongly). It does lead me to wonder how many of these old manuscripts and their translations are in the training set. That may limit its abilities against any random sample that isn't included.
Then again, maybe not; OCR is one of the most worked on problems, so the quality of parsing characters into text maybe shouldn't be as surprising.
Off topic: it's wild to me that in 2025 sites like substack don't apply `prefers-color-scheme` logic to all their blogs.
The intractable problem, here, is that “LLMs are good historians” is a nearly useless heuristic.
I’m not a historian. I don’t speak old spanish. I am not a domain expert at all. I can’t do what the author of this post can do: expertly review the work of an LLM in his field.
My expertise is in software testing, and I can report that LLMs sometimes have reasonable testing ideas— but that doesn’t mean they are safe and effective when used for that purpose by an amateur.
Despite what the author writes, I cannot use an LLM to get good information about history.
This is similar to the problem with some of the things people have been doing with o1 and o3. I've seen people share "PhD level" results from them... but if I don't have a PhD myself in that subject it's almost impossible for me to evaluate their output and spot if it makes sense or not.
I get a ton of value out of LLMs as a programmer partly because I have 20+ years of programming experience, so it's trivial for me to spot when they are doing "good" work as opposed to making dumb mistakes.
I can't credibly evaluate their higher level output in other disciplines at all.
You __can__ get good information from an LLM, however you just have to backtrack every once in a while because the information turned out to be false.
> explicación poética
> There are, again, a couple errors here: it should be “explicación phisica” [physical explanation] not “poetic explanation” in the first line, for instance.
The image seems to say "phicica" (with a "c"), but that's not Spanish. "ph" is not even a thing in Spanish. "Physical" is "física", at least today, IDK about the 1700's. So, if you try to make sense of it in such a way that you assume a nonsense word is you misreading rather than the writer "miswriting", I can see why it assumes it might say "poética", even though that makes less sense semantically.
Author here, I agree that my read may not be correct either. It’s tough to make out. Although keep in mind that “ph” is used in Latin and Greek (or at least transliterations of Greek into the Roman alphabet) so in an early modern medical context (I.e. one in which it is assumed the reader knows Latin, regardless of the language being used) “ph” is still a plausible start to a word. Early modern spelling in general is famously variable - common to see an author spell the same word two different ways in the same text.
> After all (he said, pleadingly) consciousness really is an irreducible interior fortress that refuses to be pinned down by the numeric lens (really, it is!)
I love this line and the “flattening of human complexity into numbers” quote above it. It sums up perfectly how I feel about the whole LLM to AGI hype/debate (even though he’s talking about consciousness).
Everyone who develops a model has to jump through the benchmark hoop which we all use to measure progress but we don’t even have anything approaching a rigorous definition of intelligence. Researchers are chasing benchmarks but it doesn’t feel like we’re getting any closer to true intelligence, just flattening its expression into next token prediction (aka everything is a vector).
Yeah precisely. Ever since the "brain as computer" metaphor was birthed in the 50s-60s the chief line of attack in the effort to make "intelligent" machines has been to continually narrow what we mean by intelligence further and further until we can divest it of any dependence on humanist notions. We have "intelligent" machines today more as a byproduct of our lowering the bar for what constitutes intelligence than by actually producing anything we'd consider remotely capable of the same ingenuity as the average human being.
I wrote this piece in 2023, which argues similarly that LLMs are a boon, not a threat to historians
https://zwischenzugs.com/2023/12/27/what-i-learned-using-pri...
>> One of the well-known limitations with ChatGPT is that it doesn’t tell you what the relevant sources are that it looked at to generate the text it gives you.
This isn't a limitation, this is critically dangerous. Commercial AI is a centralized, controlled, biased LLM. At what point will someone train it to say something they want people to believe? How can it be trusted?
Consensus based information is still best, and I don't feel LLMs will give us that.
Discussed here!
What I learned using private LLMs to write an undergraduate history essay - https://news.ycombinator.com/item?id=38813297 - Dec 2023 (81 comments)
"LLMs, which are exquisitely well-tuned machines for finding the median viewpoint on a given issue..."
That's an excellent way to put it. It's the default mode of an LLM. You can ask an LLM for biases, and get them, of course.
I don't think there is any reason to believe this except that everyone seems to want it to be true.
An easy way to make it not be true would be to emphasize some sources in pretraining by putting them in the corpus multiple times.
I wonder (hope) that for any given issue, the majority of the internet/the training data, and therefore the model's output, will be fairly near to the truth. Maybe not for every topic, but most.
E.g., the models won't report that unicorns are real because the majority of the internet doesn't report that unicorns are real. Of course, there may be issues (like ghosts?) where the majority of the internet isn't accurate?
It was pretty neat seeing this because a recent paper found that AI models are bad historians: https://techcrunch.com/2025/01/19/ai-isnt-very-good-at-histo...
But the gist of its argument just seems to be that they don't know fine details of history, and make the same generalized assumptions that humans would make with only a cursory knowledge of a particular topic. This seems unavoidable for a model that compresses a broad swath of human knowledge down to a couple hundred gigabytes.
Using AI as a research tool instead of a fact database is of course a whole different thing.
One thing I'd love if models would get to help me confirm a thing or find the source od soemthing I have a vague memory of and which may be right or wrong, I just don't know.
E.g. I have this recollection of a quote, slightly pithy, from around the 19 hundreds about hobby clubs controlling social life, maybe from Mark twain, maybe not.
I just cannot come up with the prompt that gets me the answer, instead I just get hallucination after hallucination, just confirming whatever I put in, like a student who didn't study for the test and is just going along with what the professor is asking at the oral exam.
This was so good. I'm super curious to learn more about the strategies used to set up system prompts for the custom GPT that was set up here.
In my experience, these AI models haven't been great with knowledge about one specific figure (like a President). I wonder if there's a movement to start introducing these AI models to books or e-books that aren't accessible online? I wish I could be able to discuss the less publicly known details of historical figures' lives or upbringings with AI, but it's clear that more niche information that you can only read about isn't available to it.
Still waiting for someone to train an LLM entirely from sources written before a chosen date and be able to discuss concepts with someone apparently lacking any knowledge of the world after that date.
Try to get an LLM to admit it doesn't know something, first
Would be fascinating trying to get an LLM trained with 1900 data to discover Einstein physics
might work for say post the 1800's in literate countries, but for e.g. Rome our sources are so sparse and so far removed from the time they're writing about that it would be worse than nothing.
"What would have happened if ChatGPT was invented in the 17th century? MonadGPT is a possible answer. MonadGPT is a finetune of Mistral-Hermes 2 on 11,000 early modern texts in English, French and Latin, mostly coming from EEBO and Gallica. Like the original Mistral-Hermes, MonadGPT can be used in conversation mode. It will not only answer in an historical language and style but will use historical and dated references. This is especially visible for science questions (astronomy, medecine). Obviously, it's not recommended to follow any advice from Monad-GPT." Available to install and run locally -- or you can try it out for free online."
https://www.metafilter.com/201537/O-brave-new-world-that-has...
In the 1950s, most people believed that the Soviets made the biggest contribution to stopping the Nazis. However, today, most people think it was actually the Americans who played the biggest role in defeating the Nazis.
> "In 1945, the French public said the Soviets did the most to defeat Nazi Germany - but in 2024 they're most likely to say it was the Americans"[0]
[0] https://yougov.co.uk/politics/articles/49613-d-day-anniversa...
Are there any successful models that weren't trained with RLHF, or using a system with RLHF. I'm curious if this could be done without a fine tune step that would't meaningfully bias this.
Normally I balk when commenters go “well they you’re the perfect person to go do it!”, but actually… this is the kind of thing that sounds like it could be a fun project if you’re legit interested. The necessary datasets are likely not hard to gather and collate, a lot of it is probably on places like Project Gutenberg or can be gleaned through OCR of images downloaded from various publicly available archives.
Granted, you’d need to spend about a year on this and for a lot of that time your graphics card (and possibly whole computer) would be unusable, but then if the results were compelling you’d get a cool 15 minutes of internet fame when you posted your results.
yes! There's this measure of historical expertise that involves "eating the brains", so to speak, of the people living back then such that if you time traveled back to a bar or street in [insert period], you could carry on a conversation about events going on in that time :) I would love something that uses newspaper fragments, books, etc. to simulate this experience!
The only reason LLMs “work” is because they are trained on a vast corpus of (text-based) human interactions online. The main reason LLMs weren’t a thing 25 years ago, was because there just wasn’t enough scrapeable and useful data available online…
Reduce the dataset to “knowledge as of year 1880” - and it’s not certain you’d even be able to “interact” with the LLM in any meaningful way…
I think I'm slow. Can you explain this again, maybe with more words?
How would one know that the translation of the Italian text (that he gives as an example) was not just already baked into the model’s training data?
Now the question is how can I, someone without a PhD in history but currently a PhD candidate in another discipline, use these tools to reliably interrogate topics of interest and produce at least a graduate level understanding of them?
I know this is possible, but the further away I get from my core domains, the harder it is for me to use these tools in a way that doesn’t feel like too much blind faith (even if it works!)
I think the trick here is to treat everything these models tell you as part of a larger information diet.
Like if you have a friend who's very well-read and talkative but is also extremely confident and loves the sound of their own voice. You quickly learn to treat them as a source of probably-correct information, but only part of they way you learn any given topic.
I do this with LLMs all the time: I'm constantly asking them clarifying questions about things, but I always assume that they might be making mistakes or feeding me convincing sounding half-truths or even full hallucinations.
Being good at mixing together information from a variety of sources - of different levels of accuracy - is key to learning anything well.
You ask them for references and check yourself. They are good exploratory and hypothesis generating tools, but not more. Getting a sensible sounding answer should not be an excuse for you to confirm. Often, the devil is in the details.
> the harder it is for me to use these tools in a way that doesn’t feel like too much blind faith (even if it works!)
I tend to ask multiple models and if they all give me roughly the same answer, then it's probably right.
I wrote a chat app built around mistrust for LLM responses. You can see an example here:
https://beta.gitsense.com/?chat=ed907b02-4f03-477f-a5e4-ce9a...
If you click on the Evaluation links, you can see how you can use multiple LLMs to validate LLM response. The evaluation of the accurate response is interesting since Llama 3.3 was the most critical.
https://beta.gitsense.com/?chat=fdfb053d-f0e2-4346-bdfc-7305...
At this point, you would ask Llama to explain why the response was not 100% which you can use to cross reference other LLMs or to do your own research.
> Now the question is how can I, someone without a PhD in history but currently a PhD candidate in another discipline, use these tools to reliably interrogate topics of interest and produce at least a graduate level understanding of them?
You can't. Because LLM's are statistical generative text algorithms, dependent upon their training data set and subsequent reinforcement. Think Bayesian statistics.
What you are asking for is "to reliably interrogate topics of interest", which is not what LLM's do. Concepts such as reliability are orthogonal to their purpose.
I'm not sure what good will a system that only focuses on targeted truths will ever do to humanity, we already live in a world were stats are only valid if they do not offend a single person. The reason AI's are so doctored are that sometimes we just do not want to hear the truth, and we dont.
Interesting perspective. I appreciate that it tests the models at different "layers" of understanding.
I have always felt that LLMs would fall apart beyond the summarization. Maybe they would be able to regurgitate someone else's analysis. The author seems to think there's some level of intelligent creativity at play
I'm hopeful that the author is right. That truly creative thinking may be beyond the abilities of LLMs and be decades away.
I think the author doesn't consider the implications of broad use of LLM societally. Will people be willing to fund human historian grad students when they can get a LLM for a fraction of the price? Will prospective historians have gained the training necessary if they've used an LLM through all of school?
I believe the education system could figure it out over time. I'm more worried that LLMs like this will be used as further justification to defund or halt humanities research. Who needs a history department when I can get 80% for the cost of a few chatGPT queries?
Good tools for translations, etc? Sure!
Good historians? Ehhhhhhh.
The problem is one of trust, and it's very difficult to trust the output of LLMs to be correct/true vs "truthy" without extensive verification that may be either as laborious as doing the original research or that may be difficult or impossible without knowledge and understanding of the internals and sources that may not be available.
I'm no professional historian, but every time I try this kind of thing I'm very disappointed in the results.
A hobby of mine is editing Wikipedia articles about Australian motorsport (yes, I have an odd hobby, sue me).
The vehicles in the premier domestic auto racing category in Australia, the Supercars Championship, are unique to the category. Like NASCAR, they're built on a dedicated space frame chassis with body panels that look like either a Mustang or a Camaro draped over the top.
I'd seen occasional claims on forums that when the organising body was deciding on the design of the current generation of cars, they considered using the "Group GT3" rules that are used for a bunch of racing series around the world (including the German DTM championship, the GT World Challenge events raced across Europe, Asia, and Australia, and the IMSA GTD and GTD Pro categories). If true, it might be an interesting side note to the article about the Supercars Championship.
So I asked Copilot (the paid model) to find articles in motor sport media about this (there are a number of professional online publications that cover the series extensively). It confidently claimed that yes, indeed, there was some interest in using GT3 cars in the Supercars championship, and pointed me to three articles making this case.
The first was an article featuring quotes from the promoter of the DTM series saying what a good idea it was to have a common car across different national series. So the first article was relevant, but didn't actually show that anyone involved in the administration of the Supercars Championship was interested in the idea.
The second and third references were articles about drivers and teams whose core business is the Supercars championship also running cars in the local GT3 championship (while not explicitly mentioned in the article, they do this for a large wad of cash from the rich hobbyists who co-drive and fund most GT3 racing). Copilot's interpretation of the articles was just flat-out wrong.
Yes, this was a sample size of one historical query, but its response was very poor.
Wow what an incredibly interesting article. Thank you for sharing.
This also means they'll be excellent at changing history for those who wish history was more aligned with their views.
This seems to sap the intrigue out of research. But I get it. My impression of academia is antiquated. People have Jobs to do. Capital J. And this is more convenient to them. Even though I think it makes them look sort of dumb. But that’s just me and I’m not an academic anyhow.
While I welcome the rise of parallel shadow institutions as civilization grows spiritlessly utilitarian, the future for common sense looks bleak.
[flagged]
On the last point, why struggle with history:
Robert Nozick (in Examined Life) asked how we feel if we found out, say, Beethoven seriously composed music based on a secret formula, which is entire mechanical and required no effort for him at all.
Would we still appreciate the music in the same way? If not, does our appreciation really stem from the fact that we feel he has also struggled like we do, and nevertheless produced something incredible.
I remember as a very small child watching figure skaters on TV and thinking "that's no big deal". And before I started programming: "it's just logic, all very straightforward". But that was before I first entered an ice rink or centre-d a div
Maybe we don't really appreciate something unless we appreciate it is hard in a visceral way.
Are LLMs good historians? Of course not. These types of articles always have some click/rage-bait title declaring AI supremacy at whatever task. I have used ChatGPT 4o to help translate old high German from broadsides printed in the 15-16th centuries into English and it seems to work pretty well. I don't think I'm doing serious ground-breaking research, but I feel like LLMs open doors and expand access to many things that were once completely locked without specialized knowledge.
[dead]
Someone who knows a lot of history is a history buff.
A historian works with (and may even seek out in musty rooms) primary and secondary sources to produce novel research and interpretation.
An AI is at best limited to ~reading sources that human historians/archivists/librarians have already identified and digitized.
Certainly value to be had here wrt to finding needles in and making sense of already-digitized historical records, but that's more like a research assistant.
Brings a whole new meaning to "history is written by the winners"
[dead]
[dead]
Yeah, especially the ones from China /s
[flagged]
[flagged]