Thanks for such a cool project! It's immediately apparent how to use it and I appreciate the brief examples.
Quick question: In the breast cancer example from the README, simple support vector machines from sklearn (the first thing i tried to compare baseline performance, incidentally) seem to outperform TabPFN. Is this expected? I know it's a baseline to demonstrate ease of use rather than SOTA performance, but I am curious.
# (TabPFN)
In [13]: print("ROC AUC:", roc_auc_score(y_test, prediction_probabilities[:, 1]))
ROC AUC: 0.996299494264216
# (LinearSVC)
In [27]: from sklearn.svm import LinearSVC
In [28]: clf=LinearSVC(C=0.01).fit(X_train, y_train)
In [29]: roc_auc_score(y_test, clf.decision_function(X_test))
Out[29]: 0.997532996176144
Author here! The breast cancer dataset is simple and heavily saturated, so small differences between methods are expected. As you say, single-use examples can be noisy due to randomness in how the data is randomly split into training and testing sets especially for a saturated dataset like this one. Cross-validation reduces this variance by averaging over multiple splits. I just ran this below:
TabPFN mean ROC AUC: 0.9973
SVM mean ROC AUC: 0.9903
TabPFN per split: [0.99737963 0.99639699 0.99966931 0.99338624 0.99966465]
SVM per split: [0.99312152 0.98788077 0.99603175 0.98313492 0.99128102]
from sklearn.model_selection import cross_val_score
from tabpfn import TabPFNClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.svm import LinearSVC
import numpy as np
data = load_breast_cancer()
X, y = data.data, data.target
# TabPFN
tabpfn_clf = TabPFNClassifier()
tabpfn_scores = cross_val_score(tabpfn_clf, X, y, cv=5,
scoring='roc_auc')
print("TabPFN per split:", tabpfn_scores)
print("TabPFN mean ROC AUC:", np.mean(tabpfn_scores))
# SVM
svm_clf = LinearSVC(C=0.01)
svm_scores = cross_val_score(svm_clf, X, y, cv=5,
scoring='roc_auc')
print("SVM per split:", svm_scores)
print("SVM mean ROC AUC:", np.mean(svm_scores))
It's hard to communicate this properly, we should probably make sure to have a favourable example ready, but just included the simplest one!
The paper includes a comparison to TabPFN v1 (among others), noting the lack of categorical & missing values handling which v2 now seems to have. Would be curious to see an updated comparison.
TabPFN is better on numerical data since v1 (see figure 6 in the CARTE paper). CARTE's main strength in on text features, which are now also supported for TabPFN v2 API version (https://github.com/PriorLabs/tabpfn-client). We compared this to CARTE and found our model to be generally quite better, and much faster. CARTE multi-table approach is also very interesting, and we want to tackle this setting in the future.
A while back, I was looking for a project amateurs could do for experimenting with Transformer alternatives and optimization algorithms. My concept was grabbing objective, test functions from the literature, making custom ones based on realistic data, and layering them together based on real-world depth. Then, training various approaches on them using consumer GPU’s or spot instances of high-end GPU’s.
What I read in this paper blew that idea out the water! I mean, it’s still doable but you’ve far exceeded it.
I love that you covered many types of structures, used 8x consumer GPU’s more like OSS folks do (widely-accessible pretraining), claim no copyright infringement for pretraining, and use enough techniques in ML that people can enjoy Googling stuff for days.
I do have some questions about what I might have overlooked in the paper.
1. Is the training data and code available to reproduce the model? And iteratively improve its architectural decisions?
2. Most authors claiming their data was legal or open were actually committing copyright infringement. Your method might dodge that if users generate their own synthetic data using methods they can verify aren’t themselves encumbered. Is that code available under open licensing? If not, would you offer it for a fee for companies or free for researchers?
3. What specific, common uses could amateurs try that would display the model’s ability in a business setting? (Both to drive more research or build products on the model.)
1. Only for the first version, not for this version. I am sorry!
2. Yeah ours is guaranteed ok, as we wrote code to generate it basically just from plain torch ops. The code to run inference is available, just not the training code and data generation.
3. We have put it to work on time series data, which is very business relevant for example https://github.com/liam-sbhoo/tabpfn-time-series, and we have a table in the Appendix with all datasets we evaluate on in our main analysis to give you some ideas for possible datasets.
“Yeah ours is guaranteed ok, as we wrote code to generate it basically just from plain torch ops.”
This is where there might be claims. It already sounds safer than training on copyrighted works. The only thing that could remain is if it was a derivative work by reusing parts of copyrighted works in your process.
So, I’m curious about how you produced the specifications that the data was generated from. In my case, I was going to just use open versions of all kinds of equations that I’d hand-convert to internal representations. Others might be fair use if my description were high level enough that it wasn’t close to theirs. Some I couldn’t use at all because they were patented and independent versions are prohibited by law.
Did you all also derive your causal models from real-world formulas and data sets? If so, did you have a rule about putting distance between your representation and theirs? Or was it an entirely-random, search process across endless configurations? (I have a hard time imagining the latter would work.)
Neat! Might this even be useful to impute missing data for a sparse network of votes, for a system like this (pol.is) whose goal is to do dimensional reduction and visualise the opinion space of divisive social topics: https://gwern.net/doc/sociology/2021-small.pdf
200 voters on 50 statements would fall within the 10,000 sample threshold. This is well within the bounds of some existing conversations with open data, so it could be tested... Potential values on each statement are agree/disagree/pass (+1/-1/0)
Looks like a great use case! We have a method specifically for imputation in the tabpfn-extensions package (https://github.com/PriorLabs/tabpfn-extensions/blob/dbc3f5da...). It needs some cleaning up before I want to highlight in the notebooks and docs.
> 200 voters on 50 statements would fall within the 10,000 sample threshold.
I think you misinterpreted. 1 voter on 50 statements with (+1/-1/0) would be 1 datapoint with 50 features. 200 voters would be 200 rows with 50 features so you would not need to be concerned about the 10,000 sample threshold. Hope that helps your study.
Do you see any artifacts from having trained on synthetic data? Is there a natural benchmark dataset (real tables in the wild)?
In my experience synthetic data can only take you so far, it has all the quirk the dataset creator can think of but the real value is usually in patterns they cannot. Vision took a huge leap forward with ImageNet dataset release
Thanks a lot! We don't see clear artifacts for the synth data. Part of the "trick" is to keep the capacity of our model low, it has only about 11M parameters. That forces the model to "learn an in-context learning algorithm" or in other words "do in-context learning rather than in-weigthts learning".
Adding real data on top will help, agreed! The synthetic data is very broad, we started by a synth data prior that was just BNNs samples with differing sizes and thus super broad. Our new data samples functions more densely that are simpler to explain but could still sample almost any function (with the constraints that our networks aren't infinitely complex).
Thanks for sharing this. Of course I will closely watch it because claiming to beat gbdts might be a bit early.
- It is not entirely clear how the datasets split is done. Do you make sure that the model is evaluated on unseen data ? More generally how does one knows whether a dataset was part of the training or not ?
- You mention some serious limitations (10k rows, 500 cols.). It seems a bit weird to have fixed numbers. Can these numbers be roughly balanced ? (eg. 1M rows, 5 columns ... ). Does these numbers scale with memory ? (what memory was used for the 10k rows / 500 cols figure ?)
Great work you guys! I have been following discussions on DL vs ML for tabular data for some time now and am very excited to see TabPFN perform so well. I would like to play around with it a bit and am wondering if there is a way to use TabPFN with larger sample sizes, say, 1000000 rows? Can I disable the 10000 sample limitation? I would appreciate a code example if so. Great work again!
Thanks a lot! Currently have an issue on documenting how to use for more samples at https://github.com/PriorLabs/TabPFN/issues/129. Will do this soon, maybe give an upvote there if it matters to you.
Just looking through the code a bit, it seems that the model both supports a (custom) attention mechanism between features and between rows (code uses the term items)? If so, does the attention between rows help improve accuracy significantly?
Generally, for standard regression and classification use cases, rows (observations) are seen to be independent, but I'm guessing cross-row attention might help the model see the gestalt of the data in some way that improves accuracy even when the independence assumption holds?
Author here: The new introduction of attention between features did make a big impact compared to the first variant of TabPFN. The old model handled every feature like it was completely different to be feature 5 vs 15, but actually features are typically more-or-less permutation invariant. So the logic is similar to why a CNN is better for images than an MLP.
Amazing results! Beating AutoML with single model is not easy :)
Could you please explain like I'm five what is doing a trick? You have model pre-trained on large set of small datasets and you leverage it to boost performance?
Training is fast, few seconds, but what is time needed to compute predictions?
To put it very simply, the trick is that while the others train a new model for each problem, TabPFN is pre-trained to handle any kind of problem on the fly.
To draw a parallel to NLP: previously people trained a neural network for each kind of text classification they wanted to do, but then LLMs came around that pre-trained to learn to perform new tasks on the fly.
Similarly, TabPFN learns to do new tasks on the fly just from the context (dataset) given.
Training and prediction in these models is by default one and the same, similar to how the prediction of the next token in an LLM is not split into learning from context and then doing the actual prediction.
There is a way to split this even up, though, then the predictions, I believe, take something like 1/10s for medium-sized datasets.
Congrats on your release. What is the best way to share feedback? I would like to share with you what I believe to be a challenging problem that this may help with.
if you're predicting on text data, our public models don't do that, they would encode as classes. Our API (https://github.com/PriorLabs/tabpfn-client/) has experimental support.
Interesting, repeated values give the model a lot more confidence of the known values. The interpolated #4 value is still off by 12%. It does not extrapolate well at all.
Looking forward to trying it on real world data with more features.
Yes! This makes sense from a learning perspective: More samples add additional evidence the datapoint is actually what you observed - based on one sample the model is closer to a mean regression (which would translate to more balanced class probabilities in classification).
Transformers have trouble counting repeated entries (there was a famous failure case of ChatGPT, asking it to count the number of 1s and 0s in a string). This model has some tricks to solve this.
Author here! The fundamental challenge is that LLMs like O1 and Claude 3.5 simply aren't built for the unique structures of tabular data. When processing tables through LLMs, the inefficiencies quickly become apparent - tokenizing a 10,000 x 100 table as a sequence and numerical values as tokens creates massive inefficiencies.
There's some interesting work on using LLMs for tabular data (TabLLM: https://proceedings.mlr.press/v206/hegselmann23a.html), but this only works for datasets with tens of samples rather than the thousands of rows needed in real-world applications.
What o1 and other LLMs typically do is wrap around existing tabular tools like XGBoost or scikit-learn. While this works, they're ultimately constrained by these tools' limitations. We're taking a fundamentally different approach - building foundation models that natively understand tabular relationships and patterns. Our approach combines the benefits of foundation models with architectures specifically designed for tabular data structures.
How can you train a tabular foundation model when the tabular features themselves are inherently domain-specific? Is there some kind of preprocessing step beforehand to match the inference time features with their closest analogues in the training set?
Yes, there are normalizations applied before the features are fed to the neural network. Additionally, the neural network is trained on a very diverse set of artificial datasets.
There have been a ton of improvements! Much better performance overall, way larger data size limit (1K-->10K rows, 100-->500 features), regression support, native categorical data and missing values handling, much better support for uninformative or outlier features etc.
No, it is *much* stronger, a different architecture and scales to 10x the number of examples. It can also do regression now, and handle categorical features. Please, have a quick look at the abstract before making such claims.
Thanks for such a cool project! It's immediately apparent how to use it and I appreciate the brief examples.
Quick question: In the breast cancer example from the README, simple support vector machines from sklearn (the first thing i tried to compare baseline performance, incidentally) seem to outperform TabPFN. Is this expected? I know it's a baseline to demonstrate ease of use rather than SOTA performance, but I am curious.
Author here! The breast cancer dataset is simple and heavily saturated, so small differences between methods are expected. As you say, single-use examples can be noisy due to randomness in how the data is randomly split into training and testing sets especially for a saturated dataset like this one. Cross-validation reduces this variance by averaging over multiple splits. I just ran this below:
It's hard to communicate this properly, we should probably make sure to have a favourable example ready, but just included the simplest one!thanks, this is helpful!
I certainly appreciate how the example in the README makes it instantly apparent how to use the code.
Related: CARTE-AI, which can also deal with multiple tables.
https://soda-inria.github.io/carte/ https://arxiv.org/pdf/2402.16785
The paper includes a comparison to TabPFN v1 (among others), noting the lack of categorical & missing values handling which v2 now seems to have. Would be curious to see an updated comparison.
TabPFN is better on numerical data since v1 (see figure 6 in the CARTE paper). CARTE's main strength in on text features, which are now also supported for TabPFN v2 API version (https://github.com/PriorLabs/tabpfn-client). We compared this to CARTE and found our model to be generally quite better, and much faster. CARTE multi-table approach is also very interesting, and we want to tackle this setting in the future.
A while back, I was looking for a project amateurs could do for experimenting with Transformer alternatives and optimization algorithms. My concept was grabbing objective, test functions from the literature, making custom ones based on realistic data, and layering them together based on real-world depth. Then, training various approaches on them using consumer GPU’s or spot instances of high-end GPU’s.
What I read in this paper blew that idea out the water! I mean, it’s still doable but you’ve far exceeded it.
I love that you covered many types of structures, used 8x consumer GPU’s more like OSS folks do (widely-accessible pretraining), claim no copyright infringement for pretraining, and use enough techniques in ML that people can enjoy Googling stuff for days.
I do have some questions about what I might have overlooked in the paper.
1. Is the training data and code available to reproduce the model? And iteratively improve its architectural decisions?
2. Most authors claiming their data was legal or open were actually committing copyright infringement. Your method might dodge that if users generate their own synthetic data using methods they can verify aren’t themselves encumbered. Is that code available under open licensing? If not, would you offer it for a fee for companies or free for researchers?
3. What specific, common uses could amateurs try that would display the model’s ability in a business setting? (Both to drive more research or build products on the model.)
I thank you for your time.
Author here!
Thanks :)
1. Only for the first version, not for this version. I am sorry! 2. Yeah ours is guaranteed ok, as we wrote code to generate it basically just from plain torch ops. The code to run inference is available, just not the training code and data generation. 3. We have put it to work on time series data, which is very business relevant for example https://github.com/liam-sbhoo/tabpfn-time-series, and we have a table in the Appendix with all datasets we evaluate on in our main analysis to give you some ideas for possible datasets.
“Yeah ours is guaranteed ok, as we wrote code to generate it basically just from plain torch ops.”
This is where there might be claims. It already sounds safer than training on copyrighted works. The only thing that could remain is if it was a derivative work by reusing parts of copyrighted works in your process.
So, I’m curious about how you produced the specifications that the data was generated from. In my case, I was going to just use open versions of all kinds of equations that I’d hand-convert to internal representations. Others might be fair use if my description were high level enough that it wasn’t close to theirs. Some I couldn’t use at all because they were patented and independent versions are prohibited by law.
Did you all also derive your causal models from real-world formulas and data sets? If so, did you have a rule about putting distance between your representation and theirs? Or was it an entirely-random, search process across endless configurations? (I have a hard time imagining the latter would work.)
Related repo: https://github.com/liam-sbhoo/tabpfn-time-series
Wow! Didn't expect the models to do so well on time series as well, will try this out.
Neat! Might this even be useful to impute missing data for a sparse network of votes, for a system like this (pol.is) whose goal is to do dimensional reduction and visualise the opinion space of divisive social topics: https://gwern.net/doc/sociology/2021-small.pdf
200 voters on 50 statements would fall within the 10,000 sample threshold. This is well within the bounds of some existing conversations with open data, so it could be tested... Potential values on each statement are agree/disagree/pass (+1/-1/0)
https://github.com/compdemocracy/openData/blob/master/brexit...
https://github.com/compdemocracy/openData/blob/master/brexit...
Looks like a great use case! We have a method specifically for imputation in the tabpfn-extensions package (https://github.com/PriorLabs/tabpfn-extensions/blob/dbc3f5da...). It needs some cleaning up before I want to highlight in the notebooks and docs.
> 200 voters on 50 statements would fall within the 10,000 sample threshold.
I think you misinterpreted. 1 voter on 50 statements with (+1/-1/0) would be 1 datapoint with 50 features. 200 voters would be 200 rows with 50 features so you would not need to be concerned about the 10,000 sample threshold. Hope that helps your study.
Great work!
Do you see any artifacts from having trained on synthetic data? Is there a natural benchmark dataset (real tables in the wild)?
In my experience synthetic data can only take you so far, it has all the quirk the dataset creator can think of but the real value is usually in patterns they cannot. Vision took a huge leap forward with ImageNet dataset release
Thanks a lot! We don't see clear artifacts for the synth data. Part of the "trick" is to keep the capacity of our model low, it has only about 11M parameters. That forces the model to "learn an in-context learning algorithm" or in other words "do in-context learning rather than in-weigthts learning". Adding real data on top will help, agreed! The synthetic data is very broad, we started by a synth data prior that was just BNNs samples with differing sizes and thus super broad. Our new data samples functions more densely that are simpler to explain but could still sample almost any function (with the constraints that our networks aren't infinitely complex).
Thanks for sharing this. Of course I will closely watch it because claiming to beat gbdts might be a bit early.
- It is not entirely clear how the datasets split is done. Do you make sure that the model is evaluated on unseen data ? More generally how does one knows whether a dataset was part of the training or not ?
- You mention some serious limitations (10k rows, 500 cols.). It seems a bit weird to have fixed numbers. Can these numbers be roughly balanced ? (eg. 1M rows, 5 columns ... ). Does these numbers scale with memory ? (what memory was used for the 10k rows / 500 cols figure ?)
Great work you guys! I have been following discussions on DL vs ML for tabular data for some time now and am very excited to see TabPFN perform so well. I would like to play around with it a bit and am wondering if there is a way to use TabPFN with larger sample sizes, say, 1000000 rows? Can I disable the 10000 sample limitation? I would appreciate a code example if so. Great work again!
Thanks a lot! Currently have an issue on documenting how to use for more samples at https://github.com/PriorLabs/TabPFN/issues/129. Will do this soon, maybe give an upvote there if it matters to you.
I tried this on a few CARTE datasets and it works surprisingly better!! Woahhh
This looks amazing!
Just looking through the code a bit, it seems that the model both supports a (custom) attention mechanism between features and between rows (code uses the term items)? If so, does the attention between rows help improve accuracy significantly?
Generally, for standard regression and classification use cases, rows (observations) are seen to be independent, but I'm guessing cross-row attention might help the model see the gestalt of the data in some way that improves accuracy even when the independence assumption holds?
Author here: The new introduction of attention between features did make a big impact compared to the first variant of TabPFN. The old model handled every feature like it was completely different to be feature 5 vs 15, but actually features are typically more-or-less permutation invariant. So the logic is similar to why a CNN is better for images than an MLP.
Speculating, cross-row might give you information where you are in that row distribution.
anyone tried this? is this actually overall better than xgboost/catboost?
Benchmark of tabpfn<2 compared to xgboost, lightgbm, and catboost: https://x.com/FrankRHutter/status/1583410845307977733 .. https://news.ycombinator.com/item?id=33486914
Yes it actually is but the limitations of rows and features could be a hindrance.
Amazing results! Beating AutoML with single model is not easy :)
Could you please explain like I'm five what is doing a trick? You have model pre-trained on large set of small datasets and you leverage it to boost performance?
Training is fast, few seconds, but what is time needed to compute predictions?
How large is the model?
To put it very simply, the trick is that while the others train a new model for each problem, TabPFN is pre-trained to handle any kind of problem on the fly.
To draw a parallel to NLP: previously people trained a neural network for each kind of text classification they wanted to do, but then LLMs came around that pre-trained to learn to perform new tasks on the fly. Similarly, TabPFN learns to do new tasks on the fly just from the context (dataset) given.
Training and prediction in these models is by default one and the same, similar to how the prediction of the next token in an LLM is not split into learning from context and then doing the actual prediction. There is a way to split this even up, though, then the predictions, I believe, take something like 1/10s for medium-sized datasets.
Congrats on your release. What is the best way to share feedback? I would like to share with you what I believe to be a challenging problem that this may help with.
thanks a ton! If it's public please share in the Discord https://discord.com/channels/1285598202732482621/ > #use-cases (just created!), if not, mail me at noah@priorlabs.ai
getting a weird error, it says "no text channels" ?
if you're predicting on text data, our public models don't do that, they would encode as classes. Our API (https://github.com/PriorLabs/tabpfn-client/) has experimental support.
Found the web interface: https://ux.priorlabs.ai/ Really cool!
Just playing around with regression mode...
... well, it has a positive slopeLet's see what happens if we copy the exact same values in the dataset 10 times first.
Interesting, repeated values give the model a lot more confidence of the known values. The interpolated #4 value is still off by 12%. It does not extrapolate well at all.Looking forward to trying it on real world data with more features.
Yes! This makes sense from a learning perspective: More samples add additional evidence the datapoint is actually what you observed - based on one sample the model is closer to a mean regression (which would translate to more balanced class probabilities in classification). Transformers have trouble counting repeated entries (there was a famous failure case of ChatGPT, asking it to count the number of 1s and 0s in a string). This model has some tricks to solve this.
Were your benchmark methods tuned per dataset or across datasets?
Tuned per dataset
Up to 4 hrs of tuning per dataset / split (10-fold CV)
Did you compare the performance with o1 or Claude 3.5 Sonnet?
Author here! The fundamental challenge is that LLMs like O1 and Claude 3.5 simply aren't built for the unique structures of tabular data. When processing tables through LLMs, the inefficiencies quickly become apparent - tokenizing a 10,000 x 100 table as a sequence and numerical values as tokens creates massive inefficiencies.
There's some interesting work on using LLMs for tabular data (TabLLM: https://proceedings.mlr.press/v206/hegselmann23a.html), but this only works for datasets with tens of samples rather than the thousands of rows needed in real-world applications.
What o1 and other LLMs typically do is wrap around existing tabular tools like XGBoost or scikit-learn. While this works, they're ultimately constrained by these tools' limitations. We're taking a fundamentally different approach - building foundation models that natively understand tabular relationships and patterns. Our approach combines the benefits of foundation models with architectures specifically designed for tabular data structures.
looks amazing - finally, DL that beats a tuned catboost?
How can you train a tabular foundation model when the tabular features themselves are inherently domain-specific? Is there some kind of preprocessing step beforehand to match the inference time features with their closest analogues in the training set?
Yes, there are normalizations applied before the features are fed to the neural network. Additionally, the neural network is trained on a very diverse set of artificial datasets.
It's probably the same model with the same limitations, released nearly two years ago?
https://arxiv.org/abs/2207.01848
There have been a ton of improvements! Much better performance overall, way larger data size limit (1K-->10K rows, 100-->500 features), regression support, native categorical data and missing values handling, much better support for uninformative or outlier features etc.
No, it is *much* stronger, a different architecture and scales to 10x the number of examples. It can also do regression now, and handle categorical features. Please, have a quick look at the abstract before making such claims.