Congratulations on your launch! But I confess that I am really confused. This sounds exactly like Aider, but closed source and it's locked into a single LLM API? I just watched you use it, and looks a lot like Aider too? Why would I use this over Aider?
I've seen people say "you don't have to add files to Codebuff", but Aider tells me when the LLM has requested to see files. I just have to approve it. If that bothers you, it's open source, so you could probably just add a config to always add files when requested.
Aider tends to maintain near "state of the art" including e.g. treesitter, and an actually refined (as in, iterated improvements over time) user experience.
Aider has been refining for 8000 commits since May of 2023. Codebuff "all started" circa Claude Sonnet 3.5.
The story of discovery (e.g. git patch) at best feels like a lack of researching the landscape since leaderboards for SOTA indicate whether a model performs better as whole code or diffs and Anthropic even cites Aider benchmarks, but cynically, the narrative feels a bit like looking through the things Aider has been doing differently/better, and putting them in an origin story so the feature list might sound less like the “sincerest form of flattery.”
Particularly concerning is the story talking about "seeing" users coding loops. Perhaps this is a figure of speech. As designed, Codebuff are in the middle of all users' code slinging, so perhaps it isn't.
Checking the Privacy Policy shows it's only about cookies and tracking, not about information privacy or IP protection of any kind.
Checking the Terms of Service says they own any code you post through it and can give it to others:
"However, by posting Content using Service you grant us the right and license to use, modify, publicly perform, publicly display, reproduce, and distribute such Content on and through Service. You agree that this license includes the right for us to make your Content available to other users of Service."
Meaning, the TOS is a for a public social media type service, not for an intellectual property service.
(Note that in VSCode "cline" can give Aider a run for its money.)
Thanks for your reply! I started Codebuff without being aware of Aider. I actually have not yet tried Aider (though I plan to try it soon!).
It's totally true that a lot of the development of Codebuff is merely me (and Brandon) working through a lot of the problems that Aider already solved! That makes sense.
Partly, my thesis is that if you start after Sonnet 3.5 is out, that you design things differently. For example, I started without manual file selection and worked to make it more like an agent that has native access to your environment.
Needless to say, I'm a fan of the work Paul has done on Aider, and I've appreciated the benchmarks and guides he's created and shared publicly. And Cline is also an amazing project which I want to try out soon as well!
With respect to privacy, we have pledged not to store your codebase, and mainly store logs that we use to debug the application. When seeing users use Codebuff, I mean I literally watched them use it, as we've done many in-person user tests, plus the Manifold Team has been using Codebuff for a while.
We also intend to release a Privacy Mode, like Cursor has, where we will not store anything at all, not even the logs of your interactions!
It makes sense to be a bit skeptical of Codebuff, since we are so new, but I intend to not let our users down!
Being in the same product space for more than 3 months, I wonder how one can not come across 2 popular open source tools that do more or less the same thing.
Like Aider has 21k stars, and Cline has around 11k stars. Both these product names come up on HN, Reddit frequently.
Curious to know if YC does some research on existing products before backing a new business.
It seems like we're all in our own tech bubbles more and more. Distribution is clearly a tough problem to crack, and no one in this space has really mastered it yet, aside from arguably Github Copilot.
No comment on YC here, but I think it's easy to criticize from the outside. I've personally have been impressed by all the peers, group partners, and alumni I've met so far. I'm biased, but I think YC knows what it's doing. Also, YC backs founders, not ideas.
In any YC application, they request a list of competitors and why your product is better! Curious about what OP listed as competitors in the application.
I've also built a similar free and open-source tool gptme (2.5k stars), since the start of last year (GPT-3.5). It has been impossible to ignore the great work done by Aider.
Yup. I recall that at some point maybe a year ago, pretty much every other LLM thread that had people speculating whether some LLM could do X or Y / improved or worsened for Z / etc., would have Paul show up and comment something along the lines, "Actually, I've benchmarked this thoroughly in my work on Aider; here's <link to data and analysis>" or such. Those were usually some of the most insightful comments in the whole thread.
I found those comments, and the work they linked to, especially valuable because it's rare to see advanced work on LLM applications done and talked about in the open. Everyone else doing equivalent work seems to want to make a startup out of it, so all we usually get is weekend hacks that stall out quickly.
It sounds minor that it finds files for you, but if you try it out, you'll see that it's a giant leap in UX and the extra files help it generate better code because it has more examples from your codebase.
But you said you haven't tried Aider, how can you say it's a "leap in UX"?
My own tool `gptme` lets the agent interactively read/collect context too (as does Anthropic in their latest minimal-harness submission to SWE-bench), it's nothing novel.
I did find codebuff a lot easier to install and get started with...usability can make or break a project. Just as a user, I think it's nice to have multiple projects doing the same thing -- exploring more of the solution space.
(I've just played a little bit with aider and codebuff. I've previously tried aider and it always errored out on my code base, but inspired by this comment I tried again, and now it works well.)
Beyond what James highlighted, I personally really like how simple Codebuff is. CLI tools tend to go a bit overboard with options and configurations imo, which is ok if you're just setting them up once or twice. But for a tool I want to rely upon every day for my work, I want them to be as simple as possible, but no simpler.
Have you used Aider extensively? How are you finding it for your coding needs vs IDE-based chats?
The demos I see for these types of tools are always some toy project and doesn't reflect day to day work I do at all. Do you have any example PRs on larger more complex projects that have been written with codebuff and how much of that was human interactive?
The real problem I want someone to solve is helping me with the real niche/challenging portion of a PR, ex: new tiptap extension that can do notebook code eval, migrate legacy auth service off auth0, record and replay API GET requests and replay a % of them as unit tests, etc.
So many of these tools get stuck trying to help me "start" rather than help me "finish" or unblock the current problem I'm at.
I hear you. This is actually a foundational idea for Codebuff. I made it to work within the large-ish codebase of my previous startup, Manifold Markets.
I want the demos to be of real work, but somehow they never seem as cool unless it's a neat front end toy example.
Historically, Pepsi won taste tests and people chose Coke. Because Pepsi is sweeter, so that first sip tastes better. But it's less satisfying—too sweet—to drink a whole can.
The sexy demos don't, in my opinion and experience, win over the engineers and leaders you need. Lil startups, maybe, and engineers that love the flavor of the week. But for solving real, unsexy problems—that's where you'll pull in organizations.
> The sexy demos don't, in my opinion and experience, win over the engineers and leaders you need.
Great point, we're in talks with a company and this exact issue came up. An engineer used Codebuff over a weekend to build a demo app, but the CEO wasn't particularly interested even after he enthusiastically explained what he made. It was only when the engineer later used Codebuff to connect the demo app to their systems that the CEO saw the potential. Figuring out how to help these two stakeholders align with one another will be a key challenge for us as we grow. Thanks for the thought!
> Historically, Pepsi won taste tests and people chose Coke. Because Pepsi is sweeter, so that first sip tastes better. But it's less satisfying—too sweet—to drink a whole can.
As a Pepsi drinker (Though Pepsi Max/Zero), I disagree with this. That's one interpretation, the other is the one Pepsi was gesturing at - that people prefer Coke when knowing it's Coke, because of branding, but with branding removed, prefer Pepsi.
I personally drank Coke Zero for years, always being "unhappy" when a restaurant only had Pepsi, until one day I realized I was actually enjoying the Pepsi more when not thinking about it, and that the only reason I "preferred" Coke was the brand. So I know that this story can also be true, at least on n=1 examples.
Watching the demo it seems like it would be more effective to learn the skills you need rather than using this for a decade.
It takes 5+ seconds just to change one field to dark mode, I don't even want to imaigne a situation where I have two fields and I want to explain that I need to change this field and not that field.
I'm not sure who is the target audience for this, people who want to be programmers without learning programming ?
> it seems like it would be more effective to learn the skills you need rather than using this for a decade.
Think of it as a calculator. You do want to be able to do addition, but not neccessarily to manually add 4-digit numbers in your head.
> It takes 5+ seconds just to change one field to dark mode
Our current LLMs are way too slow for this. I am chuckling every time someone says "we don't need LLMs to be faster because people can't read faster". Imagine this using Groq with a future model with similar capability level, and taking 0.5 seconds to do this small change.
People need to remember we're at the very beginning of using AI for coding. Of course it's suboptimal for majority of cases. Unless you believe we're way past half the sigmoid curve on AI improvements (which I don't), consider that this is the worst the AI is ever going to be for coding.
A year ago people were incredulous when told that AI could code. A year before that people would laugh you out of the room. Now we're at the stage where it kinda works, barely, sometimes. I'm bullish on the future.
Every experience I have had with LLMs generating code. LLMs tend to follow the prompt much too closely and produce large amounts of convoluted code that in the end prove not only unnecessary but quite toxic.
Where LLMs shine is in being a personal Stack Overflow: asking a question and having a personalized, specific answer immediately, that uses one's data.
But solving actual, real problems still seem out of reach. And letting them touch my files sound crazy.
(And yes, ok, maybe I just suck at prompting. But I would need detailed examples to be convinced this approach can work.)
I'm sure your prompting is great! It's just hard because LLMs tend to be very wordy by default. This was something we struggled with for a while, but I think we've done a good job at making Codebuff take a more minimal approach to code edits. Feel free to try it, let me know if it's still too wordy/convoluted for you.
> Do you have any example PRs on larger more complex projects that have been written with codebuff and how much of that was human interactive?
We have a lot of code in production which are AI written. The important thing is that you need to consciously make a module or project AI-ready. This means that things like modularity and smaller files are even more important than they usually are.
I can't share those PRs, but projects on my profile page are almost entirely AI written (except the https://bashojs.org/ link). Some of them might meet your definition of niche based on the example you provided.
Kind of like "please describe the solution and I will write code to do it".
That's not how programming works.
Writing code and testing it against expectations to get to the solution, that's programming.
FWIW I don't find that I'm losing good engineering habits/thought processes. Codebuff is not at the stage where I'm comfortable accepting its work without reviewing, so I catch bugs it introduces or edge cases it's missed. The main difference for me is the speed at which I can build now. Instead of fussing over exact syntax or which package does what, I can keep my focus on the broader implications of a particular architecture or nuances of components, etc.
I will admit, however, that my context switching has increased a ton, and that's probably not great. I often tell Codebuff to do something, inevitably get distracted with something else, and then come back later barely remembering the original task
Language is important here. Programming, at its basic definition, is just writing code that programs a machine. Software development or even design/engineering are closer to what you’re referring to.
+1; Ideally I want a tool I don't have to specify the context for. If I can point it via config files at my medium-sized codebase once (~2000 py files; 300k LOC according to `cloc`) then it starts to get actually usable.
Cursor Composer doesn't handle that and seems geared towards a small handful of handpicked files.
Would codebuff be able to handle a proper sized codebase? Or do the models fundamentally not handle that much context?
Yes. Natively, the models are limited to 200k tokens which is on the order of dozens of files, which is way too small.
But Codebuff has a whole preliminary step where it searches your codebase to find relevant files to your query, and only those get added to the coding agent's context.
That's why I think it should work up to medium-large codebases. If the codebase is too large, then our file-finding step will also start to fail.
I would give it a shot on your codebase. I think it should work.
RAG is a well-known technique now, and to paraphrase Emily Bender[1], here are some reasons why it's not a solution.
The code extruded from the LLM is still synthetic code, and likely to contain errors both in the form of extra tokens motivated by the pre-training data for the LLM rather than the input texts AND in the form of omission. It's difficult to detect when the summary you are relying on is actually missing critical information.
Even if the set up includes the links to the retrieved documents, the presence of the generated code discourages users from actually drilling down and reading them.
This is still a framing that says: Your question has an answer, and the computer can give it to you.
We actually don't use RAG! It's not that good as you say.
We build a description of the codebase including the file tree and parsed function names and class names, and then just ask Haiku which files are relevant!
This works much better and doesn't require slowly creating an index. You can just run Codebuff in any directory and it works.
It sounds like it's arguably still a form of RAG, just where the retrieval is very different. I'm not saying that to knock your approach, just saying that it sounds like it's still the case where you're retrieving some context and then using that context to augment further generation. (I get that's definitely not what people think of when you say RAG though.)
Genuine question: at what point does the term RAG lose its meaning? Seems like LLMs work best when they have the right context, and that context must be pulled from somewhere for the LLM. But if that's RAG, then what isn't? Do you have a take on this? Been struggling to frame all this in my head, so would love some insight.
RAG is a search step in an attempt to put relevant context into a prompt before performing inference. You are “augmenting” the prompt by “retrieving” information from a data set before giving it to an LLM to “generate” a response. The data set may be the internet, or a code base, or text files. The typical examples online uses an embedding model and a vector database for the search step, but doing a web query before inference is also RAG. Perplexity.ai is a RAG (but fairly good quality). I would argue that Codebuff’s directory tree search to find relevant files is a search step. It’s not the same as a similarity search on vector embeddings, and it’s not PageRank, but it is a search step.
Things that aren’t RAG, but are also ways to get a LLM to “know” things that it didn’t know prior:
1. Fine-tuning with your custom training data, since it modifies the model weights instead of adding context.
2. LoRA with your custom training data, since it adds a few layers on top of a foundation model.
3. Stuffing all your context into the prompt, since there is no search step being performed.
Gotcha – so broadly encompasses how we give external context to the LLM. Appreciate the extra note about vector databases, that's where I've heard this term used most, but I'm glad to know it extends beyond that. Thanks for explaining!
I think parsimo2010 gave a good definition. If you're pulling context from somewhere using some search process to include as input to the LLM, I would call that RAG.
So I would not consider something like using a system prompt (which does add context, but does not involve search) would not be RAG. Also, using an LLM to generate search terms before returning query results would not be RAG because the output of the search is not input to the LLM.
I would also probably not categorize a system similar to Codebuff that just adds the entire repository as context to be RAG since there's not really a search process involved. I could see that being a bit of a grey area though.
> We build a description of the codebase including the file tree and parsed function names and class names
This sounds like RAG and also that you’re building an index? Did you just mean that you’re not using vector search over embeddings for the retrieval part, or have I missed something fundamental here?
I'm currently working on a demonstration/POC system using my ElasticSearch as my content source, generating embeddings from that content, and passing them to my local LLM.
It would be cool to be talking to other people about the RAG systems they’re building. I’m working in a silo at the moment, and pretty sure that I’m reinventing a lot of techniques
I didn't mean to be down on it, and I'm really glad it's working well! If you start to reach the limits of what you can achieve with your current approach, there are lots of cute tricks you can steal from RAG, eg nothing stopping you doing a fuzzy keyword search for interesting-looking identifiers on larger codebases rather than giving the LLM the whole thing in-prompt, for example
I'll need to get approval to use this on that codebase. I've tried it out on a smaller open-source codebase as a first step.
For anyone interested:
- here's the Codebuff session: https://gist.github.com/craigds/b51bbd1aa19f2725c8276c5ad36947e2
- The result was this PR: https://github.com/koordinates/kart/pull/1011
It required a bit of back and forth to produce a relatively small change, and I think it was a bit too narrow with the files it selected (it missed updating the implementations of a method in some subclasses, since it didn't look at those files)
So I'm not sure if this saved me time, but it's nevertheless promising! I'm looking forward to what it will be capable of in 6mo.
What's the fundamental limitation to context size here? Why can't a model be fine-tuned per codebase, taking the entire code into context (and be continuously trained as it's updated)?
Forgive my naivety, I don't now anything about LLMs.
It's pretty good for complex projects imo because codebuff can understand the structure of your codebase and which files to change to implement changes. It still struggles when there isn't good documentation, but it has helped me finish a number of projects
One cool thing you can do is a ask Codebuff to create these docs. In fact, we recommend it.
Codebuff natively reads any files ending in "knowledge.md", so you can add any extra info you want it to know to these files.
For example, to make sure Codebuff creates new endpoints properly, I wrote a short guide with an example on the three files you need to update, and put it in backend/api/knowledge.md. After that, Codebuff always create new endpoints correctly!
you can put the information into knowledge.md or [description].knowledge.md, but sometimes I can't find documentation and we're both learning as we go lmao
Absolutely! Imaging setting a bunch of css styles through a long winded AI conversation, when you could have an IDE to do it in a few seconds. I don't need that.
The long tail of niche engineering problems is the time consuming bit now. That's not being solved at all, IMHO.
"On the checkout page at the very bottom there are two buttons that are visible when the user chooses to select fast shipping. The right one of those buttons should be a tiny bit more round and it seems like it's not 100% vertically aligned with the other button."
Takes a lot longer to write than just diving into the code. I think that's what they meant.
Great question – we struggled for a long time to put our demo together precisely for this reason. Codebuff is so useful in a practical setting, but we can't bore the audience with a ton of background on a codebase when we do demos, so we have to pick a toy project. Maybe in the future, we could start our demo with a half-built project?
Hopefully the demo on our homepage shows a little bit more of your day-to-day workflows than other codegen tools show, but we're all ears on ways to improve this!
To give a concrete example of usefulness, I was implementing a referrals feature in Drizzle a few weeks ago, and Codebuff was able to build out the cli app, frontend, backend, and set up db schema (under my supervision, of course!) because of its deep understanding of our codebase. Building the feature properly requires knowing how our systems intersect with one another and the right abstraction at each point. I was able to bounce back and forth with it to build this out. It felt akin to working with a great junior engineer, tbh!
If you're not worried about showing off little hints of your own codebase, record running it on one of your day to day engineering tasks. It's perfect dog fooding and would be a fun meta example.
> To give a concrete example of usefulness, I was implementing a referrals feature in Drizzle a few weeks ago, and Codebuff was able to build out the cli app, frontend, backend, and set up db schema
Record this!
Better yet, stream it on Twitch and/or YouTube and/or Discord and build a small community of followers.
Great idea! We were kicking around something like this, but couldn't get it ready in time for today's launch – but keep your eyes peeled! Our Discord community is a great place to stay up to date.
Yup, I had the same thought. I just ran into an issue during today's launch and used Codebuff to help me resolve it: https://www.tella.tv/video/solving-website-slowdown-with-ai-.... Next time, I'll try to record before I start working, but it's hard to rememeber sometimes.
I'm not paying $20 for my ssh keys and rest of the clipboard to be sent to multiple unknown 3rd parties, thanks, not for me.
Would however pay for actual software that I can just buy instead of rent to do the task of inline shell assitance, without making network calls behind my back that i'm not in complete perfectionist one hundred point zero zero per cent control of.
Sorry just my opinion in general with these types of products. If you don't have the skills to make a fully self contained language model type of product or something do this then you are not skilled enough team for me to trust with my work shell.
The main issue is the models you need to do the job are too big for most consumers, even people with nice video cards. You'll need a couple L40S GPUs at minimum. Maybe a few H100s more realistically.
So do you want to buy tens of thousands of dollars in GPUs or do you want to rent them second-by-second? Most people will choose the latter. I understand you don't trust the infrastructure and that's reasonable. If self-hosting was viable it would be more popular.
It's become my go-to tool for handling fiddly refactors. Here’s an example session from a Rust project where I used it to break a single file into a module directory.
Yes, this is a good point. I think not asking to run commands is maybe the most controversial choice we've made so far.
The reason we don't ask for human review is simply: we've found that it works fine to not ask.
We've had a few hundred users so far and usually people are skeptical of this at first, but as they use it they find that they don't want it to ask for every command. It enables cool use cases where Codebuff and iterate by running tests, seeing the error, attempting a fix, and running them again.
If you use source control like git, I also think that it's very hard for things to go wrong. Even if it ran rm -rf from your project directory, you should be able to undo that.
But here's the other thing: it won't do that. Claude is trained to be careful about this stuff and we've further prompted it to be careful.
I think not asking to run commands is the future of coding agents, so I hope you will at least entertain this idea. It's ok if you don't want to trust it, we're not asking you to do anything you are uncomfortable with.
I am not afraid of rm -rf whole directory. I am afraid of other stuff that it can do to my machines. leak my ssh keys, cookies, persnal data, network devices, and making persistent modifications (malware) to my system. Or maybe inadvertently messing with my python version, or globally installing some library to mess up whole system.
I, as a well-intending human, have run commands that broke my local python install. At least I was vaguely aware of what I did, and was able to fix things. If I didn't know what had happened I'd be pretty lost.
It's mainly from experience. From when I set it up I didn't have the feature to ask whether to run commands. It has been rawdogging commands this whole time and has never been a problem for me.
I think we have many other users who are similar. To be fair, sometimes after watching it install packages with npm, people are surprised and say that they would have preferred that it asked. But usually this is just the initial reaction. I'm pretty confident this is the way forward.
Do you have any sandbox-like restrictions in place to ensure that commands are limited to only touching the project folder not any other places in the system?
You can use pledge[1] to restrict the tool to read/write only in specific directories, or only use certain system calls. This is easier to run than from a container or VM, but can be a bit fiddly to setup at first.
Assuming you trust it with the files in your codebase, and them being shared with third parties. Which is a hard pill to swallow for a proprietary program.
It's strange that all the closed models whose mentioned reasons for being closed is safety is allowing this, and banning the apps which allows for erotic roleplay all the time. Roleplay is significantly less dangerous than full shell control.
One is that I think it is simpler for the end user to not have to add their own keys. It allows them to start for free and is less friction overall.
Another reason is that it allows us to use whichever models we think are best. Right now we just use Anthropic and OpenAI, but we are in talks with another startup to use their rewriting model. Previously, we have used our own fine-tuned model for one step, and that would be hard to do with just API keys.
The last reason that might be unpopular is that keeping it closed source and not allowing you to bring your keys means we can charge more money. Charging money for your product is good because then we can invest more energy and effort to make it even better. This is actually beneficial to you, the end user, because we can invest in making the product good. Capitalism works, cheers.
Per your last question, I do advise you use git so that you can always revert to return to your old file state! Codebuff does have a native "undo" command as well.
None of them are owned by the creator of Codebuff. Why not create something to replace or be at the same level as those? Also, who is "we"? I don't use or like none of these.
Codebuff is the easiest to use of all these, because you just chat, and it finds all the right files to edit. There's no clicking and you don't have to confirm edits.
It is also a true agent. It can run terminal commands to aid the request. For one request it could:
1. Write a unit test
2. Run the test
3. Edit code to fix the error
4. Run it again and see it pass
If you try out Codebuff, I think you'll see why it's unique!
Can codebuff handle larger files (6,000 loc) and find the right classes/functions in that code, or if it finds the file with the necessary info does it load the entire file in?
I think it would handle the giant file, but it would definitely pull the whole thing into context.
We are doing some tricks so it should be able to edit the file without rewriting it, but occasionally that fails and we fallback to rewriting it all, which may time out on such a file.
one size does not fit them all, and such tools are quite straightforward to develop. I even created one myself: https://npmjs.com/genaicode, I started with CLI tool, but eventually learned that such UX is not good (even interactive CLI), and I created a web UI.
where the problems start: cost of inference vs quality, latency, multi modality (vision + imagen), ai service provider issues (morning hours in US time zones = poor quality results)
the best part is being able to adjust it to my work style
Fundamentally, I think codegen a pretty new space and lots of people are jumping in because they see the promise. Remains to be seen what the consolidation looks like. With the rate of advancement in LLMs and codegen in particular, I wouldn't be surprised to see even more tools than we do now...
And they all converging towards one use case, and targeting one consumer base. It's still unclear for consumers, at least based on this thread what differetiate them.
Quality of code wise, is it worse or better than Cursor? I pay for Cursor now and it saves me a LOT of time to not copy files around. I actually still use the chatGPT/claude interfaces to code as well.
If its the same, hard to justify $100/month vs $20/month . I code mostly from vim so I'm searching for my vim/cli replacement while I still use both vim and Cursor.
That sounds cool and I like the idea, but definitely won't pay 5x. Maybe charge $30/month plus bring your own key. Let me know when you lower the price :)
It might sound small, but pulling in more context can make a huge difference – I remember one time Cursor completely hallucinated Prisma as part of our tech stack and created a whole new schema for us, whereas Codebuff knew we were already hooked up to Drizzle and just modified our existing schema. But like James said, we do use more tokens to do this, so pros & cons.
Sounds pretty interesting, I was thinking that would be the way to work past limited context window sizes automatically.
> Codebuff has limited free usage, but if you like it you can pay $99/mo to get more credits...
> One user racked up a $500 bill...
Those two statements are kind of confusing together. Past the free tier, what does $99/month get you? It sounds like there's some sort of credit, but that's not discussed at all here. How much did this customer do to get to that kind of bill? I get that they built a flutter app, but did it take a hour to run up a $500 bill? 6 hours? a whole weekend? Is there a way to set a limit?
The ability to rack up an unreasonable bill by accident, even just conceptually, is a non-starter for many. This is interactive so it's not as bad as accidentally leaving a GPU EC2 instance on overnight, but I'll note that Aider shows per query and session costs.
Ah, good catch. We (read: I) introduced a bug that gave customers $500 worth of credits in a month as opposed to the $100 they paid for, and this user saw that and took advantage. Needless to say, we fixed the issue and added more copy to show to users when they _do_ exceed their thresholds. We haven't had that issue since. And of course, we ate the cost because it was our (expensive!) mistake.
The user had spent the entire weekend developing the app, and admitted that he would have been more careful to manage his Codebuff usage had it not been for this bug.
We're open to adding hard limits to accounts, so you're never charged beyond the credits you paid for. We just wanted to make sure people could pay a bit more to get to a good stopping point once they passed their limits.
> The user had spent the entire weekend developing the app, and admitted that he would have been more careful to manage his Codebuff usage had it not been for this bug.
On the flip side, there's probably useful things to learn from how he developed his app when he didn't feel the need to be careful; in a way, your $500 mistake bought you useful test data.
In my own use of Aider, I noticed I'm always worried about the costs and pay close attention to token/cost summaries it displays. Being always on my mind, this affects my use of this tool in ways I'm only beginning to uncover. Same would likely be true for Codebuff users.
Oh I see. it's an interesting story and thanks for the transparency. Might leave that out of the pitch as it's confusing and the thought of running up a $500 bill is scary and since the user ultimately didn't pay for it, seems like noise.
Have you considered a bring your own api key model?
Have an upvote! I've been trying it out, it's quite nice. What I like about this vs CoPilot and Cursor is that I feel like (especially with CoPilot) I'm always "racing" the editor. Also Cursor conflicts with some longstanding keybindings I have, vs this which is just the terminal. Having worked on a similar system before, I know it's difficult to implement some of these things, but I am concerned about security. For instance, how well does it handle sensitive files like dot.env or gitignored files. At some point an audit, given that you're closed source would go a long way.
Thanks! Yup, so long as you don't have key bindings for meta-x and meta-c, you ought to be good in the terminal. We honor any repository's .gitignore and won't touch those files. We don't even let Codebuff know they exist, which has caused some issues with hallucinations in the past.
Making sure that our word is trustworthy to the broader world at large is going to be a big challenge for us. Do you have any ideas for what we can do? We're starting to think about open source, but we aren't quite ready for that yet.
Very excited for codebuff, its been a huge productivity boost for me! I've been putting it to use on a monorepo that has Go, Typescript, terraform and some sql and it always looks at the right files for the task. I like the UX way better than cursor - I like reviewing all changes at once and making minor tweaks when necessary. Especially for writing Go, i love being able to stick with Goland IDE while using codebuff.
Thanks for being one of our early users and dealing with our bugs! I love that we can fit into so many developers' workflows and support them where _they are_, as opposed to forcing them to use us for everything.
I've been using Codebuff (formerly manicode) for a few weeks. I think they have nailed the editing paradigm and I'm using it multiple times a day.
If you want to make a multi-file edit in cursor, you open composer, probably have to click to start a new composer session, type what you want, tell it which files it needs to include, watch it run through the change (seeing only an abbreviated version of the changes it makes), click apply all, then have to go and actually look at the real diff.
With codebuff, you open codebuff in terminal and just type what you want, and it will scan the whole directory to figure out which files to include. Then you can see the whole diff. It's way cleaner and faster for making large changes. Because it can run terminal commands, it's also really good at cleaning up after itself, e.g., removing files, renaming files, installing dependencies, etc.
Both tools need work in terms of reliability, but the workflow with Codebuff is 10x better.
I gave this a spin, this is the best iteration I've seen of a CLI agent, or just best agent period actually. Extremely impressed with how well it did making some modifications to my fairly complex 10,000 LOC codebase, with minimal instruction. Will gladly pay $99/mo when I run out of credits if it keeps up this level.
What if you have a microservice system with a repo-per-service setup, where to add functionality to a FE site you would have to edit code in three or four specific repos (FE site repo + backend service repo + API-client npm package repo + API gateway repo) out of hundreds of total repos?
Codebuff works on a local directory level, so it technically doesn't have to be a monorepo (though full disclaimer: our codebase is a monorepo and that's where we use it most). Most important thing is to make sure you have the projects in the same root directory so you can access them together. I've used it in a setup with two different repos in the same folder. That said, it might warn you that there's not .git folder at the root level when this happens.
Oh, it was a tool call I originally implemented so that Codebuff could look up the probabilities of markets to help it answer user questions.
I thought it would be fun if you asked it about the chance of the election or maybe something about AI capabilities, it could back up the answer by citing a prediction market.
How is this different from Qodo?
Why isn’t it mentioned as a competitor?
I’ve hard time figuring out what codebuff brings to the table that hasn’t been done before other than being YC backed. I think to win in this massively competitive and fast moving market, you really have to put forward something significantly better than an expensive cobbled together script replicating OSS solutions…
I know this sounds harsh, but believe me, differentiation makes or breaks you sooner than later. Proper differentiation doesn’t have to be hard, it just needs to answer the question what you offer that I can’t get anywhere else at a similar price point. Right now, your offer is more expensive for basically something I get elsewhere better for 1/5 the price…
I’m seriously worried whether your venture will be around in one or two years from now without a more convincing value prop.
From my experience of leaning more into full end to end Ai workflows building Rust, it seems that
1) context has clearly won over RAG. There is no way back.
2) workflow is the next obvious evolution and gets you an extra mile
3) adversial GAN training seems a path forward to get from just okay generated code to something close to a home run on the first try
4) generating a style guide based on the entire code base and feeding that style guide together with the task and context into the LLM is your ticket to enterprise customers because no matter how good your stuff might be , if the generated code doesn’t fit the mold you are not part of the conversation. Conversely, if you deliver code in the same style and formatting and it actually works, well, price doesn’t matter much.
5) in terms of marketing to developers, I suggest starting listening to their pain points working with existing Ai tools. I don’t have one single of the problems you try to solve. Im sitting over a massive Rust monorepo and I’ve seen virtually every existing Ai coding assistant failing one way or another. The one I have now works miracles half the time and only fails the other half. That is already a massive improvement compared to everything else I tried over the past four years.
Point is, there is a massive need for coding assistance on complex systems and for CodeBuff to make a dime of a difference, you have to differentiate from what’s out there by starting with the challenges engineers face today.
Yes, but did you try it? I think Codebuff is by far the easiest to use and may also be more effective in your large codebase than any other comparable tool (i.e. like Cursor composer, Aider, Cline. Not sure about Qodo) because it is better at finding the appropriate files.
Re: style guide. We encourage you to write up `knowledge.md` files which are included in every prompt. You can specify styles or other guidelines to follow in your codebase. One motivating example is we wrote in instructions of how to add an endpoint (edit these three files), and that made it do the right thing when asked to create an endpoint.
Hah no, you're not alone! Candidly, this is one of the top complaints from users. We're doing a lot of prompt engineering to be safe, but we can definitely do more. But the ones who take the leap of faith have found that it speeds up their workflows tremendously. Codebuff installs the right packages, sets up environments correctly, runs their scripts, etc. It feels magical because you can stay at a high level and focus on the real problems you're trying to solve.
If you're nervous about this, I'd suggest throwing Codebuff in a Docker container or even a separate instance with just your codebase.
What do you think about having codebuff write a parser for javascript? Something that is specifically built to enhance itself that goes beyond the regular parsers and creates a more useful structure of the codebase to be then used for RAG for code writing?
This would be double useful as a great demo for your product as well as enhancing your product intrinsically. For example the new parser can not only build the syntax tree but also provide relevant commentary for each method to describe what it does to better pick code context.
Congrats on the launch guys! Tried the product early on and it’s clearly improved a ton. I’m still using Cursor every day mainly because of how complete the feature set is - autocomplete, command K, highlight a function and ask questions about it, and command L / command shift L. I am not sure what it’ll take for me to switch - maybe I’m not an ideal user somehow… I’m working in a relatively simple codebase with few collaborators?
I’m curious what exactly people say causes them to make the switch from Cursor to Codebuff? Or do people just use both?
Sweet. Personally, I use both Cursor and Codebuff.
I open the terminal panel at the bottom of the Cursor window, start up `codebuff`, and voila, I have an upgraded version of Cursor Compose!
Depending on what exactly I'm implementing I rely more on codebuff or do more manual coding in Cursor. For manual coding, I mostly just use the tab autocomplete. That's their best feature IMO.
But codebuff is very useful for starting features out if I brain dump what I want and then go fix it up. Or, writing tests or scripts. Or refactoring. Or integrating a new api.
As codebuff has gotten better, I've found it useful in more cases. If I'm implementing a lot of web UI, I can nearly stop looking at the code altogether and just keep prompting it until it works.
Hopefully that gives you some idea of how you could use codebuff in your day-to-day development.
I've been using Codebuff for the last few weeks, and it's been really nice for working in my Elixir repo. And as someone who uses Neovim in the terminal instead of VS Code, it's nice to actually be able to have it live in the tmux split beside Neovim instead of having to switch to a different editor.
I have noticed some small oddities, like every now and then it will remove the existing contents of a module when adding a new function, but between a quick glance over the changes using the diff command and our standard CI suite, it's always pretty easy to catch and fix.
Any specific reason to choose the terminal as the interface? Do you plan to make it more extensible in the future? (sounds like this could be wrapped with an extension for any IDE, which is exciting)
Also, do you see it being a problem that you can't point it to specific lines of code? In Cursor you can select some lines and CMD+K to instruct an edit. This takes away that fidelity, is it because you suspect models will get good enough to not require that level of handholding?
Do you plan to benchmark this with swe-bench etc.?
We thought about making a VSCode extension/fork like everyone else, but decided that the future is coding agents that do most of the work for you.
The terminal is actually a great interface because it is so simple. It keeps the product focused to not have complex UI options. But also, we rarely thought we needed any options. It's enough to let the user say what they want in chat.
You can't point to specific lines, but Codebuff is really good at finding the right spot.
I actually still use Cursor to edit individual files because I feel it is better when you are manually coding and want to change just one thing there.
We do plan to do the SWE bench. It's mostly the new Sonnet 3.5 under the hood making the edits, so it should do about as well as Anthropic's benchmark for that, which is really high, 49%: https://www.anthropic.com/news/3-5-models-and-computer-use
Fun fact is that the new Sonnet was given two tools to do code edits and run terminal commands to reach this high score. That's pretty much what Codebuff does.
To add on, I know a lot of people see the terminal as cruft/legacy from the mainframe days. But it is a funny thing to look at tons of people's IDE setup and see that the one _consistent_ thing between them all is that they have a terminal nearby. It makes sense, too, since programs run in the terminal and you can only abstract so much to developers. And like James said, this sets us up nicely to build for a future of coding agents running around. Feels like a unique insight, but I dunno. I guess time will tell.
> I know a lot of people see the terminal as cruft/legacy from the mainframe days.
Hah. If you encounter people that think like this, run away because as soon as they finish telling you that terminals are stupid they inevitably want help configuring their GUI for k8s or git. After that, with or without a GUI, it turns out they also don’t understand version control / containers.
Congrats on the launch! I tried this on a migration project I'm working on (which involves a lot of rote refactoring) and it worked very well. I think you've nailed the ergonomics for terminal-based operations on the codebase.
I've been using Zed editor as my primary workhorse, and I can see codebuff as a helper CLI when I need to work. I'm not sure if a CLI-only interface outside my editor is the right UX for me to generate/edit code — but this is perfect for refactors.
Amazing, glad it worked well for you! I main VSCode but tried Zed in my demo video and loved the smoothness of it.
Totally understand where you're coming from, I personally use it in a terminal tab (amongst many) in any IDE I'm using. But I've been surprised to see how different many developers' workflows are from one another. Some people use it in a dedicated terminal window, others have a vim-based setup, etc.
> I fine-tuned GPT-4o to turn Claude's sketch of changes into a git patch, which would add and remove lines to make the edits. I only finished generating the training data late at night, and the fine-tuning job ran as I slept
Could you say more about this? What was the entirety of your training data, exactly, and how did the sketch of changes and git patch play into that?
Sure! I git cloned some open source projects, and wrote a script (with Codebuff) to pick commits and individual diffs of files. For each of those, I had Claude write a sketch of what changed from the old file to the new.
This is all the data I need: the old file, the sketch of how Claude would update it, and the ground truth diff that should be produced. I compiled this into the ideal conversation where the assistant responds with the perfect patch, and that became the training set. I think I had on the order of ~300 of these conversations for the first run, and it worked pretty well.
I came up with more improvements too, like replacing all the variant placeholder comments like "// ... existing code ..." or "# ... (keep the rest of the function)" with one [[*REPLACE_WITH_EXISITNG_CODE*]] symbol, and that made it more accurate
Manicode is really awesome, did some actual dev for live apps and it does work.
You must though, learn to code in a different way if you are not that disciplined. I had excellent results asking for small changes, step by step and committing often so I can undo and go back to a working version easily.
Net result was very positive, built two apps simultaneously (customer side and professional side).
Thanks! That's awesome to hear – if you wouldn't mind me asking, what was the context and tech stack of your side project? We love hearing about the wide variety of use cases people have found for Codebuff!
It's a pretty small, straight forward web app. Think: python backend and a vanilla html / JS frontend, served over flask. The frontend is mostly in one file, so maybe it's not the best test case for crossfile reading, but still very happy with the user experience!
I really like the vibes on this: the YouTube video is pretty good, there’s a little tongue-in-cheek humor but it’s good natured, and the transparency around how it came together at the last minute is a great story.
It’s a crowded space and I don’t know how it’ll play, but in a space that hasn’t always brought out the best in the community, this Launch HN is a winner in my book.
I hope it goes great. Congratulations on the launch.
Tongue-in-cheek! No idea what you're talking about. But I appreciate the kind words :)
Ultimately, I think a future where the limit to good software is good ideas and agency to realize them, as opposed to engineering black boxes, mucking with mysterious runtime errors, holy wars on coding styles, etc. is where all the builders in this space are striving towards. We just want to see that happen sooner than later!
I'm curious how often others have experienced this. There have been so many times on many different projects where I've struggled with something hard and had the breakthrough only right before the deadline (self-imposed or actual deadline).
Congrats, sounds like an awesome project. I'll have to try it out.
been using cline extension in vscode (which can execute commands and look at the output on terminal) and it's an incredibly adept sysadmin, cloud architect and data engineer. I like that cline lets you approve/decline execution requests and you can run it without sending the output which is safer from a data perspective.
It's cool to have this natively on the remote system though. I think a safer approach would be to compile a small binary locally that is multi-platform, and which has the command plus the capture of output to relay back, and transmit that over ssh for execution (like how MGMT config management compiles golang to static binary and sends it over to the remote node vs having to have mgmt and all it's deps installed on every system it's managing).
Could be low lift vs having a package, all it's dependencies and credentials running on the target system.
I'd assume the person giving the praise is at least a bit of all 3.
> It’s a weird catch-22 giving praise like that to LLMs.
It's a bit asymmetrical though isn't it -- judging quality is in fact much easier than producing it.
> you might be able to intuit and fill in the gaps left my the LLM and not even know it
Just because you are able to fill gaps with it doesn't mean it's not good. With all of these tools you basically have to fill gaps. There are still differences between Cline vs Cursor vs Aider vs Codebuff.
Personally I've found Cline to be the best to date, followed by Cursor.
> There’s still a skill floor required to accurately judge something.
Sure but it's not high at all.
Your typical sysadmin is doing a lot of Googling. If perplexity can tell you exactly what to do 90% of the time without error, that's a pretty good sysadmin.
Your typical programmer is doing a lot of googling and write-eval loops. If you are doing many flawless write-eval loops with the help of cline, cline is a pretty good programmer.
A lot of things AI is helping with also have good, easy to observe / generate, real-time metrics you can use to judge excellence.
It depends. For a sysadmin maybe not, but for data scientists, the bar would be pretty high just to understand the math jargon.
> If perplexity can tell you exactly what to do 90% of the time without error
That “if” is carrying a lot of weight. Anecdotally I haven’t seen any llm be correct 90% of the time. IIRC SOTA on swebench (which tbf isn’t a great benchmark) is around 30%.
> flawless write-eval loops with the help of cline, cline is a pretty good programmer.
I’m not really sure what you mean by “flawless” but having a rubber duck is always more helpful than harmful.
> A lot of things AI is helping with also have good, easy to observe / generate, real-time metrics you can use to judge excellence.
> A lot of things AI is helping with also have good, easy to observe / generate, real-time metrics you can use to judge excellence.
Exactly what I illustrated earlier: your developer productivity metrics. If you're turning code around faster, setting up your network better, turning around insights faster, the AI is working.
> It depends. For a sysadmin maybe not, but for data scientists, the bar would be pretty high just to understand the math jargon.
Why does an AI coding agent need to understand math jargon -- it just helps you write better code. Are you even familiar with what data scientists do? Seems not because if you were, you'd see clearly where the tool would be applied and do a good/bad job.
Reminder: we're talking about evaluating whether Codebuff / alternatives are "pretty good" at X. Just go play with the tools.
tgtweak expressed their opinion on how good the tool rates at some tasks {sysadmin, data engineering, cloud architecture} and your response was to question how someone could have an opinion about it. The obvious answer is that they used the tools and found it useful for those tasks. It may only be _subjectively_ good at what they're using for but it's also a rando's opinion on the internet. As another rando I very much agree with what the person you responded to is saying. You're not going to get more rigor from this discourse - go form a real opinion of your own.
I would consider myself adept at all three, not top 1% in either but the intersection of all 3 easily.
Context I have hired hundreds of engineers and built many engineering teams from scratch to 50+, and have been doing systems administration, solutions architecture, infrastructure design, devops, cloud orchestration and data platform design for 25 years.
I'm not bluffing when I say Claude's latest sonnet model and Cline in vscode has really been 99th percentile good on everything I've thrown at it (with some direction, as needed) and has done more productive, quality work than a team of 10 engineers in the last week alone.
If you haven't tried it I can understand your pessimism.
I haven’t built engineering teams, but I’ve been in the server programming field for 15 years.
I have tried Claude (with aider) for programming tasks and have been impressed that it could do anything (with handholding) but haven’t been convinced that it’s something that will change how I write code forever.
It’s nice that I can describe how to graph some data in a csv and get 80% of the way there in python after a few rounds of clarification. Claude refused to use seaborn for some reason, but that’s no big deal.
Every time I’ve tried using it for work, though, I was sorely disappointed.
I recently convinced myself that it was pretty helpful in building a yjs backed text editor, but last week realized that it led me down an incorrect path with regards to the prosemirror plugin and I had to rewrite a good chunk of the code.
I have heard good things about Cline! I'm curious to learn more. I need to try it out myself.
I see Codebuff as a premium version of Cline, assuming that we are in fact more expensive. We do a lot of work to find more relevant files to include in context.
Tbh I used manicode once about a month ago and much preferred cline. Cline seems to find context just fine, can run terminal commands in the VS code terminal, and the flow where it proposes an edit is very good. Since it's in VS code I can even pause it and edit files then unpause it. I like that I can see how much everything costs and rely on good caching and usage based billing to get a fair price.
Admittedly the last time I used manicode was a while back but I even preferred Cursor to it, and Cursor hallucinates like a mf'er. What I liked about cursor is that I can just tell composer what files I want it to look at in the UI. But I just use Cline now because I find its performance to be the best.
Other datapoints: backend / ML engineer. Maybe other kinds of engineers have different experiences.
How does it work if I'm not adding features, but want to refactor my code bases? E.g., the OOD is poor, and I want to totally change it and split the codes into new files? Would it work properly as it requires extensive reads + create new files + writes ...
It couldn't write a simple test for my typescript node system. Kept telling me credits left, login. I don't know who gets success from these tools and what they are building but none of them actually work for me. Yesterday there was Aide which I tried and found to be broken and so is this one.
That's surprising to me, usually it works quite well at this. Did you start codebuff in the root of your project so that it can get context on your codebase?
In Codebuff you don't have to manually specify any files. It finds the right ones for you! It also pulls more files to get you a better result. I think this makes a huge difference in the ergonomics of just chatting to get results.
Codebuff also will run commands directly, so you can ask it to write unit tests and run them as it goes to make sure they are working.
Aider does all of this too, and it has for quite a while. It just tends to ask you for explicit permission when e.g. adding files to the context (potentially token-expensive), creating new files, or running commands (potentially dangerous); AFAIR there's an option (both configuration and cli arg) to auto-approve those requests, though I never tried it.
Aider has extensive code for computing "repository map", with specialized handling for many programming languages; that map is sent to LLM to give it an overview of the project structure and summary of files it might be interested in. It is indeed a very convenient feature.
I never tried writing and launching unit tests via Aider, but from what I remember from the docs, it should work out of the box too.
Thanks for sharing – I definitely want to plays with Aider more. My knowledge of it is limited, but IIRC Aider doesn't pull in the right files to create context for itself like Codebuff does when making updates to your codebase.
Another aspect is simplicity. I think Aider and other CLI tools tend to err towards the side of configuration and more options, and we've been very intentional with Codebuff to avoid that. Not everyone values this, surely, but our users really appreciate how simple Codebuff is in comparison.
Aider very much does pull in the right files to create context for itself. It also uses treesitter to build a repo-map and then uses that as an initial reference for everything, asking you to add the files it thinks it needs for the context. As of quite recently it also supports adding files for context in explicit read-only mode. It works extremely well.
I think Aider does this to save tokens/money. It supports a lot of models so you can have Claude as your architect and another cheap model that does the coding.
Yup, there's a tradeoff in $$$, but for a lot of people it should be worth it, since Codebuff can find more relevant files with example code that will make the output higher quality.
The demo right there is worth $5 of software development ( in offshored upwork cost) . Imagine when this can be done at scale for huge existing codebase.
- It chooses files to read automatically on each message — unlike Cursor’s composer feature. It also reads a lot more than Cursor's @codebase command.
- It takes 0 clicks — Codebuff just edits your files directly (you can always peek at the git diffs to see what it’s doing).
- It has full access to your existing tools, scripts, and packages — Codebuff can install packages, run terminal commands and tests, etc.
- It is portable to any development environment
We use OpenAI and Anthropic, so unfortunately we have to abide by their policies. But we only grab snippets of your code at any given point, so your codebase isn't seen by any entity in its entirety. We're also considering open-sourcing, so that might be a stronger privacy guarantee.
I should note that my cofounder James uses both and gets plenty of value by combining them. Myself, I'm more of a plain VSCode guy (Zed-curious, I'll admit). But because Codebuff lives in your terminal, it fits in anywhere you need.
Alright. That gives me some directional signal. I will be interested if you make it open source. We have massive and critical code base so I am always wary of giving access to 3Ps.
Codebuff is a bit simpler and requires less input from the user since you just chat and it does multi-file edits/runs commands. It's also more powerful since it pulls more files as context.
I think you just need to try it to see the difference. You can feel how much easier it is haha.
We don't store your codebase, and have a similar policy to Cursor, in that our server is mostly a thin wrapper that forwards requests to LLM providers.
The PearAI debacle is another story, but mostly they copied the open source project Continue.dev without giving proper attribution.
I've seen similar projects, but they all rely on paid LLMs, and can't work with local models, even if the endpoint is changed... what are the possibilities for this project to be run locally?
Are there any plans to add a sandbox? This seems cool, but it seems susceptible to prompt injection attacks when for example asking questions about a not necessarily trusted open source codebase.
I've been playing with Codebuff for a few days (building out some services with Node.js + Typescript) - been working beautifully! Feels like I'm watching a skilled surgeon at work.
A skilled surgeon is a great analogy! We actually instruct Codebuff to focus on making the most minimal edits, so that it does precisely what you want.
I'm currently hacking together a prototype of such a tool. The problem I noticed is that in CLI, commands are way less predictable than lines in code files, so such a tool will probably have a pretty low correct completion rate. However, there are clearly cases where it could be very helpful.
I wrote a custom `applyPatch` function that tries to use the line numbers, but falls back to searching for the context lines to line up the +/- patched lines.
It actually got line number not too wrong, and so they might have been helpful. (I included the line numbers for the original file in context).
Ultimately though, this approach was still error prone enough that we recently switched away.
Can you speak more to how efficiency towards context management works (to reduce token costs)? Or are you loading up context to the brim with each request?
I think managing context is the most important aspect of today's coding agents. We pick only files we think would be relevant to the user request and add those. We generally pull more files than Cursor, which I think is an advantage.
However, we also try to leverage prompt-caching as much as possible to lower costs and improve latency.
So we basically only add files over time. Once context gets too large, it will purge them all and start again.
> However, we also try to leverage prompt-caching as much as possible to lower costs and improve latency.
Interesting! That does have 5 minute expiry on Claude, and your users can use Codebuff in an unoptimal way. Do you have plans in aligning your users towards using the tool in a way that makes the most use of prompt caches?
brilliant - and thank you - so impressed with your work, i finally made an account to just comment - out of the box worked, a few minor glitches, but this is the start of awesome. keep doing what you are doing.
Your website has a serious issue. Trying to play the YouTube video makes the page slow down to a crawl, even in 1080p, while playing it on YouTube directly has no issue, even in 4K.
On the project itself, I don't really find it exciting at all, I'm sorry. It's just another wrapper for a 3rd party model, and the fact that you can 1) describe the entire workflow in 3 paragraphs, and 2) built it and launched it in around 4 months, emphasizes that.
Weird, thanks for flagging – we're just using a Youtube embed in an iframe but I'll take a look.
No worries if this isn't a good fit for you. You're welcome to try it out for free anytime if you change your mind!
FWIW I wasn't super excited when James first showed me the project. I had tried so many AI code editors before, but never found them to be _actually usable_. So when James asked me to try, I just thought I'd be humoring him. Once I gave it a real shot, I found Codebuff to be great because of its form factor and deep context awareness: CLI allows for portability and system integration that plugins or extensions really can't do. And when AI actually understands my codebase, I just get a lot more done.
Not trying to convince you to change your mind, just sharing that I was in your shoes not too long ago!
> CLI allows for portability and system integration that plugins or extensions really can't do
In the past 6 or 7 years I haven't written a single line of code outside of a JetBrains IDE. Same thing for all of my team (whether they use JetBrains IDEs or VS Code), and I imagine for the vast majority of developers.
This is not a convincing argument for the vast majority of people. If anything, the fact that it requires a tool OUTSIDE of where they write code is an inconvenience.
> And when AI actually understands my codebase, I just get a lot more done.
But Amazon Q does this without me needing to type anything to instruct it, or to tell it which files to look at. And, again, without needing to go out of my IDE.
Having to switch to a new tool to write code using AI is a huge deterrent and asking for it is a reckless choice for any company offering those tools. Integrating AI in tools already used to write code is how you win over the market.
I was thinking the same. My (admittedly old-ish) 2070 Super runs at 25-30% just looking at the landing page. Seems a bit crazy for a basic web page. I'm guessing it's the background animation.
> I'm sorry. It's just another wrapper for a 3rd party model
The main challenge with working with LLMs is actually one of "ETL" and understanding what data to load and how to transform it into some form that leads to the desired output.
For trivial tasks, this is certainly easy. For complicated tasks, like understanding a codebase or a product catalog of tens of thousands of entries, this is non-trivial.
My team is not working in the code gen space, but even though we also "just wrap" an API, almost all of our work is in data acquisition, transformation, the retrieval strategy, and structuring of the request context.
The API call to the LLM is like hitting "bake" on an oven: all of the real work happens before that.
$99/month lol.
I have Perplexity, OpenAI, Claude and Cursor subscription and I end up paying way less than $99/month.
Clearly you haven't done any research on price.
Aider, Cline are open source, in not sure why someone would subscribe to it unless it's the top model on http://swebench.com/
I tried it on two of my git repositories, just to see, if it could do a decent commit summary. I was very pleasantly surprised with the good result.
I was unpleasantly surprised, that this already cost me 175 credits. If I extrapolate this over my ~100 repositories, that would already put me at 8750, just to let it write a commit message for release day. That is way out of free range and basically would eat up most of the $99 I would have to spend as well. My subscription price for cody is $8 for a month. Pricing seems just way off.
Don't take it personally or get too discouraged. You are not your product, and you're certainly not your first demo of your first product. But, knowing your competition, how you stack up against them, and how the people you're selling to feel about them, is a huge part of your job as a founder. It will only get more important.
You have to constantly do your research. It is one of those anxiety-inducing tasks that's easy to justify avoiding when all you want to do is code your idea up and there's so much other work to do. But it's your job. Even when you hire someone else to run product for you it'll be your responsibility to own it.
What you've built is cool, a lot of people love it even though they know about the other tools available. Now you know what your main competition does, you also know what it doesn't do, so you get to solve for that - and if you solved the context problem in isolation with treesitter then you're obviously capable.
You'll have realised by now that Aider didn't use treesitter when it started. Instead it used ctags - a pattern-matching approach to code indexing from 40 years ago that doesn't capture signatures or create an ast, it effectively just indexes the code with a bunch of regex. And it's not like treesitter wasn't around when aider was first written. Keep that in mind.
Congratulations on your launch! But I confess that I am really confused. This sounds exactly like Aider, but closed source and it's locked into a single LLM API? I just watched you use it, and looks a lot like Aider too? Why would I use this over Aider?
I've seen people say "you don't have to add files to Codebuff", but Aider tells me when the LLM has requested to see files. I just have to approve it. If that bothers you, it's open source, so you could probably just add a config to always add files when requested.
Aider can also run commands for you.
What am I missing?
I don't think you're missing anything.
Aider tends to maintain near "state of the art" including e.g. treesitter, and an actually refined (as in, iterated improvements over time) user experience.
Aider has been refining for 8000 commits since May of 2023. Codebuff "all started" circa Claude Sonnet 3.5.
The story of discovery (e.g. git patch) at best feels like a lack of researching the landscape since leaderboards for SOTA indicate whether a model performs better as whole code or diffs and Anthropic even cites Aider benchmarks, but cynically, the narrative feels a bit like looking through the things Aider has been doing differently/better, and putting them in an origin story so the feature list might sound less like the “sincerest form of flattery.”
Particularly concerning is the story talking about "seeing" users coding loops. Perhaps this is a figure of speech. As designed, Codebuff are in the middle of all users' code slinging, so perhaps it isn't.
Checking the Privacy Policy shows it's only about cookies and tracking, not about information privacy or IP protection of any kind.
Checking the Terms of Service says they own any code you post through it and can give it to others:
"However, by posting Content using Service you grant us the right and license to use, modify, publicly perform, publicly display, reproduce, and distribute such Content on and through Service. You agree that this license includes the right for us to make your Content available to other users of Service."
Meaning, the TOS is a for a public social media type service, not for an intellectual property service.
(Note that in VSCode "cline" can give Aider a run for its money.)
Thanks for your reply! I started Codebuff without being aware of Aider. I actually have not yet tried Aider (though I plan to try it soon!).
It's totally true that a lot of the development of Codebuff is merely me (and Brandon) working through a lot of the problems that Aider already solved! That makes sense.
Partly, my thesis is that if you start after Sonnet 3.5 is out, that you design things differently. For example, I started without manual file selection and worked to make it more like an agent that has native access to your environment.
Needless to say, I'm a fan of the work Paul has done on Aider, and I've appreciated the benchmarks and guides he's created and shared publicly. And Cline is also an amazing project which I want to try out soon as well!
With respect to privacy, we have pledged not to store your codebase, and mainly store logs that we use to debug the application. When seeing users use Codebuff, I mean I literally watched them use it, as we've done many in-person user tests, plus the Manifold Team has been using Codebuff for a while.
We also intend to release a Privacy Mode, like Cursor has, where we will not store anything at all, not even the logs of your interactions!
It makes sense to be a bit skeptical of Codebuff, since we are so new, but I intend to not let our users down!
Interesting!
Being in the same product space for more than 3 months, I wonder how one can not come across 2 popular open source tools that do more or less the same thing.
Like Aider has 21k stars, and Cline has around 11k stars. Both these product names come up on HN, Reddit frequently.
Curious to know if YC does some research on existing products before backing a new business.
It seems like we're all in our own tech bubbles more and more. Distribution is clearly a tough problem to crack, and no one in this space has really mastered it yet, aside from arguably Github Copilot.
No comment on YC here, but I think it's easy to criticize from the outside. I've personally have been impressed by all the peers, group partners, and alumni I've met so far. I'm biased, but I think YC knows what it's doing. Also, YC backs founders, not ideas.
In any YC application, they request a list of competitors and why your product is better! Curious about what OP listed as competitors in the application.
Here's my application: https://manicode.notion.site/Manicode-YC-application-c52f592...
I listed: Cursor, Devin, Codium, Augment, Greptile, Lovable.dev, Aider.chat, mentat.ai, devlo.ai, etc
So I did mention Aider. I was definitely aware that it existed, I just hadn't used it.
Maybe the person reading the application was not aware of the competition either.
This was surprising to me too.
I've also built a similar free and open-source tool gptme (2.5k stars), since the start of last year (GPT-3.5). It has been impossible to ignore the great work done by Aider.
Yup. I recall that at some point maybe a year ago, pretty much every other LLM thread that had people speculating whether some LLM could do X or Y / improved or worsened for Z / etc., would have Paul show up and comment something along the lines, "Actually, I've benchmarked this thoroughly in my work on Aider; here's <link to data and analysis>" or such. Those were usually some of the most insightful comments in the whole thread.
I found those comments, and the work they linked to, especially valuable because it's rare to see advanced work on LLM applications done and talked about in the open. Everyone else doing equivalent work seems to want to make a startup out of it, so all we usually get is weekend hacks that stall out quickly.
> With respect to privacy, we have pledged not to store your codebase [...]
It isn't necessarily a strong guarantee to have "pledged", although it is appreciated.
Amber Heard ruined that word for me.
It sounds minor that it finds files for you, but if you try it out, you'll see that it's a giant leap in UX and the extra files help it generate better code because it has more examples from your codebase.
But you said you haven't tried Aider, how can you say it's a "leap in UX"?
My own tool `gptme` lets the agent interactively read/collect context too (as does Anthropic in their latest minimal-harness submission to SWE-bench), it's nothing novel.
I'm just saying what users of Codebuff have said about us compared to competitors.
You should try Codebuff and see for yourself how it reads files! It's not simply a tool call. We put a lot of work into it.
I did find codebuff a lot easier to install and get started with...usability can make or break a project. Just as a user, I think it's nice to have multiple projects doing the same thing -- exploring more of the solution space.
(I've just played a little bit with aider and codebuff. I've previously tried aider and it always errored out on my code base, but inspired by this comment I tried again, and now it works well.)
Beyond what James highlighted, I personally really like how simple Codebuff is. CLI tools tend to go a bit overboard with options and configurations imo, which is ok if you're just setting them up once or twice. But for a tool I want to rely upon every day for my work, I want them to be as simple as possible, but no simpler.
Have you used Aider extensively? How are you finding it for your coding needs vs IDE-based chats?
The demos I see for these types of tools are always some toy project and doesn't reflect day to day work I do at all. Do you have any example PRs on larger more complex projects that have been written with codebuff and how much of that was human interactive?
The real problem I want someone to solve is helping me with the real niche/challenging portion of a PR, ex: new tiptap extension that can do notebook code eval, migrate legacy auth service off auth0, record and replay API GET requests and replay a % of them as unit tests, etc.
So many of these tools get stuck trying to help me "start" rather than help me "finish" or unblock the current problem I'm at.
I hear you. This is actually a foundational idea for Codebuff. I made it to work within the large-ish codebase of my previous startup, Manifold Markets.
I want the demos to be of real work, but somehow they never seem as cool unless it's a neat front end toy example.
Here is the demo video I sent in my application to YC, which shows it doing real stuff: https://www.loom.com/share/fd4bced4eff94095a09c6a19b7f7f45c?...
This comment makes me think of Coke vs. Pepsi.
Historically, Pepsi won taste tests and people chose Coke. Because Pepsi is sweeter, so that first sip tastes better. But it's less satisfying—too sweet—to drink a whole can.
The sexy demos don't, in my opinion and experience, win over the engineers and leaders you need. Lil startups, maybe, and engineers that love the flavor of the week. But for solving real, unsexy problems—that's where you'll pull in organizations.
> The sexy demos don't, in my opinion and experience, win over the engineers and leaders you need.
Great point, we're in talks with a company and this exact issue came up. An engineer used Codebuff over a weekend to build a demo app, but the CEO wasn't particularly interested even after he enthusiastically explained what he made. It was only when the engineer later used Codebuff to connect the demo app to their systems that the CEO saw the potential. Figuring out how to help these two stakeholders align with one another will be a key challenge for us as we grow. Thanks for the thought!
> Historically, Pepsi won taste tests and people chose Coke. Because Pepsi is sweeter, so that first sip tastes better. But it's less satisfying—too sweet—to drink a whole can.
As a Pepsi drinker (Though Pepsi Max/Zero), I disagree with this. That's one interpretation, the other is the one Pepsi was gesturing at - that people prefer Coke when knowing it's Coke, because of branding, but with branding removed, prefer Pepsi.
I personally drank Coke Zero for years, always being "unhappy" when a restaurant only had Pepsi, until one day I realized I was actually enjoying the Pepsi more when not thinking about it, and that the only reason I "preferred" Coke was the brand. So I know that this story can also be true, at least on n=1 examples.
Watching the demo it seems like it would be more effective to learn the skills you need rather than using this for a decade.
It takes 5+ seconds just to change one field to dark mode, I don't even want to imaigne a situation where I have two fields and I want to explain that I need to change this field and not that field.
I'm not sure who is the target audience for this, people who want to be programmers without learning programming ?
My 2c as someone who worked on a similar product:
> it seems like it would be more effective to learn the skills you need rather than using this for a decade.
Think of it as a calculator. You do want to be able to do addition, but not neccessarily to manually add 4-digit numbers in your head.
> It takes 5+ seconds just to change one field to dark mode
Our current LLMs are way too slow for this. I am chuckling every time someone says "we don't need LLMs to be faster because people can't read faster". Imagine this using Groq with a future model with similar capability level, and taking 0.5 seconds to do this small change.
People need to remember we're at the very beginning of using AI for coding. Of course it's suboptimal for majority of cases. Unless you believe we're way past half the sigmoid curve on AI improvements (which I don't), consider that this is the worst the AI is ever going to be for coding.
A year ago people were incredulous when told that AI could code. A year before that people would laugh you out of the room. Now we're at the stage where it kinda works, barely, sometimes. I'm bullish on the future.
Every experience I have had with LLMs generating code. LLMs tend to follow the prompt much too closely and produce large amounts of convoluted code that in the end prove not only unnecessary but quite toxic.
Where LLMs shine is in being a personal Stack Overflow: asking a question and having a personalized, specific answer immediately, that uses one's data.
But solving actual, real problems still seem out of reach. And letting them touch my files sound crazy.
(And yes, ok, maybe I just suck at prompting. But I would need detailed examples to be convinced this approach can work.)
I'm sure your prompting is great! It's just hard because LLMs tend to be very wordy by default. This was something we struggled with for a while, but I think we've done a good job at making Codebuff take a more minimal approach to code edits. Feel free to try it, let me know if it's still too wordy/convoluted for you.
> LLMs tend to follow the prompt much too closely
> produce large amounts of convoluted code that in the end prove not only unnecessary but quite toxic.
What does that say about your prompting?
> Do you have any example PRs on larger more complex projects that have been written with codebuff and how much of that was human interactive?
We have a lot of code in production which are AI written. The important thing is that you need to consciously make a module or project AI-ready. This means that things like modularity and smaller files are even more important than they usually are.
I can't share those PRs, but projects on my profile page are almost entirely AI written (except the https://bashojs.org/ link). Some of them might meet your definition of niche based on the example you provided.
This is so REAL: LLMs suck probably because your modularity sucks LOL
Kind of like "please describe the solution and I will write code to do it". That's not how programming works. Writing code and testing it against expectations to get to the solution, that's programming.
FWIW I don't find that I'm losing good engineering habits/thought processes. Codebuff is not at the stage where I'm comfortable accepting its work without reviewing, so I catch bugs it introduces or edge cases it's missed. The main difference for me is the speed at which I can build now. Instead of fussing over exact syntax or which package does what, I can keep my focus on the broader implications of a particular architecture or nuances of components, etc.
I will admit, however, that my context switching has increased a ton, and that's probably not great. I often tell Codebuff to do something, inevitably get distracted with something else, and then come back later barely remembering the original task
Language is important here. Programming, at its basic definition, is just writing code that programs a machine. Software development or even design/engineering are closer to what you’re referring to.
> ex: new tiptap extension that can do notebook code eval
Claude wrote me a prosemirror extension doing a bunch of stuff that I couldn’t figure out how to do myself. It was very convenient.
+1; Ideally I want a tool I don't have to specify the context for. If I can point it via config files at my medium-sized codebase once (~2000 py files; 300k LOC according to `cloc`) then it starts to get actually usable.
Cursor Composer doesn't handle that and seems geared towards a small handful of handpicked files.
Would codebuff be able to handle a proper sized codebase? Or do the models fundamentally not handle that much context?
Yes. Natively, the models are limited to 200k tokens which is on the order of dozens of files, which is way too small.
But Codebuff has a whole preliminary step where it searches your codebase to find relevant files to your query, and only those get added to the coding agent's context.
That's why I think it should work up to medium-large codebases. If the codebase is too large, then our file-finding step will also start to fail.
I would give it a shot on your codebase. I think it should work.
RAG is a well-known technique now, and to paraphrase Emily Bender[1], here are some reasons why it's not a solution.
The code extruded from the LLM is still synthetic code, and likely to contain errors both in the form of extra tokens motivated by the pre-training data for the LLM rather than the input texts AND in the form of omission. It's difficult to detect when the summary you are relying on is actually missing critical information.
Even if the set up includes the links to the retrieved documents, the presence of the generated code discourages users from actually drilling down and reading them.
This is still a framing that says: Your question has an answer, and the computer can give it to you.
1 https://buttondown.com/maiht3k/archive/information-literacy-...
We actually don't use RAG! It's not that good as you say.
We build a description of the codebase including the file tree and parsed function names and class names, and then just ask Haiku which files are relevant!
This works much better and doesn't require slowly creating an index. You can just run Codebuff in any directory and it works.
It sounds like it's arguably still a form of RAG, just where the retrieval is very different. I'm not saying that to knock your approach, just saying that it sounds like it's still the case where you're retrieving some context and then using that context to augment further generation. (I get that's definitely not what people think of when you say RAG though.)
Genuine question: at what point does the term RAG lose its meaning? Seems like LLMs work best when they have the right context, and that context must be pulled from somewhere for the LLM. But if that's RAG, then what isn't? Do you have a take on this? Been struggling to frame all this in my head, so would love some insight.
RAG is a search step in an attempt to put relevant context into a prompt before performing inference. You are “augmenting” the prompt by “retrieving” information from a data set before giving it to an LLM to “generate” a response. The data set may be the internet, or a code base, or text files. The typical examples online uses an embedding model and a vector database for the search step, but doing a web query before inference is also RAG. Perplexity.ai is a RAG (but fairly good quality). I would argue that Codebuff’s directory tree search to find relevant files is a search step. It’s not the same as a similarity search on vector embeddings, and it’s not PageRank, but it is a search step.
Things that aren’t RAG, but are also ways to get a LLM to “know” things that it didn’t know prior:
1. Fine-tuning with your custom training data, since it modifies the model weights instead of adding context. 2. LoRA with your custom training data, since it adds a few layers on top of a foundation model. 3. Stuffing all your context into the prompt, since there is no search step being performed.
Gotcha – so broadly encompasses how we give external context to the LLM. Appreciate the extra note about vector databases, that's where I've heard this term used most, but I'm glad to know it extends beyond that. Thanks for explaining!
Not RAG: asking the LLM to generate using its internal weights only
RAG: providing the LLM with contextual data you’ve pulled from outside its weights that you believe relate to a query
Nice, super simple. We're definitely fitting into this definition of RAG then!
I think parsimo2010 gave a good definition. If you're pulling context from somewhere using some search process to include as input to the LLM, I would call that RAG.
So I would not consider something like using a system prompt (which does add context, but does not involve search) would not be RAG. Also, using an LLM to generate search terms before returning query results would not be RAG because the output of the search is not input to the LLM.
I would also probably not categorize a system similar to Codebuff that just adds the entire repository as context to be RAG since there's not really a search process involved. I could see that being a bit of a grey area though.
> We build a description of the codebase including the file tree and parsed function names and class names
This sounds like RAG and also that you’re building an index? Did you just mean that you’re not using vector search over embeddings for the retrieval part, or have I missed something fundamental here?
Ah yeah, that's what I mean! I thought RAG is synonymous with this vector search approach.
Either way, we do the search step a little different and it works well.
Any kind of search prior for content to provide as context to the LLM prompt is RAG. The goal is to leverage traditional information retrieval as a source of context. https://cloud.google.com/use-cases/retrieval-augmented-gener...
I'm currently working on a demonstration/POC system using my ElasticSearch as my content source, generating embeddings from that content, and passing them to my local LLM.
It would be cool to be talking to other people about the RAG systems they’re building. I’m working in a silo at the moment, and pretty sure that I’m reinventing a lot of techniques
I didn't mean to be down on it, and I'm really glad it's working well! If you start to reach the limits of what you can achieve with your current approach, there are lots of cute tricks you can steal from RAG, eg nothing stopping you doing a fuzzy keyword search for interesting-looking identifiers on larger codebases rather than giving the LLM the whole thing in-prompt, for example
I'll need to get approval to use this on that codebase. I've tried it out on a smaller open-source codebase as a first step.
For anyone interested:
It required a bit of back and forth to produce a relatively small change, and I think it was a bit too narrow with the files it selected (it missed updating the implementations of a method in some subclasses, since it didn't look at those files)So I'm not sure if this saved me time, but it's nevertheless promising! I'm looking forward to what it will be capable of in 6mo.
What's the fundamental limitation to context size here? Why can't a model be fine-tuned per codebase, taking the entire code into context (and be continuously trained as it's updated)?
Forgive my naivety, I don't now anything about LLMs.
It's pretty good for complex projects imo because codebuff can understand the structure of your codebase and which files to change to implement changes. It still struggles when there isn't good documentation, but it has helped me finish a number of projects
> It still struggles when there isn't good documentation
@Codebuff team, does it make sense to provide a documentation.md with exposition on the systems?
One cool thing you can do is a ask Codebuff to create these docs. In fact, we recommend it.
Codebuff natively reads any files ending in "knowledge.md", so you can add any extra info you want it to know to these files.
For example, to make sure Codebuff creates new endpoints properly, I wrote a short guide with an example on the three files you need to update, and put it in backend/api/knowledge.md. After that, Codebuff always create new endpoints correctly!
you can put the information into knowledge.md or [description].knowledge.md, but sometimes I can't find documentation and we're both learning as we go lmao
Absolutely! Imaging setting a bunch of css styles through a long winded AI conversation, when you could have an IDE to do it in a few seconds. I don't need that.
The long tail of niche engineering problems is the time consuming bit now. That's not being solved at all, IMHO.
> ... setting a bunch of css styles through a long winded AI conversation
Any links on this topic you rate/could share?
"On the checkout page at the very bottom there are two buttons that are visible when the user chooses to select fast shipping. The right one of those buttons should be a tiny bit more round and it seems like it's not 100% vertically aligned with the other button."
Takes a lot longer to write than just diving into the code. I think that's what they meant.
Which IDE do you use for CSS editing/adjustment?
I mostly mean the text editing functionality I guess. Visual Studio Code is the IDE I use
Great question – we struggled for a long time to put our demo together precisely for this reason. Codebuff is so useful in a practical setting, but we can't bore the audience with a ton of background on a codebase when we do demos, so we have to pick a toy project. Maybe in the future, we could start our demo with a half-built project?
Hopefully the demo on our homepage shows a little bit more of your day-to-day workflows than other codegen tools show, but we're all ears on ways to improve this!
To give a concrete example of usefulness, I was implementing a referrals feature in Drizzle a few weeks ago, and Codebuff was able to build out the cli app, frontend, backend, and set up db schema (under my supervision, of course!) because of its deep understanding of our codebase. Building the feature properly requires knowing how our systems intersect with one another and the right abstraction at each point. I was able to bounce back and forth with it to build this out. It felt akin to working with a great junior engineer, tbh!
EDIT: another user shared their use cases here! https://news.ycombinator.com/item?id=42079914
Why not take a large and complex code-base such as the firefox source code, feed that in and demonstrate how that goes?
If you're not worried about showing off little hints of your own codebase, record running it on one of your day to day engineering tasks. It's perfect dog fooding and would be a fun meta example.
> To give a concrete example of usefulness, I was implementing a referrals feature in Drizzle a few weeks ago, and Codebuff was able to build out the cli app, frontend, backend, and set up db schema
Record this!
Better yet, stream it on Twitch and/or YouTube and/or Discord and build a small community of followers.
People would love to watch you.
Great idea! We were kicking around something like this, but couldn't get it ready in time for today's launch – but keep your eyes peeled! Our Discord community is a great place to stay up to date.
Yup, I had the same thought. I just ran into an issue during today's launch and used Codebuff to help me resolve it: https://www.tella.tv/video/solving-website-slowdown-with-ai-.... Next time, I'll try to record before I start working, but it's hard to rememeber sometimes.
I'm not paying $20 for my ssh keys and rest of the clipboard to be sent to multiple unknown 3rd parties, thanks, not for me.
Would however pay for actual software that I can just buy instead of rent to do the task of inline shell assitance, without making network calls behind my back that i'm not in complete perfectionist one hundred point zero zero per cent control of.
Sorry just my opinion in general with these types of products. If you don't have the skills to make a fully self contained language model type of product or something do this then you are not skilled enough team for me to trust with my work shell.
The main issue is the models you need to do the job are too big for most consumers, even people with nice video cards. You'll need a couple L40S GPUs at minimum. Maybe a few H100s more realistically.
So do you want to buy tens of thousands of dollars in GPUs or do you want to rent them second-by-second? Most people will choose the latter. I understand you don't trust the infrastructure and that's reasonable. If self-hosting was viable it would be more popular.
Noting Codebuff is manicode renamed.
It's become my go-to tool for handling fiddly refactors. Here’s an example session from a Rust project where I used it to break a single file into a module directory.
https://gist.github.com/cablehead/f235d61d3b646f2ec1794f656e...
Notice how it can run tests, see the compile error, and then iterate until the task is done? Really impressive.
For reference, this task used ~100 credits
Haha yes, here's the story of why we rebranded: https://manifold.markets/JamesGrugett/what-will-we-rename-ma...
Thanks for sharing! haxton was asking about practical use cases, I'll link them here!
Allowing LLMs to execute unrestricted commands without human review is risky and insecure.
Yes, this is a good point. I think not asking to run commands is maybe the most controversial choice we've made so far.
The reason we don't ask for human review is simply: we've found that it works fine to not ask.
We've had a few hundred users so far and usually people are skeptical of this at first, but as they use it they find that they don't want it to ask for every command. It enables cool use cases where Codebuff and iterate by running tests, seeing the error, attempting a fix, and running them again.
If you use source control like git, I also think that it's very hard for things to go wrong. Even if it ran rm -rf from your project directory, you should be able to undo that.
But here's the other thing: it won't do that. Claude is trained to be careful about this stuff and we've further prompted it to be careful.
I think not asking to run commands is the future of coding agents, so I hope you will at least entertain this idea. It's ok if you don't want to trust it, we're not asking you to do anything you are uncomfortable with.
I am not afraid of rm -rf whole directory. I am afraid of other stuff that it can do to my machines. leak my ssh keys, cookies, persnal data, network devices, and making persistent modifications (malware) to my system. Or maybe inadvertently messing with my python version, or globally installing some library to mess up whole system.
I, as a well-intending human, have run commands that broke my local python install. At least I was vaguely aware of what I did, and was able to fix things. If I didn't know what had happened I'd be pretty lost.
You'd hope that most of it is just rm -rf .venv && poetry install, or similar.
> it won't do that. Claude is trained to be careful about this stuff and we've further prompted it to be careful.
Could you please explain a bit how you are sure about it?
It's mainly from experience. From when I set it up I didn't have the feature to ask whether to run commands. It has been rawdogging commands this whole time and has never been a problem for me.
I think we have many other users who are similar. To be fair, sometimes after watching it install packages with npm, people are surprised and say that they would have preferred that it asked. But usually this is just the initial reaction. I'm pretty confident this is the way forward.
Do you have any sandbox-like restrictions in place to ensure that commands are limited to only touching the project folder not any other places in the system?
You can use pledge[1] to restrict the tool to read/write only in specific directories, or only use certain system calls. This is easier to run than from a container or VM, but can be a bit fiddly to setup at first.
Assuming you trust it with the files in your codebase, and them being shared with third parties. Which is a hard pill to swallow for a proprietary program.
[1]: https://justine.lol/pledge/
We always reset the directory back to the project directory on each command, so that helps.
But we're open to adding more restrictions so that it can't for example run `cd /usr && rm -rf .`
How about executing commands in a VM (perhaps Firecracker)?
It's strange that all the closed models whose mentioned reasons for being closed is safety is allowing this, and banning the apps which allows for erotic roleplay all the time. Roleplay is significantly less dangerous than full shell control.
You are really missing out: https://github.com/e2b-dev/e2b
I don't see any sandbox usage in the demo video.
Does this send code via your servers? If so, why? Nothing you've described couldn't be better implemented as a local service.
Could this tool get a command from the LLM which would result in file-loss? How would you prevent that?
Good question! There's a couple reasons.
One is that I think it is simpler for the end user to not have to add their own keys. It allows them to start for free and is less friction overall.
Another reason is that it allows us to use whichever models we think are best. Right now we just use Anthropic and OpenAI, but we are in talks with another startup to use their rewriting model. Previously, we have used our own fine-tuned model for one step, and that would be hard to do with just API keys.
The last reason that might be unpopular is that keeping it closed source and not allowing you to bring your keys means we can charge more money. Charging money for your product is good because then we can invest more energy and effort to make it even better. This is actually beneficial to you, the end user, because we can invest in making the product good. Capitalism works, cheers.
Per your last question, I do advise you use git so that you can always revert to return to your old file state! Codebuff does have a native "undo" command as well.
We already have AIDE, Continue, Cody , Aider, Cursor.. Why this?
None of them are owned by the creator of Codebuff. Why not create something to replace or be at the same level as those? Also, who is "we"? I don't use or like none of these.
Codebuff is the easiest to use of all these, because you just chat, and it finds all the right files to edit. There's no clicking and you don't have to confirm edits.
It is also a true agent. It can run terminal commands to aid the request. For one request it could: 1. Write a unit test 2. Run the test 3. Edit code to fix the error 4. Run it again and see it pass
If you try out Codebuff, I think you'll see why it's unique!
Can codebuff handle larger files (6,000 loc) and find the right classes/functions in that code, or if it finds the file with the necessary info does it load the entire file in?
I think it would handle the giant file, but it would definitely pull the whole thing into context.
We are doing some tricks so it should be able to edit the file without rewriting it, but occasionally that fails and we fallback to rewriting it all, which may time out on such a file.
one size does not fit them all, and such tools are quite straightforward to develop. I even created one myself: https://npmjs.com/genaicode, I started with CLI tool, but eventually learned that such UX is not good (even interactive CLI), and I created a web UI.
where the problems start: cost of inference vs quality, latency, multi modality (vision + imagen), ai service provider issues (morning hours in US time zones = poor quality results)
the best part is being able to adjust it to my work style
We also have ed, the standard Unix text editor.
A tool like this is probably a great way to get some VC money.
first time?
Heh, this is an underrated comment.
Fundamentally, I think codegen a pretty new space and lots of people are jumping in because they see the promise. Remains to be seen what the consolidation looks like. With the rate of advancement in LLMs and codegen in particular, I wouldn't be surprised to see even more tools than we do now...
And they all converging towards one use case, and targeting one consumer base. It's still unclear for consumers, at least based on this thread what differetiate them.
:KEKW:
Quality of code wise, is it worse or better than Cursor? I pay for Cursor now and it saves me a LOT of time to not copy files around. I actually still use the chatGPT/claude interfaces to code as well.
Cool, it's probably about the same, since we're both using the new Sonnet 3.5 for coding.
We might have a bit of an advantage because we pull more files as context so the edit can be more in the style of your existing code.
One downside to use pulling more context is we burn more tokens. That's partly why we have to charge $99 whereas cursor is $20 per month.
If its the same, hard to justify $100/month vs $20/month . I code mostly from vim so I'm searching for my vim/cli replacement while I still use both vim and Cursor.
Ah, but in Cursor you (mostly) have to manually choose files to edit and then approve all the changes.
With Codebuff, you just chat from the terminal. After trying it, I think you might not want to go back to Cursor haha.
That sounds cool and I like the idea, but definitely won't pay 5x. Maybe charge $30/month plus bring your own key. Let me know when you lower the price :)
https://aider.chat/
I've played with aider and didn't like it, is it essentially the same?
It might sound small, but pulling in more context can make a huge difference – I remember one time Cursor completely hallucinated Prisma as part of our tech stack and created a whole new schema for us, whereas Codebuff knew we were already hooked up to Drizzle and just modified our existing schema. But like James said, we do use more tokens to do this, so pros & cons.
Sounds pretty interesting, I was thinking that would be the way to work past limited context window sizes automatically.
> Codebuff has limited free usage, but if you like it you can pay $99/mo to get more credits...
> One user racked up a $500 bill...
Those two statements are kind of confusing together. Past the free tier, what does $99/month get you? It sounds like there's some sort of credit, but that's not discussed at all here. How much did this customer do to get to that kind of bill? I get that they built a flutter app, but did it take a hour to run up a $500 bill? 6 hours? a whole weekend? Is there a way to set a limit?
The ability to rack up an unreasonable bill by accident, even just conceptually, is a non-starter for many. This is interactive so it's not as bad as accidentally leaving a GPU EC2 instance on overnight, but I'll note that Aider shows per query and session costs.
Ah, good catch. We (read: I) introduced a bug that gave customers $500 worth of credits in a month as opposed to the $100 they paid for, and this user saw that and took advantage. Needless to say, we fixed the issue and added more copy to show to users when they _do_ exceed their thresholds. We haven't had that issue since. And of course, we ate the cost because it was our (expensive!) mistake.
The user had spent the entire weekend developing the app, and admitted that he would have been more careful to manage his Codebuff usage had it not been for this bug.
We're open to adding hard limits to accounts, so you're never charged beyond the credits you paid for. We just wanted to make sure people could pay a bit more to get to a good stopping point once they passed their limits.
> The user had spent the entire weekend developing the app, and admitted that he would have been more careful to manage his Codebuff usage had it not been for this bug.
On the flip side, there's probably useful things to learn from how he developed his app when he didn't feel the need to be careful; in a way, your $500 mistake bought you useful test data.
In my own use of Aider, I noticed I'm always worried about the costs and pay close attention to token/cost summaries it displays. Being always on my mind, this affects my use of this tool in ways I'm only beginning to uncover. Same would likely be true for Codebuff users.
Oh I see. it's an interesting story and thanks for the transparency. Might leave that out of the pitch as it's confusing and the thought of running up a $500 bill is scary and since the user ultimately didn't pay for it, seems like noise.
Have you considered a bring your own api key model?
Have an upvote! I've been trying it out, it's quite nice. What I like about this vs CoPilot and Cursor is that I feel like (especially with CoPilot) I'm always "racing" the editor. Also Cursor conflicts with some longstanding keybindings I have, vs this which is just the terminal. Having worked on a similar system before, I know it's difficult to implement some of these things, but I am concerned about security. For instance, how well does it handle sensitive files like dot.env or gitignored files. At some point an audit, given that you're closed source would go a long way.
Thanks! Yup, so long as you don't have key bindings for meta-x and meta-c, you ought to be good in the terminal. We honor any repository's .gitignore and won't touch those files. We don't even let Codebuff know they exist, which has caused some issues with hallucinations in the past.
Making sure that our word is trustworthy to the broader world at large is going to be a big challenge for us. Do you have any ideas for what we can do? We're starting to think about open source, but we aren't quite ready for that yet.
Very excited for codebuff, its been a huge productivity boost for me! I've been putting it to use on a monorepo that has Go, Typescript, terraform and some sql and it always looks at the right files for the task. I like the UX way better than cursor - I like reviewing all changes at once and making minor tweaks when necessary. Especially for writing Go, i love being able to stick with Goland IDE while using codebuff.
Thanks for being one of our early users and dealing with our bugs! I love that we can fit into so many developers' workflows and support them where _they are_, as opposed to forcing them to use us for everything.
I've been using Codebuff (formerly manicode) for a few weeks. I think they have nailed the editing paradigm and I'm using it multiple times a day.
If you want to make a multi-file edit in cursor, you open composer, probably have to click to start a new composer session, type what you want, tell it which files it needs to include, watch it run through the change (seeing only an abbreviated version of the changes it makes), click apply all, then have to go and actually look at the real diff.
With codebuff, you open codebuff in terminal and just type what you want, and it will scan the whole directory to figure out which files to include. Then you can see the whole diff. It's way cleaner and faster for making large changes. Because it can run terminal commands, it's also really good at cleaning up after itself, e.g., removing files, renaming files, installing dependencies, etc.
Both tools need work in terms of reliability, but the workflow with Codebuff is 10x better.
Thanks for being an early user and supporter! You've helped us catch so many issues that have helped us get the product to where it is today!
I gave this a spin, this is the best iteration I've seen of a CLI agent, or just best agent period actually. Extremely impressed with how well it did making some modifications to my fairly complex 10,000 LOC codebase, with minimal instruction. Will gladly pay $99/mo when I run out of credits if it keeps up this level.
What if you have a microservice system with a repo-per-service setup, where to add functionality to a FE site you would have to edit code in three or four specific repos (FE site repo + backend service repo + API-client npm package repo + API gateway repo) out of hundreds of total repos?
Codebuff works on a local directory level, so it technically doesn't have to be a monorepo (though full disclaimer: our codebase is a monorepo and that's where we use it most). Most important thing is to make sure you have the projects in the same root directory so you can access them together. I've used it in a setup with two different repos in the same folder. That said, it might warn you that there's not .git folder at the root level when this happens.
This does seem to be suited to monorepo.
Yes, unfortunately, Codebuff will only read files within one directory (and sub-directories).
If you have multiple repos, you could create a directory that contains them all, and that should work pretty well!
Why is there stuff for Manifold Markets in the distributed package?
/codebuff/dist/manifold-api.js
https://www.npmjs.com/package/codebuff?activeTab=code
Haha, it's because I used to work on Manifold Markets!
Codebuff was originally called Manicode. We just renamed it this week actually.
There was meant to be a universe of "Mani" products. My other cofounder made Manifund, and there's a conference we made called Manifest!
but that doesn’t answer the question? are these test files?
Oh, it was a tool call I originally implemented so that Codebuff could look up the probabilities of markets to help it answer user questions.
I thought it would be fun if you asked it about the chance of the election or maybe something about AI capabilities, it could back up the answer by citing a prediction market.
Why not simply remove that dependency now?
Cruft that built up over time But you make a good point, I'll write a ticket to remove it soon.
Codebuff waiting in the wings to implement.
https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExbDExYmM5Y3N0eHJ...
Thanks!
Cause they wrote codebuff using codebuff.
They came out of Manifold. Though I acknowledge that doesn't really answer your question.
How is this different from Qodo? Why isn’t it mentioned as a competitor?
I’ve hard time figuring out what codebuff brings to the table that hasn’t been done before other than being YC backed. I think to win in this massively competitive and fast moving market, you really have to put forward something significantly better than an expensive cobbled together script replicating OSS solutions…
I know this sounds harsh, but believe me, differentiation makes or breaks you sooner than later. Proper differentiation doesn’t have to be hard, it just needs to answer the question what you offer that I can’t get anywhere else at a similar price point. Right now, your offer is more expensive for basically something I get elsewhere better for 1/5 the price… I’m seriously worried whether your venture will be around in one or two years from now without a more convincing value prop.
From my experience of leaning more into full end to end Ai workflows building Rust, it seems that
1) context has clearly won over RAG. There is no way back.
2) workflow is the next obvious evolution and gets you an extra mile
3) adversial GAN training seems a path forward to get from just okay generated code to something close to a home run on the first try
4) generating a style guide based on the entire code base and feeding that style guide together with the task and context into the LLM is your ticket to enterprise customers because no matter how good your stuff might be , if the generated code doesn’t fit the mold you are not part of the conversation. Conversely, if you deliver code in the same style and formatting and it actually works, well, price doesn’t matter much.
5) in terms of marketing to developers, I suggest starting listening to their pain points working with existing Ai tools. I don’t have one single of the problems you try to solve. Im sitting over a massive Rust monorepo and I’ve seen virtually every existing Ai coding assistant failing one way or another. The one I have now works miracles half the time and only fails the other half. That is already a massive improvement compared to everything else I tried over the past four years.
Point is, there is a massive need for coding assistance on complex systems and for CodeBuff to make a dime of a difference, you have to differentiate from what’s out there by starting with the challenges engineers face today.
Yes, but did you try it? I think Codebuff is by far the easiest to use and may also be more effective in your large codebase than any other comparable tool (i.e. like Cursor composer, Aider, Cline. Not sure about Qodo) because it is better at finding the appropriate files.
Re: style guide. We encourage you to write up `knowledge.md` files which are included in every prompt. You can specify styles or other guidelines to follow in your codebase. One motivating example is we wrote in instructions of how to add an endpoint (edit these three files), and that made it do the right thing when asked to create an endpoint.
am I the only one who is scared of "it can run any command in your terminal"?
Hah no, you're not alone! Candidly, this is one of the top complaints from users. We're doing a lot of prompt engineering to be safe, but we can definitely do more. But the ones who take the leap of faith have found that it speeds up their workflows tremendously. Codebuff installs the right packages, sets up environments correctly, runs their scripts, etc. It feels magical because you can stay at a high level and focus on the real problems you're trying to solve.
If you're nervous about this, I'd suggest throwing Codebuff in a Docker container or even a separate instance with just your codebase.
What do you think about having codebuff write a parser for javascript? Something that is specifically built to enhance itself that goes beyond the regular parsers and creates a more useful structure of the codebase to be then used for RAG for code writing? This would be double useful as a great demo for your product as well as enhancing your product intrinsically. For example the new parser can not only build the syntax tree but also provide relevant commentary for each method to describe what it does to better pick code context.
Congrats on the launch guys! Tried the product early on and it’s clearly improved a ton. I’m still using Cursor every day mainly because of how complete the feature set is - autocomplete, command K, highlight a function and ask questions about it, and command L / command shift L. I am not sure what it’ll take for me to switch - maybe I’m not an ideal user somehow… I’m working in a relatively simple codebase with few collaborators?
I’m curious what exactly people say causes them to make the switch from Cursor to Codebuff? Or do people just use both?
Sweet. Personally, I use both Cursor and Codebuff.
I open the terminal panel at the bottom of the Cursor window, start up `codebuff`, and voila, I have an upgraded version of Cursor Compose!
Depending on what exactly I'm implementing I rely more on codebuff or do more manual coding in Cursor. For manual coding, I mostly just use the tab autocomplete. That's their best feature IMO.
But codebuff is very useful for starting features out if I brain dump what I want and then go fix it up. Or, writing tests or scripts. Or refactoring. Or integrating a new api.
As codebuff has gotten better, I've found it useful in more cases. If I'm implementing a lot of web UI, I can nearly stop looking at the code altogether and just keep prompting it until it works.
Hopefully that gives you some idea of how you could use codebuff in your day-to-day development.
I've been using Codebuff for the last few weeks, and it's been really nice for working in my Elixir repo. And as someone who uses Neovim in the terminal instead of VS Code, it's nice to actually be able to have it live in the tmux split beside Neovim instead of having to switch to a different editor.
I have noticed some small oddities, like every now and then it will remove the existing contents of a module when adding a new function, but between a quick glance over the changes using the diff command and our standard CI suite, it's always pretty easy to catch and fix.
Thanks for using Codebuff! Yeah, these edit issues are annoying, but I'm confident we can reduce the error rate a lot in the coming weeks.
Love the demo video! Three quick questions:
Any specific reason to choose the terminal as the interface? Do you plan to make it more extensible in the future? (sounds like this could be wrapped with an extension for any IDE, which is exciting)
Also, do you see it being a problem that you can't point it to specific lines of code? In Cursor you can select some lines and CMD+K to instruct an edit. This takes away that fidelity, is it because you suspect models will get good enough to not require that level of handholding?
Do you plan to benchmark this with swe-bench etc.?
We thought about making a VSCode extension/fork like everyone else, but decided that the future is coding agents that do most of the work for you.
The terminal is actually a great interface because it is so simple. It keeps the product focused to not have complex UI options. But also, we rarely thought we needed any options. It's enough to let the user say what they want in chat.
You can't point to specific lines, but Codebuff is really good at finding the right spot.
I actually still use Cursor to edit individual files because I feel it is better when you are manually coding and want to change just one thing there.
We do plan to do the SWE bench. It's mostly the new Sonnet 3.5 under the hood making the edits, so it should do about as well as Anthropic's benchmark for that, which is really high, 49%: https://www.anthropic.com/news/3-5-models-and-computer-use
Fun fact is that the new Sonnet was given two tools to do code edits and run terminal commands to reach this high score. That's pretty much what Codebuff does.
To add on, I know a lot of people see the terminal as cruft/legacy from the mainframe days. But it is a funny thing to look at tons of people's IDE setup and see that the one _consistent_ thing between them all is that they have a terminal nearby. It makes sense, too, since programs run in the terminal and you can only abstract so much to developers. And like James said, this sets us up nicely to build for a future of coding agents running around. Feels like a unique insight, but I dunno. I guess time will tell.
> I know a lot of people see the terminal as cruft/legacy from the mainframe days.
Hah. If you encounter people that think like this, run away because as soon as they finish telling you that terminals are stupid they inevitably want help configuring their GUI for k8s or git. After that, with or without a GUI, it turns out they also don’t understand version control / containers.
Will do Also forwarding to our competitors, but something tells me they will ignore this.
Congrats on the launch! I tried this on a migration project I'm working on (which involves a lot of rote refactoring) and it worked very well. I think you've nailed the ergonomics for terminal-based operations on the codebase.
I've been using Zed editor as my primary workhorse, and I can see codebuff as a helper CLI when I need to work. I'm not sure if a CLI-only interface outside my editor is the right UX for me to generate/edit code — but this is perfect for refactors.
Amazing, glad it worked well for you! I main VSCode but tried Zed in my demo video and loved the smoothness of it.
Totally understand where you're coming from, I personally use it in a terminal tab (amongst many) in any IDE I'm using. But I've been surprised to see how different many developers' workflows are from one another. Some people use it in a dedicated terminal window, others have a vim-based setup, etc.
> I fine-tuned GPT-4o to turn Claude's sketch of changes into a git patch, which would add and remove lines to make the edits. I only finished generating the training data late at night, and the fine-tuning job ran as I slept
Could you say more about this? What was the entirety of your training data, exactly, and how did the sketch of changes and git patch play into that?
Sure! I git cloned some open source projects, and wrote a script (with Codebuff) to pick commits and individual diffs of files. For each of those, I had Claude write a sketch of what changed from the old file to the new.
This is all the data I need: the old file, the sketch of how Claude would update it, and the ground truth diff that should be produced. I compiled this into the ideal conversation where the assistant responds with the perfect patch, and that became the training set. I think I had on the order of ~300 of these conversations for the first run, and it worked pretty well.
I came up with more improvements too, like replacing all the variant placeholder comments like "// ... existing code ..." or "# ... (keep the rest of the function)" with one [[*REPLACE_WITH_EXISITNG_CODE*]] symbol, and that made it more accurate
Very interesting, thanks!
Manicode is really awesome, did some actual dev for live apps and it does work.
You must though, learn to code in a different way if you are not that disciplined. I had excellent results asking for small changes, step by step and committing often so I can undo and go back to a working version easily.
Net result was very positive, built two apps simultaneously (customer side and professional side).
I just tried it out in the context of a small but messy side project. It did exactly what I asked for. The easy of use is a bliss. Impressive!
Thanks! That's awesome to hear – if you wouldn't mind me asking, what was the context and tech stack of your side project? We love hearing about the wide variety of use cases people have found for Codebuff!
It's a pretty small, straight forward web app. Think: python backend and a vanilla html / JS frontend, served over flask. The frontend is mostly in one file, so maybe it's not the best test case for crossfile reading, but still very happy with the user experience!
I really like the vibes on this: the YouTube video is pretty good, there’s a little tongue-in-cheek humor but it’s good natured, and the transparency around how it came together at the last minute is a great story.
It’s a crowded space and I don’t know how it’ll play, but in a space that hasn’t always brought out the best in the community, this Launch HN is a winner in my book.
I hope it goes great. Congratulations on the launch.
Thank you for the very kind comment!
Tongue-in-cheek! No idea what you're talking about. But I appreciate the kind words :)
Ultimately, I think a future where the limit to good software is good ideas and agency to realize them, as opposed to engineering black boxes, mucking with mysterious runtime errors, holy wars on coding styles, etc. is where all the builders in this space are striving towards. We just want to see that happen sooner than later!
"...in a Hail Mary attempt"
I'm curious how often others have experienced this. There have been so many times on many different projects where I've struggled with something hard and had the breakthrough only right before the deadline (self-imposed or actual deadline).
Congrats, sounds like an awesome project. I'll have to try it out.
been using cline extension in vscode (which can execute commands and look at the output on terminal) and it's an incredibly adept sysadmin, cloud architect and data engineer. I like that cline lets you approve/decline execution requests and you can run it without sending the output which is safer from a data perspective.
It's cool to have this natively on the remote system though. I think a safer approach would be to compile a small binary locally that is multi-platform, and which has the command plus the capture of output to relay back, and transmit that over ssh for execution (like how MGMT config management compiles golang to static binary and sends it over to the remote node vs having to have mgmt and all it's deps installed on every system it's managing).
Could be low lift vs having a package, all it's dependencies and credentials running on the target system.
Are you an adept sysadmin, cloud architect, and/or data engineer?
It’s a weird catch-22 giving praise like that to LLMs.
If you are, then you might be able to intuit and fill in the gaps left my the LLM and not even know it.
And if you’re not, then how could you judge?
Not really much to do with that you were saying, really, just a thought I had.
I'd assume the person giving the praise is at least a bit of all 3.
> It’s a weird catch-22 giving praise like that to LLMs.
It's a bit asymmetrical though isn't it -- judging quality is in fact much easier than producing it.
> you might be able to intuit and fill in the gaps left my the LLM and not even know it
Just because you are able to fill gaps with it doesn't mean it's not good. With all of these tools you basically have to fill gaps. There are still differences between Cline vs Cursor vs Aider vs Codebuff.
Personally I've found Cline to be the best to date, followed by Cursor.
> judging quality is in fact much easier than producing it
There’s still a skill floor required to accurately judge something.
A layman can’t accurately judge the work of a surgeon.
> Just because you are able to fill gaps with it doesn't mean it's not good.
If I had to fill in my sysadmin’s knowledge gaps I wouldn’t call them a good sysadmin.
Not saying the tool isn’t useful, mind you, just playing semantics with calling a tool a “good sysadmin” or whatever.
> There’s still a skill floor required to accurately judge something.
Sure but it's not high at all.
Your typical sysadmin is doing a lot of Googling. If perplexity can tell you exactly what to do 90% of the time without error, that's a pretty good sysadmin.
Your typical programmer is doing a lot of googling and write-eval loops. If you are doing many flawless write-eval loops with the help of cline, cline is a pretty good programmer.
A lot of things AI is helping with also have good, easy to observe / generate, real-time metrics you can use to judge excellence.
> Sure but it's not high at all.
It depends. For a sysadmin maybe not, but for data scientists, the bar would be pretty high just to understand the math jargon.
> If perplexity can tell you exactly what to do 90% of the time without error
That “if” is carrying a lot of weight. Anecdotally I haven’t seen any llm be correct 90% of the time. IIRC SOTA on swebench (which tbf isn’t a great benchmark) is around 30%.
> flawless write-eval loops with the help of cline, cline is a pretty good programmer.
I’m not really sure what you mean by “flawless” but having a rubber duck is always more helpful than harmful.
> A lot of things AI is helping with also have good, easy to observe / generate, real-time metrics you can use to judge excellence.
Like what?
> A lot of things AI is helping with also have good, easy to observe / generate, real-time metrics you can use to judge excellence.
Exactly what I illustrated earlier: your developer productivity metrics. If you're turning code around faster, setting up your network better, turning around insights faster, the AI is working.
> It depends. For a sysadmin maybe not, but for data scientists, the bar would be pretty high just to understand the math jargon.
Why does an AI coding agent need to understand math jargon -- it just helps you write better code. Are you even familiar with what data scientists do? Seems not because if you were, you'd see clearly where the tool would be applied and do a good/bad job.
Reminder: we're talking about evaluating whether Codebuff / alternatives are "pretty good" at X. Just go play with the tools. tgtweak expressed their opinion on how good the tool rates at some tasks {sysadmin, data engineering, cloud architecture} and your response was to question how someone could have an opinion about it. The obvious answer is that they used the tools and found it useful for those tasks. It may only be _subjectively_ good at what they're using for but it's also a rando's opinion on the internet. As another rando I very much agree with what the person you responded to is saying. You're not going to get more rigor from this discourse - go form a real opinion of your own.
Wow… why’d you get so defensive and presumptuous?
I have my opinion, it’s just not the same as yours.
Label it what you want, I'm responding directly to questions you posed.
> I have my opinion, it’s just not the same as yours.
This is literally the TL;DR for what I wrote.
I would consider myself adept at all three, not top 1% in either but the intersection of all 3 easily.
Context I have hired hundreds of engineers and built many engineering teams from scratch to 50+, and have been doing systems administration, solutions architecture, infrastructure design, devops, cloud orchestration and data platform design for 25 years.
I'm not bluffing when I say Claude's latest sonnet model and Cline in vscode has really been 99th percentile good on everything I've thrown at it (with some direction, as needed) and has done more productive, quality work than a team of 10 engineers in the last week alone.
If you haven't tried it I can understand your pessimism.
I haven’t built engineering teams, but I’ve been in the server programming field for 15 years.
I have tried Claude (with aider) for programming tasks and have been impressed that it could do anything (with handholding) but haven’t been convinced that it’s something that will change how I write code forever.
It’s nice that I can describe how to graph some data in a csv and get 80% of the way there in python after a few rounds of clarification. Claude refused to use seaborn for some reason, but that’s no big deal.
Every time I’ve tried using it for work, though, I was sorely disappointed.
I recently convinced myself that it was pretty helpful in building a yjs backed text editor, but last week realized that it led me down an incorrect path with regards to the prosemirror plugin and I had to rewrite a good chunk of the code.
I have heard good things about Cline! I'm curious to learn more. I need to try it out myself.
I see Codebuff as a premium version of Cline, assuming that we are in fact more expensive. We do a lot of work to find more relevant files to include in context.
Tbh I used manicode once about a month ago and much preferred cline. Cline seems to find context just fine, can run terminal commands in the VS code terminal, and the flow where it proposes an edit is very good. Since it's in VS code I can even pause it and edit files then unpause it. I like that I can see how much everything costs and rely on good caching and usage based billing to get a fair price.
Admittedly the last time I used manicode was a while back but I even preferred Cursor to it, and Cursor hallucinates like a mf'er. What I liked about cursor is that I can just tell composer what files I want it to look at in the UI. But I just use Cline now because I find its performance to be the best.
Other datapoints: backend / ML engineer. Maybe other kinds of engineers have different experiences.
How does it work if I'm not adding features, but want to refactor my code bases? E.g., the OOD is poor, and I want to totally change it and split the codes into new files? Would it work properly as it requires extensive reads + create new files + writes ...
It couldn't write a simple test for my typescript node system. Kept telling me credits left, login. I don't know who gets success from these tools and what they are building but none of them actually work for me. Yesterday there was Aide which I tried and found to be broken and so is this one.
That's surprising to me, usually it works quite well at this. Did you start codebuff in the root of your project so that it can get context on your codebase?
Yup I did.
Comparison with Aider?
Great question!
In Codebuff you don't have to manually specify any files. It finds the right ones for you! It also pulls more files to get you a better result. I think this makes a huge difference in the ergonomics of just chatting to get results.
Codebuff also will run commands directly, so you can ask it to write unit tests and run them as it goes to make sure they are working.
Aider does all of this too, and it has for quite a while. It just tends to ask you for explicit permission when e.g. adding files to the context (potentially token-expensive), creating new files, or running commands (potentially dangerous); AFAIR there's an option (both configuration and cli arg) to auto-approve those requests, though I never tried it.
Aider has extensive code for computing "repository map", with specialized handling for many programming languages; that map is sent to LLM to give it an overview of the project structure and summary of files it might be interested in. It is indeed a very convenient feature.
I never tried writing and launching unit tests via Aider, but from what I remember from the docs, it should work out of the box too.
Thanks for sharing – I definitely want to plays with Aider more. My knowledge of it is limited, but IIRC Aider doesn't pull in the right files to create context for itself like Codebuff does when making updates to your codebase.
Another aspect is simplicity. I think Aider and other CLI tools tend to err towards the side of configuration and more options, and we've been very intentional with Codebuff to avoid that. Not everyone values this, surely, but our users really appreciate how simple Codebuff is in comparison.
Aider very much does pull in the right files to create context for itself. It also uses treesitter to build a repo-map and then uses that as an initial reference for everything, asking you to add the files it thinks it needs for the context. As of quite recently it also supports adding files for context in explicit read-only mode. It works extremely well.
This is a good read: https://aider.chat/2023/10/22/repomap.html
That's why I like aider tbh. I know it's not going nuts on my repo.
I think Aider does this to save tokens/money. It supports a lot of models so you can have Claude as your architect and another cheap model that does the coding.
Yup, there's a tradeoff in $$$, but for a lot of people it should be worth it, since Codebuff can find more relevant files with example code that will make the output higher quality.
> In Codebuff you don't have to manually specify any files.
Alright, I'm in.
Ah thanks, that's excellent! That is a massive issue for Aider; it was supposed to be solved, but last I tried I still had to do that manually.
Nice work!
https://www.codebuff.com/
The demo right there is worth $5 of software development ( in offshored upwork cost) . Imagine when this can be done at scale for huge existing codebase.
whooooot! it's been a wild ride thus far, but we've been super thrilled at how people are using it and can't wait for you all to try it out!
we've seen our own productivity increase tenfold – using codebuff to build buff our own code hah
let us know what you think!
I don't see the value. Why is this better than Cursor? What guarantees that you won't steal my code?
Wasn't there a recent startup in F24 that stole code from another YC company and fire was quickly put out by everyone?
Good question – here are a few reasons:
- It chooses files to read automatically on each message — unlike Cursor’s composer feature. It also reads a lot more than Cursor's @codebase command. - It takes 0 clicks — Codebuff just edits your files directly (you can always peek at the git diffs to see what it’s doing). - It has full access to your existing tools, scripts, and packages — Codebuff can install packages, run terminal commands and tests, etc. - It is portable to any development environment
We use OpenAI and Anthropic, so unfortunately we have to abide by their policies. But we only grab snippets of your code at any given point, so your codebase isn't seen by any entity in its entirety. We're also considering open-sourcing, so that might be a stronger privacy guarantee.
I should note that my cofounder James uses both and gets plenty of value by combining them. Myself, I'm more of a plain VSCode guy (Zed-curious, I'll admit). But because Codebuff lives in your terminal, it fits in anywhere you need.
No comment on our batchmates
Alright. That gives me some directional signal. I will be interested if you make it open source. We have massive and critical code base so I am always wary of giving access to 3Ps.
Codebuff is a bit simpler and requires less input from the user since you just chat and it does multi-file edits/runs commands. It's also more powerful since it pulls more files as context.
I think you just need to try it to see the difference. You can feel how much easier it is haha.
We don't store your codebase, and have a similar policy to Cursor, in that our server is mostly a thin wrapper that forwards requests to LLM providers.
The PearAI debacle is another story, but mostly they copied the open source project Continue.dev without giving proper attribution.
Okay. I will try it once you are a bit further along and have OSS.
I've seen similar projects, but they all rely on paid LLMs, and can't work with local models, even if the endpoint is changed... what are the possibilities for this project to be run locally?
Are there any plans to add a sandbox? This seems cool, but it seems susceptible to prompt injection attacks when for example asking questions about a not necessarily trusted open source codebase.
We might! You could also set up your project within a docker container pretty simply (Codebuff would be great at setting that up :P).
I've been playing with Codebuff for a few days (building out some services with Node.js + Typescript) - been working beautifully! Feels like I'm watching a skilled surgeon at work.
A skilled surgeon is a great analogy! We actually instruct Codebuff to focus on making the most minimal edits, so that it does precisely what you want.
Does anyone know of a “copilot” style autocomplete in the CLI? I don’t want it to run anything for me, just predict what command I might type next
I'm currently hacking together a prototype of such a tool. The problem I noticed is that in CLI, commands are way less predictable than lines in code files, so such a tool will probably have a pretty low correct completion rate. However, there are clearly cases where it could be very helpful.
I assume you're referring to it guessing at what command you might type next? If so, the closest I can think of is Warp. Have you tried it?
The product design is really thoughtful and thanks for sharing your story – Cannot wait to try this see you and see how you iterate on this!
This is much needed! Gonna try this out. I haven't seen a good tool that lets me generate code via CLI.
How do you end up handling line numbers in patches? Counting has always been a sticking point for LLMs.
I wrote a custom `applyPatch` function that tries to use the line numbers, but falls back to searching for the context lines to line up the +/- patched lines.
It actually got line number not too wrong, and so they might have been helpful. (I included the line numbers for the original file in context).
Ultimately though, this approach was still error prone enough that we recently switched away.
The ergonomics of using unit tests + this to pass said unit tests is actually pretty good. Just tried it.
> One user racked up a $500 bill by building out two Flutter apps in parallel.
Is that through the Enterprise plan?
Nope, if you go over the allotted credits on the $99 plan, then you pay per usage (with a 5% discount).
We actually ended up not charging this guy since there was a bug where we told him he got 50,000 credits instead of 10,000. Oops!
Can you speak more to how efficiency towards context management works (to reduce token costs)? Or are you loading up context to the brim with each request?
I think managing context is the most important aspect of today's coding agents. We pick only files we think would be relevant to the user request and add those. We generally pull more files than Cursor, which I think is an advantage.
However, we also try to leverage prompt-caching as much as possible to lower costs and improve latency.
So we basically only add files over time. Once context gets too large, it will purge them all and start again.
> However, we also try to leverage prompt-caching as much as possible to lower costs and improve latency.
Interesting! That does have 5 minute expiry on Claude, and your users can use Codebuff in an unoptimal way. Do you have plans in aligning your users towards using the tool in a way that makes the most use of prompt caches?
That's a really great point. Since we manage the context, we should clear the old files if it's been > 5 minutes. Thanks for the idea!
How did that bug occur? Was the code generated by your code generator?
Really like the look of this interface. You're definitely onto something. Good work.
Amazing stuff! The rebrand is great and it's cool to read the whole story!
brilliant - and thank you - so impressed with your work, i finally made an account to just comment - out of the box worked, a few minor glitches, but this is the start of awesome. keep doing what you are doing.
Amazing, good to hear! What were the minor glitches you encountered? Would love to fix them up.
Does Codebuff / the tree sitter implementation support Svelte?
Yes, at least, partially. It will work, but maybe not as well as we don't parse out the function names from .svelte files.
I can add it if tree sitter adds support for Svelte. I haven't checked, maybe it already is supported?
This looks so awesome! Congrats on your launch. Eager to use it!
Extra context length looks valuable! Excited to try this out!
congrats on the laucnh! thays super cool/ but also wonder youre vision about number of calls / open source. tks!
Looks awesome! Great work team.
Your website has a serious issue. Trying to play the YouTube video makes the page slow down to a crawl, even in 1080p, while playing it on YouTube directly has no issue, even in 4K.
On the project itself, I don't really find it exciting at all, I'm sorry. It's just another wrapper for a 3rd party model, and the fact that you can 1) describe the entire workflow in 3 paragraphs, and 2) built it and launched it in around 4 months, emphasizes that.
Congrats on launch I guess.
Weird, thanks for flagging – we're just using a Youtube embed in an iframe but I'll take a look.
No worries if this isn't a good fit for you. You're welcome to try it out for free anytime if you change your mind!
FWIW I wasn't super excited when James first showed me the project. I had tried so many AI code editors before, but never found them to be _actually usable_. So when James asked me to try, I just thought I'd be humoring him. Once I gave it a real shot, I found Codebuff to be great because of its form factor and deep context awareness: CLI allows for portability and system integration that plugins or extensions really can't do. And when AI actually understands my codebase, I just get a lot more done.
Not trying to convince you to change your mind, just sharing that I was in your shoes not too long ago!
I would really rethink your value proposition.
> CLI allows for portability and system integration that plugins or extensions really can't do
In the past 6 or 7 years I haven't written a single line of code outside of a JetBrains IDE. Same thing for all of my team (whether they use JetBrains IDEs or VS Code), and I imagine for the vast majority of developers.
This is not a convincing argument for the vast majority of people. If anything, the fact that it requires a tool OUTSIDE of where they write code is an inconvenience.
> And when AI actually understands my codebase, I just get a lot more done.
But Amazon Q does this without me needing to type anything to instruct it, or to tell it which files to look at. And, again, without needing to go out of my IDE.
Having to switch to a new tool to write code using AI is a huge deterrent and asking for it is a reckless choice for any company offering those tools. Integrating AI in tools already used to write code is how you win over the market.
> Your website has a serious issue.
I was thinking the same. My (admittedly old-ish) 2070 Super runs at 25-30% just looking at the landing page. Seems a bit crazy for a basic web page. I'm guessing it's the background animation.
For trivial tasks, this is certainly easy. For complicated tasks, like understanding a codebase or a product catalog of tens of thousands of entries, this is non-trivial.
My team is not working in the code gen space, but even though we also "just wrap" an API, almost all of our work is in data acquisition, transformation, the retrieval strategy, and structuring of the request context.
The API call to the LLM is like hitting "bake" on an oven: all of the real work happens before that.
$99/month lol. I have Perplexity, OpenAI, Claude and Cursor subscription and I end up paying way less than $99/month. Clearly you haven't done any research on price. Aider, Cline are open source, in not sure why someone would subscribe to it unless it's the top model on http://swebench.com/
I tried it on two of my git repositories, just to see, if it could do a decent commit summary. I was very pleasantly surprised with the good result. I was unpleasantly surprised, that this already cost me 175 credits. If I extrapolate this over my ~100 repositories, that would already put me at 8750, just to let it write a commit message for release day. That is way out of free range and basically would eat up most of the $99 I would have to spend as well. My subscription price for cody is $8 for a month. Pricing seems just way off.
Couldn't get through the video, your keyboard sounds are very annoying.
Sorry Will try an external keyboard next time!
I'm a big fan! It's better than cursor in many ways
Your comment is a bit suspicious given that your previous submissions are limited to manifold market links, and this tool came from that company.
Yes, he's the lead Manifold eng. Please discount appropriately.
Really uncool. You guys are getting a pretty positive reception here, this deceptive behavior shouldn’t have happened
HN spamtroturfing one of their own startups? I'm shocked, shocked!
Could you elaborate on those ways please?
[dead]
[dead]
[dead]
[flagged]
brace yourself.
the night critics are coming.
Don't take it personally or get too discouraged. You are not your product, and you're certainly not your first demo of your first product. But, knowing your competition, how you stack up against them, and how the people you're selling to feel about them, is a huge part of your job as a founder. It will only get more important.
You have to constantly do your research. It is one of those anxiety-inducing tasks that's easy to justify avoiding when all you want to do is code your idea up and there's so much other work to do. But it's your job. Even when you hire someone else to run product for you it'll be your responsibility to own it.
What you've built is cool, a lot of people love it even though they know about the other tools available. Now you know what your main competition does, you also know what it doesn't do, so you get to solve for that - and if you solved the context problem in isolation with treesitter then you're obviously capable.
You'll have realised by now that Aider didn't use treesitter when it started. Instead it used ctags - a pattern-matching approach to code indexing from 40 years ago that doesn't capture signatures or create an ast, it effectively just indexes the code with a bunch of regex. And it's not like treesitter wasn't around when aider was first written. Keep that in mind.
Good luck.