This kind of reflects the fact that a lot of working with LLMs is just organizing text, and prompts can become a real engineering problem when you are orchestrating pipelines of dozens or more files with completions at various points with context windows of 100K tokens or more.
I've not found a satisfying framework yet, generally find raw Python best. But I still spend too much time on boilerplate and tweaking formatting or samplers and chunking for context windows.
If anyone knows of a better tool for abstracting that away (LangChain is not it IMO) please let me know.
But sending whole files isn't always optimal, I'm thinking there has to be a better way, like picking workspace symbols and pulling in only the code they depend on from other files. Something something LSP/tree-sitter-based.
This is what aider does, using tree sitter to extract the AST from each source file. It uses the ASTs to build a call graph. And then does a graph optimization to identify the most relevant parts of the code base, given the current state of the LLM chat.
Not anymore in the subscription LLM offerings. Claude seems to allow 70k tokens or more in their paid UI, ChatGPT seems to be about half of that while custom GPTs allow well over 100k.
function code2prompt() {
# wrap the code2prompt command in a function that sets a number of default excludes
# https://github.com/mufeedvh/code2prompt/
local arguments excludeFiles excludeFolders templatesFolder excludeExtensions
templatesFolder="${HOME}/git/code2prompt/templates"
excludeFiles=".editorconfig,.eslintignore,.eslintrc,tsconfig.json,.gitignore,.npmrc,LICENSE,esbuild.config.mjs,manifest.json,package-lock.json,\
version-bump.mjs,versions.json,yarn.lock,CONTRIBUTING.md,CHANGELOG.md,SECURITY.md,.nvmrc,.env,.env.production,.prettierrc,.prettierignore,.stylelintrc,\
CODEOWNERS,commitlint.config.js,renovate.json,pre-commit-config.yaml,.vimrc,poetry.lock,changelog.md,contributing.md,.pretterignore,.prettierrc.json,\
.prettierrc.yml,.prettierrc.js,.eslintrc.js,.eslintrc.json,.eslintrc.yml,.eslintrc.yaml,.stylelintrc.js,.stylelintrc.json,.stylelintrc.yml,.stylelintrc.yaml"
excludeFolders="screenshots,dist,node_modules,.git,.github,.vscode,build,coverage,tmp,out,temp,logs"
excludeExtensions="png,jpg,jpeg,gif,svg,mp4,webm,avi,mp3,wav,flac,zip,tar,gz,bz2,7z,iso,bin,exe,app,dmg,deb,rpm,apk,fig,xd,blend,fbx,obj,tmp,swp,\
lock,DS_Store,sqlite,log,sqlite3,dll,woff,woff2,ttf,eot,otf,ico,icns,csv,doc,docx,ppt,pptx,xls,xlsx,pdf,cmd,bat,dat,baseline,ps1,bin,exe,app,tmp,diff,bmp,ico"
echo "---"
echo "Available templates:"
ls -1 "$templatesFolder"
echo "---"
echo "Excluding files: $excludeFiles"
echo "Excluding folders: $excludeFolders"
echo "Run with -nn to disable the default excludes"
# array of build arguments
arguments=("--tokens")
# if -t and a template name is provided, append the template flag with the full path to the template to the arguments array
if [[ $1 == "-t" ]]; then
arguments+=("--template" "$templatesFolder/$2")
shift 2
fi
if [[ $1 == "-nn" ]]; then
command code2prompt "${arguments[@]}" "${@:2}" # remove the -nn flag
else
command code2prompt "${arguments[@]}" --exclude-files "$excludeFiles" --exclude-folders "$excludeFolders" --exclude "$excludeExtensions" "${*}"
fi
}
If you need to feed multiple files to chatgpt or another LLM, this makes it way easier than manually copy and pasting.
This app shows you a file modal. Use Shift or Option keys to select multiple text files across one or more directories.
All of the selected files will be concatenated for easy select all / paste into your LLM conversation.
Output format of selected files is:
### `[filepath]`
[file contents]
### `[filepath]`
... and so on.
- Output is in a text field for easy copy-pasta.
- File path starts at the common parent of all selected files
I've not found a satisfying framework yet, generally find raw Python best. But I still spend too much time on boilerplate and tweaking formatting or samplers and chunking for context windows.
If anyone knows of a better tool for abstracting that away (LangChain is not it IMO) please let me know.