Python Project X-Ray — Low‑Token Model Pack #
A practical, repeatable method to condense any Python codebase into a compact “model pack” that preserves the project’s structure, APIs, interactions, and intent with minimal tokens. Optimized for rapid onboarding of agents/developers and efficient LLM reasoning.
What You Get (Outputs) #
- overview.md — high-signal, low-token project narrative and entry points
- api.md — public surface: packages/modules/classes/functions with signatures
- tree.txt — pruned file tree (noise filtered), sizes, highlights
- graph.json — import graph + class inheritance + cross-module usage edges
- stats.json — LOC, counts, ranks, hotspots
- pack.json — metadata, budgets, and routing hints for LLMs
All outputs are deterministic, source-of-truth derived (AST + file system), and safe to share (secrets filtered by default ignores).
Design Principles #
- Signal first: functions, classes, signatures, module roles, entrypoints
- Hierarchical: project → package → module → class/method → function
- Contracts over implementation: public API, docstrings, signatures
- Interactions over details: imports, ownership, orchestration points
- Token budgets: weight by impact (orchestrators, hotspots, public API)
- Deterministic & auditable: AST-based, reproducible, no LLM in pack build
End‑to‑End Pipeline (10 steps) #
- Discover
- Locate project root. If ambiguous, prefer folder containing pyproject.toml or setup.cfg.
- Collect files with safe defaults and ignore noise (see baseline ignores).
- Classify
- Group by type: code (.py), tests, notebooks, config, assets, data, build.
- Detect packages (folders with init.py) and top-level scripts.
- Static Parse (AST)
- Per .py: extract module docstring, imports, all, classes, functions.
- For classes: bases, public methods (filter _private), decorators, docstrings.
- For functions: signature, decorators, docstrings.
- Graphs
- Import graph: module → modules used.
- Inheritance graph: class → base(s) (FQNs when resolvable).
- Cross-module usage (approx): map imported names to attribute/function calls.
- Entry Points & Runtimes
- Scripts with if name == “main”.
- CLI (argparse/click/typer), app factories, WSGI/ASGI apps, tasks.
- Public Surface
- Respect all if present, else treat names without leading _ as public.
- Record canonical FQNs: package.module:Class.method, package.module:function.
- Ranking & Budgets
- Rank modules by LOC and coupling (in/out-degree in import graph).
- Boost orchestrators by filename heuristics: app, main, server, manager, tasks, pipeline, service.
- Allocate tokens: e.g., top 20% modules → 40% tokens; public API → 40%; rest → 20%.
- Summarize (Deterministic)
- Generate terse, structured bullets. Incorporate one-line docstrings if present.
- Focus on responsibilities, collaborators, and side-effects.
- Emit Pack
- Write overview.md, api.md, tree.txt, graph.json, stats.json, pack.json.
- Use stable IDs and FQNs; prefer consistent sort order.
- Verify & Diff
- Validate JSON schemas. Re-run and diff packs across commits to track drift.
Baseline Ignores (Safe Defaults) #
- Directories: .git, .venv, venv, pycache, .mypy_cache, .pytest_cache, .ruff_cache, .tox, build, dist, site-packages, node_modules, .idea, .vscode
- Files: *.pyc, *.pyo, *.pyd, *.so, *.dll, *~
- Data/large: data/, .cache/, .local/, media/, downloads/
- Secrets: .env, .env.* (strip values if included), credentials.*
Include tests/, notebooks/ optionally: summarize lightly unless they define APIs.
Heuristics That Matter #
- Public-first: list public functions and class methods before private helpers
- Contracts: keep signatures and one-liners; skip bodies
- Collapse trivial wrappers (<= 3 statements, no branching)
- Expand orchestrators and adapters; compress pure utilities
- Merge duplicates; respect all for canonical exports
- External deps: name and role only (avoid copying third-party docs)
Output Schemas (Essential) #
- graph.json
{
"$schema": "https://example.org/xray/graph.schema.json",
"modules": ["pkg.mod1", "pkg.mod2"],
"imports": [["pkg.mod1", "pkg.mod2"], ["pkg.mod2", "pkg.util"]],
"classes": {"pkg.mod1:Foo": {"bases": ["pkg.base:Base"], "module": "pkg.mod1"}},
"uses": [["pkg.mod1", "pkg.util:helper"], ["pkg.mod2:Bar", "pkg.mod1:Foo"]]
}
- pack.json
{
"name": "project-xray-pack",
"version": 1,
"generated_at": "2025-09-02T00:00:00Z",
"python": "3.11",
"budgets": {"overview_md": 2000, "api_md": 4000},
"policy": {
"module_top_percent": 20,
"module_token_share": 0.4,
"public_api_share": 0.4
}
}
- stats.json (example keys)
{
"files": 134,
"modules": 78,
"loc": {"total": 21234, "by_module": {"pkg.core": 987}},
"rank": {"modules": ["pkg.app", "pkg.core", "pkg.service"]}
}
Golden Format: overview.md (Template) #
# Project: {name}
- Purpose: {one-liner}
- Stack: Python {py}, key deps: {X, Y, Z}
- Entrypoints: {scripts, CLIs, apps}
## Architecture (high signal)
- Packages: {pkgA} (role), {pkgB} (role)
- Orchestrators: {module/class} → coordinates {n} components
- Data flow: {producer} → {processor} → {sink}
- Storage/IO: {db/cache/files}, config via {env/config}
## Key Interactions
- Imports: {top edges}
- Cross-module usage: {who calls what}
## Risks/Hotspots
- {module} high complexity and centrality
- {module} high churn / LOC
## Notes
- Conventions, extension points, feature flags
Quickstart: One-File Builder (drop‑in script) #
The following script produces a compact pack folder for any Python project with zero external dependencies. Save as xray.py and run.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Python Project X-Ray — minimal, dependency-free pack builder.
Outputs: overview.md, api.md, tree.txt, graph.json, stats.json, pack.json
"""
from __future__ import annotations
import argparse
import ast
import datetime as dt
import json
import os
from pathlib import Path
import sys
from typing import Any, Dict, List, Optional, Tuple
IGNORED_DIRS = {".git", ".venv", "venv", "__pycache__", ".mypy_cache", ".pytest_cache", ".ruff_cache", ".tox", "build", "dist", "site-packages", "node_modules", ".idea", ".vscode", ".cache"}
IGNORED_FILE_GLOBS = {".pyc", ".pyo", ".pyd", ".so", ".dll"}
class ModuleInfo:
def __init__(self, module: str, path: Path) -> None:
self.module = module
self.path = path
self.doc: Optional[str] = None
self.imports: List[str] = []
self.exports: Optional[List[str]] = None
self.classes: Dict[str, Dict[str, Any]] = {}
self.functions: Dict[str, Dict[str, Any]] = {}
self.loc: int = 0
def is_private(name: str) -> bool:
return name.startswith("_") and not (name.startswith("__") and name.endswith("__"))
def fqmn(module: str, name: str) -> str:
return f"{module}:{name}" if name else module
def parse_module(path: Path, pkg_root: Path) -> Optional[ModuleInfo]:
try:
text = path.read_text(encoding="utf-8", errors="ignore")
except Exception:
return None
if not text.strip():
return None
rel = path.relative_to(pkg_root)
mod = ".".join(rel.with_suffix("").parts)
try:
tree = ast.parse(text)
except Exception:
return None
info = ModuleInfo(mod, path)
info.loc = len(text.splitlines())
info.doc = ast.get_docstring(tree)
# Extract __all__
for node in tree.body:
if isinstance(node, ast.Assign):
for t in node.targets:
if isinstance(t, ast.Name) and t.id == "__all__":
try:
val = ast.literal_eval(node.value)
if isinstance(val, (list, tuple)):
info.exports = [str(v) for v in val]
except Exception:
pass
# Imports
for node in ast.walk(tree):
if isinstance(node, ast.Import):
for n in node.names:
info.imports.append(n.name)
elif isinstance(node, ast.ImportFrom):
base = node.module or ""
if base:
info.imports.append(base)
# Classes & functions
for node in tree.body:
if isinstance(node, ast.ClassDef):
cls_name = node.name
bases = []
for b in node.bases:
if isinstance(b, ast.Name):
bases.append(b.id)
elif isinstance(b, ast.Attribute):
bases.append(f"{getattr(b.value, 'id', '')}.{b.attr}".strip('.'))
elif isinstance(b, ast.Subscript):
bases.append(ast.unparse(b.value) if hasattr(ast, 'unparse') else 'Generic')
else:
bases.append(type(b).__name__)
methods: Dict[str, Any] = {}
for n in node.body:
if isinstance(n, ast.FunctionDef):
if is_private(n.name):
continue
sig = _format_signature(n.args)
doc = ast.get_docstring(n)
methods[n.name] = {"sig": sig, "doc": _one_line(doc)}
info.classes[cls_name] = {"bases": bases, "methods": methods, "doc": _one_line(ast.get_docstring(node))}
elif isinstance(node, ast.FunctionDef):
if is_private(node.name):
continue
sig = _format_signature(node.args)
info.functions[node.name] = {"sig": sig, "doc": _one_line(ast.get_docstring(node))}
# Respect __all__
if info.exports is not None:
info.functions = {k: v for k, v in info.functions.items() if k in info.exports}
info.classes = {k: v for k, v in info.classes.items() if k in info.exports}
return info
def _one_line(s: Optional[str], max_len: int = 160) -> Optional[str]:
if not s:
return None
s = " ".join(s.strip().split())
return (s[: max_len - 1] + "…") if len(s) > max_len else s
def _format_signature(args: ast.arguments) -> str:
def fmt(arg: Optional[ast.arg]) -> str:
if not arg:
return ""
return arg.arg
parts: List[str] = []
# Positional-only (3.8+)
for a in getattr(args, "posonlyargs", []) or []:
parts.append(fmt(a))
if getattr(args, "posonlyargs", None):
parts.append("/")
# Positional/keyword
for a in args.args:
parts.append(fmt(a))
# Vararg
if args.vararg:
parts.append("*" + fmt(args.vararg))
else:
# keyword-only separator
if args.kwonlyargs:
parts.append("*")
for a in args.kwonlyargs:
parts.append(fmt(a))
if args.kwarg:
parts.append("**" + fmt(args.kwarg))
return "(" + ", ".join(parts) + ")"
def walk_py_files(root: Path) -> List[Path]:
out: List[Path] = []
for dirpath, dirnames, filenames in os.walk(root):
# prune ignored dirs
dirnames[:] = [d for d in dirnames if d not in IGNORED_DIRS]
for fn in filenames:
p = Path(dirpath) / fn
if p.suffix == ".py" and not any(fn.endswith(ext) for ext in IGNORED_FILE_GLOBS):
out.append(p)
return out
def build_tree_text(root: Path) -> str:
lines: List[str] = []
base_len = len(str(root))
for p in sorted(root.rglob("*")):
rel = str(p)[base_len + 1 :]
if not rel:
continue
# apply ignores
parts = Path(rel).parts
if any(part in IGNORED_DIRS for part in parts):
continue
if p.is_file() and p.suffix not in {".py", ".md", ".toml", ".cfg", ".ini", ".yaml", ".yml", ".txt"}:
continue
depth = rel.count(os.sep)
lines.append(" " * depth + (p.name + ("/" if p.is_dir() else "")))
return "\n".join(lines)
def main() -> None:
ap = argparse.ArgumentParser(description="Build a low-token X-Ray pack for a Python project")
ap.add_argument("root", help="project root directory")
ap.add_argument("out", help="output directory for the pack")
args = ap.parse_args()
root = Path(args.root).resolve()
out = Path(args.out).resolve()
out.mkdir(parents=True, exist_ok=True)
py_files = walk_py_files(root)
modules: List[ModuleInfo] = []
for f in py_files:
mi = parse_module(f, root)
if mi:
modules.append(mi)
# Graphs & stats
mod_names = sorted({m.module for m in modules})
imports: List[Tuple[str, str]] = []
classes: Dict[str, Any] = {}
uses: List[Tuple[str, str]] = [] # rough placeholder; deterministic but shallow
loc_total = 0
rank = sorted(((m.loc, m.module) for m in modules), reverse=True)
for m in modules:
loc_total += m.loc
for imp in m.imports:
# normalize import to module-ish key
target = imp.replace("/", ".")
imports.append((m.module, target))
for cname, cinfo in m.classes.items():
fq = f"{m.module}:{cname}"
classes[fq] = {"bases": cinfo["bases"], "module": m.module}
graph = {
"modules": mod_names,
"imports": imports,
"classes": classes,
"uses": uses,
}
stats = {
"files": len(py_files),
"modules": len(modules),
"loc": {"total": loc_total, "by_module": {m.module: m.loc for m in modules}},
"rank": {"modules": [m for _, m in rank]},
}
pack = {
"name": root.name,
"version": 1,
"generated_at": dt.datetime.utcnow().isoformat() + "Z",
"python": f"{sys.version_info.major}.{sys.version_info.minor}",
"budgets": {"overview_md": 2000, "api_md": 4000},
"policy": {"module_top_percent": 20, "module_token_share": 0.4, "public_api_share": 0.4},
}
# overview.md
overview = [
f"# Project: {root.name}",
f"- Python: {pack['python']} | Files: {stats['files']} | Modules: {stats['modules']} | LOC: {stats['loc']['total']}",
"\n## Top Modules (by LOC)",
]
for m in stats["rank"]["modules"][:10]:
overview.append(f"- {m}")
overview.append("\n## Entrypoints (heuristic)")
for p in sorted(root.glob("**/*.py")):
if p.name in {"__main__.py", "main.py", "app.py", "server.py", "manage.py", "cli.py"}:
overview.append(f"- {p.relative_to(root)}")
(out / "overview.md").write_text("\n".join(overview), encoding="utf-8")
# api.md
api_lines: List[str] = ["# Public API\n"]
for m in sorted(modules, key=lambda x: x.module):
if not m.functions and not m.classes:
continue
api_lines.append(f"\n## {m.module}")
for fname, finfo in sorted(m.functions.items()):
api_lines.append(f"- def {fname}{finfo['sig']} — {finfo.get('doc') or ''}")
for cname, cinfo in sorted(m.classes.items()):
bases = f"({', '.join(cinfo['bases'])})" if cinfo['bases'] else ""
api_lines.append(f"- class {cname}{bases} — {cinfo.get('doc') or ''}")
for mname, minfo in sorted(cinfo["methods"].items()):
api_lines.append(f" - {cname}.{mname}{minfo['sig']} — {minfo.get('doc') or ''}")
(out / "api.md").write_text("\n".join(api_lines), encoding="utf-8")
# tree.txt
(out / "tree.txt").write_text(build_tree_text(root), encoding="utf-8")
# graph.json, stats.json, pack.json
(out / "graph.json").write_text(json.dumps(graph, indent=2), encoding="utf-8")
(out / "stats.json").write_text(json.dumps(stats, indent=2), encoding="utf-8")
(out / "pack.json").write_text(json.dumps(pack, indent=2), encoding="utf-8")
print(f"Pack written to: {out}")
if __name__ == "__main__":
main()
Run It #
- Using uv (recommended):
uv run python xray.py /path/to/project ./project-pack
- Using system Python:
python3 xray.py /path/to/project ./project-pack
Result: folder ./project-pack containing overview.md, api.md, tree.txt, graph.json, stats.json, pack.json
LLM Usage Templates #
- System preface (paste once):
You are given a compact “X-Ray pack” of a Python project consisting of overview.md, api.md, tree.txt, graph.json, and stats.json. Use these as authoritative structure and contracts. Avoid speculating about code internals unless explicitly present. Prioritize public APIs, entrypoints, and orchestrators for reasoning.
- Question routing:
If the question pertains to structure, use overview.md.
If about interfaces, use api.md.
If about interactions, consult graph.json.
If about scope/size, use stats.json and tree.txt.
- Optimization brief (developer/agent prompt):
Identify the top 3 refactoring opportunities by combining centrality (graph.json imports in/out degree) and LOC hotspots (stats.json). For each candidate module, propose a refactor plan focusing on SRP and reducing coupling, and list risks. Keep it under 300 tokens.
Quality Gates (fast checks) #
- JSON validates and loads; counts match
- No secrets or large binaries present
- overview.md ≤ budget; api.md ≤ budget
- Graph is acyclic at package layer (imports across packages justified)
- Public API coverage ≥ 95% (if all present)
Extending the Pack (optional add‑ons) #
- Complexity: add radon metrics and bump hotspots
- Test coverage map: link tests → modules under test
- Config surface: list env vars and defaults
- Typed API: include type hints from annotations (ast.get_source_segment)
- Diffs: compare packs across commits to produce CHANGELOG.md (structural)
Troubleshooting #
- Module missing? Path resolve errors or non-UTF8 file: ensure readable encoding
- Empty API? All names may be private or only re-exported in init
- Giant tree? Tighten ignores or set a max-depth for assets
- Wrong imports? Namespace packages or dynamic imports; treat conservatively
Summary #
Use this X-Ray methodology and script to produce deterministic, low-token packs that capture how a Python project is structured and how parts interact. Share packs, not repos, to accelerate agent and developer understanding while controlling cost.