Python Project X-Ray — Low‑Token Model Pack

Python Project X-Ray — Low‑Token Model Pack #

A practical, repeatable method to condense any Python codebase into a compact “model pack” that preserves the project’s structure, APIs, interactions, and intent with minimal tokens. Optimized for rapid onboarding of agents/developers and efficient LLM reasoning.

What You Get (Outputs) #

  • overview.md — high-signal, low-token project narrative and entry points
  • api.md — public surface: packages/modules/classes/functions with signatures
  • tree.txt — pruned file tree (noise filtered), sizes, highlights
  • graph.json — import graph + class inheritance + cross-module usage edges
  • stats.json — LOC, counts, ranks, hotspots
  • pack.json — metadata, budgets, and routing hints for LLMs

All outputs are deterministic, source-of-truth derived (AST + file system), and safe to share (secrets filtered by default ignores).

Design Principles #

  • Signal first: functions, classes, signatures, module roles, entrypoints
  • Hierarchical: project → package → module → class/method → function
  • Contracts over implementation: public API, docstrings, signatures
  • Interactions over details: imports, ownership, orchestration points
  • Token budgets: weight by impact (orchestrators, hotspots, public API)
  • Deterministic & auditable: AST-based, reproducible, no LLM in pack build

End‑to‑End Pipeline (10 steps) #

  1. Discover
  • Locate project root. If ambiguous, prefer folder containing pyproject.toml or setup.cfg.
  • Collect files with safe defaults and ignore noise (see baseline ignores).
  1. Classify
  • Group by type: code (.py), tests, notebooks, config, assets, data, build.
  • Detect packages (folders with init.py) and top-level scripts.
  1. Static Parse (AST)
  • Per .py: extract module docstring, imports, all, classes, functions.
  • For classes: bases, public methods (filter _private), decorators, docstrings.
  • For functions: signature, decorators, docstrings.
  1. Graphs
  • Import graph: module → modules used.
  • Inheritance graph: class → base(s) (FQNs when resolvable).
  • Cross-module usage (approx): map imported names to attribute/function calls.
  1. Entry Points & Runtimes
  • Scripts with if name == “main”.
  • CLI (argparse/click/typer), app factories, WSGI/ASGI apps, tasks.
  1. Public Surface
  • Respect all if present, else treat names without leading _ as public.
  • Record canonical FQNs: package.module:Class.method, package.module:function.
  1. Ranking & Budgets
  • Rank modules by LOC and coupling (in/out-degree in import graph).
  • Boost orchestrators by filename heuristics: app, main, server, manager, tasks, pipeline, service.
  • Allocate tokens: e.g., top 20% modules → 40% tokens; public API → 40%; rest → 20%.
  1. Summarize (Deterministic)
  • Generate terse, structured bullets. Incorporate one-line docstrings if present.
  • Focus on responsibilities, collaborators, and side-effects.
  1. Emit Pack
  • Write overview.md, api.md, tree.txt, graph.json, stats.json, pack.json.
  • Use stable IDs and FQNs; prefer consistent sort order.
  1. Verify & Diff
  • Validate JSON schemas. Re-run and diff packs across commits to track drift.

Baseline Ignores (Safe Defaults) #

  • Directories: .git, .venv, venv, pycache, .mypy_cache, .pytest_cache, .ruff_cache, .tox, build, dist, site-packages, node_modules, .idea, .vscode
  • Files: *.pyc, *.pyo, *.pyd, *.so, *.dll, *~
  • Data/large: data/, .cache/, .local/, media/, downloads/
  • Secrets: .env, .env.* (strip values if included), credentials.*

Include tests/, notebooks/ optionally: summarize lightly unless they define APIs.

Heuristics That Matter #

  • Public-first: list public functions and class methods before private helpers
  • Contracts: keep signatures and one-liners; skip bodies
  • Collapse trivial wrappers (<= 3 statements, no branching)
  • Expand orchestrators and adapters; compress pure utilities
  • Merge duplicates; respect all for canonical exports
  • External deps: name and role only (avoid copying third-party docs)

Output Schemas (Essential) #

  • graph.json
{
  "$schema": "https://example.org/xray/graph.schema.json",
  "modules": ["pkg.mod1", "pkg.mod2"],
  "imports": [["pkg.mod1", "pkg.mod2"], ["pkg.mod2", "pkg.util"]],
  "classes": {"pkg.mod1:Foo": {"bases": ["pkg.base:Base"], "module": "pkg.mod1"}},
  "uses": [["pkg.mod1", "pkg.util:helper"], ["pkg.mod2:Bar", "pkg.mod1:Foo"]]
}
  • pack.json
{
  "name": "project-xray-pack",
  "version": 1,
  "generated_at": "2025-09-02T00:00:00Z",
  "python": "3.11",
  "budgets": {"overview_md": 2000, "api_md": 4000},
  "policy": {
    "module_top_percent": 20,
    "module_token_share": 0.4,
    "public_api_share": 0.4
  }
}
  • stats.json (example keys)
{
  "files": 134,
  "modules": 78,
  "loc": {"total": 21234, "by_module": {"pkg.core": 987}},
  "rank": {"modules": ["pkg.app", "pkg.core", "pkg.service"]}
}

Golden Format: overview.md (Template) #

# Project: {name}
- Purpose: {one-liner}
- Stack: Python {py}, key deps: {X, Y, Z}
- Entrypoints: {scripts, CLIs, apps}

## Architecture (high signal)
- Packages: {pkgA} (role), {pkgB} (role)
- Orchestrators: {module/class} → coordinates {n} components
- Data flow: {producer} → {processor} → {sink}
- Storage/IO: {db/cache/files}, config via {env/config}

## Key Interactions
- Imports: {top edges}
- Cross-module usage: {who calls what}

## Risks/Hotspots
- {module} high complexity and centrality
- {module} high churn / LOC

## Notes
- Conventions, extension points, feature flags

Quickstart: One-File Builder (drop‑in script) #

The following script produces a compact pack folder for any Python project with zero external dependencies. Save as xray.py and run.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Python Project X-Ray — minimal, dependency-free pack builder.
Outputs: overview.md, api.md, tree.txt, graph.json, stats.json, pack.json
"""
from __future__ import annotations
import argparse
import ast
import datetime as dt
import json
import os
from pathlib import Path
import sys
from typing import Any, Dict, List, Optional, Tuple

IGNORED_DIRS = {".git", ".venv", "venv", "__pycache__", ".mypy_cache", ".pytest_cache", ".ruff_cache", ".tox", "build", "dist", "site-packages", "node_modules", ".idea", ".vscode", ".cache"}
IGNORED_FILE_GLOBS = {".pyc", ".pyo", ".pyd", ".so", ".dll"}

class ModuleInfo:
    def __init__(self, module: str, path: Path) -> None:
        self.module = module
        self.path = path
        self.doc: Optional[str] = None
        self.imports: List[str] = []
        self.exports: Optional[List[str]] = None
        self.classes: Dict[str, Dict[str, Any]] = {}
        self.functions: Dict[str, Dict[str, Any]] = {}
        self.loc: int = 0


def is_private(name: str) -> bool:
    return name.startswith("_") and not (name.startswith("__") and name.endswith("__"))


def fqmn(module: str, name: str) -> str:
    return f"{module}:{name}" if name else module


def parse_module(path: Path, pkg_root: Path) -> Optional[ModuleInfo]:
    try:
        text = path.read_text(encoding="utf-8", errors="ignore")
    except Exception:
        return None
    if not text.strip():
        return None
    rel = path.relative_to(pkg_root)
    mod = ".".join(rel.with_suffix("").parts)
    try:
        tree = ast.parse(text)
    except Exception:
        return None
    info = ModuleInfo(mod, path)
    info.loc = len(text.splitlines())
    info.doc = ast.get_docstring(tree)

    # Extract __all__
    for node in tree.body:
        if isinstance(node, ast.Assign):
            for t in node.targets:
                if isinstance(t, ast.Name) and t.id == "__all__":
                    try:
                        val = ast.literal_eval(node.value)
                        if isinstance(val, (list, tuple)):
                            info.exports = [str(v) for v in val]
                    except Exception:
                        pass

    # Imports
    for node in ast.walk(tree):
        if isinstance(node, ast.Import):
            for n in node.names:
                info.imports.append(n.name)
        elif isinstance(node, ast.ImportFrom):
            base = node.module or ""
            if base:
                info.imports.append(base)

    # Classes & functions
    for node in tree.body:
        if isinstance(node, ast.ClassDef):
            cls_name = node.name
            bases = []
            for b in node.bases:
                if isinstance(b, ast.Name):
                    bases.append(b.id)
                elif isinstance(b, ast.Attribute):
                    bases.append(f"{getattr(b.value, 'id', '')}.{b.attr}".strip('.'))
                elif isinstance(b, ast.Subscript):
                    bases.append(ast.unparse(b.value) if hasattr(ast, 'unparse') else 'Generic')
                else:
                    bases.append(type(b).__name__)
            methods: Dict[str, Any] = {}
            for n in node.body:
                if isinstance(n, ast.FunctionDef):
                    if is_private(n.name):
                        continue
                    sig = _format_signature(n.args)
                    doc = ast.get_docstring(n)
                    methods[n.name] = {"sig": sig, "doc": _one_line(doc)}
            info.classes[cls_name] = {"bases": bases, "methods": methods, "doc": _one_line(ast.get_docstring(node))}
        elif isinstance(node, ast.FunctionDef):
            if is_private(node.name):
                continue
            sig = _format_signature(node.args)
            info.functions[node.name] = {"sig": sig, "doc": _one_line(ast.get_docstring(node))}

    # Respect __all__
    if info.exports is not None:
        info.functions = {k: v for k, v in info.functions.items() if k in info.exports}
        info.classes = {k: v for k, v in info.classes.items() if k in info.exports}

    return info


def _one_line(s: Optional[str], max_len: int = 160) -> Optional[str]:
    if not s:
        return None
    s = " ".join(s.strip().split())
    return (s[: max_len - 1] + "…") if len(s) > max_len else s


def _format_signature(args: ast.arguments) -> str:
    def fmt(arg: Optional[ast.arg]) -> str:
        if not arg:
            return ""
        return arg.arg

    parts: List[str] = []
    # Positional-only (3.8+)
    for a in getattr(args, "posonlyargs", []) or []:
        parts.append(fmt(a))
    if getattr(args, "posonlyargs", None):
        parts.append("/")
    # Positional/keyword
    for a in args.args:
        parts.append(fmt(a))
    # Vararg
    if args.vararg:
        parts.append("*" + fmt(args.vararg))
    else:
        # keyword-only separator
        if args.kwonlyargs:
            parts.append("*")
    for a in args.kwonlyargs:
        parts.append(fmt(a))
    if args.kwarg:
        parts.append("**" + fmt(args.kwarg))
    return "(" + ", ".join(parts) + ")"


def walk_py_files(root: Path) -> List[Path]:
    out: List[Path] = []
    for dirpath, dirnames, filenames in os.walk(root):
        # prune ignored dirs
        dirnames[:] = [d for d in dirnames if d not in IGNORED_DIRS]
        for fn in filenames:
            p = Path(dirpath) / fn
            if p.suffix == ".py" and not any(fn.endswith(ext) for ext in IGNORED_FILE_GLOBS):
                out.append(p)
    return out


def build_tree_text(root: Path) -> str:
    lines: List[str] = []
    base_len = len(str(root))
    for p in sorted(root.rglob("*")):
        rel = str(p)[base_len + 1 :]
        if not rel:
            continue
        # apply ignores
        parts = Path(rel).parts
        if any(part in IGNORED_DIRS for part in parts):
            continue
        if p.is_file() and p.suffix not in {".py", ".md", ".toml", ".cfg", ".ini", ".yaml", ".yml", ".txt"}:
            continue
        depth = rel.count(os.sep)
        lines.append("  " * depth + (p.name + ("/" if p.is_dir() else "")))
    return "\n".join(lines)


def main() -> None:
    ap = argparse.ArgumentParser(description="Build a low-token X-Ray pack for a Python project")
    ap.add_argument("root", help="project root directory")
    ap.add_argument("out", help="output directory for the pack")
    args = ap.parse_args()

    root = Path(args.root).resolve()
    out = Path(args.out).resolve()
    out.mkdir(parents=True, exist_ok=True)

    py_files = walk_py_files(root)
    modules: List[ModuleInfo] = []
    for f in py_files:
        mi = parse_module(f, root)
        if mi:
            modules.append(mi)

    # Graphs & stats
    mod_names = sorted({m.module for m in modules})
    imports: List[Tuple[str, str]] = []
    classes: Dict[str, Any] = {}
    uses: List[Tuple[str, str]] = []  # rough placeholder; deterministic but shallow

    loc_total = 0
    rank = sorted(((m.loc, m.module) for m in modules), reverse=True)
    for m in modules:
        loc_total += m.loc
        for imp in m.imports:
            # normalize import to module-ish key
            target = imp.replace("/", ".")
            imports.append((m.module, target))
        for cname, cinfo in m.classes.items():
            fq = f"{m.module}:{cname}"
            classes[fq] = {"bases": cinfo["bases"], "module": m.module}

    graph = {
        "modules": mod_names,
        "imports": imports,
        "classes": classes,
        "uses": uses,
    }

    stats = {
        "files": len(py_files),
        "modules": len(modules),
        "loc": {"total": loc_total, "by_module": {m.module: m.loc for m in modules}},
        "rank": {"modules": [m for _, m in rank]},
    }

    pack = {
        "name": root.name,
        "version": 1,
        "generated_at": dt.datetime.utcnow().isoformat() + "Z",
        "python": f"{sys.version_info.major}.{sys.version_info.minor}",
        "budgets": {"overview_md": 2000, "api_md": 4000},
        "policy": {"module_top_percent": 20, "module_token_share": 0.4, "public_api_share": 0.4},
    }

    # overview.md
    overview = [
        f"# Project: {root.name}",
        f"- Python: {pack['python']}  |  Files: {stats['files']}  |  Modules: {stats['modules']}  |  LOC: {stats['loc']['total']}",
        "\n## Top Modules (by LOC)",
    ]
    for m in stats["rank"]["modules"][:10]:
        overview.append(f"- {m}")
    overview.append("\n## Entrypoints (heuristic)")
    for p in sorted(root.glob("**/*.py")):
        if p.name in {"__main__.py", "main.py", "app.py", "server.py", "manage.py", "cli.py"}:
            overview.append(f"- {p.relative_to(root)}")
    (out / "overview.md").write_text("\n".join(overview), encoding="utf-8")

    # api.md
    api_lines: List[str] = ["# Public API\n"]
    for m in sorted(modules, key=lambda x: x.module):
        if not m.functions and not m.classes:
            continue
        api_lines.append(f"\n## {m.module}")
        for fname, finfo in sorted(m.functions.items()):
            api_lines.append(f"- def {fname}{finfo['sig']}{finfo.get('doc') or ''}")
        for cname, cinfo in sorted(m.classes.items()):
            bases = f"({', '.join(cinfo['bases'])})" if cinfo['bases'] else ""
            api_lines.append(f"- class {cname}{bases}{cinfo.get('doc') or ''}")
            for mname, minfo in sorted(cinfo["methods"].items()):
                api_lines.append(f"  - {cname}.{mname}{minfo['sig']}{minfo.get('doc') or ''}")
    (out / "api.md").write_text("\n".join(api_lines), encoding="utf-8")

    # tree.txt
    (out / "tree.txt").write_text(build_tree_text(root), encoding="utf-8")

    # graph.json, stats.json, pack.json
    (out / "graph.json").write_text(json.dumps(graph, indent=2), encoding="utf-8")
    (out / "stats.json").write_text(json.dumps(stats, indent=2), encoding="utf-8")
    (out / "pack.json").write_text(json.dumps(pack, indent=2), encoding="utf-8")

    print(f"Pack written to: {out}")

if __name__ == "__main__":
    main()

Run It #

  • Using uv (recommended):
uv run python xray.py /path/to/project ./project-pack
  • Using system Python:
python3 xray.py /path/to/project ./project-pack

Result: folder ./project-pack containing overview.md, api.md, tree.txt, graph.json, stats.json, pack.json

LLM Usage Templates #

  • System preface (paste once):
You are given a compact “X-Ray pack” of a Python project consisting of overview.md, api.md, tree.txt, graph.json, and stats.json. Use these as authoritative structure and contracts. Avoid speculating about code internals unless explicitly present. Prioritize public APIs, entrypoints, and orchestrators for reasoning.
  • Question routing:
If the question pertains to structure, use overview.md.
If about interfaces, use api.md.
If about interactions, consult graph.json.
If about scope/size, use stats.json and tree.txt.
  • Optimization brief (developer/agent prompt):
Identify the top 3 refactoring opportunities by combining centrality (graph.json imports in/out degree) and LOC hotspots (stats.json). For each candidate module, propose a refactor plan focusing on SRP and reducing coupling, and list risks. Keep it under 300 tokens.

Quality Gates (fast checks) #

  • JSON validates and loads; counts match
  • No secrets or large binaries present
  • overview.md ≤ budget; api.md ≤ budget
  • Graph is acyclic at package layer (imports across packages justified)
  • Public API coverage ≥ 95% (if all present)

Extending the Pack (optional add‑ons) #

  • Complexity: add radon metrics and bump hotspots
  • Test coverage map: link tests → modules under test
  • Config surface: list env vars and defaults
  • Typed API: include type hints from annotations (ast.get_source_segment)
  • Diffs: compare packs across commits to produce CHANGELOG.md (structural)

Troubleshooting #

  • Module missing? Path resolve errors or non-UTF8 file: ensure readable encoding
  • Empty API? All names may be private or only re-exported in init
  • Giant tree? Tighten ignores or set a max-depth for assets
  • Wrong imports? Namespace packages or dynamic imports; treat conservatively

Summary #

Use this X-Ray methodology and script to produce deterministic, low-token packs that capture how a Python project is structured and how parts interact. Share packs, not repos, to accelerate agent and developer understanding while controlling cost.