Skip to main content
Welcome. This site supports keyboard navigation and screen readers. Press ? at any time for keyboard shortcuts. Press [ to focus the sidebar, ] to focus the content. High-contrast themes are available via the toolbar.
serard@dev00:~/cv

02 — The extractor

Article 01 pinned the user-facing surface: six TC39-standard decorators and a module-load registry that accumulates fragments during module evaluation. This article builds its static mirror. The extractor in @frenchexdev/ide-forge never evaluates a spec file. It opens it with ts-morph, walks its AST, reads the same decorator call expressions the registry would have seen at runtime, and assembles a LanguageIR that is byte-for-byte identical to what listFragments() would have produced. Same IR, two paths — runtime evaluation and static extraction — and the contract between them is the Requirement this article earns.

The framing worth keeping in mind: the extractor is pure. No filesystem reads by default, no network, no clock. A SourceReader port abstracts the source-code-origin away; callers inject the real fs in production and string literals in tests. The design is borrowed from packages/typed-fsm/src/analysis/state-machine-extractor.ts, where the same posture has been in production for a year and lets the unit suite test the extractor on literal source strings with no disk involvement. Replicating it here is not novelty — it is the application of a pattern that has already paid for itself.

REQ-IDEDSL-EXTRACTOR-DETERMINISTIC — the Requirement the chain stands on

REQ-IDEDSL-EXTRACTOR-DETERMINISTICGiven identical input spec files, the chain SourceReaderts.Program / SourceFile → decorator extraction → LanguageIR assembly shall produce byte-identical IR output, with no reliance on runtime evaluation, filesystem walk order, environment state, or timestamp.

Rationale: determinism is the pre-condition for caching, for snapshot tests, for diffing emitter output in review, and for the traceability reports the @frenchexdev/requirements package builds on top. A non-deterministic extractor turns every downstream artefact — grammar JSON, manifest, snippets, LSP server scaffold, generated .vsix — into an unreviewable moving target. Non-determinism here also breaks the equivalence with the module-load registry, because two runs of the static path would produce different IRs for a single evaluation of the runtime path, leaving the invariant "runtime eval and static extraction converge" un-assertable.

Fit criteria: property tests on the extractor over 200 fast-check runs assert idempotence and order-independence; two runs on the same input produce identical LanguageIR JSON (deep-equal); the IR shape matches what listFragments() returns after importing the spec module.

Verification: Test. Refines REQ-IDEDSL-DECORATORS-STANDARD.

@Refines is the SysML refinement link: REQ-IDEDSL-DECORATORS-STANDARD said "decorators shall run under TC39 without reflection"; this Requirement refines that into a concrete invariant on the static path. If the decorators went through reflect-metadata, no static extractor could replicate them without executing the module; the determinism Requirement would be impossible to honour without losing what the module-load channel could have told the runtime. Because article 01 already closed that channel, article 02 can make this Requirement in good faith.

The fit criteria are deliberately testable. "Byte-identical JSON" is a JSON.stringify comparison after a canonicalising serialisation pass (the IR's readonly arrays are already in decorator-firing order, so the canonicalisation is a no-op — that is the property articles 10 and 19 will property-test under 200 fast-check runs). "Matches what listFragments() returns" is the equivalence test: import the spec file normally (which evaluates it and populates the registry), run the extractor against the same file (which does not evaluate it), and deep-equal the two IRs. If the extractor misses a decorator, or orders tokens differently, or branded a name wrong, this test fires immediately.

FEAT-IDEDSL-02 — the satisfying Feature

// packages/ide-forge/requirements/features/extractor.ts
import { Feature, Priority, Satisfies, type ACResult } from '@frenchexdev/requirements';
import { ReqIdeDslExtractorDeterministicRequirement } from '../requirements/req-idedsl-extractor-deterministic.js';

@Satisfies(ReqIdeDslExtractorDeterministicRequirement)
export abstract class IdeForgeExtractorFeature extends Feature {
  readonly id = 'FEAT-IDEDSL-02';
  readonly title = 'ide-forge extractor — ts-morph walk that produces a LanguageIR byte-equal to the module-load registry';
  readonly priority = Priority.Critical;

  // ── Port ──
  abstract extractorTakesASourceReaderPort(): ACResult;
  abstract extractorNeverExecutesTheSpecModule(): ACResult;

  // ── Decorator extraction ──
  abstract extractsLanguageHeaderFromLanguageDecorator(): ACResult;
  abstract extractsTokenFragmentFromEachTokenDecorator(): ACResult;
  abstract extractsRuleFragmentFromEachRuleDecorator(): ACResult;
  abstract extractsSnippetFragmentFromEachSnippetDecorator(): ACResult;
  abstract extractsLspFeatureFragmentFromEachLspFeatureDecorator(): ACResult;
  abstract extractsExecutorFragmentFromEachExecutorDecorator(): ACResult;

  // ── Invariants ──
  abstract extractedIrEqualsRegistryIrForTheSameSpec(): ACResult;
  abstract preservesDecoratorOrderWithinEachFragmentKind(): ACResult;
  abstract raisesParseErrorWithSourceRangeOnMalformedInput(): ACResult;
  abstract twoRunsOnTheSameInputProduceIdenticalIr(): ACResult;
}

Twelve ACs, three clusters. The "Port" cluster pins the testability posture: the extractor takes a SourceReader, never executes the module. The "Decorator extraction" cluster has one AC per decorator — six ACs, verifying the one-to-one mapping from decorator call to IR fragment. The "Invariants" cluster has the four load-bearing properties: registry equivalence (the equivalence test above), order preservation (article 01's registry AC, re-asserted on the static path), rich error reporting with source ranges, and bit-for-bit determinism across runs. That last AC is the one the fast-check property test will verify.

The SourceReader port

A port is a TypeScript interface whose purpose is to invert a dependency. The extractor depends on a SourceReader; production wires the real filesystem; tests wire a Map<string, string>. The port is five methods wide:

// packages/ide-forge/src/ports/source-reader.ts
export interface SourceReader {
  /** Returns the source text of a file, or undefined if not found. */
  readText(absolutePath: string): string | undefined;
  /** Lists all absolute paths the reader can serve. */
  listPaths(): readonly string[];
  /** True if the reader can serve that path (cheap existence check). */
  exists(absolutePath: string): boolean;
  /** Module kind for resolver hints — "commonjs" | "esm". */
  moduleKind(absolutePath: string): 'commonjs' | 'esm' | 'unknown';
  /** Canonical filesystem separator hint — '/' or '\\'. */
  pathSeparator(): '/' | '\\';
}

Five methods is already three more than strictly necessary for a happy-path extractor — readText and listPaths would do the work — but the extra three earn their keep on the edges. exists lets the extractor raise a clean ParseError with source range when a @Rule references a non-existent token without having to eat a read failure. moduleKind lets the ts-morph project configure its module resolver correctly, which matters because the spec file's own imports (from @frenchexdev/ide-dsl) have to resolve before the decorator calls can be read. pathSeparator is a cheap normalisation hint that prevents the extractor from producing OS-specific IR on Windows developers' machines — Windows paths inside the IR would break the byte-identical determinism property the moment a developer on macOS regenerated the extension.

The real production reader is a thin wrapper around Node's fs.promises.readFile with a small LRU cache (200 entries is adequate for every realistic spec), and the InMemorySourceReader is a Map<string, string> with the five methods implemented inline. The latter is what every test in the extractor suite will use; the former appears only in the CLI shell at packages/ide-forge/src/bin/.

Diagram
Figure 1 — Extractor pipeline. The SourceReader is the only port that touches the outside world; everything downstream is pure functions on ts-morph AST nodes.

Two arrows go into SourceReader (filesystem and in-memory), one arrow comes out. That funnel is what makes the extractor testable without spinning up a temp directory. It is also the shape of every port-driven analysis module in the monorepo: the single injection point is the one seam tests need.

Decorator-by-decorator extraction

Once ts-morph has a Project, the extractor walks every SourceFile it contains and every ClassDeclaration within, and inspects each class's decorator list. For each decorator, it dispatches on the identifier name and runs the matching extraction function.

// packages/ide-forge/src/extractor/extract.ts
import { Project, ClassDeclaration, Decorator, Node } from 'ts-morph';
import { LanguageIR, LanguageFragment, IRToken, IRRule, /* ... */ } from '@frenchexdev/ide-dsl/ir';
import { ParseError } from '../errors/parse-error.js';

export function extractFromProject(project: Project): readonly LanguageIR[] {
  const fragments: LanguageFragment[] = [];
  for (const sf of project.getSourceFiles()) {
    for (const cls of sf.getClasses()) {
      const frag = extractClassFragment(cls);
      if (frag) fragments.push(frag);
    }
  }
  return fragments.map(fragmentToIr);
}

function extractClassFragment(cls: ClassDeclaration): LanguageFragment | undefined {
  let frag: LanguageFragment | undefined;
  for (const dec of cls.getDecorators()) {
    const name = dec.getName();
    switch (name) {
      case 'Language':   frag = applyLanguage(dec, cls, frag);   break;
      case 'Token':      frag = applyToken(dec, cls, frag);      break;
      case 'Rule':       frag = applyRule(dec, cls, frag);       break;
      case 'Snippet':    frag = applySnippet(dec, cls, frag);    break;
      case 'LspFeature': frag = applyLspFeature(dec, cls, frag); break;
      case 'Executor':   frag = applyExecutor(dec, cls, frag);   break;
      default:           /* foreign decorator — ignored */       break;
    }
  }
  return frag;
}

The dispatch is a switch on the decorator's identifier name. Unrecognised decorators (@Satisfies, for example, which is a @frenchexdev/requirements decorator that may also appear on Ide.Dsl spec classes if the author is dog-fooding the requirements DSL) are silently ignored — the extractor only speaks six words.

Each apply* function reads the decorator's single argument-object literal and extracts the branded fields. applyToken is representative:

function applyToken(dec: Decorator, cls: ClassDeclaration, frag: LanguageFragment | undefined): LanguageFragment {
  const arg = requireSingleObjectArg(dec, cls);
  const name = readString(arg, 'name', dec);
  const pattern = readString(arg, 'pattern', dec);
  const scope = readString(arg, 'scope', dec);
  const token: IRToken = {
    name: name as TokenName,
    pattern,
    scope,
  };
  const out = frag ?? emptyFragment(cls.getNameOrThrow());
  return { ...out, tokens: [...out.tokens, token] };
}

Three field reads, one brand cast at the boundary (as TokenName), one immutable append to the fragment. The fragment is produced fresh — either from the accumulator or from emptyFragment when @Token appears before @Language in source order (which is allowed; the class header is attached by whichever @Language call appears, in any position). The return value replaces the accumulator, so the top-level extractClassFragment sees a persistent chain of immutable fragments. That matches exactly the module-load registry's behaviour, because the registry does the same append under the hood — only it mutates its internal array in place, and the extractor can afford not to.

The readString helper is where parse-don't-validate lives: if the decorator argument is missing the name property, or it is not a string literal, readString throws a ParseError with the exact source range of the offending expression. No silent defaults, no null-coalescing to an empty string. The extractor refuses to make up data.

Parse-don't-validate and ParseError

Eric Lippert's "parse, don't validate" argument — revived by Alexis King in 2019 — is that a function which accepts loose input and returns structured output should not return T | null and a separate isValid: boolean; it should return T or throw, and the type system should reflect the narrowed type everywhere downstream. The extractor applies this to every field read.

export class ParseError extends Error {
  constructor(
    public readonly file: string,
    public readonly range: { start: number; end: number; line: number; column: number },
    public readonly code: 'E_MISSING_FIELD' | 'E_NOT_A_STRING' | 'E_NOT_AN_ARRAY' | 'E_UNKNOWN_DECORATOR' | 'E_DUPLICATE_LANGUAGE',
    message: string,
  ) {
    super(`[${code}] ${file}:${range.line}:${range.column}${message}`);
  }
}

function readString(arg: ObjectLiteralExpression, field: string, dec: Decorator): string {
  const prop = arg.getProperty(field);
  if (!prop || !Node.isPropertyAssignment(prop)) {
    throw new ParseError(dec.getSourceFile().getFilePath(), rangeOf(dec), 'E_MISSING_FIELD', `Missing ${field} on @${dec.getName()}`);
  }
  const init = prop.getInitializerOrThrow();
  if (!Node.isStringLiteral(init) && !Node.isNoSubstitutionTemplateLiteral(init)) {
    throw new ParseError(dec.getSourceFile().getFilePath(), rangeOf(init), 'E_NOT_A_STRING', `Field ${field} on @${dec.getName()} must be a string literal`);
  }
  return init.getLiteralText();
}

Three things about this shape. The ParseError.code is a small string union, not an exception hierarchy; downstream code that cares about a specific error kind (ide-forge's CLI has to render E_DUPLICATE_LANGUAGE with a different colour from E_MISSING_FIELD) can switch on e.code the way HTTP handlers switch on status codes. The range is a {start, end, line, column} tuple rather than just a byte offset, because when the CLI renders the error it wants to show the offending source line with a caret, and the emitters that consume the IR want to emit LSP diagnostics with the same shape — same decision point as vscode-languageserver's Position type. And the file is an absolute path from the SourceReader's readText contract, which makes error messages copy-paste-able to open in an editor.

Parse-don't-validate at this boundary does two things downstream. It lets the emitters type their inputs as LanguageIR, not Partial<LanguageIR>; they are never forced to check whether a field they need exists. And it pushes every ambiguity to the error surface, where a developer sees it once and fixes it, rather than letting it trickle through to a generated extension that starts up without complaint and then silently omits a feature. The pattern appears throughout packages/requirements/src/cli/types.ts and its smart-constructor neighbours; this article is where the meta-DSL inherits it.

Prior art — ts-morph, state-machine-extractor, Langium

ts-morph (Jeff Young, 2018–) is the TypeScript Compiler API wrapped in an ergonomic object model. Where the raw typescript package gives you a flat AST and a mass of type predicates (ts.isClassDeclaration, ts.isDecorator, …), ts-morph gives you a Project that owns a Program, a SourceFile that exposes getClasses(), and a ClassDeclaration with a getDecorators() method that returns typed nodes. The difference is ergonomic, not semantic — ts-morph produces the exact same AST the raw API does — but the ergonomic saving on a codegen walker is very large. A 40-line raw-API pass becomes an 8-line ts-morph pass. The cost is a dependency on an external wrapper; the benefit is code that reads the way the domain reads. The meta-DSL pays the cost.

The closest in-repo precedent is packages/typed-fsm/src/analysis/state-machine-extractor.ts, which does the same kind of walk for @FiniteStateMachine decorators — detects the decorator on a class, reads its option object, extracts the states, events, emits, listens arrays, and assembles a MachineNode. The ExtractorEnv in that module is exactly the shape the SourceReader port replicates here; the only difference is domain vocabulary. A developer who has already read that extractor will read the ide-forge one in five minutes.

Langium (TypeFox, 2021–) is the closest contemporary competitor to Ide.Dsl: a TypeScript language workbench that emits a VSCode extension from a grammar file. Langium's extractor reads a .langium grammar (not a decorated TypeScript class) and produces its own AST, then generates a parser, validator, and LSP server. The architectural shape is similar — single source, multi-emitter — and there is much to admire. The difference is surface: Langium's grammar file is a dedicated .langium DSL that requires its own tooling to edit, whereas Ide.Dsl keeps the author in TypeScript throughout. The trade-off is real. Langium's DSL is more expressive for grammars (it has first-class reference resolution, type unions in productions, and cross-references); Ide.Dsl's decorator-on-class approach has no tooling overhead and reuses every TypeScript editing feature the author already knows. Design article 03 engaged Langium at length; this article is where the architectural inheritance becomes visible. The extractor-as-pure-function-of-source-text posture is the one Langium helped normalise.

A smaller but load-bearing precedent is TypeScript's own tsserver, which every VSCode user interacts with every day. tsserver walks source files, produces diagnostics, answers completion requests — all without evaluating the modules it reads. The meta-DSL's extractor is a thin imitation of that posture, scoped to six decorator names.

One last precedent worth naming: Babel's decorator transforms (2014–), which predate the TypeScript ones and shaped the way the ecosystem still thinks about decorators. Babel offered a plugin (@babel/plugin-proposal-decorators) with two "versions" — legacy and 2018-09 — that each had its own runtime. Authors who targeted one and then switched had to rewrite every decorator, because the argument shapes were incompatible. The legacy Babel shape eventually aligned with TypeScript's experimentalDecorators; the 2018-09 shape was eclipsed by the TC39 Stage 3 one. What this history makes concrete is that building on unstable decorator proposals is an expensive commitment — every three-to-four-year cycle has broken the dependency chains that ride it. Choosing the ratified TC39 standard today is the first decorator choice in a decade that carries a decent actuarial guarantee of not needing a rewrite by the time article 15 ships.

Testing the extractor — the shape the port lets us take

Because the extractor takes a SourceReader, the test file injects an in-memory reader with the spec source as a string literal. No temp dir, no cleanup, no flaky disk:

// packages/ide-forge/test/unit/extractor/extractor.test.ts
import { expect } from 'vitest';
import { FeatureTest, Verifies } from '@frenchexdev/requirements';
import { Project } from 'ts-morph';
import { InMemorySourceReader } from '../../../src/ports/in-memory-source-reader.js';
import { extractFromProject } from '../../../src/extractor/extract.js';
import { IdeForgeExtractorFeature } from '../../../requirements/features/extractor.js';

const SPEC = `
import { Language, Token } from '@frenchexdev/ide-dsl/decorators';

@Token({ name: 'SECOND', pattern: '2', scope: 'x.2' })
@Token({ name: 'FIRST',  pattern: '1', scope: 'x.1' })
@Language({ id: 'req', displayName: 'Requirements', fileExtensions: ['.req'] })
export class RequirementsSpec {}
`;

@FeatureTest(IdeForgeExtractorFeature)
class ExtractorTests {
  @Verifies('extractsLanguageHeaderFromLanguageDecorator')
  pullsTheHeaderOffTheLanguageDecorator() {
    const reader = new InMemorySourceReader(new Map([['/spec/req.ts', SPEC]]));
    const project = createProjectFromReader(reader);
    const [ir] = extractFromProject(project);
    expect(ir!.id).toBe('req');
    expect(ir!.displayName).toBe('Requirements');
    expect(ir!.fileExtensions).toEqual(['.req']);
  }

  @Verifies('preservesDecoratorOrderWithinEachFragmentKind')
  preservesTheOrderDecoratorsFireIn() {
    const reader = new InMemorySourceReader(new Map([['/spec/req.ts', SPEC]]));
    const project = createProjectFromReader(reader);
    const [ir] = extractFromProject(project);
    // Class decorators fire bottom-to-top; push order in the fragment matches.
    expect(ir!.tokens.map(t => t.name)).toEqual(['FIRST', 'SECOND']);
  }

  @Verifies('twoRunsOnTheSameInputProduceIdenticalIr')
  isDeterministicAcrossRepeatedRuns() {
    const reader = new InMemorySourceReader(new Map([['/spec/req.ts', SPEC]]));
    const a = extractFromProject(createProjectFromReader(reader));
    const b = extractFromProject(createProjectFromReader(reader));
    expect(JSON.stringify(a)).toBe(JSON.stringify(b));
  }

  @Verifies('raisesParseErrorWithSourceRangeOnMalformedInput')
  raisesParseErrorOnAMissingRequiredField() {
    const BAD = `import { Language } from '@frenchexdev/ide-dsl/decorators'; @Language({ id: 'req' }) export class X {}`;
    const reader = new InMemorySourceReader(new Map([['/spec/bad.ts', BAD]]));
    const project = createProjectFromReader(reader);
    expect(() => extractFromProject(project)).toThrowError(/E_MISSING_FIELD.*displayName/);
  }
}

Four ACs, four methods, no disk. The createProjectFromReader helper wires the SourceReader into a ts-morph Project's compilerOptions.paths and readFile callback; its body is about fifteen lines of ts-morph API calls and lives next to InMemorySourceReader in src/ports/. The equivalence test — the deepest AC, extractedIrEqualsRegistryIrForTheSameSpec — is worth thirty words more: it imports the spec module via dynamic import(), reads listFragments() from @frenchexdev/ide-dsl, runs the extractor on the same source text, and asserts the two IRs serialise to the same JSON. That single test is what pins the runtime-static equivalence invariant and earns the Requirement.

Diagram
Figure 2 — The equivalence invariant. The registry path evaluates the spec module; the extractor path walks the AST without evaluating it; both produce the same `LanguageIR`. The invariant is asserted by a single test that runs both in sequence and deep-equals the output.

SOLID lens

Dependency Inversion is the central principle of the extractor. The core extraction logic depends on the SourceReader interface; the filesystem-backed reader and the in-memory reader both implement it; neither one is visible from the extractor's point of view. Inverting the dependency is what makes the extractor testable at all, and it is what will let article 12 wire a headless @vscode/test-electron reader that streams source out of an editor buffer.

Single Responsibility lives at the apply* function level. applyToken does exactly one thing: turn a @Token(...) decorator into an IRToken fragment-append. It does not validate cross-token uniqueness (that is a later compliance pass), does not dedupe against previous tokens (same), does not emit anything. The function is forty lines, and forty lines is the right size — any more and it would be doing someone else's job.

Open/Closed lives at the switch(name) dispatch in extractClassFragment. Adding a seventh decorator — article 04 will mention @SemanticToken as a possibility — is a single new case and a single new apply* function, with no change to the five existing ones. The dispatch is an anti-pattern only when the cases carry logic; here each case carries a single call, so the switch is a legitimate discriminator.

Interface Segregation is worth naming explicitly. The SourceReader could have been a single read(path: string): string function, and most of the time it would have been fine; but the extractor genuinely needs listPaths() to discover spec files when given only a project root, and it genuinely needs exists() to keep error messages useful. Rolling them into the port — rather than passing three separate callbacks — keeps the dependency footprint legible at call sites (one parameter, one name) without widening the surface the consumer has to satisfy. The test reader satisfies all five methods in about twenty lines; the production reader in about forty. Neither is over-serving the extractor.

DRY lens

Reuse the ExtractorEnv shape. The SourceReader port is not a new invention; it is the shape typed-fsm already uses for its AST extractor, renamed and narrowed by two methods. A developer who has read one has read the other, and the cross-package consistency reduces the cognitive load of moving between the two — which is frequent because Ide.Dsl generates extensions whose LSP handlers may, in turn, need to extract state-machine information from the spec files (article 11 will reach for this).

ParseError is one class, not seven. The six decorators share one error class with a small code union. The alternative — a dedicated MissingLanguageFieldError, InvalidTokenPatternError, and so on — would type-discriminate more precisely but would also multiply files by seven and make the CLI's error-rendering code switch on class names. One class with a union code field is the DRY-correct trade-off, and it mirrors the way packages/requirements/src/cli/types.ts structures its ParseError.

Cross-link and what article 03 picks up

The design counterpart is Part 05 — DRY and the IR as contract, which argued the IR's single-source-of-truth role; and Part 06, Unit testing, port-driven, which argued the port/injection shape this article implements. Article 03 picks up the IR itself: its ajv-validated JSON schema, its branded primitives' smart constructors, and the golden-snapshot test fixtures that hold the extractor and the module-load registry to their equivalence promise over time.

Proposal (write-in-public). The extractor currently throws on the first ParseError it encounters. A future revision could collect all errors and return Result<LanguageIR[], ParseError[]>, so a developer who saved a spec with three missing fields sees all three reported at once. The cost is a Result monad the emitters do not otherwise need; the benefit is editor-feedback density. The call is deferred until article 09's diagnostics pipeline forces the decision by needing collected-error shape for publishDiagnostics.

⬇ Download