Skip to main content
Welcome. This site supports keyboard navigation and screen readers. Press ? at any time for keyboard shortcuts. Press [ to focus the sidebar, ] to focus the content. High-contrast themes are available via the toolbar.
serard@dev00:~/cv

SchemaInputReader

The source generator reads schemas in two upstream formats — JSON for core K8s OpenAPI dumps, YAML for CRD bundles — and converges them into one in-memory shape: System.Text.Json.Nodes.JsonNode. Everything downstream (SchemaVersionMerger, OpenApiV3SchemaEmitter, CrdSchemaEmitter, BuilderEmitter) operates on JsonNode and never knows or cares which format the source file used.

This chapter is short because the dispatcher is short. The whole pattern is ~20 lines of code plus a 15-line CRD envelope walker.

Why YAML in the source generator at all

Three options were considered:

  1. JSON-only. The downloader normalizes CRD YAML to JSON at fetch time. The SG only sees JSON. Pro: smallest SG dependency footprint. Con: ~30 LOC of normalization in the downloader, schemas/ less diffable, native fidelity lost.
  2. YAML-only. The downloader fetches everything and re-formats core K8s JSON to YAML. Pro: one parser. Con: needless reformatting of upstream-native JSON, slower SG cold start.
  3. Hybrid (chosen). Each schema stays in its native upstream format. The SG dispatches on file extension. Pro: native fidelity, smallest downloader, ~30 LOC saved. Con: one extra SG dependency on YamlDotNet.

The third option won because YamlDotNet 16.3.0 is already centrally pinned in Directory.Packages.props (the GitLab.Ci.Yaml runtime library uses it), and the cost of adding it to the SG analyzer pack is ~700 KB and ~20 LOC. The savings (~30 LOC deleted from the downloader, more readable PR diffs when bumping CRD bundles) are worth more than the cost.

The dispatcher

// Kubernetes.Dsl.SourceGenerator/SchemaInputReader.cs
using System.IO;
using System.Text.Json.Nodes;
using System.Threading;
using Microsoft.CodeAnalysis;
using YamlDotNet.Serialization;

namespace Kubernetes.Dsl.SourceGenerator;

internal static class SchemaInputReader
{
    public static JsonNode ReadSchema(AdditionalText file, CancellationToken ct)
    {
        var text = file.GetText(ct)?.ToString() ?? string.Empty;
        var ext = Path.GetExtension(file.Path).ToLowerInvariant();

        return ext switch
        {
            ".json" => JsonNode.Parse(text)
                       ?? throw new InvalidDataException($"Invalid JSON: {file.Path}"),

            ".yaml" or ".yml" => YamlToJsonNode(text, file.Path),

            _ => throw new InvalidDataException(
                $"Unsupported schema extension '{ext}' for {file.Path}. Expected .json, .yaml, or .yml.")
        };
    }

    private static JsonNode YamlToJsonNode(string yaml, string sourcePath)
    {
        // YamlDotNet does not produce JsonNode directly. The canonical pattern is:
        //   YAML -> object graph -> JSON-compatible text -> JsonNode.
        // Roughly 2-3x slower than direct JSON parse but happens once per
        // schema file per build (cached by SG incremental compilation).
        var deserializer = new DeserializerBuilder()
            .IgnoreUnmatchedProperties()
            .Build();

        var graph = deserializer.Deserialize<object?>(yaml);
        if (graph is null)
            throw new InvalidDataException($"Empty YAML document: {sourcePath}");

        var serializer = new SerializerBuilder()
            .JsonCompatible()
            .Build();

        var jsonText = serializer.Serialize(graph);
        return JsonNode.Parse(jsonText)
            ?? throw new InvalidDataException($"YAML to JSON round-trip failed: {sourcePath}");
    }
}

That's the entire dispatcher. Twenty lines of meaningful code. The two-step YAML-to-JSON path uses SerializerBuilder().JsonCompatible() because YamlDotNet's JSON-compatible serializer emits valid JSON text from any YAML object graph. The double-parse is the price of not pulling in a third-party YAML-to-JSON-AST converter.

Why JsonNode and not a custom IR

Three reasons:

  1. System.Text.Json.Nodes.JsonNode is in the BCL. No extra dependency. The SG already references System.Text.Json for its own work.
  2. It's a tree, not a class hierarchy. OpenAPI v3 schemas are recursive ($ref, allOf, oneOf), and a tree IR matches them better than a parsed class graph.
  3. SchemaVersionMerger already operates on a tree shape. Reusing it (Part 5) means the merger sees JsonNode regardless of original format. Free reuse.

The alternative — parsing into a custom OpenApiSchema POCO — would force every schema-walking emitter to write twice as much code, and would make the merger a custom AST walker instead of a generic tree walker.

The CRD envelope

CRDs have an extra layer the dispatcher doesn't handle on its own. A CRD YAML file looks like this:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: rollouts.argoproj.io
spec:
  group: argoproj.io
  names:
    kind: Rollout
    plural: rollouts
  scope: Namespaced
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                replicas: { type: integer }
                strategy: { ... }
                template: { $ref: '#/definitions/PodTemplateSpec' }
              required: [template]

The actual OpenAPI schema you want to emit a C# type for is buried at spec.versions[*].schema.openAPIV3Schema. There may be multiple versions (v1alpha1, v1beta1, v1), each with served: and storage: flags and its own schema. CrdSchemaEmitter walks the envelope and yields one CrdSchemaSlice per served: true version:

// Kubernetes.Dsl.SourceGenerator/CrdEnvelopeWalker.cs
internal static class CrdEnvelopeWalker
{
    public static IEnumerable<CrdSchemaSlice> ExtractServedVersions(JsonNode crdDoc, string sourcePath)
    {
        var spec = crdDoc["spec"]
            ?? throw new InvalidDataException($"{sourcePath}: missing spec");
        var group = spec["group"]?.GetValue<string>()
            ?? throw new InvalidDataException($"{sourcePath}: missing spec.group");
        var kind = spec["names"]?["kind"]?.GetValue<string>()
            ?? throw new InvalidDataException($"{sourcePath}: missing spec.names.kind");

        var versions = spec["versions"] as JsonArray
            ?? throw new InvalidDataException($"{sourcePath}: missing spec.versions[]");

        foreach (var v in versions)
        {
            var served = v?["served"]?.GetValue<bool>() ?? false;
            if (!served) continue;

            var version = v!["name"]!.GetValue<string>();
            var storage = v["storage"]?.GetValue<bool>() ?? false;
            var schema = v["schema"]?["openAPIV3Schema"]
                ?? throw new InvalidDataException(
                    $"{sourcePath}: {version} has served=true but no schema.openAPIV3Schema");

            yield return new CrdSchemaSlice(group, kind, version, storage, schema);
        }
    }
}

internal sealed record CrdSchemaSlice(
    string Group,
    string Kind,
    string Version,
    bool IsStorageVersion,
    JsonNode Schema);

Same logic whether the source was YAML or JSON. The walker doesn't know which.

Putting it together inside the SG

Here's how the dispatcher and the envelope walker compose inside the source generator's main pipeline:

// Kubernetes.Dsl.SourceGenerator/KubernetesBundleGenerator.cs (excerpt)
[Generator]
public sealed class KubernetesBundleGenerator : IIncrementalGenerator
{
    public void Initialize(IncrementalGeneratorInitializationContext context)
    {
        var bundles = context.SyntaxProvider.ForAttributeWithMetadataName(
            "Kubernetes.Dsl.Attributes.KubernetesBundleAttribute",
            predicate: (_, _) => true,
            transform: (ctx, _) => KubernetesBundleConfig.From(ctx));

        var schemaFiles = context.AdditionalTextsProvider
            .Where(file =>
            {
                var ext = Path.GetExtension(file.Path).ToLowerInvariant();
                return ext is ".json" or ".yaml" or ".yml";
            });

        var parsedSchemas = schemaFiles.Select((file, ct) =>
        {
            var node = SchemaInputReader.ReadSchema(file, ct);
            var path = file.Path.Replace('\\', '/');
            return new ParsedSchema(path, node);
        });

        var combined = bundles.Combine(parsedSchemas.Collect());

        context.RegisterSourceOutput(combined, (spc, pair) =>
        {
            var (config, schemas) = pair;
            var (coreSchemas, crdSchemas) = ClassifySchemas(schemas, config);

            // Track B core: walk OpenAPI v3 schemas directly
            var coreUnified = SchemaVersionMerger.MergeCore(coreSchemas, config);
            OpenApiV3SchemaEmitter.Emit(spc, coreUnified, config);

            // Track B CRDs: walk the envelope, then merge by (group, kind)
            var crdSlices = crdSchemas
                .SelectMany(s => CrdEnvelopeWalker.ExtractServedVersions(s.Document, s.Path)
                    .Select(slice => (s.Path, slice)));
            var crdUnified = SchemaVersionMerger.MergeCrds(crdSlices, config);
            CrdSchemaEmitter.Emit(spc, crdUnified, config);

            // Type registry for the YAML reader's discriminator dispatch
            TypeRegistryEmitter.Emit(spc, coreUnified, crdUnified);
        });
    }
}

Three things to notice:

  1. AdditionalTextsProvider.Where filters by extension. Roslyn passes anything matching <AdditionalFiles> globs through; the SG filters to the three formats it understands.
  2. SchemaInputReader.ReadSchema runs once per file, returns JsonNode. The cache is the SG's own incremental cache; if a file's text hasn't changed, the parsed JsonNode is reused (modulo Roslyn's value comparison rules).
  3. SchemaVersionMerger.MergeCore and MergeCrds both consume JsonNode trees. The merger never sees the file format. Part 5 unpacks this.

SG project file

<!-- Kubernetes.Dsl.SourceGenerator/Kubernetes.Dsl.SourceGenerator.csproj -->
<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <TargetFramework>netstandard2.0</TargetFramework>
    <IsRoslynComponent>true</IsRoslynComponent>
    <IncludeBuildOutput>false</IncludeBuildOutput>
    <Nullable>enable</Nullable>
    <LangVersion>latest</LangVersion>
    <EnforceExtendedAnalyzerRules>true</EnforceExtendedAnalyzerRules>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="Microsoft.CodeAnalysis.CSharp" PrivateAssets="all" />
    <PackageReference Include="System.Text.Json" PrivateAssets="all" />
    <PackageReference Include="YamlDotNet" PrivateAssets="all" />
  </ItemGroup>
  <ItemGroup>
    <ProjectReference Include="..\..\..\Builder\src\FrenchExDev.Net.Builder.SourceGenerator.Lib\FrenchExDev.Net.Builder.SourceGenerator.Lib.csproj" />
  </ItemGroup>
</Project>

PrivateAssets="all" ensures YamlDotNet ships with the analyzer pack and doesn't leak as a transitive dependency on the consuming project. The Builder.SourceGenerator.Lib reference is a project reference because the SG calls BuilderEmitter.Emit(...) directly as a library function (Part 6).

Performance

The two-step YAML-to-JSON path is roughly 2-3x slower than JsonNode.Parse. For the v0.1 slice (~20 schema files, mostly JSON), the total parse time is single-digit milliseconds — negligible relative to Roslyn compile time. For v0.5 (full multi-version, ~200 files), it's still under 100 ms in cold-start scenarios. Part 8 has the actual numbers and the caching strategy.

The Roslyn incremental cache memoizes parsedSchemas per AdditionalText, keyed by the file's content hash. Unchanged schemas don't get re-parsed across builds. Bumping a single CRD bundle re-parses one file; bumping a K8s minor re-parses ~50 files; the merger then re-runs but BuilderEmitter only re-emits types whose underlying schema actually changed.

Verification

Three new diagnostics live in the SG itself (not the analyzer pack — the analyzer pack runs on user code, not on schemas):

Code Failure mode
KSG001 Unsupported schema file extension. The dispatcher only accepts .json, .yaml, .yml.
KSG002 YAML parse failure. Reports the file path, line, column, and the YamlDotNet exception message.
KSG003 CRD envelope walker found a CRD with no served: true versions, or a served: true version with no schema.openAPIV3Schema.

These surface as compile-time errors with the originating AdditionalText file path, so they show up in the IDE's problems pane next to whatever schema file is broken.


Previous: Part 3: Schema Acquisition — Where Schemas Come From Next: Part 5: Multi-Version Schema Merging — Across Both Core and CRDs

⬇ Download