Skip to main content
Welcome. This site supports keyboard navigation and screen readers. Press ? at any time for keyboard shortcuts. Press [ to focus the sidebar, ] to focus the content. High-contrast themes are available via the toolbar.
serard@dev00:~/cv

Part V: CobraHelpParser -- Parsing Go CLI Help Output

One parser handles Docker, Docker Compose, Podman, and any cobra-based CLI -- 5 different binaries, same IHelpParser.

The Realization

Docker is written in Go. Docker Compose is written in Go. Podman is written in Go. The GitLab CLI (glab) is written in Go. And they all use the same CLI framework: cobra.

That means their --help output follows a predictable structure -- predictable enough to parse with a single IHelpParser implementation.

I did not plan this. I started by writing a parser for Docker's help text. Then I pointed it at Docker Compose, and it worked. Then I pointed it at Podman -- same thing. Then glab. At that point I renamed the class from DockerHelpParser to CobraHelpParser and accepted the gift that the Go ecosystem had accidentally given me.

This post is about that parser: the state machine, the flag anatomy, the type mapping from Go to C#, and the edge cases that 97 scraped versions revealed. If you have not read the BinaryWrapper post yet, go there first -- it defines the IHelpParser interface and the three-phase pipeline that this parser plugs into.


The IHelpParser Interface

The contract is simple. I covered it briefly in the BinaryWrapper post, but here it is again because it is the only thing CobraHelpParser needs to satisfy:

public interface IHelpParser
{
    string Name { get; }

    CommandNode ParseHelp(string helpText, string commandPath);

    bool CanParse(string helpText);
}

Three members. Name is a human-readable identifier ("cobra" in our case). ParseHelp takes raw help text -- the full stdout of docker container run --help -- and the command path ("docker container run") and returns a CommandNode. CanParse is the auto-detection hook: given an unknown blob of help text, can this parser handle it?

The scraper calls CanParse first. If multiple parsers claim they can handle the text, the scraper uses the one registered first. In practice, CobraHelpParser is registered first for Docker and Compose because we know those are cobra binaries. The auto-detection matters more for the generic BinaryWrapper scenario where someone points the scraper at an unknown binary.


The CommandNode Model

The parser's output is a tree of CommandNode objects:

public record CommandNode(
    string Name,
    string? Description,
    IReadOnlyList<CommandNode> SubCommands,
    IReadOnlyList<CommandOption> Options);

public record CommandOption(
    string LongName,
    string? ShortName,
    string? Description,
    string? DefaultValue,
    OptionValueKind ValueKind,
    string ClrType,
    bool IsRequired);

public enum OptionValueKind
{
    Flag,       // --detach (no value, boolean)
    Single,     // --name string (one value)
    List,       // --env list (repeatable)
}

CommandNode is recursive: a node can have subcommands, and each subcommand is itself a CommandNode with its own options and possibly more subcommands. The scraper walks this tree by calling docker {subcommand} --help for each discovered subcommand, parsing each response, and stitching the results into a tree.

CommandOption is where the interesting work happens. Every flag in the help text gets parsed into one of these records. The parser must figure out the long name, optional short name, Go type (mapped to a CLR type), whether it takes a value or is a boolean flag, whether it has a default, and the description. That is a lot to extract from a line of plain text.

OptionValueKind drives code generation downstream. Flag options get bool properties and --flag/no-flag toggle behavior. Single options get a property of their CLR type. List options get List<T> properties and can appear multiple times on the command line.


Cobra Help Format Anatomy

Before I show the parser, you need to see what it is parsing. Here is a truncated docker container run --help from Docker 24.0.0:

Usage:  docker container run [OPTIONS] IMAGE [COMMAND] [ARG...]

Aliases:
  docker container run, docker run

Create and run a new container from an image

Options:
      --add-host list                  Add a custom host-to-IP mapping
                                       (host:ip)
  -a, --attach list                    Attach to STDIN, STDOUT or STDERR
      --blkio-weight uint16            Block IO (relative weight),
                                       between 10 and 1000, or 0 to
                                       disable (default 0)
      --blkio-weight-device list       Block IO weight (relative device
                                       weight) (default [])
      --cap-add list                   Add Linux capabilities
      --cap-drop list                  Drop Linux capabilities
      --cgroupns string                Cgroup namespace to use
                                       (host|private)
                                       'host':    Run the container in
                                                  the Docker host's
                                                  cgroup namespace
                                       'private': Run the container in
                                                  its own private cgroup
                                                  namespace
                                       '':        Use the cgroup
                                                  namespace as
                                                  configured by the
                                                  default-cgroupns-mode
                                                  option on the daemon
                                                  (default)
  -d, --detach                         Run container in background and
                                       print container ID
  -e, --env list                       Set environment variables
      --env-file list                  Read in a file of environment
                                       variables
  -h, --hostname string                Container host name
  -i, --interactive                    Keep STDIN open even if not
                                       attached
  -m, --memory bytes                   Memory limit
      --memory-swappiness int          Tune container memory swappiness
                                       (0 to 100) (default -1)
      --name string                    Assign a name to the container
      --network network                Connect a container to a network
  -p, --publish list                   Publish a container's port(s) to
                                       the host
      --pull string                    Pull image before running
                                       ("always", "missing", "never")
                                       (default "missing")
      --restart string                 Restart policy to apply when a
                                       container exits (default "no")
      --rm                             Automatically remove the container
                                       when it exits
  -t, --tty                            Allocate a pseudo-TTY
  -v, --volume list                    Bind mount a volume
  -w, --workdir string                 Working directory inside the
                                       container
      ...                              (90+ flags total)

That is 60+ flags for a single command, and I have truncated it. The full output has over 90 flags. Every line follows the same structure, with five identifiable sections.

Section 1: Usage Line

Usage:  docker container run [OPTIONS] IMAGE [COMMAND] [ARG...]

The Usage: line gives us the full command path (docker container run) and the argument syntax. The parser uses this to validate that the command path matches what the scraper expects. The [OPTIONS] placeholder tells us that flags will follow. IMAGE is a positional argument -- not a flag, not optional. [COMMAND] and [ARG...] are optional positional arguments.

I extract the command path from here, but I do not attempt to parse the argument syntax into a formal grammar. Positional arguments in Docker are too varied and too contextual to model generically. The generated API handles them as string[] trailing arguments.

Section 2: Aliases

Aliases:
  docker container run, docker run

This tells us that docker run is an alias for docker container run. The parser captures aliases because they affect how users invoke the command -- and because the scraper needs to avoid scraping the same command twice through different paths. If we already scraped docker container run, we skip docker run.

Section 3: Description

Create and run a new container from an image

Free-form text between the aliases (or usage line, if no aliases) and the first section header. This becomes the Description property on the CommandNode and eventually a /// <summary> XML doc comment on the generated C# class.

Section 4: Options

The bulk of the output. Every flag, its type, its description, and its default value. This is where the parser earns its keep, and I will dedicate an entire section to flag parsing below.

Section 5: Commands (for non-leaf commands)

For non-leaf commands like docker container --help, there is an Available Commands: section instead of (or in addition to) Options::

Available Commands:
  attach      Attach local standard input, output, and error streams
  cp          Copy files/folders between a container and the local filesystem
  create      Create a new container
  exec        Execute a command in a running container
  inspect     Display detailed information on one or more containers
  kill        Kill one or more running containers
  logs        Fetch the logs of a container
  ls          List containers
  rm          Remove one or more containers
  run         Create and run a new container from an image
  start       Start one or more stopped containers
  stop        Stop one or more running containers
  ...         (24 commands total)

The parser extracts these subcommand names. The scraper then recursively calls docker container {subcommand} --help for each one, parses the result, and attaches it as a child CommandNode.


The Parsing State Machine

CobraHelpParser is a line-by-line state machine. It reads the help text one line at a time, transitions between states based on section headers, and accumulates data into temporary buffers that ultimately become CommandNode fields.

State Enum

private enum ParserState
{
    Initial,
    Usage,
    Aliases,
    Description,
    Options,
    GlobalOptions,
    Commands,
    AdditionalHelp,
}

Eight states. The parser starts in Initial and transitions forward as it encounters section headers. It never goes backward -- cobra's help output is always ordered the same way: Usage, Aliases, Description, Options/Flags, Global Flags, Available Commands, Additional Help Topics.

The State Machine Diagram

Diagram
The eight-state parser that walks cobra help output once, never backtracking — cobra always emits Usage, Aliases, Description, Flags, Global Flags, Commands and Additional Help in that order, so forward-only transitions are enough.

Section Detection

State transitions are driven by lines that end with : (sometimes with leading whitespace stripped). Here is the detection logic:

private static ParserState? DetectSectionHeader(string trimmedLine)
{
    return trimmedLine switch
    {
        "Usage:" => ParserState.Usage,
        "Aliases:" => ParserState.Aliases,
        "Options:" or "Flags:" => ParserState.Options,
        "Global Flags:" or "Global Options:" => ParserState.GlobalOptions,
        "Available Commands:" or "Commands:" => ParserState.Commands,
        "Additional help topics:" => ParserState.AdditionalHelp,
        _ => null,
    };
}

Pattern matching. Clean. The "Flags:" variant appears in some older cobra versions. "Global Options:" appears in a few custom cobra templates. The parser handles both.

Why not use string.EndsWith(":") and be more generic? Because that would also match lines inside multi-line descriptions that happen to end with a colon. A strict allowlist of known section headers is safer. If a future cobra version introduces a new section header, I add one line to this switch statement and reparse.

The Main Loop

public CommandNode ParseHelp(string helpText, string commandPath)
{
    var state = ParserState.Initial;
    var lines = helpText.Split('\n');

    string? usageLine = null;
    var aliases = new List<string>();
    var descriptionLines = new List<string>();
    var options = new List<CommandOption>();
    var globalOptions = new List<CommandOption>();
    var subcommands = new List<CommandNode>();

    CommandOption? pendingOption = null;

    for (var i = 0; i < lines.Length; i++)
    {
        var line = lines[i];
        var trimmed = line.TrimEnd();

        // Check for section header transition
        var nextState = DetectSectionHeader(trimmed.TrimStart());
        if (nextState.HasValue)
        {
            // Flush any pending multi-line option
            FlushPendingOption(ref pendingOption, options, globalOptions, state);
            state = nextState.Value;
            continue;
        }

        switch (state)
        {
            case ParserState.Initial:
                // Look for usage on same line as "Usage:"
                if (trimmed.StartsWith("Usage:"))
                {
                    usageLine = trimmed["Usage:".Length..].Trim();
                    state = ParserState.Usage;
                }
                break;

            case ParserState.Usage:
                if (string.IsNullOrWhiteSpace(trimmed))
                    state = ParserState.Description;
                else
                    usageLine = trimmed.Trim();
                break;

            case ParserState.Aliases:
                if (string.IsNullOrWhiteSpace(trimmed))
                    state = ParserState.Description;
                else
                    aliases.AddRange(
                        trimmed.Split(',')
                            .Select(a => a.Trim())
                            .Where(a => a.Length > 0));
                break;

            case ParserState.Description:
                if (!string.IsNullOrWhiteSpace(trimmed))
                    descriptionLines.Add(trimmed.Trim());
                break;

            case ParserState.Options:
            case ParserState.GlobalOptions:
                ParseOptionLine(
                    trimmed,
                    ref pendingOption,
                    state == ParserState.GlobalOptions
                        ? globalOptions
                        : options);
                break;

            case ParserState.Commands:
                ParseCommandLine(trimmed, subcommands);
                break;

            case ParserState.AdditionalHelp:
                // Ignored -- these are hints like
                // "Run 'docker COMMAND --help' for more"
                break;
        }
    }

    // Flush final pending option
    FlushPendingOption(ref pendingOption, options, globalOptions, state);

    var name = ExtractCommandName(commandPath);
    var description = string.Join(" ", descriptionLines);

    // Merge global options into options list
    options.AddRange(globalOptions);

    return new CommandNode(name, description, subcommands, options);
}

A few things to notice:

  1. pendingOption: Multi-line descriptions require buffering. When we parse a flag line, we do not immediately emit it -- we hold it in pendingOption. If the next line is a continuation (heavily indented, no -- prefix), we append to the pending option's description. If the next line starts a new flag, we flush the pending option and start a new one.

  2. Global options merge: Cobra separates Options: (command-specific) from Global Flags: (inherited from parent commands). I merge them into one list because the generated C# API needs all flags on the command, regardless of where cobra categorizes them. The code generator can separate them later if needed -- the JSON output preserves the isGlobal flag.

  3. Alias handling: Aliases are comma-separated on a single line. docker container run, docker run becomes two entries. The scraper uses this to build a deduplication set.

  4. Description accumulation: The description can span multiple lines between the aliases section and the options section. I join them with spaces.

Command Line Parsing

Commands are simpler than flags. Each line in the Available Commands: section follows this pattern:

  commandname    Description text here

Two or more leading spaces, the command name, a gap of two or more spaces, then the description. The parsing logic:

private static void ParseCommandLine(
    string line,
    List<CommandNode> subcommands)
{
    if (string.IsNullOrWhiteSpace(line))
        return;

    var trimmed = line.TrimStart();
    if (trimmed.Length == 0)
        return;

    // Find the gap between command name and description
    // Command names don't contain spaces; the gap is 2+ spaces
    var match = Regex.Match(trimmed, @"^(\S+)\s{2,}(.+)$");
    if (!match.Success)
        return;

    var name = match.Groups[1].Value;
    var description = match.Groups[2].Value.Trim();

    subcommands.Add(new CommandNode(
        name,
        description,
        SubCommands: Array.Empty<CommandNode>(),
        Options: Array.Empty<CommandOption>()));
}

The regex is simple: one or more non-whitespace characters (the command name), two or more whitespace characters (the gap), then the rest (the description). Command names never contain spaces in cobra -- they are single tokens like run, build, ls, network.

The resulting CommandNode has empty subcommands and options. Those get filled in when the scraper recursively invokes --help on each discovered subcommand.


Flag Line Parsing: The Hard Part

This is where CobraHelpParser earns its complexity budget. A flag line in cobra help output packs five pieces of information into a single line of plain text, using alignment and whitespace as the only delimiters.

Flag Line Anatomy

  -d, --detach                         Run container in background
  ^^  ^^^^^^^^                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |   |                                |
  |   Long name                        Description
  Short name (optional)

      --blkio-weight uint16            Block IO weight (default 0)
                     ^^^^^^            ^^^^^^^^^^^^^^^^ ^^^^^^^^^^^
                     |                 |                |
                     Type hint         Description      Default value
Diagram
The five components the flag-line regex has to tease apart from a single plain-text line — short name, long name, type hint, description and default value — using only alignment and whitespace as delimiters.

A flag line has:

  1. Optional short name: -d, at the start. A single dash, a single letter, a comma, a space. Not all flags have short names.
  2. Long name: --detach or --blkio-weight. Double dash, then a name that can contain hyphens. Always present.
  3. Optional type hint: string, int, uint16, list, stringArray, duration, etc. If absent, the flag is a boolean toggle.
  4. Description: Free text. Can be very long and span multiple continuation lines.
  5. Optional default value: In parentheses at the end of the description: (default 0) or (default "missing") or (default []).

The Flag Parser

private static readonly Regex FlagLineRegex = new(
    @"^\s+" +                               // leading whitespace
    @"(?:(-\w),\s+)?" +                     // optional short name: -d,
    @"(--[\w][\w-]*)" +                     // long name: --detach
    @"(?:\s+(\S+))?" +                      // optional type: string, int, list
    @"(?:\s{2,}(.+))?$",                    // description (after 2+ space gap)
    RegexOptions.Compiled);

private static readonly Regex DefaultValueRegex = new(
    @"\(default\s+(.+?)\)\s*$",
    RegexOptions.Compiled);

Two regexes. The first handles the structural components of a flag line. The second extracts the default value from the description.

The parsing itself:

private void ParseOptionLine(
    string line,
    ref CommandOption? pendingOption,
    List<CommandOption> target)
{
    // Is this a continuation line?
    // Continuation lines are indented but don't start with -
    if (pendingOption != null && IsContinuationLine(line))
    {
        pendingOption = pendingOption with
        {
            Description = pendingOption.Description + " "
                + line.Trim(),
        };
        return;
    }

    // Flush previous option
    if (pendingOption != null)
    {
        target.Add(FinalizeOption(pendingOption));
        pendingOption = null;
    }

    // Try to parse as a new flag line
    var match = FlagLineRegex.Match(line);
    if (!match.Success)
        return;

    var shortName = match.Groups[1].Success
        ? match.Groups[1].Value
        : null;
    var longName = match.Groups[2].Value;
    var goType = match.Groups[3].Success
        ? match.Groups[3].Value
        : null;
    var description = match.Groups[4].Success
        ? match.Groups[4].Value.Trim()
        : null;

    pendingOption = new CommandOption(
        LongName: longName,
        ShortName: shortName,
        Description: description,
        DefaultValue: null,     // extracted during finalize
        ValueKind: MapValueKind(goType),
        ClrType: MapClrType(goType),
        IsRequired: false);     // cobra doesn't mark required in help
}

private static bool IsContinuationLine(string line)
{
    if (string.IsNullOrWhiteSpace(line))
        return false;

    // Continuation lines have heavy indentation (30+ spaces)
    // and do NOT start with -- or -X,
    var trimmed = line.TrimStart();
    var indent = line.Length - trimmed.Length;

    return indent >= 30
        && !trimmed.StartsWith("--")
        && !Regex.IsMatch(trimmed, @"^-\w,");
}

private static CommandOption FinalizeOption(CommandOption option)
{
    if (option.Description == null)
        return option;

    // Extract default value from description
    var defaultMatch = DefaultValueRegex.Match(option.Description);
    if (!defaultMatch.Success)
        return option;

    var defaultValue = defaultMatch.Groups[1].Value.Trim('"');
    var cleanDescription = option.Description[..defaultMatch.Index].TrimEnd();

    return option with
    {
        DefaultValue = defaultValue,
        Description = cleanDescription,
    };
}

The key insight is the continuation line detection. Cobra aligns descriptions to a column (usually around column 39). When a description is too long, it wraps to the next line at the same column. That means continuation lines have 30+ spaces of leading indentation and do not start with -- or -X,. The parser detects this and appends to the pending option instead of starting a new one.

This handles the --cgroupns case from earlier, where the description spans four lines describing each possible value.

Multi-Line Description Example

Consider this flag from the real output:

      --cgroupns string                Cgroup namespace to use
                                       (host|private)
                                       'host':    Run the container in
                                                  the Docker host's
                                                  cgroup namespace
                                       'private': Run the container in
                                                  its own private cgroup
                                                  namespace
                                       '':        Use the cgroup
                                                  namespace as
                                                  configured by the
                                                  default-cgroupns-mode
                                                  option on the daemon
                                                  (default)

The first line matches FlagLineRegex: long name --cgroupns, type string, description starts with "Cgroup namespace to use". The next 13 lines are all continuation lines -- they have 30+ spaces of indentation and do not start with --. Each one gets appended to the description.

The final description is one long string: "Cgroup namespace to use (host|private) 'host': Run the container in the Docker host's cgroup namespace 'private': Run the container in its own private cgroup namespace '': Use the cgroup namespace as configured by the default-cgroupns-mode option on the daemon".

The last (default) on the final line does match DefaultValueRegex -- but the captured value is an empty string, which is correct: the default is the empty string, meaning "use the daemon's configured namespace mode."


Go Type to C# Type Mapping

Cobra attaches Go type hints to flag definitions. These appear in the help output after the flag name. The parser maps them to CLR types:

private static string MapClrType(string? goType) => goType switch
{
    null or ""          => "bool",
    "string"            => "string",
    "int"               => "int",
    "int64"             => "long",
    "uint" or "uint64"  => "ulong",
    "uint16"            => "ushort",
    "uint32"            => "uint",
    "float64"           => "double",
    "duration"          => "string",
    "list"              => "List<string>",
    "stringArray"       => "List<string>",
    "strings"           => "List<string>",
    "stringToString"    => "Dictionary<string,string>",
    "ulimit"            => "string",
    "bytes"             => "string",
    "mount"             => "string",
    "network"           => "string",
    "gpu-request"       => "string",
    "decimal"           => "double",
    "filter"            => "string",
    "command"           => "string",
    "ip"                => "string",
    _                   => "string",    // unknown types default to string
};

private static OptionValueKind MapValueKind(string? goType) => goType switch
{
    null or ""              => OptionValueKind.Flag,
    "list"                  => OptionValueKind.List,
    "stringArray"           => OptionValueKind.List,
    "strings"               => OptionValueKind.List,
    "stringToString"        => OptionValueKind.List,
    _                       => OptionValueKind.Single,
};

The full type mapping table:

Cobra Type Go Type C# Type OptionValueKind
(absent) bool bool Flag
string string string Single
int int int Single
int64 int64 long Single
uint16 uint16 ushort Single
uint32 uint32 uint Single
uint / uint64 uint / uint64 ulong Single
float64 float64 double Single
duration time.Duration string Single
list []string List<string> List
stringArray []string List<string> List
strings []string List<string> List
stringToString map[string]string Dictionary<string,string> List
ulimit custom string Single
bytes custom string Single
mount custom string Single
network custom string Single
gpu-request custom string Single
filter custom string Single

A few decisions worth explaining:

duration and bytes map to string, not TimeSpan or long. Go durations ("10s", "5m30s") and Docker memory values ("512m", "2g") have their own formats. I pass them through as strings -- the generated builders provide strongly typed overloads (TimeSpan, long) that format correctly, but the underlying option stays string to avoid lossy translation.

list and stringArray both map to List<string>. Cobra has two slice types that behave identically from the parser's perspective -- both produce repeatable flags like -e FOO=1 -e BAR=2.

stringToString maps to Dictionary<string,string>. This is cobra's map[string]string. On the command line: --label key=value --label key2=value2.

Unknown types default to string. If Docker introduces a new cobra type, the parser does not crash. It maps to string, logs a warning, and I add the type on the next scrape cycle.


Real Scraped Output: Docker vs Docker Compose

Docker Compose uses cobra too, but its help output has different characteristics. Here is docker compose up --help from Compose v2.24.0:

Usage:  docker compose up [OPTIONS] [SERVICE...]

Create and start containers

Options:
      --abort-on-container-exit   Stops all containers if any container
                                  was stopped. Incompatible with -d
      --attach stringArray        Restrict attaching to the specified
                                  services. Incompatible with
                                  --attach-dependencies.
      --build                     Build images before starting
                                  containers
  -d, --detach                    Detached mode: Run containers in the
                                  background
      --dry-run                   Execute command in dry run mode
      --force-recreate            Recreate containers even if their
                                  configuration and image haven't changed
      --no-deps                   Don't start linked services
      --pull string               Pull image before running
                                  ("always"|"missing"|"never"|"build")
                                  (default "policy")
      --remove-orphans            Remove containers for services not
                                  defined in the Compose file
      --scale scale               Scale SERVICE to NUM instances.
                                  Overrides the `scale` setting in the
                                  Compose file if present.
  -t, --timeout int               Use this timeout in seconds for
                                  container shutdown (default 0)
      --wait                      Wait for services to be
                                  running|healthy. Implies detached mode.
  -w, --watch                     Watch source code and rebuild/refresh
                                  containers when files are updated.

Same cobra format. Same section headers. Same flag line structure. CobraHelpParser handles it identically. But there are differences worth noting:

No aliases section: Compose commands do not have aliases the way Docker commands do (docker container run / docker run). The parser handles the absence of the Aliases: section by transitioning directly from Usage to Description.

stringArray instead of list: Compose prefers stringArray for its repeatable flags (--attach, --no-attach). Docker prefers list. Both map to List<string> in C#.

scale as a type: The --scale flag has type scale, which is a custom cobra type. The parser maps it to string via the fallback rule. The actual format is service=num pairs -- the generated builder provides a Dictionary<string,int> overload that serializes correctly.

Global flags: Compose inherits global flags from the docker compose parent command (--file, --project-name, --project-directory, --profile, --env-file, etc.). Those appear in the Global Flags: section when you run docker compose up --help. The parser captures them and merges them into the option list.

The point: one parser, zero changes, two different CLI tools. The cobra format is the format. Docker and Compose differ in content, not in structure.


The Recursive Scrape Flowchart

CobraHelpParser parses one command at a time. The scraper orchestrates the recursion:

Diagram
The recursion the scraper wraps around CobraHelpParser — the parser handles a single help page while this loop walks depth-unbounded until every leaf command's flags have been captured into one CommandTree.

Docker has three levels of nesting: docker -> docker container -> docker container run. Compose has two: docker compose -> docker compose up. The scraper does not hardcode depth -- it just follows subcommands until there are none.

For Docker 24.0.0, this produces 180+ CommandNode objects in a single tree. The scraper serializes the tree to JSON, one file per version: docker-24.0.0.json. That JSON file is what the source generator reads at build time.


The Other Parsers: Why One Parser Cannot Rule Them All

CobraHelpParser handles cobra-based CLIs. But BinaryWrapper supports other CLIs too -- Packer, Vagrant, PodmanCompose (Python), and others. Each has its own help format, and each requires its own parser.

Here is a brief comparison:

Parser Framework Used By Section Header Flag Format
CobraHelpParser Go/cobra Docker, Compose, Podman, glab Available Commands: -s, --name type
StandardHelpParser GNU-style Generic CLIs Commands: --name=VALUE
ArgparseHelpParser Python argparse PodmanCompose optional arguments: -s, --name VALUE
PackerHelpParser HashiCorp custom Packer Flat subcommand list -name=value
VagrantHelpParser Ruby custom Vagrant Indented tree --name VALUE
GlabHelpParser Custom cobra template GitLab CLI Modified cobra sections Same as cobra

Each format is different enough that a single parser would devolve into a mess of heuristics. Cobra uses -s, --name type. GNU-style uses --name=VALUE. Python argparse uses -s, --name VALUE with optional arguments: as the section header. Packer uses single-dash -name=value with Available commands are:. Vagrant uses an indented tree under Common commands:.

The decision tree for auto-detection:

Diagram
The auto-detection decision tree — each parser's CanParse method fingerprints a section header and flag shape, so an unknown binary's help text is routed to the right parser without hardcoded mapping.

Each parser's CanParse method checks for its framework's signature patterns. CobraHelpParser looks for "Available Commands:" combined with flag lines that match the -s, --name type pattern. ArgparseHelpParser looks for "optional arguments:". PackerHelpParser looks for "Available commands are:" (note: lowercase c, singular are). VagrantHelpParser looks for "Common commands:".

The auto-detection is a best-effort heuristic. For known binaries -- Docker, Compose, Podman, Packer, Vagrant -- the scraper configuration explicitly specifies which parser to use. Auto-detection is the fallback for unknown binaries.


The Reparse Workflow

Here is a scenario I hit regularly: I discover that CobraHelpParser mishandles a rare flag format. Maybe a new Docker version introduces a flag with a type I have not seen, or a description wraps in an unexpected way. I fix the parser. Now I need to verify the fix against all 97 scraped versions without re-running the containers.

This is why the scraper saves raw help text alongside the parsed JSON.

The scraper produces two outputs per version:

scrape/
  docker-24.0.0.json          # Parsed CommandNode tree
  docker-24.0.0.help/         # Raw help text, one file per command
    docker.txt
    docker-container.txt
    docker-container-run.txt
    docker-container-ls.txt
    docker-image.txt
    docker-image-build.txt
    ...

The .help/ directory contains the raw stdout of every --help invocation. One file per command path, named with hyphens replacing spaces. For Docker 24.0.0, that is 180+ text files.

The --reparse flag tells the scraper to skip container operations and re-run the parser against cached help text:

dotnet run --project src/Scraper -- \
    --binary docker \
    --reparse \
    --output scrape/

This reads every .txt file in every .help/ directory, re-parses it with the updated CobraHelpParser, and overwrites the .json files. It takes about 2 seconds for all 97 Docker versions. Compare that to the 30+ minutes it takes to rebuild containers and re-scrape.

The reparse workflow serves two purposes:

  1. Development loop: Fix a parser bug, reparse, diff the JSON output, verify the fix.
  2. Regression detection: After any parser change, reparse ALL versions and diff. If any previously correct JSON changes in a way I did not expect, I have a regression.

I diff the JSON output with a simple script:

# Before: save current JSON as baseline
cp -r scrape/*.json baseline/

# After parser change: reparse
dotnet run --project src/Scraper -- --binary docker --reparse

# Diff
diff -r baseline/ scrape/ --include="*.json" | head -100

If the diff shows only the fixes I intended, I commit. If it shows unexpected changes, I investigate. This has caught at least a dozen regressions over the life of the project.


CanParse: Auto-Detecting Cobra Output

The CanParse method is CobraHelpParser's gate. Given a blob of help text from an unknown binary, it returns true if the text looks like cobra output:

public bool CanParse(string helpText)
{
    if (string.IsNullOrWhiteSpace(helpText))
        return false;

    var lines = helpText.Split('\n');
    var hasUsage = false;
    var hasCobraFlags = false;
    var hasCommands = false;

    foreach (var line in lines)
    {
        var trimmed = line.TrimStart();

        if (trimmed.StartsWith("Usage:"))
            hasUsage = true;

        if (trimmed is "Available Commands:" or "Commands:")
            hasCommands = true;

        if (trimmed is "Options:" or "Flags:" or "Global Flags:")
            hasCobraFlags = true;

        // Look for cobra-style flag format: -X, --name
        if (FlagLineRegex.IsMatch(line))
            hasCobraFlags = true;
    }

    // Cobra output always has Usage: and either commands or flags
    return hasUsage && (hasCobraFlags || hasCommands);
}

The heuristic is deliberately conservative: cobra output always has Usage: plus either flags or commands. I would rather return false and let another parser try than return true and produce garbage. False negatives mean specifying the parser explicitly. False positives mean silently corrupted command trees.


Malformed Help Handling

Not all help text is clean. Over 97 Docker versions and 57 Compose versions, I have seen experimental commands with truncated output, plugin commands with custom cobra templates, old Docker 17.x formatting, and commands that error instead of printing help. The parser's philosophy: extract what you can, skip what you cannot, never crash.

public CommandNode ParseHelp(string helpText, string commandPath)
{
    try
    {
        return ParseHelpCore(helpText, commandPath);
    }
    catch (Exception ex)
    {
        _logger.LogWarning(
            "Failed to parse help for {Command}: {Error}. " +
            "Returning empty CommandNode.",
            commandPath, ex.Message);

        return new CommandNode(
            Name: ExtractCommandName(commandPath),
            Description: null,
            SubCommands: Array.Empty<CommandNode>(),
            Options: Array.Empty<CommandOption>());
    }
}

Inside ParseHelpCore, unrecognized lines in the Options state are silently skipped -- no crash, no exception. Empty help text returns an empty CommandNode. Missing Usage lines fall back to the commandPath parameter. Truncated output returns whatever was parsed so far. Partial data is better than no data.

Post-parse validation catches parser bugs: duplicate long names (a strong signal that a continuation line was misidentified as a new flag), missing descriptions, and suspiciously empty nodes all generate warnings in the scrape log.


Edge Cases from 97 Docker Versions

Scraping 97 versions of Docker (and 57 of Compose) is the best stress test a parser can get. Here are the edge cases that forced parser changes:

Edge Case 1: Flags with No Description

Some Docker commands have flags with no description at all:

      --oom-score-adj int

Just the name and type, no gap, no description. The regex handles this because the description group is optional ((?:\s{2,}(.+))?$). But the continuation line detector needs to not treat the next real flag line as a continuation of this one.

Edge Case 2: Plugin Commands

Docker plugins like buildx and scout are cobra-based but register as plugins. Their help output is slightly different:

Usage:  docker buildx build [OPTIONS] PATH | URL | -

Start a build

Build Flags:
      --add-host strings              Additional custom host-to-IP
                                      mapping (format: "host:ip")
      --allow strings                 Allow extra privileged
                                      entitlement (e.g.,
                                      "network.host",
                                      "security.insecure")

Note "Build Flags:" instead of "Options:". I added this to the section header detection:

_ when trimmedLine.EndsWith("Flags:") => ParserState.Options,

Any line ending with "Flags:" transitions to the Options state. This is slightly more permissive than the strict allowlist approach, but it is safe because this pattern only appears as a section header in cobra output.

Edge Case 3: Deprecated Flags

Some flags have DEPRECATED in their description:

      --link list                      Add link to another container
                                       (DEPRECATED)

The parser does not strip this marker. It flows into the Description field and eventually into the generated C# code as a [Obsolete] attribute. The code generator detects (DEPRECATED) in the description and adds the attribute.

Edge Case 4: Hidden Commands

Some Docker versions have hidden commands that do not appear in Available Commands: but respond to --help if you know the name. The parser cannot discover these -- they are invisible in the help text. I handle this with a supplementary list of known hidden commands per binary, maintained manually.

Edge Case 5: Flag Names with Dots

The Docker daemon has flags like --log-opt max-size=10m where the option name in some help outputs contains a dot: --storage-opt dm.basesize. The regex [\w][\w-]* matches word characters and hyphens. Dots are not word characters. I extended the pattern:

@"(--[\w][\w.-]*)"    // long name: --name, --storage-opt

The dot only appeared in a few daemon-level flags, but without this fix, those flags silently disappeared from the parsed output.


Putting It All Together

Here is the complete flow from raw help text to serialized JSON, showing how CobraHelpParser fits into the larger scraping pipeline:

// In the scraper's recursive walk
async Task ScrapeCommand(
    string binaryPath,
    string commandPath,
    IHelpParser parser,
    CancellationToken ct)
{
    // 1. Execute --help
    var helpText = await _processRunner.RunAsync(
        binaryPath,
        $"{commandPath} --help",
        ct);

    // 2. Parse the help text
    var node = parser.ParseHelp(helpText, commandPath);

    // 3. Save raw help text for reparse
    var helpFileName = commandPath.Replace(' ', '-') + ".txt";
    await File.WriteAllTextAsync(
        Path.Combine(_helpDir, helpFileName),
        helpText, ct);

    // 4. Recurse into subcommands
    var children = new List<CommandNode>();
    foreach (var sub in node.SubCommands)
    {
        var childPath = $"{commandPath} {sub.Name}";
        var childNode = await ScrapeCommand(
            binaryPath, childPath, parser, ct);
        children.Add(childNode);
    }

    // 5. Return the node with fully populated children
    return node with
    {
        SubCommands = children,
    };
}

The parser is stateless. It takes text in, returns a CommandNode out. The scraper owns the recursion, the file I/O, and the tree assembly. This separation means I can unit test the parser with raw text fixtures and integration test the scraper with a mock IProcessRunner.


Performance

Parser performance does not matter much -- we parse at design time, not runtime. But for reference: a single command parse takes ~0.1ms, a full Docker version (180+ commands) takes ~15ms, and reparsing all 97 versions takes ~1.5 seconds. The regex is compiled (RegexOptions.Compiled), but even without that, string splitting on a few hundred lines of text is not where bottlenecks live.


Testing Strategy

CobraHelpParser has 80+ unit tests in three categories: fixture tests (real help text from specific Docker/Compose/Podman versions saved as embedded resources), edge case tests (synthetic help text exercising multi-line descriptions, missing fields, unusual types), and cross-binary consistency tests (verifying the same parser produces equivalent structures for Docker, Compose, and Podman).

[Theory]
[InlineData("docker-24.0.0-container-run")]
[InlineData("docker-20.10.0-container-run")]
[InlineData("compose-2.24.0-up")]
[InlineData("podman-4.9.0-run")]
public void ParsesRealHelpText(string fixtureName)
{
    var helpText = LoadFixture(fixtureName);
    var node = _parser.ParseHelp(helpText, fixtureName);

    node.Name.Should().NotBeNullOrEmpty();
    node.Options.Should().NotBeEmpty();
    node.Options.Should().AllSatisfy(o =>
        o.LongName.Should().StartWith("--"));
}

[Fact]
public void DockerAndPodmanProduceSameStructure()
{
    var dockerNode = _parser.ParseHelp(
        LoadFixture("docker-24.0.0-container-run"), "docker container run");
    var podmanNode = _parser.ParseHelp(
        LoadFixture("podman-4.9.0-run"), "podman run");

    var commonFlags = new[] { "--detach", "--name", "--env", "--volume" };
    foreach (var flag in commonFlags)
    {
        dockerNode.Options.Should().Contain(o => o.LongName == flag);
        podmanNode.Options.Should().Contain(o => o.LongName == flag);
    }
}

The fixture tests are regression tests -- if a parser change breaks an existing fixture, the test fails. The cross-binary tests verify the abstraction holds: same interface, same behavior, different binaries.


What The Parser Does Not Do

It is worth being explicit about the boundaries:

The parser does not discover hidden flags. Cobra has a concept of hidden flags that do not appear in --help output. If Docker hides an experimental flag, the parser cannot see it. This is fine -- if it is hidden, I probably should not generate a typed API for it.

The parser does not validate flag semantics. It does not know that --memory expects a value like "512m" or that --restart only accepts "no", "always", "unless-stopped", "on-failure". Semantic validation is the code generator's job, informed by hardcoded allowlists and (default ...) patterns in descriptions.

The parser does not resolve flag conflicts. Some Docker flags are mutually exclusive (--network host and --publish). Cobra does not express this in help text, so the parser cannot extract it. Conflict detection is manual, maintained in a supplementary configuration file.

The parser does not handle non-cobra formats. That is what the other five parsers are for.


Closing

One parser. Five binaries. Thousands of commands. CobraHelpParser is the most exercised parser in the BinaryWrapper suite -- and the one that demonstrates why scraping --help is a viable strategy for generating typed APIs.

The cobra framework gave Go CLI tools a consistent, machine-readable help format. CobraHelpParser exploits that consistency with an 8-state state machine, a single regex for flag lines, and a type mapping table that converts Go types to CLR types. The reparse workflow turns 97 cached versions into a regression test suite that validates every parser change against every Docker release from 17.x to 27.x.

The parsed CommandNode tree is the input to the next stage: the Roslyn source generator that turns it into C# code. That is Part VI: Build Time -- The Source Generator for CLI Commands.


Previous: Part IV: Design Time -- Scraping 57 Docker Compose Versions | Next: Part VI: Build Time -- The Source Generator for CLI Commands

Back to series index

⬇ Download