Internals
Internal architecture and design details for datacur8.
Table of contents
- Package Structure
- Validation Phases
- Selectors
- Strict Mode
- CSV Parsing
- Export Ordering
- Memory Model
- Performance Notes
Package Structure
datacur8 is implemented in Go with all application logic in the internal/ directory to enforce encapsulation. The main.go at the repository root handles CLI argument parsing and delegates to internal packages.
main.go # CLI entry point, flag parsing
internal/
cli/ # Command orchestration (validate, export, tidy)
config/ # Config model, loading, defaults, validation
constraints/ # Constraint evaluation engine
discovery/ # File discovery and type matching
export/ # Output file generation
schema/ # JSON Schema validation with strict mode
selector/ # JSONPath-like selector parser and evaluator
tidy/ # File formatting and normalization
Package dependencies
main → cli → config, constraints, discovery, export, schema, tidy
constraints → config, selector
discovery → config
export → config
schema → (external: google/jsonschema-go)
selector → (standalone)
tidy → (standalone)
Validation Phases
Validation runs in a strict sequence of phases. Each phase must succeed before the next phase runs. This ensures that errors are reported at the earliest meaningful point.
Phase 1: Config Validation
Package: config
- Load and parse the
.datacur8YAML file - Apply default values (strict_mode, constraint scope, csv delimiter)
- Validate the config structurally and semantically:
- Version format and compatibility
- Valid enum values for strict_mode, input, output.format
- Unique type names
- Unique output paths across types
- Regex patterns compile successfully
- Schemas are present with
type: object - CSV config is present when input is csv
- Constraint selectors are valid
- Foreign key references point to defined types
- Path capture groups exist in all include patterns
Config validation returns both warnings and errors. Warnings (e.g., version check skipped for dev builds) do not prevent further processing.
Phase 2: File Discovery
Package: discovery
- Walk the repository directory tree
- Skip ignored directories (
.git,node_modules,__pycache__, etc.) and output paths - For each file, test against all type include/exclude patterns
- Extract named capture groups and built-in path values
- Validate that each file matches exactly one type
Discovery pre-compiles all regex patterns for efficiency. The result is a sorted list of DiscoveredFile records, each carrying a pointer to its TypeDef and a map of path captures.
Phase 3: Schema Validation
Package: schema, cli
- Read and parse each discovered file according to its input format
- For JSON and YAML: parse into a single
map[string]any - For CSV: validate headers, convert each row into a typed
map[string]any - Apply strict mode overlay to the schema (if configured)
- Validate each item against its JSON Schema using
google/jsonschema-go
CSV parsing is notable: it uses the schema to guide type conversion of cell values (string → boolean, number, integer), and validates headers against schema properties and required fields.
Phase 4: Constraint Evaluation
Package: constraints
- Build in-memory indexes for all items grouped by type
- Evaluate each type’s constraints:
- unique: Build a set of seen values; report duplicates
- foreign_key: Build a lookup index of referenced type’s key values; check each owning item
- path_equals_attr: Compare path capture value against item attribute value
- Collect all errors with stable ordering (by type, then file path, then row index)
Selectors
The selector package implements a constrained subset of JSONPath for predictable behavior.
Supported syntax
| Syntax | Example | Meaning |
|---|---|---|
| Root | $ | The entire object |
| Field access | $.field | A top-level field |
| Nested access | $.a.b.c | Nested field traversal |
| Array projection | $.items[*].id | All id values from array items |
Evaluation behavior
- Selectors are parsed into a sequence of segments (field names and wildcards)
- Evaluation traverses the data structure following each segment
- Missing fields return an empty result (not an error)
- The
[*]wildcard expands across all elements of an array - A selector is “scalar” if it contains no
[*]wildcards
Multi-value handling
When a selector yields multiple values for a single item:
- unique with
scope: type: each value contributes to the global uniqueness set - unique with
scope: item: all values within one item must be unique - foreign_key: invalid — requires a single scalar value
- path_equals_attr: invalid — requires a single scalar value
Strict Mode
Strict mode is implemented as a schema overlay applied at validation time, not by modifying the config on disk.
DISABLED
Schemas are passed to the validator exactly as defined.
ENABLED
Before validation, the schema is traversed recursively. For every object schema that does not explicitly set additionalProperties, the overlay inserts additionalProperties: false. Explicit settings (true or false) are preserved.
FORCE
Same as ENABLED, but additionalProperties: true is overridden to false. Only additionalProperties set to a schema object (not boolean) is preserved.
The overlay function walks the schema tree, visiting properties, items, allOf, anyOf, oneOf, if, then, else, and additionalProperties (when it is a schema object).
CSV Parsing
CSV files are handled specially because they don’t have native types — every cell is a string.
Parsing flow
- Read the entire CSV file with the configured delimiter
- Validate headers: every column name must exist in
schema.properties; everyschema.requiredfield must be present as a column - Convert each cell value based on the schema property type:
string: used as-isboolean:"true"→true,"false"→false(case-insensitive)number: parsed as float64integer: parsed as integer, then stored as float64 for JSON compatibility
- Validate each row object against the JSON Schema
If any header validation fails, no rows are processed. If any cell cannot be converted, the entire file is rejected with per-row error messages.
Export Ordering
Export produces deterministic output through strict ordering rules:
- Types are processed in the order they appear in the config
- Items within a type are ordered by file path (lexicographic)
- For CSV files, rows maintain their within-file order
Output formats
- JSON: Items are wrapped in an object keyed by the type name, with the value being an array. Pretty-printed with 2-space indentation.
- YAML: Same structure as JSON but serialized as YAML.
- JSONL: One minified JSON object per line.
Output directories are created automatically if they don’t exist.
Memory Model
datacur8 uses an in-memory model for all processing:
- All discovered files are loaded into memory
- All parsed items are held in memory simultaneously
- Constraint indexes (uniqueness sets, foreign key lookup maps) are built in memory
This approach is simple and fast for the expected use case (hundreds to low thousands of files). The architecture allows for future optimizations (streaming, spill-to-disk) without changing the configuration model.
Performance Notes
- Regex patterns are pre-compiled once during discovery setup
- File discovery skips common non-data directories early
- Schema validation uses a compiled schema evaluator
- Constraint evaluation builds indexes in a single pass, then validates in a second pass
- Export and tidy operate on already-parsed data, avoiding re-reads