Data Modeling on

Data Modeling onhttps://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/data-modeling/Recent content in Data Modeling onHugo -- gohugo.ioOverviewhttps://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/overview/Mon, 01 Jan 0001 00:00:00 +0000https://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/overview/Data modeling in DJ can be done in several ways: Using the UI Using the API Any supported clients Using YAML files Data Modeling Stages # The typical flow for onboarding data models looks like this: Create appropriate namespaces for organization. Register tables as source nodes in DJ. Create transform nodes if any additional, light-weight SQL transformations are necessary. Create dimension nodes, depending on what dimensions are needed. These may be simple references to existing source nodes, if the dimensional modeling has already been done outside of DJ.Column Typeshttps://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/column-types/Mon, 01 Jan 0001 00:00:00 +0000https://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/column-types/DJ’s type system is based on Apache Spark SQL types. This ensures compatibility with Spark-based query engines and provides a familiar type system for users working with big data platforms. Specifying Types # How you specify types depends on the node type: Source nodes: You can manually specify column types as strings, or DJ can automatically infer types by connecting to the external table via table reflection. # Example: Specifying column types for a source node columns: - name: user_id type: bigint - name: username type: string - name: metadata type: map<string, string> - name: tags type: array<string> Transform, dimension, and metric nodes: DJ automatically parses your SQL query and infers column types based on the expressions and upstream node types.Namespaceshttps://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/namespaces/Mon, 01 Jan 0001 00:00:00 +0000https://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/namespaces/👉 Namespaces can be thought of as folders. All nodes in DataJunction exist within a namespace. Node names are dot separated alpha-numeric elements. The leading elements identify the namespace where the node exists. Nodes that do not include any dots in the name are automatically defined in the default namespace. Node Name Namespace roads.demo.repairs roads.demo finance.revenue finance hr.people.employees hr.people customer default Since namespaces are inferred directly from the node name, creating a node in a particular namespace simply requires prefixing the node name with the namespace.YAML Projectshttps://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/yaml-projects/Mon, 01 Jan 0001 00:00:00 +0000https://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/yaml-projects/DJ entities can be managed through YAML definitions. This is a versatile feature that enables change review and more holistic testing before deploying to production. Using source-controlled YAML definitions provide a more structured approach to development, allowing you to review and audit changes. 👉 You can anchor your project to a specific namespace in DJ, and use YAML files to define all nodes in that namespace. While not required, this approach promotes a cleaner and more organized setup.Git Integrationhttps://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/git-integration/Mon, 01 Jan 0001 00:00:00 +0000https://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/git-integration/DJ provides full Git integration for managing your node definitions. Link a namespace to a repository, then create branches, commit changes, open pull requests, and merge—all from the DJ UI. Overview # Link any namespace to a Git repository to enable version-controlled workflows. Once linked, you choose how to work: Git as source of truth: Make the namespace read-only so all changes must come from Git commits (recommended for production) UI-driven development: Create branches, edit nodes in the UI, commit your changes, and open PRs to merge them back Recommended Workflow # This section walks through a typical setup where your production namespace is linked to the main branch and protected from direct edits.Sourceshttps://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/sources/Mon, 01 Jan 0001 00:00:00 +0000https://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/sources/👉 Source nodes can be thought of as tables. Source nodes represent external tables in a database or data warehouse and make up the foundational layer on which other nodes are built upon. Attribute Description Type name Unique name used by other nodes to select from this node string description A human readable description of the node string display_name A human readable name for the node string mode published or draft (see Node Mode) string catalog The name of the external catalog string schema_ The name of the external schema string table The name of the external table string columns A map of the external table’s column names and types map Creating Catalogs and Engines # Before creating source nodes, a DataJunction server must contain at least one catalog.Transformshttps://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/transforms/Mon, 01 Jan 0001 00:00:00 +0000https://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/transforms/👉 Transform nodes can be thought of as views. Transform nodes allow you to do arbitray SQL operations on sources, dimensions, and even other transform nodes. Of course with a perfect data model, you may not need to define any transform nodes. However, in some cases it may be convenient to use transform nodes to clean up your external data within DJ by joining, aggregating, casting types, or any other SQL operation that your query engine supports.Dimensionshttps://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/dimensions/Mon, 01 Jan 0001 00:00:00 +0000https://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/dimensions/👉 Dimension nodes can be thought of as special views, with the additional ability to configure joins. Dimension nodes include a query that can select from any other node to create a representation of a dimension. They must always have a primary key configured, and can have any number of associated dimensional attributes. One key feature of dimension nodes is the ability to configure join links. Any DJ node can be linked to a dimension node via two different types of dimension linking.Dimension Linkshttps://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/dimension-links/Mon, 01 Jan 0001 00:00:00 +0000https://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/dimension-links/Dimension Links # Dimension links help build out DJ’s dimensional metadata graph, a key component of the DJ DAG. There are two types of dimension links: join links and alias/reference links. Join Links # You can configure a join link between a dimension node and any source, transform, or dimension nodes. Configuring this join link will make it so that all dimension attributes on the dimension node are accessible by the original node.Metricshttps://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/metrics/Mon, 01 Jan 0001 00:00:00 +0000https://6a023ef24bddd400077b9264--thriving-cassata-78ae72.netlify.app/docs/0.1.0/data-modeling/metrics/Metric nodes represent an aggregation of a measure defined as a single expression in a query that selects from a single source, transform, or dimension node. Attribute Description Type name Unique name used by other nodes to select from this node string display_name A human readable name for the node string description A human readable description of the node string mode published or draft (see Node Mode) string query A SQL query that selects a single expression from a single node string Creating Metric Nodes # curl curl -X POST http://localhost:8000/nodes/metric/ \ -H 'Content-Type: application/json' \ -d '{ "name": "default.