This blog addresses the operationalizing of meta-data usage in data management using Talend. To explain further you have files and tables which have schema definitions. The schemas hold information like name, data type, and length. This information can be imported or keyed in with a visual editor like Talend Studio at design time; however, this can add a lot of extra work and is prone to errors. If you could extract this information from a known & governed source and use it at run time you could automate the creation of tables or file structures. Once tested and verified the tables can be manually imported if stored meta-data schemas are desired.
What is schema?
Schema is the definition of the data formats, fields, types and lengths.
This is an example of what a persisted meta-data schema looks like in Talend. This is created at design time in Talend.
Data Dictionary File Example
This is what a data dictionary file could look like. This can be used to instead of defining a static schema in Talend Meta-Data as displayed above. The data dictionaries used in this process will have a static meta-data schema from a data modeling tool like Erwin or a database schema. Your dictionaries should be transformed to a common format allowing for more code reuse.
Schema for a Data Dictionary
This is a basic layout for a data dictionary using Column Name, Type and Length. An exhaustive list of Talend dictionary items is listed below.
What is the conceptual process?
The diagram below is the logical process for handling dynamics schemas with a data dictionary at run time. Talend has components for handling dynamic schemas for positional non-Big-Data files, but all other types of data sources and targets could apply to this pattern.
- Load the data dictionary
- Define input and output as a single data item
- String for Big Data
- Dynamic for files, tables and internal schemas
- Use Java APIs to operationalize the use of data dictionaries
- For non-Big-Data use the Talend API which will be demonstrated in subsequent examples.
- For Big-Data Java string utilities will be used
This process can be used to:
- Virtualize schemas for files
- Fixed (NOTE: There are Talend components to do this as well)
- Virtualize schemas for Tables
- Virtualize schemas for Big Data elements like HDFS or Hive Schemas
- Virtualize Internal Schemas used for Data services or Queues
Load the Dictionary
Below is the code to load the data dictionary from a file or table. The values read from the dictionary are moved into memory in the form of an ArrayList. This ArrayList can then be used throughout the data management process to operationalize the processing of data.
// Define a counter that controls number of columns
int rowCnt = (Integer) globalMap.get("rowCount");
// Define three arrays to hold the data dictionary columns
List nameList = new ArrayList();
List typeList = new ArrayList();
List lengthList = new ArrayList();
// Load the arrays from global variables
if (rowCnt == 0)
nameList = (ArrayList) globalMap.get("nameList");
typeList = (ArrayList) globalMap.get("typeList");
lengthList = (ArrayList) globalMap.get("lengthList");
// Move data dictionary file or tables values to the array elements
// Put the array back to a global variable with the new values added
globalMap.put("rowCount", rowCnt + 1);
Define Input and Output
Dynamic Schema for non-Big-Data Components
This is an example of a dynamic schema definition for a delimited file. One field of the type Dynamic is used for the entire record and the schema will be determined at runtime.
Dynamic Schema definition for Big-Data Components
This is an example of a dynamic schema for Big Data. Notice that a data type of string is used instead of dynamic. Talend doesn’t support dynamic types for Bid Data components.
In the following weeks I will go into specific usages of dynamic schemas and the implementation of Talend jobs, components and Java APIs.
- Dynamic Schemas for traditional files
- Dynamic Schemas for Big Data Files
- Dynamic Schemas for NoSQL Tables
- Operationalizing with the Meta Data Bridge (Available in a future release of Talend)
The usage of dynamic schemas can save on maintenance and development in the data management layer. Dynamics schemas can also be used to create tables or files that can be imported as persisted meta-date schemas. This article is intended to propose some complementary technologies around meta-data management. How you use them will apply to the priority of your architectural objects. These objectives can be somewhat opposing such as code maintainability vs strictly governed meta-data.
To find the meta-data approach that works you can ask questions such as:
- Does your Talend use-case favor code re-use and the use of governed 3rd-party data dictionaries?
- Use Dynamic Schemas
- Does Talend use case favor persisted meta-data? Is this meta-data the vehicle for schema governance?
- Use Persisted Schemas
- Do you want to use dynamic schemas to create meta-data for testing and POCs which will eventually end up as persisted meta-data?
- Use a Hybrid Approach by dynamically creating files or tables that can be imported as persisted schemas.