Operations

Unary Operations

The abstract class for unary operation modules.

class modules.operations.unary_operation.UnaryOperation(module, env, named_modules)

The Unary operation base abstract class.

Parameters:module (dict) – The module must contain a source field with the name of the incoming data flow.
add_to_graph(graph)

A method for adding the module to a graphviz graph instance.

Parameters:graph (graphviz.dot.Digraph) – A graphviz Digraph object
rendered_result()

Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.

Return type:Tuple[str, str]

Map

The map operation module

class modules.operations.map.Map(module, env, named_modules)

Bases: modules.operations.unary_operation.UnaryOperation

A module that maps an arbitrary scala function to the incoming data flow.

Warning: Arbitrary scala code will only be checked at compilation and therefore could make the final program fail

Parameters:module (dict) –

The module dict must contain a function field that contains the desired scala function to be mapped to the data flow. (ex: "(tuple) => (tuple._1*2, tuple._2)").

The outType field must also be provided to ensure compatibility with downstream modules.

check_integrity()

Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.

get_out_type()

Returns the output type of the module as a list of strings.

rendered_result()

Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.

Return type:Tuple[str, str]

Count Distinct

The distinct count operation module

class modules.operations.count_distinct.CountDistinct(module, env, named_modules)

Bases: modules.operations.unary_operation.UnaryOperation

A module that count distinct elements of a dataflow and append it to the dataflow as a separate column.

Parameters:module (dict) – The module dict must contain the fields field which corresponds to a list of columns to group the flow by in order to count the number of distinct elements.
check_integrity()

Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.

get_out_type()

Returns the output type of the module as a list of strings.

rendered_result()

Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.

Return type:Tuple[str, str]

Projection

The projection operation module

class modules.operations.projection.Projection(module, env, named_modules)

Bases: modules.operations.unary_operation.UnaryOperation

A module that projects the incoming dataflow on the fields specified in fields.

Parameters:module (dict) – The module dict must contain a fields field, an array of integer representing the columns to project on (ex: [0, 2]).
check_integrity()

Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.

get_out_type()

Returns the output type of the module as a list of strings.

rendered_result()

Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.

Return type:Tuple[str, str]

Split

The split operation module

class modules.operations.split.Split(module, env, named_modules)

Bases: modules.operations.unary_operation.UnaryOperation

A module that split a given string field from an incoming dataflow according to a regex.

Parameters:module (dict) –

The module dict must contain a field that specify the column index on which to perform the split operation (ex: 0) A delimiter field indicates the separator (ex: ",")

Other optional fields are:
  • The reduce optional field is used to select only one element of the array resulting of the split function (ex: null, -1, 2).
check_integrity()

Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.

get_out_type()

Returns the output type of the module as a list of strings.

rendered_result()

Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.

Return type:Tuple[str, str]

Binary Operations

The abstract class for binary operation modules.

class modules.operations.binary_operation.BinaryOperation(module, env, named_modules)

The abstract base module for all binary operations (Operations that take two data flows as input).

Parameters:module (dict) – The module dict must have the two fields source1 and source2 that contain the names of the two input flows.
add_to_graph(graph)

A method for adding the module to a graphviz graph instance.

Parameters:graph (graphviz.dot.Digraph) – A graphviz Digraph object
rendered_result()

Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.

Return type:Tuple[str, str]

Join

The join operation module

class modules.operations.join.Join(module, env, named_modules)

Bases: modules.operations.binary_operation.BinaryOperation

A module that joins two incoming dataflows on field1 == field2

Parameters:module (dict) –

The module dict must contain the field1 and field2 fields that correspond to the desired index to make the join on.

Other optional fields are:
  • leftFields and rightFields are lists of integers specifying the indexes to keep in the join’s output. Default value is "all". (ex: [0, 2])
check_integrity()

Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.

get_out_type()

Returns the output type of the module as a list of strings.

rendered_result()

Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.

Return type:Tuple[str, str]

String Similarity

The string similarity operation module

class modules.operations.string_similarity.StringSimilarity(module, env, named_modules)

Bases: modules.operations.binary_operation.BinaryOperation

A module that compute similarity scores between two string inputs with one of the available soft string matching algorithms.

Warning: Some algorithms compute a distance, some others a similarity score. Besides, some are normalized and some aren’t. See the documentation for more details on each algorithm.

The default algorithm is the Levenshtein distance, but any one from the following list can be chosen:

  • Levenshtein
  • NormalizedLevenshtein
  • Damerau
  • OptimalStringAlignment
  • JaroWinkler
  • LongestCommonSubsequence
  • MetricLCS
  • Cosine

All implemetations are from Thibault Debatty’s string similarity Java library. See the javadoc for a detailed description of all algorithms

Parameters:module (dict) –

The module dict must contain the fields leftField and rightField that are integers corresponding to the columns that will be compared (ex: 0).

An algorithm field should also be included in module with a string containing the name of the desired algorithm (ex: "Levenshtein").

Other optional fields are:
  • leftOutFields and rightOutFields that are by default set to "all" but can be a list of integers that represent the columns on which the result should be projected (ex: [0, 2, 3]).
check_integrity()

Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.

get_out_type()

Returns the output type of the module as a list of strings.

rendered_result()

Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.

Return type:Tuple[str, str]

Jaccard Measure on Sets

The word similarity operation module

class modules.operations.extractor_word_similarity.ExtractorWordSimilarity(module, env, named_modules)

Bases: modules.operations.binary_operation.BinaryOperation

A module that computes a Jaccard similarity measure on two input sets.

Parameters:module (dict) – The module dict must contain leftField rightField that correspond to the index of the set to be compared in the input flows.
check_integrity()

Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.

get_out_type()

Returns the output type of the module as a list of strings.

rendered_result()

Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.

Return type:Tuple[str, str]

Union

The union operation module

class modules.operations.union.Union(module, env, named_modules)

Bases: modules.operations.binary_operation.BinaryOperation

A module that performs the union of two incoming data flows.

Parameters:module (dict) – The module dict must contain the fields leftField and rightField that are integers corresponding to the columns that will be compared (ex: 0).
check_integrity()

Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.

get_out_type()

Returns the output type of the module as a list of strings.

rendered_result()

Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.

Return type:Tuple[str, str]