Operations¶
Unary Operations¶
The abstract class for unary operation modules.
-
class
modules.operations.unary_operation.UnaryOperation(module, env, named_modules)¶ The Unary operation base abstract class.
Parameters: module (dict) – The module must contain a sourcefield with the name of the incoming data flow.-
add_to_graph(graph)¶ A method for adding the module to a graphviz graph instance.
Parameters: graph (graphviz.dot.Digraph) – A graphviz Digraph object
-
rendered_result()¶ Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.
Return type: Tuple[str,str]
-
Map¶
The map operation module
-
class
modules.operations.map.Map(module, env, named_modules)¶ Bases:
modules.operations.unary_operation.UnaryOperationA module that maps an arbitrary scala function to the incoming data flow.
Warning: Arbitrary scala code will only be checked at compilation and therefore could make the final program fail
Parameters: module (dict) – The module dict must contain a
functionfield that contains the desired scala function to be mapped to the data flow. (ex:"(tuple) => (tuple._1*2, tuple._2)").The
outTypefield must also be provided to ensure compatibility with downstream modules.-
check_integrity()¶ Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.
-
get_out_type()¶ Returns the output type of the module as a list of strings.
-
rendered_result()¶ Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.
Return type: Tuple[str,str]
-
Count Distinct¶
The distinct count operation module
-
class
modules.operations.count_distinct.CountDistinct(module, env, named_modules)¶ Bases:
modules.operations.unary_operation.UnaryOperationA module that count distinct elements of a dataflow and append it to the dataflow as a separate column.
Parameters: module (dict) – The module dict must contain the fieldsfield which corresponds to a list of columns to group the flow by in order to count the number of distinct elements.-
check_integrity()¶ Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.
-
get_out_type()¶ Returns the output type of the module as a list of strings.
-
rendered_result()¶ Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.
Return type: Tuple[str,str]
-
Projection¶
The projection operation module
-
class
modules.operations.projection.Projection(module, env, named_modules)¶ Bases:
modules.operations.unary_operation.UnaryOperationA module that projects the incoming dataflow on the fields specified in fields.
Parameters: module (dict) – The module dict must contain a fieldsfield, an array of integer representing the columns to project on (ex:[0, 2]).-
check_integrity()¶ Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.
-
get_out_type()¶ Returns the output type of the module as a list of strings.
-
rendered_result()¶ Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.
Return type: Tuple[str,str]
-
Split¶
The split operation module
-
class
modules.operations.split.Split(module, env, named_modules)¶ Bases:
modules.operations.unary_operation.UnaryOperationA module that split a given string field from an incoming dataflow according to a regex.
Parameters: module (dict) – The module dict must contain a
fieldthat specify the column index on which to perform the split operation (ex:0) Adelimiterfield indicates the separator (ex:",")- Other optional fields are:
- The
reduceoptional field is used to select only one element of the array resulting of the split function (ex:null,-1,2).
- The
-
check_integrity()¶ Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.
-
get_out_type()¶ Returns the output type of the module as a list of strings.
-
rendered_result()¶ Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.
Return type: Tuple[str,str]
Binary Operations¶
The abstract class for binary operation modules.
-
class
modules.operations.binary_operation.BinaryOperation(module, env, named_modules)¶ The abstract base module for all binary operations (Operations that take two data flows as input).
Parameters: module (dict) – The module dict must have the two fields source1andsource2that contain the names of the two input flows.-
add_to_graph(graph)¶ A method for adding the module to a graphviz graph instance.
Parameters: graph (graphviz.dot.Digraph) – A graphviz Digraph object
-
rendered_result()¶ Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.
Return type: Tuple[str,str]
-
Join¶
The join operation module
-
class
modules.operations.join.Join(module, env, named_modules)¶ Bases:
modules.operations.binary_operation.BinaryOperationA module that joins two incoming dataflows on field1 == field2
Parameters: module (dict) – The module dict must contain the
field1andfield2fields that correspond to the desired index to make the join on.- Other optional fields are:
leftFieldsandrightFieldsare lists of integers specifying the indexes to keep in the join’s output. Default value is"all". (ex:[0, 2])
-
check_integrity()¶ Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.
-
get_out_type()¶ Returns the output type of the module as a list of strings.
-
rendered_result()¶ Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.
Return type: Tuple[str,str]
String Similarity¶
The string similarity operation module
-
class
modules.operations.string_similarity.StringSimilarity(module, env, named_modules)¶ Bases:
modules.operations.binary_operation.BinaryOperationA module that compute similarity scores between two string inputs with one of the available soft string matching algorithms.
Warning: Some algorithms compute a distance, some others a similarity score. Besides, some are normalized and some aren’t. See the documentation for more details on each algorithm.
The default
algorithmis the Levenshtein distance, but any one from the following list can be chosen:- Levenshtein
- NormalizedLevenshtein
- Damerau
- OptimalStringAlignment
- JaroWinkler
- LongestCommonSubsequence
- MetricLCS
- Cosine
All implemetations are from Thibault Debatty’s string similarity Java library. See the javadoc for a detailed description of all algorithms
Parameters: module (dict) – The module dict must contain the fields
leftFieldandrightFieldthat are integers corresponding to the columns that will be compared (ex:0).An
algorithmfield should also be included in module with a string containing the name of the desired algorithm (ex:"Levenshtein").- Other optional fields are:
leftOutFieldsandrightOutFieldsthat are by default set to"all"but can be a list of integers that represent the columns on which the result should be projected (ex: [0, 2, 3]).
-
check_integrity()¶ Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.
-
get_out_type()¶ Returns the output type of the module as a list of strings.
-
rendered_result()¶ Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.
Return type: Tuple[str,str]
Mention Extraction¶
The mention link extractor operation module
-
class
modules.operations.extractor_link.ExtractorLink(module, env, named_modules)¶ Bases:
modules.operations.binary_operation.BinaryOperationA module that extracts the occurrences of a given field of a data flow into a field of an other data flow. The source extraction will always be the right flow and the target will be the left flow.
Parameters: module (dict) – The module dict must contain
sourceExtractandtargetExtractwith the index of source and target columns.sourceExtractcorresponds to thesource1data flow with text to make extraction from.targetExtractcorresponds tosource2and contains the patterns that will be searched for in thesourceExtract.-
check_integrity()¶ Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.
-
get_out_type()¶ Returns the output type of the module as a list of strings.
-
rendered_result()¶ Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.
Return type: Tuple[str,str]
-
Jaccard Measure on Sets¶
The word similarity operation module
-
class
modules.operations.extractor_word_similarity.ExtractorWordSimilarity(module, env, named_modules)¶ Bases:
modules.operations.binary_operation.BinaryOperationA module that computes a Jaccard similarity measure on two input sets.
Parameters: module (dict) – The module dict must contain leftFieldrightFieldthat correspond to the index of the set to be compared in the input flows.-
check_integrity()¶ Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.
-
get_out_type()¶ Returns the output type of the module as a list of strings.
-
rendered_result()¶ Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.
Return type: Tuple[str,str]
-
Union¶
The union operation module
-
class
modules.operations.union.Union(module, env, named_modules)¶ Bases:
modules.operations.binary_operation.BinaryOperationA module that performs the union of two incoming data flows.
Parameters: module (dict) – The module dict must contain the fields leftFieldandrightFieldthat are integers corresponding to the columns that will be compared (ex:0).-
check_integrity()¶ Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.
-
get_out_type()¶ Returns the output type of the module as a list of strings.
-
rendered_result()¶ Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.
Return type: Tuple[str,str]
-