Extractors

The package containing all the extractor operation modules

File Importer

The file importer operation module base class

class modules.extractors.file_importer.FileImporter(module, env, named_modules)

File Importer is an abstract class that is used for building modules that read files on disk.

It cannot be used by as is because it is an abstract class.

Parameters:module (dict) – The module dict must have a path field that contains the path to the file to be read by the module (Ex: ~/project/file.csv).
add_to_graph(graph)

A method for adding the module to a graphviz graph instance.

Parameters:graph (graphviz.dot.Digraph) – A graphviz Digraph object
check_integrity()

Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.

rendered_result()

Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.

Return type:Tuple[str, str]

CSV Importer

The CSV loader operation module

class modules.extractors.csv_importer.CsvImporter(module, env, named_modules)

Bases: modules.extractors.file_importer.FileImporter

Main CSV loader operation module class.

Parameters:module (dict) –

The module dict must have a dataType field that contains the input types as a list of strings. (Ex: ["String", "Int", "Double"])

Other optional fields are:
  • fieldDelimiter (csv delimiter if other than comma, Ex: "|")
  • quoteCharacter (don’t separate within quoted fields, Ex: """)
  • namedFields (for selecting only some of the columns by their name, Ex: ["name", "age"])
get_out_type()

Returns the output type of the module as a list of strings.

rendered_result()

Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.

Return type:Tuple[str, str]

JSON Importer

Warning: The JSON importer has limited extraction capacities and the MongoDB should be used instead when possible.

The JSON loader operation module

class modules.extractors.json_importer.JsonImporter(module, env, named_modules)

Bases: modules.extractors.file_importer.FileImporter

Main JSON loader operation module class

get_out_type()

Returns the output type of the module as a list of strings.

rendered_result()

Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.

Return type:Tuple[str, str]

DB Importer

The Database loader operation module

class modules.extractors.db_importer.DbImporter(module, env, named_modules)

Bases: modules.base_module.BaseModule

Main database loader operation module class.

Parameters:module (dict) –

The module dict must have:

  • A dbUrl field with the database entrypoint for JDBC. (e.g for a Postgres db named test running on localhost "jdbc:postgresql://localhost/test").
  • A dataType field with the input data types (Ex: ["String", "Int", "Double"]).
  • The names of the desired columns in fieldNames (Ex: ["age", "date", "name"]).
  • The query to be interpreted by the db.
Other optional fields are:
  • filterNull a boolean value for filtering null values from the db output.
add_to_graph(graph)

A method for adding the module to a graphviz graph instance.

Parameters:graph (graphviz.dot.Digraph) – A graphviz Digraph object
check_integrity()

Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.

get_out_type()

Returns the output type of the module as a list of strings.

rendered_result()

Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.

Return type:Tuple[str, str]

Mongo Importer

The Mongo loader operation module

class modules.extractors.mongo_importer.MongoImporter(module, env, named_modules)

Bases: modules.base_module.BaseModule

Main Mongo loader operation module class The Mongo loader allows to retreive an arbitrary number of fields from MongoDb Documents on convert the into a Flink DataSet.

Parameters:module (dict) – The module dict must have a dbName field with the name of the DB (ex: "hatvpDb"), a collection with the name of the desired collection (ex: "publications"), the requiredFields of the obtained documents (ex: ["age", "name"])
add_to_graph(graph)

A method for adding the module to a graphviz graph instance.

Parameters:graph (graphviz.dot.Digraph) – A graphviz Digraph object
check_integrity()

Performs some check on the upstream modules and types when necessary to ensure the integrity of the DAG.

get_out_type()

Returns the output type of the module as a list of strings.

rendered_result()

Returns a pair of strings containing the rendered lines of codes and external classes or objects definitions.

Return type:Tuple[str, str]