Getting started¶
Dependencies¶
The project makes use of the Apache Flink stream and batch processing framework. Flink needs a working Java 8.x installation to run and is compatible with Windows, Linux and MacOS. The code is compatible with Flink 1.5 and above.
Apache Flink can be installed by downloading and extracting a binary from here.
Or you can install it with a package manager of your choice (e.g. Homebrew on MacOS), a more detailed description there.
In order to run the code, Scala 2.12.x and sbt should also be installed (details).
Finally, the compiler is compatible with Python3.6 and above. You can install the
dependencies with pip install -r requirements.txt
from the main directory.
The data journalism extractor has several modules that are based on different softwares. The MongoDB Importer module needs a working MongoDB installation in order to work.
To run the example, one will need working Postgres and MongoDB installations.
Installation¶
Get the project from GitHub¶
The project can be downloaded from the GitHub repository with the command
git clone git@github.com:hugcis/data_journalism_extractor.git your/path/
then run cd your/path/
to get started.
How to run a program¶
The Flink program that will execute some operations lives in the scala/src/main/scala/core
directory, along with all the modules and classes used to run operations.
The main template is MainTemplate.scala.template
. It will be rendered with code corresponding
to some specifications from a JSON file.
Specifications come in the following JSON structure:
{
modules: [
{
moduleType: "string",
name: "string",
...
},
{
moduleType: "string",
name: "string",
...
},
...
]
}
This defines a directed acyclic graph (DAG) of operations, where outputs of some modules can be fed to inputs of other modules.
The JSON files defines a program, that can be compiled with a python script as follow:
python/compile.py -s spec.json
A more complete description of the available options and functioning of the script can
be obtained by running python/compile.py -h
.
Usage: compile.py [-h] [-s filename] [-t dirname] [-o output] [-p]
Command line interface for compiling JSON spec file in Scala code.
Optional arguments are:
-h, --help show this help message and exit
-s filename, --spec filename spec filename
-t dirname, --template-dir dirname template directory
-o output, --output output output filename
-p, --pdf-output output the rendered graph in a pdf file
Once the DAG has been compiled, the generated Scala program can in turn be compiled and packaged into a JAR to run on a Flink cluster (either locally or on a cluster).
Minimal Example¶
Take the following simple program:
{
"modules": [
{
"moduleType": "csvImporter",
"name": "In",
"path": "~/project/in.csv",
"dataType": ["String", "String"]
},
{
"moduleType": "csvOutput",
"name": "Out",
"path": "~/project/out.csv",
"source": "In"
}
]
}
This defines a program that will copy a CSV file with two inputs called in.csv
and paste it
in an other file called out.csv
. The corresponding DAG looks like this:
It was generated by compiling the above JSON file with python/compile.py -s spec.json -p
. The
flag -p
is used to generate a DAG representatin of your program in PDF format.
The following lines of Scala code were generated during compilation
// set up the execution environment
val env = ExecutionEnvironment.getExecutionEnvironment
// ===== CSV Importer module In =====
val filePath_In = "~/project/in.csv"
val lineDelimiter_In = "\n"
val fieldDelimiter_In = ","
val In = env.readCsvFile[(String, String)](filePath_In, lineDelimiter_In, fieldDelimiter_In)
// ===== CSV Output File Out =====
val filePath_Out = "~/project/out.csv"
In.writeAsCsv(filePath_Out, writeMode=FileSystem.WriteMode.OVERWRITE)
// ===== Execution =====
env.execute()
To make Flink run the program, you need to pack your code into a .jar file with sbt clean assembly
from the scala
directory.
Next, make sure you have a Flink task manager running, for example locally (see link). You should be able to see the Flink web interface at http://localhost:8081.
You can then run your program with flink run target/scala-2.11/test-assembly-0.1-SNAPSHOT.jar
(the exact name of the file depends on the parameters set in build.sbt
).
You should the be able to see out.csv
in ~/project
and the following job in your Flink
web interface:
A more thorough introduction to this software can be found a this link