The project's aim is to handle large volumes of data to allow exploration of them through advanced graphical interface. The exploited source is a documentary corpus of scientific publications in the computer science field, provided by CiteSeerX.
Instead of using a classical taxonomy to categorize the information resources, the Research Map project consider them as "interconnected" with the words they contain, the authors who write them and the research groups or institutions that connect these authors.
Hypothetically the publications are classifiable by frequency's analysis of keywords. Considering co-occurrences of these keywords, they can also be grouped into domains and sub-domains.
In that way, resources form aggregates which can be represented similarly to an interactive map. The user can navigate through the map by activating various functions (zoom pan, query).
Stronger are the links between domains, more obvious or shorter are the paths. The time evolution can also be shown, either by static representation using one of the visual variables of the graphical semiology (size, shape, color, value, etc.) or by interactive representations (animation and/or time slider).
A significant part of the project was devoted to the establishment of an infrastructure for distributed processing of large data volumes.
Several languages, tools, libraries and concepts, particularly from the Big Data trend (eg Hadoop, MapReduce, Scalding, Spark, Mahout) and Information Retrieval field (Solr, Lucene) were tested and combined to arrive at a solution capable of handling more than two million records and produce a custom indexing almost in real time.
The second project's aim was to provide an interface to access the results of indexing over an http client.
Many types of graphical representation but also libraries that could support them have been explored.