WP8- JRA1 - Social Mining and Big Data Resource Integration

Objectives

This WP will promote repeatable and open science in large-scale social data mining. Thus, it will deliver datasets, methods and applications to the SoBigData++ platform, for both virtual and trans-national access. We will aim to focus on the reproducibility of results by making it easier to find, access and replicate experiments in accordance with the FAIR principles. In the case of virtual access, the objective of the WP is to increase the number of datasets and methods integrated into the e-infrastructure. Methods will be grouped in libraries, available to be used either in the cloud as service (using the computational resources of the e-infrastructure) or to be downloaded for local use.

Tasks

LUH, CNR, USFD, UNIPI, UT, IMT, SNS, AALTO, ETH Zürich, CNRS, CEU, URV, BSC, UPF, KTH, UvA
The following task describes a services integration which will be common for several tasks. Services design and integration: the task consists of planning, design and integration of services in the SoBigData++ platform. The objective is to give users the possibility of using the algorithms provided by the consortium in two modalities:

1- As part of a library for local execution and inclusion in user’s own analytical processes (e.g. a Python library to be included in the user’s code)

2- As services in the cloud of the e-infrastructure, to be executed using the SoBigData++ computational resources. In both instances, planning and design will be undertaken in order to allow interoperability of methods using common data representations and homogenous programming environments. This will be done following the concepts of ‘megamodelling’ and FAIR principles. The first result of the design will be the definition of a set of libraries which will group together homogenous services (i.e. Community Discovery, Topic Analysis, Diffusion simulation, etc.). Services will be subsequently integrated into the SoBigData++ e-infrastructure to ensure remote execution in the cloud. This twofold integration will be reached for most of the public methods unless specific cases with restrictions (i.e. restricted methods provided only in trans-national access, or methods with specific hardware requirements). Tools defined as applications represent a specific case (i.e. tools with front-end consisting of complex visualization and/or ad-hoc interfaces) and in these instances, tools will be integrated as web-services running on the SoBigData++ e-infrastructure or relying on external services (i.e. TagMe, SWAT, WAT, SMAPH, Twitter Monitor or M-Atlas).

T8.1 Data Management and Integration of Social Data resources
Task leader: LUH
Participants: ALL
This task will continuously update the Data Management Plan (inherit from SoBigData, D8.1 https://goo.gl/kjcBZS) providing policies for description, preservation and sharing of the social data sets for VA and TA. Dataset description will include a unique reference and an assessment of their nature, scale and available metadata (such as related scientific publications, privacy issues, data governance policies, licensing, or similar resources). The preservation procedures describe how the partners store the data, which technology is used, and for how long the data will be available. The task will also monitor that all the partners comply with privacy and licensing restrictions declared for their data and will take care of the costs associated for their long-term preservation in collaboration with the BOEL in WP2. The available and newly gathered data sets (listed in Section 1) will be registered in the e-infrastructure following the specification  efined in this task and implemented in WP9. We plan the integration of two types of datasets:
1. Publicly Available Data. Several datasets are made available by private and public entities and will be included in the list of resources of the project. The information about these datasets is not always complete or well described. For this reason, a platform aimed at easily finding, annotating and discussing them within the research community is needed.
2. Restricted data. Within the consortium, we will make available proprietary datasets. Due to the restriction imposed by data owners, such collections will be made available prevalently on-site, through Transnational Access. In such scenarios, the consortium will take care of simplifying the procedures needed to grant data access to the researchers who want to pursue experiments on them. The access through VA will be granted for all those datasets whose policies allow open diffusion; conversely, for all the data set whose access is restricted due to licensing restrictions, access will be provided only through TA. Moreover, for some datasets, in order to avoid Term of Usage (ToS) infringements, access will be offered in form of data crawlers which can be used both in VA and TA to obtain data directly from the original source and for a specific and time-limited experiment (e.g. Twitter data).

T8.2 Social media observatory and crowd-sensing design and integration
Task leader: CNR
Participants: USFD, UT, ETHZ
This task is aimed at combining tools and services on social media analysis to create an application named Social Media Observatory to be used by domain experts and stakeholders (i.e. social and political scientists, journalists, economists, etc.) to create social media listening campaigns over specific topics of interest by leveraging the crowdsensing paradigm. Data collection will be carried out by specifying keywords, accounts, or geographic areas of interest. The sensed data will be cleansed of ‘noisy’ and unreliable information, such as that produced by fake or bot accounts, and possibly enriched with additional high-level information not directly available on the native data sources (i.e. type of sentiment detected, hate speech level, set of geographic coordinates associated to the messages and obtained via geoparsing, etc.). Moreover, the functionalities of the existing Twitter Monitor application will be enhanced by composing it with other services provided by SoBigData++. All data collection will be GDPR and ethics compliant, following best practices and protocols defined in T8.1 and WP2.

T8.3 Text and Social Media Mining services design and integration
Task leader: USFD
Participants: CNR, UNIPI, LUH, KTH
The task will plan, design and integrate the methods according to the SoBigData++ Services design and integration focusing on Text and Social Media Mining services.
Example of these services is the rich set of open-source natural language processing algorithms from GATE, which are continuously updated and provided and which will be the base for complex analytical workflows. Methods for cross-lingual text classification will be integrated as well as the DNA-based method for bot-detection. Moreover, the
platform will host the suite of advanced semantic annotation tools (Acube) which annotate social media content with entities drawn from a Knowledge Base, such as TagMe and WAT for texts, SMAPH for queries and SWAT for entity salience. They will be the basis for the design and implementation of novel Conversational AI and advanced profiling
tools. There will also be new services for processing online misinformation, identifying hate speech, and other deeper semantic analysis.

T8.4 Complex Network Analysis Mining services design and integration
Task leader: CNR
Participants: IMT, SNS, ETHZ, CEU
The task will plan, design and integrate the methods according to the SoBigData++ Services design and integration focusing on Complex Network Analysis services. In particular, tools for networks with a specific structure or properties will be provided. Examples of such tools are services and algorithm designed to analyse bipartite networks, which are
typical in finance; Ego Network which focuses on human relationships and social interactions; temporal networks for social and economic systems. Different methods for network optimization problems will be analysed and integrated, in particular, the methods of generalized network dismantling. Libraries for community detection and methods for bipartite
networks reconstruction and validation will be also integrated.

T8.5 Human Mobility Analytics services design and integration
Task leader: CNR
Participants: CEU, KTH
The task will plan, design and integrate the methods according to the SoBigData++ Services design and integration focusing on Human Mobility Analytics services. An example of these services will be the integration of all trajectory reconstruction algorithms and annotation and mining in the M-Atlas application into a more flexible and efficient library
as well as their integration in the cloud. Thus, this will enable the definition of an analytical workflow on mobility data running on the cloud. Methods for transfer learning of mobility models will be investigated and integrated.

T8.6 Web Analytics services design and integration
Task leader: LUH
Participants: USFD, CNR
The task will plan, design and integrate the methods according to the SoBigData++ Services design and integration focusing on Web Analytics services. Example of these services will be the integration of many Web content extraction tools that are used to extract relevant text, entities, time-stamps, tables from Web data. Most of the Web retrieval, social media analysis and text analysis tasks are direct beneficiaries of effective content extraction methods and these tools are primitive operations in any NLP, IR and Media analysis pipeline. We plan to greatly advance the number of extraction methods that seamlessly integrate as a pre-processing step in processing large volumes of Web data.

T8.7 Visual Analytics services design and integration
Task leader: UvA
Participants: CNRS, CNR
The task will plan, design and integrate the methods according to the SoBigData++ Services design and integration focusing on Visual Analytics services. The task leverages the use of interactive visualisation techniques to the analytical processes in two parts of the analytical process. The first part is the inspection and the second regards the consolidation of analytical results. The rationale for such an approach, not the lack, but the abundance of potentially useful results generated with different data sources, methods or settings. The consolidation process includes the comparison, evaluation and sense-making from partial results and the generation of a top-level view to the analysis.

T8.8 Privacy Enhancing Technology and Discrimination preventing services design and integration
Task leader: UNIPI
Participants: LUH, URV
The task will plan, design and integrate methods according to the SoBigData++ Services design and integration focusing on Privacy Enhancing Technology (PET) services. An example of these services will be the integration of a library of algorithms for data anonymization, the application Prudence for the enforcement of data subjects’ rights and privacy risk assessment as well as tools for the mitigation of such risk. Different methods for discrimination discovery will be analysed and integrated.

T8.9 Explainable AI services design and integration
Task leader: UNIPI
Participants: CNR, LUH, AALTO
The task will plan, design and integrate methods according to the SoBigData++ Services design and integration focused on Explainable AI. New algorithms for agnostic local explanation will be able to provide interpretable and faithful explanations (LORE, MARLENA and L2G) and will be integrated as new services. Moreover, this task will focus on model agnostic explanations and robustness in explanations for complex predictive models. The service's library in this context will be a superset of the already existing X-Lib including the up-to-date version of the algorithms (i.e. the new *LIME# algorithms family).

T8.10 Scalable machine learning services design and integration
Task leader: LUH
Participants: CNR, BSC
Continuous representations of discrete structures like text, graphs, tables etc. have become inevitable input formats for many deep learning models. The task will plan, design and integrate the methods according to the SoBigData++ Services design and integration principles focusing on training to deploy such representations in a scalable manner from the existing datasets in the platform to accelerate utility of many recent machine learning models. Examples include scalable learning of text representations like word2vec and glove and network representations for social network analysis like Deepwalk, node2vec, verse etc. In particular, BSC will use its expertise to port services from HPC environment and test them on its platform.

T8.11 Filling the gaps: emerging new analytical technologies
Task leader: UNIPI
Participants: ALL
This task will foster the development and the research of new analytical methods for social mining among the consortium, associated partners and final users. This will expand platform hosted methods, renewing and adapting services to emerging research topics and methodologies (i.e. peer-to-peer decentralized machine learning algorithms for personalised AI). Moreover, in this task, we will collect new data mining methods, machine learning and deep learning algorithms (i.e. general purpose) which will be used in different thematic clusters and exploratories in WP10.