Project-Team

The PF1


Overall objectives
Data collection is ubiquitous nowadays and it is providing our society with tremendous volumes of knowledge about human, environmental, and industrial activity.This ever-increasing stream of data holds the keys to new discoveries, both in industrial and scientific domains.However, those keys will only be accessible to those who can make sense out of such data.This is, however, a hard problem.It requires a good understanding of the data at hand, proficiency with the available analysis tools and methods, and good deductive skills.All these skills have been grouped under the umbrella term "Data Science" and universities have put a lot of effort in producing professionals in this field."Data Scientist" is currently an extremely sought-after job, as the demand far exceeds the number of competent professionals.Despite its boom, data science is still mostly a "manual" process: current data analysis tools still require a significant amount of human effort and know-how.This makes data analysis a lengthy and error-prone process.This is true even for data science experts, and current approaches are mostly out of reach of non-specialists.
The objective of the team LACODAM is to facilitate the process of making sense out of (large) amounts of data.This can serve the purpose of deriving knowledge and insights for better decisionmaking.Our approaches are mostly dedicated to provide novel tools to data scientists, that can either perform tasks not addressed by any other tools, or that improve the performance in some area for existing tasks (for instance reducing execution time, improving accuracy or better handling imbalanced data).

Research program 3.1 Introduction
LACODAM is a research team on data science methods and applications, composed of researchers with a background in symbolic AI, data mining, databases, and machine learning.Our research is organized along the three following research axes: • Symbolic methods (Section 3.2) is the first fundamental research axis.It focuses on methods that operate in symbolic domains, that usually take as input discrete data (ex: event logs, transactional data, RDF data) and output symbolic results (ex: patterns, concepts).
• Interpretable Machine Learning (Section 3.3) is the other fundamental research axis of the team.It aims at providing interpretable machine learning approaches, mostly by proposing post-hoc interpretability for state-of-the-art numerical machine learning methods.Interpretable by design machine learning approaches that do not fall into the "Symbolic methods" axis are also studied here.
• Real world AI (Section 3.4) deals with the application or adaptation of the methods developed in the aforementioned fundamental axes to real world problems.These works are conducted in collaboration with either industrial or academic partners from other domains.For example, one important application area for the team is digital agriculture with colleagues from Inrae.

Symbolic methods
LACODAM's core symbolic expertise is in methods for exploring efficiently large combinatorial spaces.Such expertise is used in three main research areas: • Pattern mining, a field of data mining where the goal is to find regularities in data (in an unsupervised way); • Semantic web, where the goal is to reason over the contents of the Web; • Skyline queries, where the goal is to find solutions to multiple criteria optimization queries.
In the pattern mining domain, the team is well known for tackling problems where the data and expected patterns have a temporal components.Usually the data considered are timestamped event logs, an ubiquitous type of data nowadays.The patterns extracted can be more or less complex subsequences, but also patterns exhibiting temporal periodicity.
A well-known problem in pattern mining is pattern explosion: due to either underspecified constraints or the combinatorial nature of the search space, pattern mining approaches may produce millions of patterns of mixed interest.The current best approach to limit the number of output patterns is to produce a small size pattern set, where the set optimizes some quality criteria.The best pattern set methods so far are based on information theory and rely on the principle of Minimum Description Length (MDL).LACODAM is the leading French team on MDL-based pattern mining, especially for complex patterns.After having integrated Peggy Cellier in 2021, who is the main French expert in MDL-based pattern mining, we integrated in April 2022 Sébastien Ferré, who is also an expert in this area, especially for graph patterns.
The contribution of the team in the Semantic Web domain focuses on different problems related to knowledge graphs (KGs) -usually extracted (semi-)automatically from the Web.These include applications such as mining and reasoning, as well as data management tasks such as provenance and archiving.Reasoning can resort to either symbolic methods such as Horn rules or numeric approaches such as KG embeddings that can be explained via post-hoc explainability modules.The integration of Sébastien Ferré (former SemLIS team leader) further strengthens the Semantic Web axis by extending our expertise on general graph mining, relation extraction, and semantic data exploration.
Skyline queries is a research topic from the database community, and is closely related to multicriteria optimization.In transactional data, one may want to optimize over several different attributes of equal importance, which means discovering a Pareto Front (the "skyline").The team has expertise on skyline queries in traditional databases as well as their application to pattern mining (extraction of skypatterns).Recently, the team started to tackle the extraction of skyline groups, i.e. groups of records that together optimize multiple criteria.

Interpretable ML
Making Machine Learning more interpretable is one of the greatest challenges for the AI community nowadays.LACODAM contributes to the main areas of explainable AI (XAI): • From a fundamental point of view, the team is trying to deepen the understanding of state-of-theart post-hoc interpretability approaches (LIME/SHAP), in order to improve these methods or adapt them to novel domains.The team has also started working on the generation of counterfactual explanations.Both lines of work have in common the need for novel notions of neighborhood of points in the model's data space.
• The team is also working on "interpretable-by-design" machine learning methods, where the decision taken can immediately be explained by the (part of) the model that took the decision.
Approaches used can as well be deep learning architectures or hybrid numeric/symbolic models relying on pattern mining techniques.
• Last, the team has a special interest in time series data, which arises in many applications but has not yet received enough attention from the interpretability community.We have proposed both post-hoc and "by design" approaches for interpretable ML for time series.
More generally, LACODAM is interested in the study of the interpretability-accuracy trade-off.Our studies may be able to answer questions such as "how much accuracy can a model lose (or perhaps gain) by becoming more interpretable?".Such a goal requires us to define interpretability in a more principled way-a challenge that has very recently been addressed, not yet overcome.

Real world AI
LACODAM's research work is firmly rooted in applications.On the one hand the data science tools proposed in our fundamental work need to prove their value at solving actual problems.And on the other hand, working with practitioners allows us to understand better their needs and the limitations of existing approaches w.r.t.those needs.This can open new and fruitful (fundamental) research directions.
Our objective, in that axis, is to work on challenging problems with interesting and pertinent partners.We target problems where off-the-shelf data science approaches either cannot be applied or do not give satisfactory results: such problems are the most likely to lead to new and meaningful research in our field.For some problems, collaborative research may not necessarily lead to fundamental breakthroughs, but can still allow making progress in the practitioners' field.We also value such work, which contributes to the discovery of new knowledge and helps industrial partners innovate.
Due to the team expertise in handling temporal data, a lot of our applicative collaborations revolve around the analysis of time series or event logs.Naturally, our work on interpretability is also present in most of our collaborations, as experts want accurate models, but also want to understand the decisions of those models.
The precise application domains are described in more details in the next section (Section 4).

Application domains
The current period is extremely favorable for teams working in Data Science and Artificial Intelligence, and LACODAM is not the exception.We are eager to see our work applied in real world applications, and have thus an important activity in maintaining strong ties with industrial partners concerned with marketing and energy as well as public partners working on health, agriculture and environment.

Industry
We present below our industrial collaborations.Some are well-established partnerships, while others are more recent collaborations with local industries that wish to reinforce their Data Science R&D with us.
• Privacy-Preserving Data-Sharing The collection of electrical consumption time series through smart meters grows with ambitious nationwide smart grid programs.This data is both highly sensitive and highly valuable: strong laws about personal data protect it while laws about open data aim at making it public after a privacy-preserving data publishing process.The CIFRE PhD of Antonin Voyez, funded by Enedis, is concerned with this application.We study the uniqueness of large-scale real-life fine-grained electrical consumption time-series, the potential privacy threats, and their mitigation.
• Heterogeneous tabular data generation with deep generative models Tabular data generation is paramount when dealing with privacy-sensitive data and with missing values, which are frequent cases in the real (industrial) world and particularly at Orange.It is also used for data augmentation, a pre-processing step often needed when training data-hungry deep learning models (for example to detect anomalies in networks, study customer profiles, ...).The CIFRE PhD of Charbel Kindji, funded by Orange, is concerned with this application.We study methods to tackle this problem when the tabular data are heterogeneous (numerical and symbolic) and when new tables should be generated from scratch based on a human prompt.
• Counterfactual explanations over multivariate time series.Very complex machine learning models (that are called-black boxes) are often used in critical applications (e.g.self-driving cars).
To comply with EU regulations and better understand their systems, many companies, and in particular Stellantis, are interested in developing skills in "explainable AI", a domain which aims at bringing back the human in the decision loop that involves a black box model.The CIFRE PhD of Paul Sevellec, funded by Stellantis, is concerned with this application.We study the particular case of counterfactual explanations on the challenging context of multivariate time-series.This problem is related to the generation of new data that fulfills some human requirements.

Agriculture and Environment
• Animal welfare.There has been an increasing concern of both consumers and professionals to better take into account farm animals welfare.For consumers, this is an important ethical issue.For professionals, their animals will have to be able to adapt to quickly evolving climatic conditions due to global warming, thus required to improve animal health and resilience.Better understanding animal welfare in a key component of these improvements.This is the general topic of the WAIT4 project (see Section 10.3), where Lacodam provides its data mining expertise to analyze time series of precision farming sensors, as well as event logs of animal behaviors.As a first topic of research in this project, the PhD of Lucie Lepetit is concerned with heat stress.The data are rumen temperature data from dairy cows of our Inrae partner.In this data, we can notice that in especially hot days of summer, some cows have difficulties to cope with the high temperature and while exhibit high rumen temperature both during the event and during several days after.While on the other hand, there are cows that are only mildly affected by the heat during the event, and who will quickly resume to a normal rumen temperature.Our goal is to design a method that quickly identifies all the abnormal rumen temperature periods correlated to high external temperature, and that provides a characterization of the cows that either resist well to the heat, or on the contrary do not cope well with it.
• Prediction of the Dynamics of Crop Diseases.The PhD thesis of Olivier Gauriau focuses on the prediction of the dynamics of crop diseases by means of pattern-aided regression techniques.Such techniques are known to strike an interesting trade-off between accuracy and interpretability, which can help agronomers understand the best predictors of high disease incidence, and therefore optimize the usage of phytosanitary products.This project is funded by #DigitAg and the Ecophyto program and constitutes a collaboration with the ACTA of Toulouse and the INRAE.

Education
• Data-oriented Academic Counseling.Course selection and recommendation are important aspects of any academic counseling system.The Learning Analytics community has long supported these activities via automatic, data-based tools for recommendation and prediction.LACODAM, in collaboration with the Ecuadorian research center CTI1 has contributed to this body of research with the design of a tool that allows students to select multiple courses and predict their academic performance based on historical academic data.The tool resorts to visualization and interpretable machine learning techniques, and is intended to be used by the students before the counseling sessions to plan their upcoming semester at the Ecuadorian university ESPOL.In our ongoing collaboration with CTI we are studying the impact of academic predictions, explanations in the behavior and decision of the students and counselors.
• Online Children Handwriting Recognition.The PhD thesis of Simon Corbillé addresses the problem of online handwriting recognition, a problem that enjoys satisfactory solutions for adults, but remains a challenge for children.This is because, children's handwriting is, at an early stage of learning, approximate and includes deformed letters.This is a joint effort between the LACODAM and ShaDoc (IRISA) teams.

Semantic Data Management
• RDF Archiving and Provenance.Archiving and provenance tracking are two crucial tasks in the management of large collaborative RDF knowledge bases, such as Wikidata or DBpedia.This is a consequence of the dynamicity and source heterogeneity of such data collections.Notwithstanding the value of RDF archiving and provenance tracking for both data maintainers and consumers, this field of research remains under-developed for multiple reasons.These include, among others, the lack of usability and scalability of the existing systems, a disregard of the evolution patterns of RDF datasets, and a weaker focus on data processes involving non-monotone operations 2 .These challenges are tackled in our ongoing collaboration with the DAISY team of Aalborg University, namely thanks the PhD thesis of Olivier Pelgrin on scalable RDF archiving, and the post-doctoral fellowship of Daniel Hernández on how-provenance computation for SPARQL queries.
5 Social and environmental responsibility

Footprint of research activities
There are two main axes that characterize the bulk of LACODAM's environmental impact: work trips, and computing resources utilisation.
Work trips.While the sanitary crisis had drastically cut the number of work trips of the team, recent years have seen an increase in the physical participation in conferences and various committees.However compared to the pre-covid period, one can note that the majority of movements are national or at best European, with very few trips outside of Europe and most of them using trains (and not planes).It seems that in general, the possibility of participating to meetings by videoconference has removed many "low added value trips".This is a first step in reducing our carbon footprint in a meaningful way, while preserving some of trips important for the scientific as well as human aspect of our work.
Utilisation of computing resources.LACODAM contributed in 2020 with a new server (abacus12) to the Igrida computing platform.Being a team specialized in data science and machine learning, a recurrent task in LACODAM is to run CPU-intensive algorithms on large data collections, for example, to train deep neural networks.Some of our recent PhD research topics (e.g., the theses of Simon Corbillé, and Simon Felton) concern deep learning technologies, and the important place of eXplanaible AI in our research program have made our team highly reliant on Igrida (notably with the PhD of Julien Delaunay and Victor Guyomard).While the discontinuation of Igrida services and the transition towards Grid'5000 and Jean Zay has reduced our access to easily available computation resources (making it harder to perform experiments during the transition and requiring to learn new ways of operating), it can be said that it has a positive effect on energy consumption, as we are now using national infrastructures that benefit from even better sharing between users than Igrida (which was already heavily used).

Impact of research results
We estimate that the research work can have actual impact in three different ways: • In the short/medium term, a significant part of our research work is conducted in collaboration with companies, through CIFRE PhDs.Hence, the addressed research problems concern an important challenge for the company, and the solutions proposed are evaluated on their relevance to tackle this challenge.
• In the medium/long term, we also have potential impactful research work with scientists from other domains, especially in environment and agriculture.Some earlier work of the team, conducted with INRAE SAS team, helped better understand nitrate pollution in Brittany, an important environmental issue.Current work of Lucie Lepetit is dedicated to the design of better data mining tools to characterize heat stress for the cows, which will help to guarantee the well-being of farm animals in a time of climate change.
• Last, in the longer term, the team has a fundamental line of work on machine learning and interpretability.This is a critical topic nowadays due to the emergence of the GDPR.Given the increasing use of machine learning solutions in most areas of human activity, work on interpretability is of utmost societal importance, as it will help in designing more useful and also more acceptable machine learning approaches.This will require a sustained effort from the community: LACODAM is taking part in this effort, both on its own, as the coordinator of the Inria HyAIAI project, and last by having several of its members in the large European Project TAILOR dedicated to this topic.

Other highlights
Romaric Gaudel defended his HDR in February on the subject "Recommender Systems: Online Learning and Ranking" Elisa Fromont and Alexandre Termier were involved in the project proposal for the CMA AI training project (6 M euros) which was awarded in February for the Université of Rennes.Elisa Fromont is now the scientific leader of this 5 years project.Functional Description: Dexteris is a low-code tool for data exploration and transformation.It works as an interactive data-oriented query builder with JSONiq as the target query language.It uses JSON as the pivot data format but it can read from and write to a few other formats: text, CSV, and RDF/Turtle (to be extended to other formats).
Dexteris is very expressive as JSONiq is Turing-complete, and supports a varied set of data processing features: -reading JSON files, and CSV as JSON (one object per row, one field per column),string processing (split, replace, match, ...), -arithmetics, comparison, and logics, -accessing and creating JSON data structures, i-terations, grouping, filtering, aggregates and ordering (FLWOR operators), -local function definitions.
The built JSONiq programs are high-level, declarative, and concise.Under-progress results are given at every step so that users can keep focused on their data and on the transformations they want to apply.

New results
We organize the scientific results of the research conducted at LACODAM according to the axes described in our research program (Section 3).
Remark about the "Participants" boxes: we compiled syntactically the list of co-authors of the papers that make the "New Results" of the year, for each subsection.It obviously does not mean that other members of the team do not work on the topics listed, the correct meaning is that they did not have a publication on that topic this year.

Concepts of Neighbors and their Application to Instance-based Learning on Relational Data [20].
Knowledge graphs and other forms of relational data have become a widespread kind of data, and powerful methods to analyze and learn from them are needed.Formal Concept Analysis (FCA) is a mathematical framework for the analysis of symbolic datasets, which has been extended to graphs and relational data, like Graph-FCA.It encompasses various tasks such as pattern mining or machine learning, but its application generally relies on the computation of a concept lattice whose size can be exponential with the number of instances.We propose to follow an instance-based approach where the learning effort is delayed until a new instance comes in, and an inference task is set.This is the approach adopted in k-Nearest Neighbors, and this relies on a distance between instances.We define a conceptual distance based on FCA concepts, and from there the notion of concepts of neighbors, which can be used as a basis for instance-based reasoning.Those definitions are given for both classical FCA and Graph-FCA.We provide efficient algorithms for computing concepts of neighbors, and we demonstrate their inference capabilities by presenting three different applications: query relaxation, knowledge graph completion, and relation extraction.[50].This chapter illustrates the basic concepts of fault localization using a data mining technique.It utilizes the Trityp program to illustrate the general method.Formal concept analysis and association rule are two well-known methods for symbolic data mining.In their original inception, they both consider data in the form of an object-attribute table.

Data Mining-Based Techniques for Software Fault Localization
In their original inception, they both consider data in the form of an object-attribute table.The chapter considers a debugging process in which a program is tested against different test cases.Two attributes, PASS and FAIL, represent the issue of the test case.The chapter extends the analysis of data mining for fault localization for the multiple fault situations.It addresses how data mining can be further applied to fault localization for GUI components.Unlike traditional software, GUI test cases are usually event sequences, and each individual event has a unique corresponding event handler.[58].A poster presenting the scikit-mine library, a python library for pattern mining.This library proposes an Open Source implementation of recent MDL-based pattern mining algorithms, that focuses on the ease of use of these algorithms.[47].To get a good understanding of a dynamical system, it is convenient to have an interpretable and versatile model of it.Timed discrete event systems are a kind of model that respond to these requirements.However, such models can be inferred from timestamped event sequences but not directly from numerical data.To solve this problem, a discretization step must be done to identify events or symbols in the time series.Persist is a discretization method that intends to create persisting symbols by using a score called persistence score.This allows to mitigate the risk of undesirable symbol changes that would lead to a too complex model.After the study of the persistence score, we point out that it tends to favor excessive cases making it miss interesting persisting symbols.To correct this behavior, we replace the metric used in the persistence score, the Kullback-Leibler divergence, with the Wasserstein distance.Experiments show that the improved persistence score enhances Persist's ability to capture the information of the original time series and that it makes it better suited for discrete event systems learning.[35].The ARC (Abstraction and Reasoning Corpus) challenge has been proposed to push AI research towards more generalization capability rather than ever more performance.It is a collection of unique tasks about generating colored grids, specified by a few examples only.We propose object-centered models analogous to the natural programs produced by humans.The MDL (Minimum Description Length) principle is exploited for an efficient search in the vast model space.We obtain encouraging results with a class of simple models: various tasks are solved and the learned models are close to natural programs.[25].The signature approach considers a sequence of itemsets, and given a number k it returns a segmentation of the sequence in k segments such that the number of items occurring in all segments is maximized.The limitation of this approach is that it requires to manually set k, and thus fixes the temporal granularity at which the data is analyzed.We propose the sky-signature model that removes this requirement, and allows us to examine the results at multiple levels of granularity, while keeping a compact output.We also propose efficient algorithms to mine sky-signatures, as well as an experimental validation with real data both from the retail domain and from natural language processing (political speeches).

Graph-FCA
Participants: H. Ambre Ayats, Peggy Cellier, Sébastien Ferré.Some otherwise presented documents also contribute to this research domain: [20] .[34].A number of extensions have been proposed for Formal Concept Analysis (FCA).Among them, Pattern Structures (PS) bring complex descriptions on objects, as an extension to sets of binary attributes; while Graph-FCA brings n-ary relationships between objects, as well as n-ary concepts.We have introduced a novel extension named Graph-PS that combines the benefits of PS and Graph-FCA.In conceptual terms, Graph-PS can be seen as the meet of PS and Graph-FCA, seen as sub-concepts of FCA.We have demonstrated how it can be applied to RDFS graphs, handling hierarchies of classes and properties, and patterns on literals such as numbers and dates.
Language Models as Controlled Natural Language Semantic Parsers for Knowledge Graph Question Answering [41], in collaboration with Prof. Jens Lehmann (Amazon, TU Dresden).We have proposed the use of controlled natural language as a target for knowledge graph question answering (KGQA) semantic parsing via language models as opposed to using formal query languages directly.Controlled natural languages are close to (human) natural languages, but can be unambiguously translated into a formal language such as SPARQL.Our research hypothesis is that the pre-training of large language models (LLMs) on vast amounts of textual data leads to the ability to parse into controlled natural language for KGQA with limited training data requirements.We devise an LLM-specific approach for semantic parsing to study this hypothesis.To conduct our study, we created a dataset that allows the comparison of one formal and two different controlled natural languages.Our analysis shows that training data requirements are indeed substantially reduced when using controlled natural languages, which is of relevance since collecting and maintaining high-quality KGQA semantic parsing training data is very expensive and time-consuming.[44].In recent years, research in RDF archiving has gained traction due to the ever-growing nature of semantic data and the emergence of communitymaintained knowledge bases.Several solutions have been proposed to manage the history of large RDF graphs, including approaches based on independent copies, time-based indexes, and change-based schemes.In particular, aggregated changesets have been shown to be relatively efficient at handling very large datasets.However, ingestion time can still become prohibitive as the revision history increases.To tackle this challenge, we propose a hybrid storage approach based on aggregated changesets, snapshots, and multiple delta chains.We evaluate different snapshot creation strategies on the BEAR benchmark for RDF archives, and show that our techniques can speed up ingestion time up to two orders of magnitude while keeping competitive performance for version materialization and delta queries.This allows us to support revision histories of lengths that are beyond reach with existing approaches.

GLENDA: Querying over RDF Archives with SPARQL [43].
The dynamicity of semantic data has propelled the research on RDF Archiving, i.e., the task of storing and making the full history of a large RDF dataset accessible.That said, existing archiving techniques fail to scale when confronted to very large RDF archives and complex SPARQL queries.In this demonstration, we showcase GLENDA, a system capable of running full SPARQL 1.1 compliant queries over large RDF archives.We achieve this through a multi-snapshot change-based storage architecture that we interface using the Comunica query engine.Thanks to this integration we demonstrate that fast SPARQL query processing over multiple versions is possible.Moreover our demonstration provides different statistics about the history of RDF datasets.This provides insights about the evolution dynamics of the data.
The Need for Better RDF Archiving Benchmarks [49].The advancements and popularity of Semantic Web technologies in the last decades have led to an exponential adoption and availability of Webaccessible datasets.While most solutions consider such datasets to be static, they often evolve over time.Hence, efficient archiving solutions are needed to meet the users' and maintainers' needs.While some solutions to these challenges already exist, standardized benchmarks are needed to systematically test the different capabilities of existing solutions and identify their limitations.Unfortunately, the development of new benchmarks has not kept pace with the evolution of RDF archiving systems.In this paper, we therefore identify the current state of the art in RDF archiving benchmarks and discuss to what degree such benchmarks reflect the current needs of real-world use cases and their requirements.Through this empirical assessment, we highlight the need for the development of more advanced and comprehensive benchmarks that align with the evolving landscape of RDF archiving.

Concepts of Neighbors and their Application to Instance-based Learning on Relational Data [20].
Knowledge graphs and other forms of relational data have become a widespread kind of data, and powerful methods to analyze and learn from them are needed.Formal Concept Analysis (FCA) is a mathematical framework for the analysis of symbolic datasets, which has been extended to graphs and relational data, like Graph-FCA.It encompasses various tasks such as pattern mining or machine learning, but its application generally relies on the computation of a concept lattice whose size can be exponential with the number of instances.We propose to follow an instance-based approach where the learning effort is delayed until a new instance comes in, and an inference task is set.This is the approach adopted in k-Nearest Neighbors, and this relies on a distance between instances.We define a conceptual distance based on FCA concepts, and from there the notion of concepts of neighbors, which can be used as a basis for instance-based reasoning.Those definitions are given for both classical FCA and Graph-FCA.We provide efficient algorithms for computing concepts of neighbors, and we demonstrate their inference capabilities by presenting three different applications: query relaxation, knowledge graph completion, and relation extraction.
Dexteris: Data Exploration and Transformation with a Guided Query Builder Approach [33].Data exploration and transformation remain a challenging prerequisite to the application of data analysis methods.The desired transformations are often ad-hoc so that existing end-user tools may not suffice, and plain programming may be necessary.We propose a guided query builder approach to reconcile expressivity and usability, i.e. to support the exploration of data, and the design of ad-hoc transformations, through data-user interaction only.This approach is available online as a client-side web application, named Dexteris.Its strengths and weaknesses are evaluated on a representative use case, and compared to plain programming and ChatGPT-assisted programming.To address these issues, we propose Therapy, the first model-agnostic explainability method for language models that does not require input data.This method generates texts that follow the distribution learned by the classifier to be explained through cooperative generation.Not depending on initial examples allows for applicability in situations where no data is available (e.g., for privacy reasons) and offers explanations on the global functioning of the model instead of multiple local explanations, thus providing an overview of the model's workings.Our experiments show that even without input data, Therapy provides instructive insights into the text features used by the classifier, which are competitive with those provided by methods using data.

Precise Segmentation for Children Handwriting Analysis by Combining Multiple Deep Models with
Online Knowledge [31].We present a strategy, called Seq2Seg, to reach both precise and accurate recognition and segmentation for children handwritten words.Reaching such high performance for both tasks is necessary to give personalized feedback to children who are learning how to write.The first contribution is to combine the predictions of an accurate Seq2Seq model with the predictions of a R-CNN object detector.The second one is to refine the bounding box predictions provided by the detector with a segmentation lattice computed from the online signal.An ablation study shows that both contributions are relevant, and their combination is efficient enough for immediate feedback and achieves state-of-the-art results even compared to more informed systems.
Adaptation of AI Explanations to Users' Roles [48].Surrogate explanations approximate a complex model by training a simpler model over an interpretable space.Among these simpler models, we identify three kinds of surrogate methods: (a) feature-attribution, (b) example-based, and (c) rule-based explanations.Each surrogate approximates the complex model differently, and we hypothesise that this can impact how users interpret the explanation.Despite the numerous calls for introducing explanations for all, no prior work has compared the impact of these surrogates on specific user roles (e.g., domain expert, developer).In this article, we outline a study design to assess the impact of these three surrogate techniques across different user roles.

Effects of Locality and Rule Language on Explanations for Knowledge Graph Embeddings [36].
Knowledge graphs (KGs) are key tools in many AI-related tasks such as reasoning or question answering.This has, in turn, propelled research in link prediction in KGs, the task of predicting missing relationships from the available knowledge.Solutions based on KG embeddings have shown promising results in this matter.On the downside, these approaches are usually unable to explain their predictions.While some works have proposed to compute post-hoc rule explanations for embedding-based link predictors, these efforts have mostly resorted to rules with unbounded atoms, e.g., bornIn(x,y)->residence(x,y), learned on a global scope, i.e., the entire KG.None of these works has considered the impact of rules with bounded atoms such as nationality(x,England)->speaks(x,English), or the impact of learning from regions of the KG, i.e., local scopes.We therefore study the effects of these factors on the quality of rule-based explanations for embedding-based link predictors.Our results suggest that more specific rules and local scopes can improve the accuracy of the explanations.Moreover, these rules can provide further insights about the inner-workings of KG embeddings for link prediction.
Visualizing How-Provenance Explanations for SPARQL Queries [37].Knowledge graphs (KGs) are vast collections of machine-readable information, usually modeled in RDF and queried with SPARQL.
KGs have opened the door to a plethora of applications such as Web search or smart assistants that query and process the knowledge contained in those KGs.An important, but often disregarded, aspect of querying KGs is query provenance: explanations of the data sources and transformations that made a query result possible.In this article we demonstrate, through a Web application, the capabilities of SPARQLprov, an engine-agnostic method that annotates query results with how-provenance annotations.
To this end, SPARQLprov resorts to query rewriting techniques, which make it applicable to already deployed SPARQL endpoints.We describe the principles behind SPARQLprov and discuss perspectives on visualizing how-provenance explanations for SPARQL queries.
Interactive Visualization of Counterfactual Explanations for Tabular Data [39], [46].In this paper, we present an interactive visualization tool that exhibits counterfactual explanations to explain model decisions.Each individual sample is assessed to identify the set of changes needed to flip the output of the model.These explanations aim to provide end-users with personalized actionable insights with which to understand automated decisions.An interactive method is also provided so that users can explore various solutions.The functionality of the tool is demonstrated by its application to a customer retention dataset.The tool is compatible with any counterfactual explanation generator and decision model for tabular data.
Generating robust counterfactual explanations [38].Counterfactual explanations have become a mainstay of the XAI field.This particularly intuitive statement allows the user to understand what small but necessary changes would have to be made to a given situation in order to change a model prediction.
The quality of a counterfactual depends on several criteria: realism, actionability, validity, robustness, etc.
In this paper, we are interested in the notion of robustness of a counterfactual.More precisely, we focus on robustness to counterfactual input changes.This form of robustness is particularly challenging as it involves a trade-off between the robustness of the counterfactual and the proximity with the example to explain.We propose a new framework, CROCO, that generates robust counterfactuals while managing effectively this trade-off, and guarantees the user a minimal robustness.An empirical evaluation on tabular datasets confirms the relevance and effectiveness of our approach.

Impressions and Strategies of Academic Advisors When Using a Grade Prediction Tool During Term
Planning [42].
Academic advising brings numerous benefits to the mission of Higher Education Institutions.One of the main actors is the advisors who support students in defining appropriate academic roadmaps.One central and challenging duty of academic advisors is course recommendation for term planning.This task requires both knowledge of the study programs and a thorough analysis of the students' unique circumstances.If we add limited time and a large student population, the task becomes overwhelming.As a result, an important body of research has sought to expedite term planning via data-oriented decision-support tools.While the impact of such tools on students has been extensively studied, the advisors' perspective remains largely unexplored.We contribute to filling this gap by studying how a grade prediction tool shapes advisors' course recommendation strategies.Our observations suggest that while the advisors' usual strategies tend to prevail, their recommendations are mostly affected by the advisee's historical performance.CCS Concepts: • Human-centered computing → Empirical studies in visualization.

Computer Vision and Robotics
Participants: Simon Corbillé, Samuel Felton, Élisa Fromont, Elodie Germani.Some of previously presented documents also contribute to this research domain: [31] .
Deep metric learning for visual servoing: when pose and image meet in latent space [32].We propose a new visual servoing method that controls a robot's motion in a latent space.We aim to extract the best properties of two previously proposed servoing methods: we seek to obtain the accuracy of photometric methods such as Direct Visual Servoing (DVS), as well as the behavior and convergence of pose-based visual servoing (PBVS).Photometric methods suffer from limited convergence area due to a highly nonlinear cost function, while PBVS requires estimating the pose of the camera which may introduce some noise and incurs a loss of accuracy.Our approach relies on shaping (with metric learning) a latent space, in which the representations of camera poses and the embeddings of their respective images are tied together.By leveraging the multimodal aspect of this shared space, our control law minimizes the difference between latent image representations thanks to information obtained from a set of pose embeddings.Experiments in simulation and on a robot validate the strength of our approach, showing that the sought-out benefits are effectively found.
On the benefits of self-taught learning for brain decoding [26].Context.We study the benefits of using a large public neuroimaging database composed of fMRI statistic maps, in a self-taught learning framework, for improving brain decoding on new tasks.First, we leverage the NeuroVault database to train, on a selection of relevant statistic maps, a convolutional autoencoder to reconstruct these maps.Then, we use this trained encoder to initialize a supervised convolutional neural network to classify tasks or cognitive processes of unseen statistic maps from large collections of the NeuroVault database.Results.We show that such a self-taught learning process always improves the performance of the classifiers but the magnitude of the benefits strongly depends on the number of samples available both for pre-training and finetuning the models and on the complexity of the targeted downstream task.Conclusion.The pre-trained model improves the classification performance and displays more generalizable features, less sensitive to individual differences.[61].Functional magnetic resonance imaging analytical workflows are highly flexible with no definite consensus on how to choose a pipeline.While methods have been developed to explore this analytical space, there is still a lack of understanding of the relationships between the different pipelines.We use community detection algorithms to explore the pipeline space and assess its stability across different contexts.We show that there are subsets of pipelines that give similar results, especially those sharing specific parameters (e.g., number of motion regressors, software packages, etc.), with relative stability across groups of participants.By visualizing the differences between these subsets, we describe the effect of pipeline parameters and derive general relationships in the analytical space.

Uncovering communities of pipelines in the task-fMRI analytical space [55], [60],
The HCP multi-pipeline dataset: an opportunity to investigate analytical variability in fMRI data analysis [56].Results of functional Magnetic Resonance Imaging (fMRI) studies can be impacted by many sources of variability including differences due to: the sampling of the participants, differences in acquisition protocols and material but also due to different analytical choices in the processing of the fMRI data.While variability across participants or across acquisition instruments has been extensively studied in the neuroimaging literature the root causes of analytical variability remain an open question.Here, we share the HCP multi-pipeline dataset, including the resulting statistic maps for 24 typical fMRI pipelines on 1,080 participants of the HCP-Young Adults dataset.We share both individual and group results -for 1,000 groups of 50 participants -over 5 motor contrasts.We hope that this large dataset covering a wide range of analysis conditions will provide new opportunities to study analytical variability in fMRI.

Agriculture
Participants: Christine Largouët.[57].Real-time combination of observed growth and feed intake performance with performance simulated by InraPorc® to apply precision feeding to growing pigs Precision feeding (PF) of growing pigs requires methods for real-time analysis of performance and prediction of nutritional requirements.Two calculation methods for reducing nutrient intake were evaluated.Individual daily kinetics of body weight (BW) and feed intake (FI) of 285 pigs, reared from 81 to 156 days of age (ad libitum feeding) were used.The PF1 approach (from the Feeda-Gene project) used the Holt-Winters and MARS methods to predict individual daily FI and BW, respectively.The standardised digestible lysine (dLys) requirement was calculated daily from the predicted performance using the factorial method.The PF2 method used 2200 virtual pigs whose performance was simulated using InraPorc®.By comparing FI and BW dynamics of real and virtual pigs, the 10 closest virtual pigs were selected daily for each real pig.Individual daily performance and expected dLys requirements were obtained by averaging the InraPorc® data of these 10 virtual pigs.PF was then simulated for each real pig.For each method, the blend proportions of two diets (A and B, 9.7 MJ net energy (NE)/kg, crude protein: 16.9% and 9.3%, dLys: 1.0 and 0.4 g/MJ NE, respectively) were calculated daily to achieve calculated requirements.Nitrogen (N) and dLys intake and N excretion were calculated individually.A two-phase (2-P) feeding strategy was also simulated (A:B = 83:17 until 65 kg PV, 50:50 afterwards).Compared to 2-P, N and dLys intake and N excretion were reduced by, respectively, 6.7%, 9.7% and 11.9% with PF1 and by 9.2%, 13.3% and 16.3% with PF2.The PF2 method provided better day-to-day stability of performance predictions, leading to a more regular decrease in N and dLys intakes during growth.The potential of this new method needs to be confirmed under field conditions.[21].The relational database SOWELL was created to better understand the behaviour and individual responses of gestating sows facing different short-term events induced: a competitive situation for feed, hot and cold thermal conditions, a sound event, an enrichment (straw, ropes and bags available) and an impoverishment (no straw, no objects) of the pen.The data were collected on 102 crossbred sows equipped with activity sensors, group-housed in video-recorded pens (16-18 sows per pen), with access to automatons.Feeding and drinking behaviours were extracted from the electronic feeders and drinkers' recordings.Social behaviours, physical activities and locations in the pen were recorded thanks to manual video analysis labelling at the individual scale.Accelerometer fixed on the sows' ears also recorded individual physical activities.The physical activity was also determined at a group scale by automatic video analysis using deep learning techniques.BWs, back fat thickness, and body condition (cleanliness, body damages) were recorded weekly during the whole gestation.Last gestation room data regarding environmental conditions (temperature, humidity, noise level) were recorded using automatic sensors.The database can fulfil different research purposes, namely sows' nutrition for example to better calculate the energy requirements regarding environmental factors, or also on welfare or health during gestation by providing indicators.[59].Background and Objectives Precision feeding aims to define the right feeding strategy according to individual's nutrient requirements, in order to improve health and reduce feed cost.Usually, the nutrient requirements of gestating sows are provided by a mechanistic nutritional model requiring input data such as age and body status.This paper proposes to predict the daily nutritional requirements, with only the data measured by sensors.According to various digital farm configurations, we explore and evaluate Machine Learning (ML) methods to predict nutrient requirements of gestating sows.Material and Methods Behavioural data of gestating sows are extracted from sensors data collected on 73 sows from parities 1 to 9. Their nutrient requirements concerned metabolisable energy (ME, in MJ/d) and standard ileal digestible lysine (SID Lys, in g/ d).Various digital farm configurations are proposed, from low-cost to more expensive equipments (electronic feeder and drinker, connected weight scale, accelerometers and video analysis software), producing various data at different levels of detail on sow behavior.Nine ML algorithms were trained on these 23 scenarios to predict daily energy and lysine for each sow.Results proposed by the ML algorithms are compared with outputs given by the nutritional model InraPorc.Results Using a Random Forest algorithm, the RMSE were lower with data feeder alone (1.22 MJ/d for ME and 0.53 g/d for SID Lys, 2.4 and 4.02% of mean absolute error respectively) compared those obtained with combined data from feeders and accelerometers (1.01 MJ/d and 0.29 g/d, 1.9 and 2.1%).The inclusion of the sows' characteristics reduced the RMSE, on average, by 20% for ME and by 35% for Lys.Discussion and Conclusion This study highlights that daily requirements of gestating sows can be predicted accurately thanks to behavioural data provided by sensors.It paves the way to propose simpler solutions for the application of precision feeding on farms.[23].Precision feeding is a strategy for supplying an amount and composition of feed as close that are as possible to each animal's nutrient requirements, with the aim of reducing feed costs and environmental losses.Usually, the nutrient requirements of gestating sows are provided by a nutrition model that requires input data such as sow and herd characteristics, but also an estimation of future farrowing performances.New sensors and automatons, such as automatic feeders and drinkers, have been developed on pig farms over the last decade, and have produced large amounts of data.This study evaluated machine-learning methods for predicting the daily nutrient requirements of gestating sows, based only on sensor data, according to various configurations of digital farms.The data of 73 gestating sows was recorded using sensors such as electronic feeders and drinker stations, connected weight scales, accelerometers, and cameras.Nine machine-learning algorithms were trained on various dataset scenarios according to different digital farm configurations (one or two sensors), in order to predict the daily metabolizable energy and standardized ileal digestible lysine requirements for each sow.The prediction results were compared to those predicted by the InraPorc model, a mechanistic model for the precision feeding of gestating sows.The scenario predictions were also evaluated with or without the housing conditions and sow characteristics at artificial insemination usually integrated into the InraPorc model.Adding housing and sow characteristics to sensor data improved the mean average percentage error by 5.58% for lysine and by 2.22% for energy.The higher correlation coefficient values for lysine (0.99) and for energy (0.95) were obtained for scenarios involving an automatic feeder system (daily duration and number of visits with or without consumption) only.The scenarios including an automatic feeder combined with another sensor gave good performance results.For the scenarios using sow and housing characteristics and automatic feeder only, the root mean square error was lower with Gradient Tree Boosting (0.91 MJ/d for energy and 0.08 g/d for lysine) compared with those obtained using linear regression (2.75 MJ/d and 1.07 g/d).The results of this study show that the daily nutrient requirements of gestating sows can be predicted accurately using data provided by sensors and machine-learning methods.It paves the way to simpler solutions for precision feeding.

Prediction of the daily nutrient requirements of gestating sows based on sensor data and machinelearning algorithms
Estimation of gestating sows' welfare status based on machine learning methods and behavioral data [22].Estimating the welfare status at an individual level on the farm is a current issue to improve livestock animal monitoring.New technologies showed opportunities to analyze livestock behavior with machine learning and sensors.The aim of the study was to estimate some components of the welfare status of gestating sows based on machine learning methods and behavioral data.The dataset used was a combination of individual and group measures of behavior (activity, social and feeding behaviors).A clustering method was used to estimate the welfare status of 69 sows (housed in four groups) during different periods (sum of 2 days per week) of gestation (between 6 and 10 periods, depending on the group).Three clusters were identified and labelled (scapegoat, gentle and aggressive).Environmental conditions and the sows' health influenced the proportion of sows in each cluster, contrary to the characteristics of the sow (age, body weight or body condition).The results also confirmed the importance of group behavior on the welfare of each individual.A decision tree was learned and used to classify the sows into the three categories of welfare issued from the clustering step.This classification relied on data obtained from an automatic feeder and automated video analysis, achieving an accuracy rate exceeding 72%.This study showed the potential of an automatic decision support system to categorize welfare based on the behavior of each gestating sow and the group of sows.

Machine Learning on Sequences
Participants: Abderaouf N Amalou, Élisa Fromont.[28].This paper presents CAWET, a hybrid worst-case program timing estimation technique.CAWET identifies the longest execution path using static techniques, whereas the worst-case execution time (WCET) of basic blocks is predicted using an advanced language processing technique called Transformer-XL.By employing Transformers-XL in CAWET, the execution context formed by previously executed basic blocks is taken into account, allowing for consideration of the microarchitecture of the processor pipeline without explicit modeling.Through a series of experiments on the TacleBench benchmarks, using different target processors (Arm Cortex M4, M7, and A53), our method is demonstrated to never underestimate WCETs and is shown to be less pessimistic than its competitors.

Software Engineering
Participants: Peggy Cellier, Sébastien Ferré.[50].This book chapter illustrates the basic concepts of fault localization using a data mining technique.It utilizes the Trityp program to illustrate the general method.Formal concept analysis and association rule are two well-known methods for symbolic data mining.In their original inception, they both consider data in the form of an objectattribute table.In their original inception, they both consider data in the form of an object-attribute table.The chapter considers a debugging process in which a program is tested against different test cases.Two attributes, PASS and FAIL, represent the issue of the test case.The chapter extends the analysis of data mining for fault localization for the multiple fault situations.It addresses how data mining can be further applied to fault localization for GUI components.Unlike traditional software, GUI test cases are usually event sequences, and each individual event has a unique corresponding event handler.
Contract amount: 30k€ + Phd Salary Context.This project is a collaboration with Orange Labs Lannion about interpretable machine learning.The Orange company aims to develop the use of machine learning algorithms to enhance the services they propose to their customers (for instance, credit acceptance or attribution prediction).It ensues the development of generic approaches for providing interpretable decisions to customers or client managers.Objective.The GDPR, implemented by the EU in 2018, stipulates the right for explanations for EU citizens in regard to decisions made from personal data.In a society where many of those decisions are computer-assisted via machine learning algorithm, interpretable ML is crucial.A promising way to convey explanations for the outcomes of ML models are counterfactual explanations.The focus of the PhD thesis financed by this project is the generation of usable and actionable counterfactual explanations for ML classifiers, which are intensively used by Orange within their services.Additional remarks.This contract finances the PhD of Victor GUYOMARD by Orange.

Contract amount: 70k€ + Phd Salary
Context.This project is a collaboration with Stellantis and focuses on the development of interpretable machine learning models for multivariate time series data.Utilizing a range of sensors integrated within vehicles, these models are designed to make real-time decisions.Providing drivers with clear explanations of these decisions is a key aspect.We specifically concentrate on counterfactual explanations, which not only clarify why a particular decision was made but also illustrate how alternative scenarios might have led to different outcomes.
Objective.Current approaches providing counterfactual explanations for time series models are limited to univariate time series.In this project, we aim to develop approaches to handle multivariate time series, which requires capturing the correlations between the series.
Additional remarks.This is the doctoral contract for the PhD of Paul Sévellec (Thèse CIFRE).

Visits to international teams
Research stays abroad Elodie Germani (PhD student supervised by Elisa Fromont with EMPENN) has spent 3 months in Canada at Concordia University (Montreal) with a Mitacs Globalink Research Award (GRA) with Centre National de la Recherche Scientifique (CNRS) on the project "Improving rs-fMRIderived biomarkers of Parkinson's Disease".

European initiatives 10.2.1 H2020 projects
Elisa Fromont, Alexandre Termier and Luis Galárraga are all members (within Inria) of the project H2020 ICT-48 TAILOR "Foundations of Trustworthy AI -Integrating Reasoning, Learning and Optimization".Elisa Fromont is responsible for Task 3.7 and 3.8 (roadmap and synergies with industry) in WP3.
The Inria Project Lab HyAIAI is a consortium of Inria teams (Sequel, Magnet, Tau, Orpailleur, Multispeech, and LACODAM) that work together towards the development of novel methods for machine learning, that combine numerical and symbolic approaches.The goal is to develop new machine learning algorithms such that (i) they are as efficient as current best approaches, (ii) they can be guided by means of human-understandable constraints, and (iii) their decisions can be better understood.The project ended in June 2023.
#DigitAg is a "Convergence Institute" dedicated to the increasing importance of digital techniques in agriculture.Its goal is twofold: First, making innovative research on the use of digital techniques in agriculture in order to improve competitiveness, preserve the environment, and offer correct living conditions to farmers.Second, preparing future farmers and agricultural policy makers to successfully exploit such technologies.While #DigitAg is based on Montpellier, Rennes is a satellite of the institute focused on cattle farming.
LACODAM is involved in the "data mining" challenge of the institute, which A. Termier co-leads.He is also the representative of Inria in the steering committee of the institute.The interest for the team is to design novel methods to analyze and represent agricultural data, which are challenging because they are both heterogeneous and multi-scale (both spatial and temporal).
The WAIT 4 project is a part of the "Agroecology and numeric" PEPR.The goal of this project is to provide the scientific basis for significant improvements in the well-being of farm animals.Up to now, animal well-being is evaluated with indicators of the means deployed (e.g.available space, method to control building temperature, time spent outside...).The goal of WAIT4 is to provide tools required in order to move to results indicators: can some guarantees be given on the well being of animals?Can this well (or unwell) being be correlated to management actions from the farmer, or to their general living conditions?
This requires a much finer understanding of animal mental as well as physiological state.The project is led by Inrae (Florence Gondret), which brings animal science specialists, ranging from biologists to ethologists.CEA provides expertise on blood sensors, to measure molecules linked to stress.And Inria as well as Insa Lyon provide computer science expertise for tools to analyse the data.More precisely, the Lacodam team will deal first with analyzing time series of numerical sensor data (e.g.temperature, activity), and second with categorical sequences of events produced by annotation tools from the analysis of videos.Both will help to better model animal behavior, and determine what are "normal" behaviors, and what are anomalous behaviors that may be linked to bad conditions for the animals.
• Bourse IUF -Elisa FROMONT This project supports the work of Elisa Fromont both with a reduction of teaching load, and some research money (15Keuros / year for 5 years).Elisa is currently working on designing effective data mining and machine learning algorithms for real-life data (which are scarce, heterogenous, multimodal, imbalanced, temporal, . . .).For the next few years, Elisa would like to focus on the interpretability of the results obtained by these algorithms.In pattern mining, her goal is to design algorithms which can directly mine a small number of relevant patterns.In the case of black box machine learning models (e.g.deep neural nets), Elisa would like to design methods to help the end user understand the decisions taken by the model.
Scikit-mine (SKM for short) is a Python library of pattern mining algorithms, desiging to be compatible with the well-known scikit-learn library.It allows practitioners to use state-of-the-art pattern mining algorithm with a library that has the same usage interface as scikit-learn, and that exploits the same data types.SKM was developed by CNRS AI engineers in the context of the F-WIN project of the PNR-IA program of CNRS, which general goal is to improve the development of AI software in research teams of CNRS labs.

How can we fully automatically choose the best explanation for a given use case in classification?.
Answering this question is the raison d'être of the JCJC ANR project FAbLe.By "best explanation" we mean an explanation that is both understandable by humans and faithful among a universe of possible explanations.We focus on local explanations, i.e., when we want to explain the answer of a black-box model for a given use case, which we call the "target instance".We argue that the choice of the best explanation depends on the (i) data, namely the model, the explanation technique and the target instance, etc., and (ii) the recipients of the explanations.Hence our research is focused on two main questions: "What makes an explanation suitable (interpretable and faithful) for a particular instance and model?" and "What is the effect of the different AI-based explanation techniques and visual representations on users' comprehension and trust?".Answering these questions will help us understand and automate the selection of a particular explanation style based on the use case.Our ultimate goal is to produce a suite of algorithms that will compute suitable explanations for ML algorithms based on our insights of what is interpretable.User studies on different explanation settings (methods and visual representations) will allow us to characterize the features of explanations that make them acceptable (i.e., understandable and trustworthy) by users.
• Formal Concept Analysis (FCA) is a mathematical framework based on lattice theory and aimed at data analysis and classification.FCA, which is closely related to pattern mining in knowledge discovery (KD), can be used for data mining purposes in many application domains, e.g.life sciences and linked data.Moreover, FCA is human-centered and provides means for visualization and interaction with data and patterns.Actually it is now possible to deal with complex data such as intervals, sequences, trajectories, trees, and graphs.Research in FCA is dynamic, but there is still room for extensions of the original formalism.Many theoretical and practical challenges remain.Actually there does not exist any consensual platform offering the necessary components for analyzing real-life data.This is precisely the objective of the SmartFCA project to develop the theory and practice of FCA and its extensions, to make the related components inter-operable, and to implement a usable and consensual platform offering the necessary services and workflows for KD.
In particular, for satisfying in the best way the needs of experts in many application domains, SmartFCa will offer a "Knowledge as a Service" (KaaS) component for making domain knowledge operable and reusable on demand.In MeKaNo, we aim to search the web with things, in order to get more accurate results over a wide diversity of sources.Traditional web search engines search the web with strings.However, keyword search often returns many irrelevant documents, pushing users to refine their keyword list following a trial-and-error process.To overcome such limitations, major companies allowed searching for things, not strings.Asking for the age of "James Cameron" to your vocal assistant, it locates in a Knowledge Graph (KG) a Person matching "James Cameron" where a property "age" is set to 66 years, i.e. the Thing "James Cameron".If searching for Things is a tremendous progress and delivers exact answers, the search is done over a Knowledge Graph and not on the Web.Consequently, there may exist many answers on the web that are not part of the knowledge graph.
To summarize, searching with strings over the web offers diversity at the expense of noise.Searching for Things delivers exact answers, but we lose diversity.In MeKaNo, we aim at searching the web with Things to get diversity and avoid noisy results.To search the web with Things, we face three main scientific challenges: 1. Users are used to search with keywords.Transforming a keyword query into a mixed query that first searches over a KG then into the web is difficult, especially, for complex queries.
2. As with traditional web searches, users expect to obtain ranked results in a snap.Combining KG search and Web search while preserving performances is highly challenging and requires a new kind of search engine.
3. Improving the connection between the web of microdata and Knowledge Graphs requires entity matching at large scale for microdata entities and KG entities.

Member of the steering committees
• Peggy Cellier is member of the steering committee of the international conference ECML PKDD.

Member of the editorial boards
• Peggy Cellier is member of the editorial board of ICFCA.
• Sébastien Ferré is member of the editorial board of ICFCA.

Chair of conference program committees
• Luis Galarraga Del Prado, Tassadit Bouadi: Organization of the AIMLAI (Advances in Interpretable Machine Learning and Artificial Intelligence) workshop co-located with the ECML/PKDD conference that took place in Turin on the week of September 18-22, 2023.• 24/08/2023: Invited speaker (in French) about "L'IA à la ferme: moissoner les données pour quels fruits ?",Annual congress of agriculture chambers of the Atlantic Arc (AC3A), Bruz.

Scientific expertise
• Tassadit Bouadi: Member of the working group of Axis 2 'Research and Innovation Program' within the IRIS-E program at the University of Rennes • Peggy Cellier: Member of the folowing recruiting committees (2): IUT Annecy/LISTIC, ISTIC/IRISA (3 positions for one committee).Also spare member for the IUT Lannion MCF recruiting comittee.
• Christine Largouët: Member of the CSTP of the PEPR Agroecology and ICT.

Research administration
• Peggy Cellier: Peggy Cellier is in charge of the Phd students of the IRISA lab (commission personnel each month, etc).She is elected at the "Conseil de coposante" of the Computer Science departement of INSA Rennes.She is also a member of "Conseil de l'école doctorale MATISSE".

Teaching
Apart from Luis Galarraga Del Prado (research scientist) and Gaelle Tworkowski (administrative assistant), each permanent member of the project-team LACODAM is also faculty members and is actively involved in computer science teaching programs in ISTIC, IUT of Lannion, INSA, or Agrocampus-Ouest.Besides these usual teachings LACODAM is responsible of some teaching tracks and of some courses.and "Advanced Database and Semantic Web" (Master 2).She also teaches some other courses: "Use and functionalities of an operating system" (Licence 3).At master 2 SIF, she teaches in English 4 hours in the data mining course (DMV).In addition she gives a lecture of 2 hours also in master 2 SIF about "Qu'est-ce qu'une thèse, un doctorat, un•e doctorant•e ?".
• Alexandra Padonou (M1); supervisors: Peggy Cellier, Sebastien Ferre; title: Evaluation of the capabilities and performance of LLMs for the question-answering of knowledge graphs.
• Sebastien Ferre was a member of the mid-term evaluation juries of Thimotée Neithoffer.
• Luis Galarraga Del Prado was a member of the mid-term evaluation committee of Ataollah Kamal (INSA Lyon) and Sacha Corbugy (Université de Namur) • Alexandre Termier was a member of the mid-term evaluation committee of Valentin Guien (University of Clermont-Ferrand), Armel Soubiega (University of Clermont-Ferrand), Lise Morice (University of Rennes), Erwan Vincent (University of Rennes) • Elisa Fromont was a member of the mid-term evaluation committee of Ewan MOREL-CORLU (Rennes) • Laurence Rozé was a member of the mid-term evaluation juries of Nolwenn Pinczon du sel.
• Christine Largouët was a member of the mid-term evaluation juries of Baptiste Sorin (BIOEPAR, INRAE).

Interventions
• Luis Galarraga Del Prado participated (2h) at the tutorial on eXplainable Graph ML (in collaboration with Megha Khosla from TU Delft) organized within the AIMLAI workshop that took place at the ECML/PKDD 2023 conference (Sept 18-22).

1
Scientific events: organisationAs part of the scientific animation of the DKM (D7) research department at IRISA, Elisa Fromont (head of the department) co-organises monthly seminars which have featured, in 2023: Damien Eveillard, Frederic Jurie, Colin de la Higuera, Sihem Amer-Yahia, Meghyn Bienvenu, Hendrik Blockeel, Aurélien Bellet.General chair, scientific chairRomaric Gaudel was co-program chair of CAP 2023 (the French conference on Machine Learning) in Strasbourg.

• Deep learning-based analysis of the early development of bovine embryos from videomicroscopy.
Simon Corbillé (supervised by Elisa Fromont and Eric Anquetil) obtained the best poster award for his work on "Precise Segmentation for Children Handwriting Analysis by Combining Multiple Deep Models with Online Knowledge" at ICDAR 2023.Maëva Durand (supervised by Christine Largouet and Charlotte Gaillard (INRAE)) won the PhD prize of the Association Française de Zootechnie (AFZ), for her PhD "Alimentation sur mesure et estimation du bien-être des truies gestantes à partir de données hétérogènes".This PhD is funded by the DigitAg project.

Honey, Tell Me What's Wrong", Global Explainability of NLP Models through Cooperative Generation [45].
The ubiquity of machine learning has highlighted the importance of explainability algorithms.Among these, model-agnostic methods generate artificial examples by slightly modifying original data and then observing changes in the model's decision-making on these artificial examples.However, such methods require initial examples and provide explanations only for the decisions based on these examples.

SmartFCA: A Smart Tool for Analyzing Complex Data with Formal Concept Analysis Participants:
Sébastien Ferré, Peggy Cellier.

•
Véronique Masson is the head of the L3 studies in Computer Science at University of Rennes • Alexandre Termier is co-head of Master 2 SIF (Science Informatique -research master in Computer Science) at University of Rennes, with Bertrand Coüasnon (INSA Rennes).• Sebastien Ferre is the head of Master M1 Miage, and of the EIT international master track in Data Science (about 75 students).• Peggy Cellier is the head of the last year at Computer Science Department at INSA (master 2 level, about 70 students).
• Tassadit Bouadi was head of continuation of studies at IUT of Lannion (computer science department), until July 2023.Since September 2023, she is co-head (with Romaric Gaudel) of the future Master M1 and M2 Artificial Intelligence at ISTIC, University of Rennes.
Peggy Cellier is in charge of the APC (Approche par compétences) development for the Computer Science Department.She is also part of the IDPE (Ingénieur diplômé par l'état) committee of INSA Rennes.She also represents INSA Rennes in the CMA (Compétence et Métier d'Avenir) IA TIAre.• Laurence Rozé is in charge of the communication at the computer science departement at INSA Rennes.Julie Boudebs, 2021-2024; supervisors: Peggy Cellier and Sebastien Ferre, title: Un assistant en langue naturelle pour interroger le Web sémantique, ED Matisse.• Olivier Gauriau, (Inria, DigitAg, Acta Toulouse) 2021-2024; supervisors: Luis Galarraga Del Prado, François Brun, Alexandre Termier and David Makowski; title: Numerical Rule Mining for the Prediction of the Dynamics of Crop Diseases, ED Matisse.Peggy Cellier, Bruno Crémilleux and Alexandre Termier; title: Data mining methods for discovering behaviors related to animal well-being in precision farming data, ED Matisse.• Pierre Maurand, 2022-2025; supervisors: Tassadit Bouadi, Peggy Cellier, Bruno Crémilleux (GREYC) and Alexandre Termier; title: Tell me your preferences and I will show you what you are interested in, ED Matisse.• Vanessa Fokou, 2022-2025; supervisors: Florence Le Ber and Xavier Dolques (Univ.Strasbourg), Sebastien Ferre, Peggy Cellier; title: Comparison and cooperation of different Formal Concept Analysis approaches for relational data, Univ.Strasbourg.