Abstract
- What is the initial representation of the provided database? That is: What are the objects (rows) and what are the descriptive attributes (columns)?
- What is the considered pattern language used to describe objects in the database w.r.t. the pattern mining task and the provided information?
- How to check whether a pattern holds for some object in the database?
- How to evaluate the “interestingness” of the findings (i.e. a quality measure, constraints that patterns or the pattern set need to have, etc.)?
- To revisit, from an order-theoretic point of view, the formalization of a common pattern mining task through the pattern setup framework [6]. This provides a better-understanding of the underlying properties that a pattern language could have.
- To instantiate the formalism for various pattern languages and to show how it fits well the usual pattern mining tasks.
- To explore various approaches to enumerate patterns in a pattern setup:
- Exhaustive approach. Leverage pattern setup properties. For instance, closure operators [39,40,10], anti-exchange property [21,7], etc.
- Sampling approach. Use direct sampling approach [11,19] to generate patterns such that the probability that some pattern is discovered is proportional to its quality (e.g. frequency, area, discriminativity).
- Anytime approach. Enumerate progressively the pattern search space [5,12] providing patterns with increasing quality anytime.
We give here an overview of the notions that will be discussed in the tutorial.
A pattern language or a description space is a partially ordered set (poset) $(\mathcal{D}, \sqsubseteq)$ where elements of $\mathcal{D}$ are called patterns and $\sqsubseteq$ is a subsumption order ; i.e. for two patterns $d_1, d_2 \in \mathcal{D}$, $d_1$ subsumes or generalizes $d_2$, denoted $d_1 \sqsubseteq d_2$, if $d_2$ logically implies $d_1$. For instance, the description $age \: \leq \: 60$ subsumes the description $age\:\leq\:30$ since every individuals having an age less than $30$, has also an age less than $60$.
A pattern setup [6,44] is then a triple $(\mathcal{G}, (\mathcal{D}, \sqsubseteq), \delta)$ where $\mathcal{G}$ is an arbitrary set of objects (i.e. individuals, transaction identifier, etc.), $(\mathcal{D}, \sqsubseteq)$ is an arbitrary poset designating the description space and $\delta: \mathcal{G} \to \mathcal{D}$ maps each object $g \in \mathcal{G}$ to its description in the pattern space $\delta(g) \in \mathcal{D}$. For instance, if the description space is the set of all possible interval-restrictions over the attribute $age$, then an individuals $g \in \mathcal{G}$ having an age of $26$ will be mapped to $\delta(g) = 26\:\leq\:age\:\leq\:26$.
Based on the pattern setup, the cover binary relation can be intuitively built between $\mathcal{G}$ and $\mathcal{D}$. A pattern $d \in \mathcal{D}$ is said to cover or hold for an object $g \in \mathcal{G}$ in the pattern setup if $d$ subsumes $\delta(g)$ (i.e. $d \sqsubseteq \delta(g)$). Again, it is clear that the pattern $age\:\leq\:60$ holds for the object $g$ with $\delta(g) = 26\:\leq\:age\:\leq\:26$.
Starting from the cover binary relation, one can build many notions such as the extent (i.e. the set of objects for which some pattern holds), the cover (i.e. the set of descriptions common to a set of objects), definable sets (i.e. subset of objects that are separable in the pattern space), implications between patterns w.r.t. to the dataset, equivalent patterns , minimal generators , support-closed patterns , and so on.
Additional properties of the pattern setup depends generally on the properties of the subsumption order (i.e. a lattice [18], a multilattice [8,46], a convex geometry [21], an arbitrary poset). For instance, having a unique maximum common description for any subset of objects is directly linked to the fact that the description space is a lattice. In such a case, we say that the pattern setup is a pattern structure [27,44]. Having a pattern structure allows to build a Galois Connection and hence a closure operator on the pattern space. Many usual pattern languages are lattice-based. One can cite for instance itemsets [51,2], intervals [36], convex polygons [7] and others (see [41]). However, not all pattern spaces are lattices as is it is the case with sequential patterns [3,53].
Understanding properties about the description space allows to design efficient enumeration algorithms. For instance, if the aim is to look for patterns $d \in D$ maximizing some quality measure depending solely on the covered instances (i.e. extent) as is it the case in Subgroup Discovery [38] or Exceptional Model Mining [42], one should only look for the definable sets (subset of objects that are separable w.r.t. the pattern language). Enumerating definable sets is equivalent to the task of enumerating closed patterns when the pattern space is a lattice. Moreover, if for example the pattern space has the anti-exchange property (i.e. convex geometry), one can enumerate definable sets without even computing the closure (i.e. successive removal of extreme points [7]). The task of enumerating definable sets in a non-lattice based pattern space (e.g. sequential patterns) is harder since it is even not equivalent to enumerate support-closed descriptions (e.g. maximal common subsequences).
It is worth to mention that pattern setups have the objective to model and understand the considered description spaces; the pattern setup definition is then independent from the algorithm enumerating its elements. It should be noticed that whenever a pattern structure is built using the considered description space, a general exhaustive/exact algorithm to enumerate its definable sets exists. It uses the closure operator [10,39,40]. Yet, one could use the same framework to design (direct) sampling algorithms [11,19,4], heuristic algorithms or anytime algorithms [6,12].
- Introduction and Motivation (30 mn)
This section should motivate the importance of a formal framework to model and define the pattern search space in order to understand the underlying properties. It allows also to revisit some usual pattern mining problems: Frequent Pattern Mining [2,3], Subgroup Discovery [38,52], Exceptional Model Mining [42] and Redescription Mining [26] among others. These pattern mining problems rely on a common need which is defining the pattern language: itemset patterns [2], interval patterns [36], sequential patterns [3,49,14], trajectory patterns [32] and so on. - Pattern Setup Framework I (60 mn)
This section presents the framework of pattern setups and the different notions that can be derived from, such as extent operator, covering descriptions, definable sets, minimal generator, support-closed patterns and so on. The presentation of these notions will be illustrated by examples instantiated for various pattern languages and pattern mining problems. - Coffee Break (20 mn)
One can also drink tea or juice ... - Pattern Setup Framework II (45 mn)
Further details on pattern setups and their properties are discussed. For instance, we start by presenting the framework of pattern structures [40] (i.e. pattern setups on a lattice-based pattern space) and the additional properties. Other details will be briefly investigated:- Problems related to maximal common patterns (support-closed patterns) in some arbitrary description spaces and the notion of multilattices.
- Transformations of pattern setups into simpler spaces (i.e. projections) or richer ones (i.e. completions).
- Algorithms in Pattern Setups (45 mn)
This section discusses the pattern setup framework and how it could be used to think about algorithms enumerating patterns w.r.t. pattern mining problem. We present here:- Exhaustive approaches. Particularly, algorithms like CbO (Close-by-One) [39,40,30,10] within finite pattern structures are detailed here. Variants depending on the description space properties [36,7] will also be investigated.
- Direct Sampling Approaches. Direct samplig algorithms w.r.t. to some distribution probability following [10,19] are briefly presented. Extensions to other types of patterns is investigated from a pattern setup perspective.
- Anytime Approaches. Algorithms exploring progressively the pattern search space providing guarantees on the optimized quality measure anytime are discussed. A particular attention will be paid to pattern subspace properties (e.g. complete sublattices [5], kernel systems [15,16]).
- Discussion and Open Problems (30 mn)
The aim of this final section is to propose discussions on some pattern mining open problems we are currently working on:- How to enumerate definable sets exhaustively and irredundantly in an arbitrary finite pattern setup?
- What about pattern sets from pattern setup perspective?
- Subgroup discovery in labeled datasets is tightly linked to Garriga’s relevant patterns [29], the existence of an output-polynomial (or polynomial delay) algorithm that is polynomial in space remains to be an open problem [33].
- Other questions emerging during the tutorial will also be discussed depending on the audience interest.
Materials
Speakers
- Aimene Belfodil
Aimene Belfodil is a 3rd year PhD student supervised by Dr. Mehdi Kaytoue and Pr. Céline Robardet in the LIRIS Laboratory in Lyon (France) and Mobile Devices company. His research interest are about Ordered Sets in Pattern Mining and Subgroup Discovery. Since 2016, he is a lecturer in the Computer Science department at INSA de Lyon (France). His teaching topics relate to Algorithmics, Data Analysis and Management and Linked Data. He was awarded for the Best Data Mining Student Paper in ECML-PKDD 2018 for “Anytime Subgroup Discovery in Numerical Domains with Guarantees” . - Mehdi kaytoue
Mehdi Kaytoue is an assistant-professor at the “Institut National des Sciences Appliquées” since 2012 (INSA de Lyon, France) and a member of the LIRIS Laboratory in Lyon (France). He is currently working at Infologic company that edits and integrates software solutions for the food supply chain. Mehdi Kaytoue has a large experience in KDD and has already published a series of papers in high level venues. He is a member of the program committee of ECML-PKDD since 2015 and is deeply involved in the CLA and ICFCA communities. - Sergei O. Kuznetsov
Sergei O. Kuznetsov is a professor, head of Department of Data Analysis and Artificial Intelligence at the National Research University Higher School of Economics (NRU HSE), Moscow, Russia. His interests are algorithmic complexity and algorithms in machine learning and formal concept analysis, data mining, and knowledge discovery. He is involved in the CLA, ICCS and ICFCA community among others and he chaired five international conferences and numerous international workshops on FCA and Artificial Intelligence. - Amedeo Napoli
Amedeo Napoli is a member of the LORIA Laboratory in Nancy (France). He is the scientific head of the Orpailleur team and is mainly interested in knowledge discovery and knowledge representation and reasoning, with particular attention to FCA in Knowledge Discovery. He was the Conference Chair of the ECML-PKDD Conference organized in Nancy (France) in September 2014. Finally, he is also deeply involved in the CLA and ICFCA communities.
Amedeo Napoli and Sergei O. Kuznetsov have a joint activity since many years now and they have co-authored a series of papers on FCA and extensions. They have been involved in conferences on FCA mainly ICCS, ICFCA and CLA and in the organization of special issues on FCA in journals (Annals of Mathematics and Artificial Intelligence, Discrete Applied Mathematics). They have organized a joint tutorial on FCA at IJCAI 2013 and IJCAI 2015, and they have a strong experience in teaching abroad on topics about Knowledge Discovery. In addition, Sergei O. Kuznetsov and Amedeo Napoli are co-chairs of the workshop series FCA4AI, which exists since 2012 and is usually related to ECAI or IJCAI Conferences (see http://fca4ai.hse.ru/ ). Both were also co-chairs of a special session on FCA for Knowledge Discovery at the ISMIS Conference (see http://ismis2017.ii.pw.edu.pl/s_kd_fca.php ).
References
[2] Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: SIGMOD Conference. pp. 207–216. ACM Press (1993)
[3] Agrawal, R., Srikant, R.: Mining sequential patterns. In: ICDE. pp. 3–14. IEEE Computer Society (1995)
[4] Belfodil, A., Cazalens, S., Lamarre, P., Plantevit, M.: Identifying exceptional (dis)agreement between groups. Research report, LIRIS UMR CNRS 5205 (Feb 2019), https://hal.archives-ouvertes.fr/hal-02018813
[5] Belfodil, A., Belfodil, A., Kaytoue, M.: Anytime subgroup discovery in numerical domains with guarantees. In: ECML/PKDD (2). Lecture Notes in Computer Science, vol. 11052, pp. 500–516. Springer (2018)
[6] Belfodil, A., Kuznetsov, S.O., Kaytoue, M.: Pattern setups and their completions. In: CLA. CEUR Workshop Proceedings, vol. 2123, pp. 243–253. CEUR-WS.org (2018)
[7] Belfodil, A., Kuznetsov, S.O., Robardet, C., Kaytoue, M.: Mining convex polygon patterns with formal concept analysis. In: IJCAI. pp. 1425–1432. ijcai.org (2017)
[8] Benado, M.: Les ensembles partiellement ordonnés et le théorème de raffinement de schreier. ii. théorie des multistructures. Czechoslovak Mathematical Journal 5(3), 308–344 (1955)
[9] Bendimerad, A.A., Plantevit, M., Robardet, C.: Mining exceptional closed patterns in attributed graphs. Knowl. Inf. Syst. 56(1), 1–25 (2018)
[10] Boley, M., Horváth, T., Poigné, A., Wrobel, S.: Listing closed sets of strongly accessible set systems with applications to data mining. Theor. Comput. Sci. 411(3), 691–700 (2010)
[11] Boley, M., Lucchese, C., Paurat, D., Gärtner, T.: Direct local pattern sampling by efficient two-step random procedures. In: KDD. pp. 582–590. ACM (2011)
[12] Bosc, G., Boulicaut, J., Raı̈ssi, C., Kaytoue, M.: Anytime discovery of a diverse setof patterns with monte carlo tree search. Data Min. Knowl. Discov. 32(3), 604–650 (2018)
[13] Brito, P.: Order structure of symbolic assertion objects. IEEE Trans. Knowl. Data Eng. 6(5), 830–834 (1994)
[14] Buzmakov, A., Egho, E., Jay, N., Kuznetsov, S.O., Napoli, A., Raı̈ssi, C.: On mining complex sequential data by means of FCA and pattern structures. Int. J. General Systems 45(2), 135–159 (2016)
[15] Buzmakov, A., Kuznetsov, S.O., Napoli, A.: Fast generation of best interval patterns for nonmonotonic constraints. In: ECML/PKDD (2). Lecture Notes in Computer Science, vol. 9285, pp. 157–172. Springer (2015)
[16] Buzmakov, A., Kuznetsov, S.O., Napoli, A.: Efficient mining of subsample-stable graph patterns. In: ICDM. pp. 757–762. IEEE Computer Society (2017)
[17] Casas-Garriga, G.: Summarizing sequential data with closed partial orders. In: SDM. pp. 380–391. SIAM (2005)
[18] Davey, B.A., Priestley, H.A.: Introduction to lattices and order. Cambridge university press (2002)
[19] Diop, L., Diop, C.T., Giacometti, A., Li, D.H., Soulet, A.: Sequential pattern sampling with norm constraints. In: ICDM. pp. 89–98. IEEE Computer Society (2018)
[20] Duivesteijn, W., Feelders, A., Knobbe, A.J.: Exceptional model mining - supervised descriptive local pattern mining with complex target concepts. Data Min. Knowl. Discov. 30(1), 47–98 (2016)
[21] Edelman, P.H., Jamison, R.E.: The theory of convex geometries. Geometriae dedicata 19(3), 247–270 (1985)
[22] Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery in databases. AI Magazine 17(3), 37–54 (1996)
[23] Ferré, S., Ridoux, O.: A logical generalization of formal concept analysis. In: ICCS. Lecture Notes in Computer Science, vol. 1867, pp. 371–384. Springer (2000)
[24] Fournier-Viger, P., Lin, J.C.W., Nkambou, R., Vo, B., Tseng, V.S.: High-Utility Pattern Mining. Springer (2019)
[25] Galbrun, E., Cellier, P., Tatti, N., Termier, A., Crémilleux, B.: Mining periodic patterns with a MDL criterion. In: ECML/PKDD (2). Lecture Notes in Computer Science, vol. 11052, pp. 535–551. Springer (2018)
[26] Galbrun, E., Miettinen, P.: Redescription mining: An overview. IEEE Intelligent Informatics Bulletin 18(2), 7–12 (2017)
[27] Ganter, B., Kuznetsov, S.O.: Pattern structures and their projections. In: ICCS. Lecture Notes in Computer Science, vol. 2120, pp. 129–142. Springer (2001)
[28] Ganter, B., Wille, R.: Formal concept analysis - mathematical foundations. Springer (1999)
[29] Garriga, G.C., Kralj, P., Lavrac, N.: Closed sets for labeled data. In: PKDD. Lecture Notes in Computer Science, vol. 4213, pp. 163–174. Springer (2006)
[30] Gély, A.: A generic algorithm for generating closed sets of a binary relation. In: ICFCA. Lecture Notes in Computer Science, vol. 3403, pp. 223–234. Springer (2005)
[31] Giacometti, A., Soulet, A.: Dense neighborhood pattern sampling in numerical data. In: SDM. pp. 756–764. SIAM (2018)
[32] Giannotti, F., Nanni, M., Pinelli, F., Pedreschi, D.: Trajectory pattern mining. In: KDD. pp. 330–339. ACM (2007)
[33] Grosskreutz, H.: Class relevant pattern mining in output-polynomial time. In: SDM. pp. 284–294. SIAM / Omnipress (2012)
[34] Hacene, M.R., Huchard, M., Napoli, A., Valtchev, P.: Relational concept analysis: mining concept lattices from multi-relational data. Ann. Math. Artif. Intell. 67(1), 81–108 (2013)
[35] Imielinski, T., Mannila, H.: A database perspective on knowledge discovery. Commun. ACM 39(11), 58–64 (1996)
[36] Kaytoue, M., Kuznetsov, S.O., Napoli, A.: Revisiting numerical pattern mining with formal concept analysis. In: IJCAI. pp. 1342–1347. IJCAI/AAAI (2011)
[37] Kaytoue, M., Plantevit, M., Zimmermann, A., Bendimerad, A.A., Robardet, C.: Exceptional contextual subgraph mining. Machine Learning 106(8), 1171–1211 (2017)
[38] Klösgen, W.: Explora: A multipattern and multistrategy discovery assistant. In: Advances in Knowledge Discovery and Data Mining, pp. 249–271. AAAI/MIT Press (1996)
[39] Kuznetsov, S.O.: A Fast Algorithm for Computing All Intersections of Objects in a Finite Semi-lattice. Nauchno-Tekhnicheskaya Informatsiya ser. 2(1), 17–20 (1993)
[40] Kuznetsov, S.O.: Learning of simple conceptual graphs from positive and negative examples. In: PKDD. pp. 384–391 (1999)
[41] Kuznetsov, S.O.: Pattern structures for analyzing complex data. In: RSFDGrC (2009)
[42] Leman, D., Feelders, A., Knobbe, A.J.: Exceptional model mining. In: ECML/PKDD (2). Lecture Notes in Computer Science, vol. 5212, pp. 1–16. Springer (2008)
[43] Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. In: KDD. pp. 80–86. AAAI Press (1998)
[44] Lumpe, L., Schmidt, S.E.: Pattern structures and their morphisms. In: CLA. CEUR Workshop Proceedings, vol. 1466, pp. 171–179. CEUR-WS.org (2015)
[45] Mannila, H., Toivonen, H.: Levelwise search and borders of theories in knowledge discovery. Data Min. Knowl. Discov. 1(3), 241–258 (1997)
[46] Martı́nez, J., Gutiérrez, G., de Guzmán, I.P., Cordero, P.: Generalizations of lattices via non-deterministic operators. Discrete Mathematics 295(1-3), 107–141 (2005)
[47] Novak, P.K., Lavrac, N., Webb, G.I.: Supervised descriptive rule discovery: A unifying survey of contrast set, emerging pattern and subgroup mining. Journal of Machine Learning Research 10, 377–403 (2009)
[48] Özden, B., Ramaswamy, S., Silberschatz, A.: Cyclic association rules. In: ICDE. pp. 412–421. IEEE Computer Society (1998)
[49] Plantevit, M., Laurent, A., Laurent, D., Teisseire, M., Choong, Y.W.: Mining multidimensional and multilevel sequential patterns. TKDD 4(1), 4:1–4:37 (2010)
[50] Ramakrishnan, N., Kumar, D., Mishra, B., Potts, M., Helm, R.F.: Turning cartwheels: an alternating algorithm for mining redescriptions. In: KDD. pp. 266– 275. ACM (2004)
[51] Wille, R.: Restructuring lattice theory: an approach based on hierarchies of concepts. In: Ordered sets, pp. 445–470. Springer (1982)
[52] Wrobel, S.: An algorithm for multi-relational discovery of subgroups. In: PKDD. Lecture Notes in Computer Science, vol. 1263, pp. 78–87. Springer (1997)
[53] Zaki, M.J.: SPADE: an efficient algorithm for mining frequent sequences. Machine Learning 42(1/2), 31–60 (2001)