Spectral Clustering¶
On veut couper un graphe en deux en coupant le moindre d’arcs possible. C’est un algorithme de clustering.
Un graphe¶
Les données contiennent deux dictionnaires :
noeuds[i]
: contient la valeur associée au noeudi
arcs
: si le dictionnaire contient le tuple(i,j)
, alors les noeudsi
etj
sont reliés.
[1]:
# tutoriel_graphe
noeuds = {
0: "le",
1: "silences",
2: "quelques",
3: "\xe9crit",
4: "non-dits.",
5: "Et",
6: "risque",
7: "\xe0",
8: "qu'elle,",
9: "parfois",
10: "aim\xe9",
11: "lorsque",
12: "que",
13: "plus",
14: "les",
15: "Minelli,",
16: "n'oublierai",
17: "je",
18: "prises",
19: "sa",
20: "la",
21: "jeune,",
22: "qu'elle,",
23: "\xe0",
24: "ont",
25: "j'ai",
26: "chemin",
27: "\xe9tranger",
28: "lente",
29: "de",
30: "voir",
31: "quand",
32: "la",
33: "recul,",
34: "de",
35: "trop",
36: "ce",
37: "Je",
38: "Il",
39: "l'extr\xeame",
40: "J'ai",
41: "silences,",
42: "qu'elle,",
43: "le",
44: "trace,",
45: "avec",
46: "seras",
47: "dire,",
48: "femme",
49: "soit",
}
arcs = {
(3, 15): None,
(46, 47): None,
(42, 33): None,
(35, 45): None,
(1, 14): None,
(22, 26): None,
(26, 28): None,
(43, 29): None,
(40, 41): None,
(29, 44): None,
(17, 3): None,
(32, 37): None,
(24, 19): None,
(46, 34): None,
(11, 19): None,
(34, 49): None,
(22, 2): None,
(37, 48): None,
(14, 12): None,
(3, 10): None,
(5, 18): None,
(12, 24): None,
(34, 32): None,
(45, 39): None,
(37, 26): None,
(33, 45): None,
(34, 47): None,
(36, 31): None,
(29, 47): None,
(13, 11): None,
(12, 21): None,
(2, 16): None,
(5, 4): None,
(33, 35): None,
(28, 49): None,
(25, 49): None,
(21, 0): None,
(3, 13): None,
(18, 24): None,
(12, 7): None,
(13, 15): None,
(11, 1): None,
(16, 23): None,
(37, 45): None,
(27, 32): None,
(32, 41): None,
(8, 24): None,
(10, 1): None,
(2, 24): None,
(24, 11): None,
(2, 14): None,
(47, 36): None,
(48, 39): None,
(30, 25): None,
(30, 43): None,
(15, 14): None,
(26, 27): None,
(6, 8): None,
(20, 10): None,
(19, 17): None,
(5, 7): None,
(44, 25): None,
(27, 38): None,
(2, 0): None,
(3, 18): None,
(3, 9): None,
(25, 33): None,
(42, 48): None,
(2, 15): None,
(26, 48): None,
(26, 38): None,
(7, 8): None,
(8, 4): None,
}
[3]:
from teachpyx.tools.graphviz_helper import draw_graph_graphviz
from IPython.display import Image
draw_graph_graphviz(noeuds, arcs, "image.png")
Image("image.png", width=400)
[3]:
Partie 1 : clustering en pratique¶
Q1¶
On souhaite obtenir deux composantes connexes en enlevant le minimum d’arcs. Pour cela, on pense à un algorithme appelé Minimum Spanning Tree. Est-ce que cet algorithme vous satisfait ?
Q2¶
On définit le degré d’un noeud comme étant le nombre d’arcs reliés à ce noeud. Ecrire une fonction qui calcule le degré pour un noeud donné.
Q3¶
On construit le Laplacien du graphe. C’est une matrice carrée \(n\times n\) où \(n\) est le nombre de noeuds du graphe. Soit \(M=(m_{ij})\) ce laplacien :
Construisez cette matrice. On rappelle que le graphe est toujours non-directionnel. Montrez que la somme des valeurs de chaque colonne est 0.
Q4¶
Montrez que l’une des valeurs propres de cette matrice est 0. Quel est le vecteur propre associé ?
Rappel : En anglais, valeurs propres et vecteurs propres se traduisent par eigen values et eigen vectors.
Q5¶
Calculez les valeurs propres à l’aide du module numpy. Vous vous assurerez que le résultat théorique précédent est vérifié numériquement.
Q6¶
Toutes les valeurs propres sont positives ou nulles, quel est le vecteur propre associé à la plus petite des valeurs propres non nulles ?
Q7¶
On classe les noeuds en deux classes selon qu’ils sont associés à une valeur positive ou négative d’après ce vecteur propre. On peut maintenant déterminer quel noeud appartient à la première composante, quel noeud appartient à la seconde. Ecrire une fonction qui calcule le nombre d’arcs qui relient un noeud de la première composante à un noeud de la seconde. Quel est le résultat ?
Partie 2 : intermède théorique¶
Les résultats suivants sont extraits du livre Statistical Analysis of Network Data de Eric D. Kolaczyk. \(G\) désigne un graphe connexe, avec une seule composante connexe. On définit \(S\) un ensemble de sommets du graphe, \(\bar{S}\) est son complémentaire. \(E(S,\bar{S})\) désigne l’ensemble des arcs qui relient un sommet de \(S\) à un sommet de son complémentaire. \(|S|\) désigne le nombre de sommets inclus dans \(S\). Enfin :
On cherche toujours à couper un graphe connexe en deux composantes connexes en supprimant le moins d’arcs possibles. Cela dit, on souhaite aussi éviter couper un seul sommet du reste du graphe tout simplement parce qu’il est relié par un seul arc. La fonction \(\Phi(G)\) exprime ce compromis.
Résoudre ce problème de minimisation est un problème NP-complet (il n’existe pas d’algorithme capable de résoudre ce problème en un temps polynomial).}. \(\lambda_2\) désigne la plus petite des valeurs propres non nulles et \(d_{max}\) le plus grand nombre d’arcs reliés au même sommet (degré maximal du graphe). On peut montrer dans ce cas :
De plus, si \(V=(x_1,...x_n)\) le vecteur propre associé à cette valeur propre, on définit \(S\) et \(\bar{S}\) comme suit :
On s’assure que \(|S| \leqslant |\bar{S}|\). Alors, on peut montrer que :
Le vecteur \(V\) est donc une réponse approchée au problème de minimisation \((1)\).
Partie 3 : un peu plus loin¶
On applique cette méthode à un problème de clusterisation.
[10]:
import pandas
points = [
(0.84737386691659533, 0.95848816613228727),
(0.28893525107454354, 0.66073249195336492),
(0.60382037086559148, 0.13747945088383384),
(0.21951613156582261, 0.040905525433785228),
(0.21613062123493632, 0.096875623632852625),
(0.99787588721497178, 0.79337171783327132),
(0.18576957348508683, 0.78396225027633837),
(0.23875443625588322, 0.35497638429086975),
(0.8713637939628045, 0.22983756618811024),
(0.28301724069085921, 0.99408996134013161),
(0.39792684083973429, 0.77105362865540716),
(0.75452041353842147, 0.330325155167562),
(0.24824845436118537, 0.95998690078041737),
(0.92318434139996397, 0.38115765401571988),
(0.54660304309415886, 0.62093667623480242),
(0.58899996464290505, 0.9017292023892568),
(0.60541336358687847, 0.28929082523865812),
(0.87925379747840293, 0.94834058131858756),
(0.61449632813730748, 0.94264237081849722),
(0.13119804743502139, 0.44158556198130949),
(0.20660796942108339, 0.915599021810789),
(0.3097131996826511, 0.81979953110332837),
(0.89711055197298928, 0.7298496710091944),
(0.22499060312661545, 0.072786594549671291),
(0.012604758185058018, 0.36199484670070914),
(0.92050750708863993, 0.91447248587261709),
(0.26304069827339327, 0.026058147250910935),
(0.59289937178711172, 0.86673111722782969),
(0.70640070176443837, 0.64096733852134291),
(0.049399266565914535, 0.54027723332288746),
(0.26450585597978316, 0.50883097182669357),
(0.91987410679455195, 0.97753050553942622),
(0.5618293073273094, 0.27688371997865069),
(0.91241761244784847, 0.090310675429991605),
(0.90925789663628509, 0.40628594240956295),
(0.3832814495252409, 0.66221025722485627),
(0.74928785967005418, 0.32840192750838815),
(0.25478832731446643, 0.70269825611412617),
(0.54293534537395793, 0.87800254191632932),
(0.89603330911109724, 0.77106655965183546),
(0.29830084404349644, 0.97117954065316903),
(0.075137754060910056, 0.086473140735377596),
(0.120307047737505, 0.073651360408690802),
(0.87835916829742444, 0.34622147871872355),
(0.20567119579830373, 0.42658381934346423),
(0.27715586337053655, 0.87999487046170488),
(0.16364186693234739, 0.98604111274325335),
(0.31830209002283116, 0.36372930495109934),
(0.73434680601907532, 0.65926820980026724),
(0.9830474686174655, 0.12246834322318068),
(4.0293130665095358, -3.0529459366329164),
(-3.7755737603387041, 2.2685053357046323),
(-2.1926920625846602, 2.4857321786911326),
(5.1445647965531025, 4.8943143876324848),
(0.87403644639763023, 5.6464000746270226),
(-3.5545355219233219, 3.8988261206085766),
(2.0785612031685732, -1.2948920530351256),
(-3.4682717483474708, 2.2364561845005868),
(2.0695530720860349, -2.9439062757612424),
(3.9563571060210054, -2.0678946581365616),
(3.2485209278176157, -2.6386418932454814),
(-3.4800728241977779, 0.72646452125011518),
(-1.8341241854718167, 3.3482541467971951),
(-4.5558692651012178, 3.5624030818263908),
(-4.6768285328272157, -1.0106699901361971),
(3.9175303893386597, 0.1087117017596031),
(-3.9111941479785823, 2.70001353796486),
(-5.5501953466420737, -3.8544512068951891),
(1.9246058344257151, 4.123740240481137),
(-4.110657752575519, -2.0774760107085393),
(2.6547967574269418, 4.6868873425221045),
(-1.9308254017076039, -2.9448006865754279),
(-3.0788555249744247, 3.396205767032443),
(-4.0516249434348621, 0.42035392996461629),
(-2.2989465364173602, -3.2706795830191275),
(4.651698949077459, 1.1364194264447973),
(3.3637257964296152, -2.5082040184760555),
(-3.2502121678035314, 4.5383631321594571),
(4.5274668721202556, 4.473426056956777),
(3.400114365788911, -3.0434200740148363),
(3.513062501300436, 2.718209259961025),
(-2.3986743034356737, -4.0590996420222467),
(2.6632346815268289, 4.8894243587379433),
(4.2802341564965607, -3.4921791441653762),
(-1.5297912885016269, 5.5780900056883569),
(4.0634598983096293, -2.2904478604819776),
(1.0857595813036722, 5.6366192967000295),
(-4.5596385297232223, -1.3177709282351766),
(-2.1361714943468244, -3.9107871995830976),
(3.7240531749202161, -4.8033709892679886),
(-4.1017624989859351, -0.54374796617700816),
(2.3715344477591818, -3.2387553898801391),
(3.8187172884547076, -5.1522284671097314),
(1.0454193728074506, 3.1688190599740418),
(-3.9848808505730315, -3.5176013894081675),
(-4.1965918931505275, 2.0248869962483522),
(3.4535361867324776, 3.4437155145638751),
(-3.2171776428648808, -2.0867326734388021),
(-3.5763512667620065, -3.785293447306691),
(-3.2489915323631275, -4.6589505137265448),
(1.2817385669950028, -4.0553290947191964),
(-5.5481507299407191, 5.2080477057573553),
(-2.2817876881965624, 0.12512408298772948),
(-3.4831125975271719, 1.7834950195462245),
(-4.2064606598908139, -3.2421411165648886),
(-5.3461204499811092, 0.65966593807378215),
(-0.36559473517464181, -3.9248327086099932),
(-4.4223418217602317, 4.790875007038224),
(-3.9026572243192548, -0.21621909226838504),
(0.16100173690141428, -4.8875278273011942),
(-4.2792213808538602, 1.9041297697847308),
(-4.4298318748123444, -3.8717874765920124),
(3.2660121035644738, 3.8922848961161609),
(4.4724681658043082, -1.8875314666371643),
(-3.1337207059785208, 7.2290596706950154),
(5.0970619686963916, 2.4188864705446997),
(-1.824501293502089, 0.87811217547665232),
(2.6141377553638456, 2.4736768016729647),
(-3.9646033676482686, 1.7291507868196327),
(-5.6494860793108481, -1.1744278681124489),
(3.3291564189715617, 3.1892910878432268),
(2.6260111359196396, -4.8029748349762125),
(-4.1110554486386404, 0.0087017311510849682),
(-3.812034605848817, 1.8310006567642712),
(-4.0643824785110239, 4.7806635726760689),
(-3.8724397920934015, 0.65927045141188367),
(-3.6202135060380289, -0.18281430910806151),
(1.8134764145891591, -4.0328054369849538),
(4.0315824591034124, -3.5339867923196042),
(-3.0906912982614791, -3.8390710019489158),
(0.77019164393866146, 4.0099320163703895),
(-3.2239134319849398, 2.5227757084315567),
(-2.5342615497190861, -4.5402720724503229),
(0.52313297572359074, 5.8268409663350287),
(-2.0896974241486603, -0.83931337455192145),
(5.9824769771009292, 1.8062615072223389),
(-1.7151819974072808, -4.6553638508191835),
(-0.94296691141453703, -4.3332773280899097),
(-2.9080659785364102, 3.8017876981653527),
(-4.146797854411842, -2.4943345068020939),
(-1.6135304662636716, -4.5968234340599352),
(-5.2240732422979015, -0.40050907128273239),
(3.0003615064702411, 4.3564534485947091),
(1.5251603471425388, 5.3602495377614252),
(0.70829180528117897, 4.8705912438690024),
(-1.9857439387875215, 4.3495410597763557),
(-1.7415118623160484, -2.8482449535792851),
(3.1227029816875906, -3.943690794192229),
(2.5533372938495322, 0.23654193364300019),
(4.9320538122814632, 0.27398085527961841),
(3.5379571426787906, 3.5479478416595258),
(-3.9952197756192462, 0.9519866242123729),
(-0.63418929807710789, 4.9714021509147459),
(3.7514419719026835, -3.7952656655539831),
(5.8168652955867248, -5.8059389896821614),
(-3.86083201462211, 1.6763339473293351),
(5.2346287443442741, -2.0049022214331869),
(3.0159172780756807, -4.6747832401686313),
(1.9625789720275502, 0.21332969214064601),
(-5.4459656516053521, 1.8490131071943328),
(5.4887755131556295, 1.0537691340713213),
(4.1214658457920255, 1.8180419262808878),
(1.0417225435808637, 6.4876076903545457),
(5.2056831059665383, 3.4403227294912879),
(-3.29183542445509, 1.1299087065549616),
(-4.6894950904308068, 0.67877427899602139),
(-4.2334935303450196, 0.66692066781151726),
(6.918359229911677, -0.43825691963852248),
(5.0912552685819197, 5.9256467457380193),
(3.9995400634925016, -4.2633779062253305),
(-1.3270510253578853, -2.8998811026998816),
(-3.4372749748248483, -2.800876689538256),
(2.5720483206059228, -4.5479241832525954),
(3.5107697954439923, -5.6063323885377114),
(3.45355690226015, -1.3924594206301864),
(4.8170391803389006, -1.3343907023480963),
(1.1592191821861308, 4.551692003143347),
(-2.2147820707711716, 0.55930561729387951),
(-3.2364813901253862, -1.7059292544869302),
(3.5980046177747229, 3.0606302788023871),
(3.0235041652892747, -0.27015781708378661),
(2.4303330714757383, 3.3989583334332432),
(2.4649562148782955, -4.3524552397826168),
(-3.3322237797463616, 1.6813558717119386),
(4.3359544685337736, 2.7104894884469877),
(3.350410042767797, -3.8412188670946792),
(-2.8993273426849919, 5.5101185505218293),
(3.3563537615645282, 5.0439247587050282),
(3.3738404946436238, -0.43277784903448813),
(1.6236691719193734, -4.8192122194763103),
(-4.3000303214498619, 2.7045156595962521),
(3.2036876689968699, -0.22379027409222038),
(5.0078193725337679, -0.33061456656172339),
(1.3173753727230917, 2.3292728936983247),
(0.17305051546078376, -2.3708524146324814),
(0.18920570140751003, 2.7288547711089577),
(4.5559793038807355, 2.4460955268542377),
(-0.65537111745445098, 4.3024274811626642),
(0.32733974310015845, -2.6653194005399481),
(-4.3495524342659682, 0.50620561077402126),
(3.6859406925109957, 1.0042337939426813),
(-5.4168309661540643, -2.3784247121303279),
(2.0873449293614152, 3.8206900404120345),
(3.3397623772131446, -2.2347446764630474),
(2.8720948774765485, 2.6955132035521556),
(5.9472576652843694, -3.3542922693748149),
(1.030233796538444, 1.6199282129862145),
(-1.7351581782776853, -5.5709314373179808),
(0.14607908112131446, 2.79251837326064),
(0.37002429983216167, 4.3653059393186942),
(3.8616789948811956, 3.6100436336617339),
(-4.8019087210485418, -3.5911421188072357),
(1.6953052292111459, -4.3928959775316905),
(-1.049532260408768, -2.9169000088107522),
(-4.8042700374731648, -2.6636201843555991),
(2.2856117402115821, -4.497386564362329),
(-1.1085015582769402, -4.1635806015318408),
(-0.51764720743541925, 3.3207617687324866),
(2.6552485122750968, 1.9457154950840061),
(4.4574030967957459, 0.13220998701481373),
(4.1064026703010086, -4.6992062016898437),
(3.6218017958370492, 2.4171784152426357),
(2.1893570148164336, -0.53987360896641756),
(-0.62289304323418893, 5.6377915319211773),
(0.95656595366184183, -3.5482370903224183),
(4.6552715153624238, -0.42419842122106877),
(3.9138981541477369, 1.5211086418661788),
(-5.7643908686171743, 3.3462875243179644),
(4.4001664954474204, 1.8715548148469952),
(3.7209034976257116, -4.3132712976844925),
(2.0077653108424371, -3.8044349295045858),
(-2.7004396541700451, 3.6313151291578776),
(2.7805282578575432, -1.3496033840422226),
(2.5149407509344646, -4.4491799573779538),
(-3.4969549443875327, 0.59052341158001964),
(2.5871839418980924, -2.8626995345211439),
(4.530084220131168, 0.73947783901217035),
(-4.2278934560638541, -1.4480933790189707),
(-3.6638968948801822, -1.8603129450393652),
(1.0034748779660814, 4.3783603559660618),
(-0.24711046251746965, 5.0245225170472958),
(-0.75233017871629115, -3.4003624728787472),
(-5.3204808270534789, 0.8530050107548528),
(-0.66555456366565435, -3.210607962975542),
(4.4312598575388913, -1.8510534338146063),
(-1.0579141292803367, -3.8599892658343156),
(5.1580465239922022, -1.6376354853614972),
(-2.6525127599513731, 2.9406618825179196),
(3.3353268107001339, 4.5193520805659642),
(4.9838132614191322, -4.5937246171656669),
]
df = pandas.DataFrame(points, columns=["x", "y"])
df.plot.scatter(x="x", y="y")
[10]:
<Axes: xlabel='x', ylabel='y'>
On note \(d(X_1,X_2)\) la distance euclidienne entre deux points \(X_1\) et \(X_2\). On construit le Laplacien suivant à partir d’un ensemble de points du plan \((X_i)_i\).
Et voici un ensemble de points juste au cas où.
[5]:
Q1¶
Créer une fonction qui génère un tel ensemble de points en deux dimensions de façon aléatoires ?
Q2¶
Implémentez la méthode suggérée et dessinez le résultat.
[6]: