{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Image et doublons\n", "\n", "Material for the hackathon ENSAE / BRGM / 2018. Les images sont extraites de tweets mais sont retweet\u00e9es sans \u00eatre retweet\u00e9es."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": ["%matplotlib inline\n", "import matplotlib.pyplot as plt"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## S\u00e9paration des doublons\n", "\n", "Pour le challenge, il faut rep\u00e9rer les doublons dans les images. Pour cela, je zoom chaque image sur un carr\u00e9 50x50 en noir et blanc, suivi d'une ACP puis k plus proches voisins pour d\u00e9tecter les doublons."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Images en gris 50x50"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": ["folder = \"c:/temp/suricatenat_images\""]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": ["from ensae_projects.hackathon.image_helper import apply_image_transform, image_zoom, img2gray\n", "\n", "dest_folder = \"img5050\"\n", "list(apply_image_transform(folder, dest_folder, lambda img: image_zoom(img2gray(img), (50, 50)), fLOG=print))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Images en features\n", "\n", "Pas utilis\u00e9 par la suite."]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": ["from ensae_projects.hackathon.image_helper import stream_image2features\n", "import numpy\n", "\n", "dest_folder = \"img5050\"\n", "dest_batch = \"batch\"\n", "for b in stream_image2features(dest_folder, dest_batch, numpy.array, fLOG=print):\n", " pass"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### voisins"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "code", "execution_count": 7, "metadata": {"scrolled": true}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["[ImageNearestNeighbors] processing image 0: 'inondation_2016\\735614357036519425_CjVtTTrUoAAUUZp.jpg' - class 'inondation_2016'\n", "[ImageNearestNeighbors] processing image 1000: 'inondation_2016\\737596119933321217_Cjx3w1FVAAAyyY1.jpg' - class 'inondation_2016'\n", "[ImageNearestNeighbors] processing image 2000: 'inondation_2016\\737891662077255685_Cj2EjjXWUAA8Dhq.jpg' - class 'inondation_2016'\n", "[ImageNearestNeighbors] processing image 3000: 'inondation_2016\\738050337521709056_Cj4UpFDUoAIR2gD.jpg' - class 'inondation_2016'\n", "[ImageNearestNeighbors] processing image 4000: 'inondation_2016\\738283056302313472_Cj7oe7VWkAAPwAT.jpg' - class 'inondation_2016'\n", "[ImageNearestNeighbors] processing image 5000: 'inondation_2016\\738366585526718464_Cj80fFNXEAAx9T2.jpg' - class 'inondation_2016'\n", "[ImageNearestNeighbors] processing image 6000: 'inondation_2016\\738439428159377408_Cj92vvAUYAARP2A.jpg' - class 'inondation_2016'\n", "[ImageNearestNeighbors] processing image 7000: 'inondation_2016\\738629637845221376_CkAjvUFVAAErbJF.jpg' - class 'inondation_2016'\n", "[ImageNearestNeighbors] processing image 8000: 'inondation_2016\\738695722296614912_CkBf1CbXIAAonp1.jpg' - class 'inondation_2016'\n", "[ImageNearestNeighbors] processing image 9000: 'inondation_2016\\738766013416787968_CkCfqR3XIAEuX8m.jpg' - class 'inondation_2016'\n", "[ImageNearestNeighbors] processing image 10000: 'inondation_2016\\738894521304526849_CkEUnRhW0AEh1e1.jpg' - class 'inondation_2016'\n", "[ImageNearestNeighbors] processing image 11000: 'inondation_2016\\739101985295728640_CkHRVZ-WUAAyCrp.jpg' - class 'inondation_2016'\n", "[ImageNearestNeighbors] processing image 12000: 'inondation_2016\\739400457899114496_CkLgzBAWkAE9hCa.jpg' - class 'inondation_2016'\n", "[ImageNearestNeighbors] processing image 13000: 'inondation_2016\\739732522427424768_CkQOztKWYAAlOul.jpg' - class 'inondation_2016'\n", "[ImageNearestNeighbors] processing image 14000: 'inondation_2016\\740054590863859712_CkUzuikWgAAJUTK.jpg' - class 'inondation_2016'\n", "[ImageNearestNeighbors] processing image 15000: 'inondation_2016\\740416207296299008_CkZ8nAnWYAAG7cC.jpg' - class 'inondation_2016'\n", "[ImageNearestNeighbors] processing image 16000: 'inondation_2016\\740833843914153985_Ckf4dIFWEAANSOT.jpg' - class 'inondation_2016'\n", "[ImageNearestNeighbors] processing image 17000: 'inondation_2016\\742361701924937728_Ck1mBsLWkAE6EQX.jpg' - class 'inondation_2016'\n", "[ImageNearestNeighbors] processing image 18000: 'inondation_2018\\955391968712019968_DUI76ywW4AA2J1b.jpg' - class 'inondation_2018'\n", "[ImageNearestNeighbors] processing image 19000: 'inondation_2018\\956216357934325761_LKmRQ9hLmVxOkWtm.jpg' - class 'inondation_2018'\n", "[ImageNearestNeighbors] processing image 20000: 'inondation_2018\\957254473604268032_DUjZ2vSWkAAdzd2.jpg' - class 'inondation_2018'\n", "[ImageNearestNeighbors] processing image 21000: 'inondation_2018\\959020320320565248_DU8fYlpX4AAZIRV.jpg' - class 'inondation_2018'\n", "[ImageNearestNeighbors] processing image 22000: 'inondation_2018\\964034081381109761_DWDv4vHWsAAMQIS.jpg' - class 'inondation_2018'\n", "[ImageNearestNeighbors] processing image 23000: 'seisme_Amatrice\\768290329543995392_MwkGcfSrCBzWbxwK.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 24000: 'seisme_Amatrice\\768326333034364928_CqmktbfXEAAw2RU.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 25000: 'seisme_Amatrice\\768345861646581760_Cqm2eUjWYAAWS68.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 26000: 'seisme_Amatrice\\768361403522646016_CqnEgFrWcAANqdO.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 27000: 'seisme_Amatrice\\768374709645967361_CqnQt96XEAAew2V.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 28000: 'seisme_Amatrice\\768387852862455810_CqncoxLWYAAulnb.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 29000: 'seisme_Amatrice\\768401257769865216_CqnlYItWAAAbc7p.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 30000: 'seisme_Amatrice\\768417849652027394_Cqnz5_gXgAAx967.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 31000: 'seisme_Amatrice\\768433724564377600_CqoGZC4WAAEh8zG.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 32000: 'seisme_Amatrice\\768451168372662272_CqoWQCGW8AQIbHV.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 33000: 'seisme_Amatrice\\768468307288743936_Cqol1cDXgAEychm.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 34000: 'seisme_Amatrice\\768488406091386880_Cqo4H6GWIAAr4YP.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 35000: 'seisme_Amatrice\\768511762429800448_CqpNXTxXYAATvNk.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 36000: 'seisme_Amatrice\\768543842845032448_CqplczAWIAAhINz.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 37000: 'seisme_Amatrice\\768647190260518912_CqrIhKnUkAAyvf6.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 38000: 'seisme_Amatrice\\768716815279063040_CqsH3mqUEAA6gXD.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 39000: 'seisme_Amatrice\\768743738634080256_CqsgWwgWcAE0rWO.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 40000: 'seisme_Amatrice\\768772807568351232_Cqs6ORfWIAAXlf8.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 41000: 'seisme_Amatrice\\768804543748575232_CqtXniSXYAAp7Tt.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 42000: 'seisme_Amatrice\\768843712357076993_Cqt7R_1WYAE6tr6.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 43000: 'seisme_Amatrice\\768901703898771456_Cquv7mKWgAAn6ZX.jpg' - class 'seisme_Amatrice'\n", "[ImageNearestNeighbors] processing image 44000: 'suricatenat_inondation_aude\\1052220109740228608_Dpo8nOhXgAYLNEm.jpg' - class 'suricatenat_inondation_aude'\n"]}], "source": ["from ensae_projects.hackathon.image_knn import ImageNearestNeighbors\n", "folder = \"img5050\"\n", "knn = ImageNearestNeighbors()\n", "knn.fit(folder, fLOG=print)"]}, {"cell_type": "code", "execution_count": 8, "metadata": {"scrolled": false}, "outputs": [{"data": {"text/plain": ["44053"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["from ensae_projects.hackathon.image_helper import enumerate_image_class\n", "folder = \"img5050\"\n", "iter = enumerate_image_class(folder)\n", "imgs = [_[0] for _ in zip(iter, range(0,1000000))]\n", "len(imgs)"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["dist = [[ 0. 0. 7.93725393 366.16662874 380.73481585]]\n", "ind = [[ 12 3 10 21464 8684]]\n"]}], "source": ["for i, img in enumerate(imgs):\n", " dist, ind = knn.kneighbors(img[0])\n", " if dist[0, 1] <= 10:\n", " print(\"dist =\", dist)\n", " print(\"ind =\", ind)\n", " break"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {"needs_background": "light"}, "output_type": "display_data"}], "source": ["knn.plot_neighbors(ind, dist, obs=img[0], folder_or_images=folder);"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["0/44053 done\n", "1000/44053 done\n", "2000/44053 done\n", "3000/44053 done\n", "4000/44053 done\n", "5000/44053 done\n", "6000/44053 done\n", "7000/44053 done\n", "8000/44053 done\n", "9000/44053 done\n", "10000/44053 done\n", "11000/44053 done\n", "12000/44053 done\n", "13000/44053 done\n", "14000/44053 done\n", "15000/44053 done\n", "16000/44053 done\n", "17000/44053 done\n", "18000/44053 done\n", "19000/44053 done\n", "20000/44053 done\n", "21000/44053 done\n", "22000/44053 done\n", "23000/44053 done\n", "24000/44053 done\n", "25000/44053 done\n", "26000/44053 done\n", "27000/44053 done\n", "28000/44053 done\n", "29000/44053 done\n", "30000/44053 done\n", "31000/44053 done\n", "32000/44053 done\n", "33000/44053 done\n", "34000/44053 done\n", "35000/44053 done\n", "36000/44053 done\n", "37000/44053 done\n", "38000/44053 done\n", "39000/44053 done\n", "40000/44053 done\n", "41000/44053 done\n", "42000/44053 done\n", "43000/44053 done\n", "44000/44053 done\n"]}], "source": ["pairs = []\n", "for i, img in enumerate(imgs):\n", " if i % 1000 == 0:\n", " print(\"{0}/{1} done\".format(i, len(imgs)))\n", " dist, ind = knn.kneighbors(img[0])\n", " sub = ind.ravel()[dist.ravel() <= 10]\n", " if len(sub) > 0:\n", " for j in sub:\n", " pairs.append((i, j))"]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/plain": ["[(0, 0),\n", " (1, 1),\n", " (2, 2),\n", " (3, 12),\n", " (3, 3),\n", " (3, 10),\n", " (4, 4),\n", " (5, 133),\n", " (5, 1549),\n", " (5, 158)]"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["pairs[:10]"]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"data": {"text/plain": ["(75725, 33675)"]}, "execution_count": 14, "metadata": {}, "output_type": "execute_result"}], "source": ["pairs2 = [(i,j) for i,j in pairs if i != j]\n", "len(pairs), len(pairs2)"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [{"data": {"text/plain": ["[(3, 12),\n", " (3, 10),\n", " (5, 133),\n", " (5, 1549),\n", " (5, 158),\n", " (5, 5632),\n", " (5, 16784),\n", " (8, 14699),\n", " (8, 23),\n", " (8, 35)]"]}, "execution_count": 15, "metadata": {}, "output_type": "execute_result"}], "source": ["pairs2[:10]"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {"needs_background": "light"}, "output_type": "display_data"}], "source": ["dist, ind = knn.kneighbors(imgs[5][0])\n", "knn.plot_neighbors(ind, dist, obs=imgs[5][0], folder_or_images=folder);"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Composantes connectes"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["0 9096\n", "1 6\n", "2 0\n", "3 0\n", "4 0\n", "5 0\n", "6 0\n", "7 0\n", "8 0\n", "9 0\n"]}], "source": ["distincts = []\n", "for i, j in pairs2:\n", " distincts.append(i)\n", " distincts.append(j)\n", "distincts = set(distincts)\n", "connex = {}\n", "for k in distincts:\n", " connex[k] = k\n", "\n", "n = 0\n", "while n < 10:\n", " modif = 0\n", " for i, j in pairs2: \n", " a = min(connex[i], connex[j])\n", " if a != connex[i] or a != connex[j]:\n", " modif += 1\n", " connex[i] = connex[j] = a\n", " print(n, modif)\n", " n += 1"]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"text/plain": ["(13271, 4185)"]}, "execution_count": 18, "metadata": {}, "output_type": "execute_result"}], "source": ["len(connex), len(set(connex.values()))"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/plain": ["['inondation_2016/735614357036519425_CjVtTTrUoAAUUZp.jpg',\n", " 'inondation_2016/735616090261184512_CjVu73ZVEAAlWmu.jpg']"]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["names = knn.image_names_\n", "names[:2]"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/plain": ["9086"]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["dups = []\n", "for i, j in connex.items():\n", " if i != j:\n", " dups.append(names[i])\n", "len(dups)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Images tr\u00e8s proches"]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["dist = [[ 0. 21.97726098 21.97726098 21.97726098 161.13348504]]\n", "ind = [[ 285 308 351 311 3005]]\n"]}], "source": ["for i, img in enumerate(imgs):\n", " dist, ind = knn.kneighbors(img[0])\n", " if 10 < dist[0, 1] <= 30:\n", " print(\"dist =\", dist)\n", " print(\"ind =\", ind)\n", " break"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {"needs_background": "light"}, "output_type": "display_data"}], "source": ["obs = imgs[ind[0, 0]][0]\n", "knn.plot_neighbors(ind, dist, obs=obs, folder_or_images=folder);"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Recopie de la base"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [{"data": {"text/plain": ["9086"]}, "execution_count": 23, "metadata": {}, "output_type": "execute_result"}], "source": ["not_allowed = set(dups)\n", "len(not_allowed)"]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [{"data": {"text/plain": ["['inondation_2016/735805396657397762_CjYbG-DUgAQTu19.jpg',\n", " 'inondation_2016/735829559329853440_CjYxFcrXEAAvjlH.jpg',\n", " 'inondation_2016/735870604038045696_CjZWafAXEAA3sOb.jpg',\n", " 'inondation_2016/735892072960512000_CjZp8CoWsAIOhL5.jpg',\n", " 'inondation_2016/735892650583306240_CjZqdvoXAAEaSRM.jpg']"]}, "execution_count": 24, "metadata": {}, "output_type": "execute_result"}], "source": ["list(sorted(not_allowed))[:5]"]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["[stream_copy_images] copy image 0: 'bing\\01-9.jpg' - class 'bing'\n", "[stream_copy_images] copy image 1000: 'imagenet1\\3271012508_955158b073.jpg' - class 'imagenet1'\n", "[stream_copy_images] copy image 2000: 'imagenet2\\3287016043_987800dc67.jpg' - class 'imagenet2'\n", "[stream_copy_images] copy image 3000: 'imagenet4\\106994_5349_big_200907_voyager11.jpg' - class 'imagenet4'\n", "[stream_copy_images] copy image 4000: 'imagenet5\\532346050_dafb11ec86.jpg' - class 'imagenet5'\n", "[stream_copy_images] copy image 5000: 'inondation_2016\\736966968138473472_Cjo7jTrXAAAeffo.jpg' - class 'inondation_2016'\n", "[stream_copy_images] copy image 6000: 'inondation_2016\\737629970399252480_CjySiGiUkAUr8TC.jpg' - class 'inondation_2016'\n", "[stream_copy_images] copy image 7000: 'inondation_2016\\737923554407256064_Cj2hYNwWUAElsOP.jpg' - class 'inondation_2016'\n", "[stream_copy_images] copy image 8000: 'inondation_2016\\738072076880347136_Cj4opLHXEAAuuGK.jpg' - class 'inondation_2016'\n", "[stream_copy_images] copy image 9000: 'inondation_2016\\738298504267730945_Cj72k1kUoAAIKUA.jpg' - class 'inondation_2016'\n", "[stream_copy_images] copy image 10000: 'inondation_2016\\738378724442296321_Cj8_iRUXEAEIYex.jpg' - class 'inondation_2016'\n", "[stream_copy_images] copy image 11000: 'inondation_2016\\738456441082793984_Cj-GNbPWkAAecmj.jpg' - class 'inondation_2016'\n", "[stream_copy_images] copy image 12000: 'inondation_2016\\738642491671379968_CkAvbhyVAAQdhnl.jpg' - class 'inondation_2016'\n", "[stream_copy_images] copy image 13000: 'inondation_2016\\738708144893927424_CkBrBsFXIAAesMt.jpg' - class 'inondation_2016'\n", "[stream_copy_images] copy image 14000: 'inondation_2016\\738775822753013760_CkCosKRXEAAL3QS.jpg' - class 'inondation_2016'\n", "[stream_copy_images] copy image 15000: 'inondation_2016\\738983572388913152_CkFlodbW0AAjH1A.jpg' - class 'inondation_2016'\n", "[stream_copy_images] copy image 16000: 'inondation_2016\\739133036877467649_CkHtiX3XEAAQ5qt.jpg' - class 'inondation_2016'\n", "[stream_copy_images] copy image 17000: 'inondation_2016\\739435820709519360_CkMA9WNXAAEBBwW.jpg' - class 'inondation_2016'\n", "[stream_copy_images] copy image 18000: 'inondation_2016\\739759634534141958_CkQnd1TUUAQli3i.jpg' - class 'inondation_2016'\n", "[stream_copy_images] copy image 19000: 'inondation_2016\\740101248225935361_CkVVPYDWUAAc8U3.jpg' - class 'inondation_2016'\n", "[stream_copy_images] copy image 20000: 'inondation_2016\\740462147130556416_CkamZeeXAAIf6ru.jpg' - class 'inondation_2016'\n", "[stream_copy_images] copy image 21000: 'inondation_2016\\740924772062769152_CkhLHExW0AIpwYC.jpg' - class 'inondation_2016'\n", "[stream_copy_images] copy image 22000: 'inondation_2016\\742979124050964480_Ck-XkQfXEAE46Wh.jpg' - class 'inondation_2016'\n", "[stream_copy_images] copy image 23000: 'inondation_2018\\955500762070769664_DUKe4P3WAAEBJFC.jpg' - class 'inondation_2018'\n", "[stream_copy_images] copy image 24000: 'inondation_2018\\956447069216165890_DUX7giCXUAANfkI.jpg' - class 'inondation_2018'\n", "[stream_copy_images] copy image 25000: 'inondation_2018\\957555126931279872_DUnrT9aXUAARFxJ.jpg' - class 'inondation_2018'\n", "[stream_copy_images] copy image 26000: 'inondation_2018\\959394452564598784_DVB0KQsWkAA4Bta.jpg' - class 'inondation_2018'\n", "[stream_copy_images] copy image 27000: 'inondation_2018\\965549350599487488_DWZSA7cWsAEWGaK.jpg' - class 'inondation_2018'\n", "[stream_copy_images] copy image 28000: 'seisme_Amatrice\\768296828550819841_CqmJ4k4UsAEcTaF.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 29000: 'seisme_Amatrice\\768330792049205248_CqmooXdXgAAJ2b3.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 30000: 'seisme_Amatrice\\768348574694408192_Cqm4itvWcAAsv_s.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 31000: 'seisme_Amatrice\\768363756728516608_CqnGo90WIAAfA1I.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 32000: 'seisme_Amatrice\\768376884677738496_CqnOR17WIAAP2Hn.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 33000: 'seisme_Amatrice\\768390411228422144_Cqne_UWWYAAnY6V.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 34000: 'seisme_Amatrice\\768404063755141120_CqnrVy8XYAAYIGO.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 35000: 'seisme_Amatrice\\768420565745106944_Cqn6bjbWIAEAbck.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 36000: 'seisme_Amatrice\\768436635444908032_CqoI-OfWIAEpe5T.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 37000: 'seisme_Amatrice\\768453842098880512_CqoYsPEXEAARM5o.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 38000: 'seisme_Amatrice\\768471447140458496_CqoosvJW8AIjpEA.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 39000: 'seisme_Amatrice\\768492129882517506_Cqo7hBpW8AA64OU.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 40000: 'seisme_Amatrice\\768516668515577856_CqpR0mDWIAAEGkL.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 41000: 'seisme_Amatrice\\768550981206441984_Cqpw-qOWAAAVGTB.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 42000: 'seisme_Amatrice\\768679088013778944_CqrlaXlVUAAcqVG.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 43000: 'seisme_Amatrice\\768721000015892480_CqsLrL6UkAAofwM.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 44000: 'seisme_Amatrice\\768749206500741120_CqslU7hWEAA_Cyn.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 45000: 'seisme_Amatrice\\768777504609931264_Cqs_DzcWAAAZ2_h.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 46000: 'seisme_Amatrice\\768810730250461184_CqtdRvNWAAAE_HJ.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 47000: 'seisme_Amatrice\\768850688487022592_Cqsp-GyWgAEJ6Kp.jpg' - class 'seisme_Amatrice'\n", "[stream_copy_images] copy image 48000: 'seisme_Amatrice\\768916332322648064_Cqu9OPSWgAEzaux.jpg' - class 'seisme_Amatrice'\n"]}], "source": ["from ensae_projects.hackathon.image_helper import stream_copy_images\n", "\n", "src_folder = \"c:/temp/suricatenat_images/\"\n", "dest_folder = \"c:/temp/suricatenat_clean/\"\n", "\n", "def valid(name):\n", " spl = name.split(\"suricatenat_images\")[-1].replace(\"\\\\\", \"/\").strip(\"/\\\\\")\n", " return spl not in allowed\n", "\n", "for img in stream_copy_images(src_folder, dest_folder, valid, fLOG=print):\n", " pass"]}, {"cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": ["l1 = list(enumerate_image_class(\"c:/temp/suricatenat_images/\"))"]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": ["l2 = list(enumerate_image_class(\"c:/temp/suricatenat_clean/\"))"]}, {"cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [{"data": {"text/plain": ["(48884, 39798)"]}, "execution_count": 28, "metadata": {}, "output_type": "execute_result"}], "source": ["len(l1), len(l2)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Takes a random sample"]}, {"cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": ["from ensae_projects.hackathon.image_helper import stream_random_sample, last_element\n", "rnd = last_element(stream_random_sample(\"c:/temp/suricatenat_clean/\", abspath=False))"]}, {"cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [{"data": {"text/plain": ["[('imagenet2\\\\2611787731_6b65bdaf6a.jpg', 'imagenet2'),\n", " ('inondation_2016\\\\740608740169224192_CkcruUEXIAEsWUl.jpg',\n", " 'inondation_2016'),\n", " ('inondation_2016\\\\738614580658606080_CkAWBegUgAA5Z9l.jpg',\n", " 'inondation_2016'),\n", " ('inondation_2018\\\\956548703552245760_DUZX5TRWsAAyDqH.jpg',\n", " 'inondation_2018'),\n", " ('inondation_2018\\\\956925376936148993_DUeuiGQX4AAocq-.jpg',\n", " 'inondation_2018')]"]}, "execution_count": 30, "metadata": {}, "output_type": "execute_result"}], "source": ["rnd[:5]"]}, {"cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": ["import os\n", "import shutil\n", "\n", "src_folder = \"c:/temp/suricatenat_clean/\"\n", "dest_folder = \"c:/temp/suricatenat_sample/\"\n", "\n", " \n", "for img, sub in rnd:\n", " src = os.path.join(src_folder, img)\n", " dst = os.path.join(dest_folder, img)\n", " d = os.path.dirname(dst)\n", " if not os.path.exists(d):\n", " os.makedirs(d)\n", " \n", " shutil.copy(src, dst)"]}, {"cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0"}}, "nbformat": 4, "nbformat_minor": 2}