{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# 1A.algo - La distance d'\u00e9dition (correction)\n", "\n", "Correction."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["Plan\n", "
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/plain": ["1"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["def dist_hamming(m1,m2):\n", " d = 0\n", " for a,b in zip(m1,m2):\n", " if a != b : \n", " d += 1\n", " return d\n", "\n", "dist_hamming(\"close\", \"cloue\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 1 : comment prendre en compte diff\u00e9rentes tailles de mots ?\n", "\n", "On peut allonger le mot le plus court par des espaces ce qui revient \u00e0 ajouter au r\u00e9sultat la diff\u00e9rence de taille."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/plain": ["(1, 2)"]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["def dist_hamming(m1,m2):\n", " d = abs(len(m1)-len(m2))\n", " for a,b in zip(m1,m2):\n", " if a != b : \n", " d += 1\n", " return d\n", "\n", "dist_hamming(\"close\", \"cloue\"), dist_hamming(\"close\", \"clouet\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice 2 : impl\u00e9menter une distance \u00e0 partir de cette \u00e9galit\u00e9"]}, {"cell_type": "markdown", "metadata": {"collapsed": true}, "source": ["### premi\u00e8re option r\u00e9cursive \n", "\n", "Comme l'\u00e9criture est r\u00e9cursive, on peut essayer m\u00eame si cela n'est pas optimal (pas optimal du tout)."]}, {"cell_type": "code", "execution_count": 4, "metadata": {"collapsed": true}, "outputs": [], "source": ["def distance_edition_rec(m1,m2):\n", " if max(len(m1), len(m2)) <= 2 or min(len(m1), len(m2)) <= 1:\n", " return dist_hamming(m1,m2)\n", " else:\n", " collecte = []\n", " for i in range(1,len(m1)):\n", " for j in range(1,len(m2)):\n", " d1 = distance_edition_rec(m1[:i],m2[:j])\n", " d2 = distance_edition_rec(m1[i:],m2[j:])\n", " collecte.append(d1+d2)\n", " return min(collecte)"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/plain": ["1"]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["distance_edition_rec(\"longmot\", \"liongmot\")"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"text/plain": ["1"]}, "execution_count": 7, "metadata": {}, "output_type": "execute_result"}], "source": ["distance_edition_rec(\"longmot\", \"longmoit\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Que se passe-t-il lorsqu'on enl\u00e8ve la condition ``or min(len(m1), len(m2)) <= 1`` ?"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### version non r\u00e9cursive qui m\u00e9morise les r\u00e9sultats"]}, {"cell_type": "code", "execution_count": 7, "metadata": {"collapsed": true}, "outputs": [], "source": ["def distance_edition_rec_cache(m1,m2,cache=None):\n", " if cache is None:\n", " cache = {}\n", " if (m1,m2) in cache:\n", " return cache[m1,m2]\n", " if max(len(m1), len(m2)) <= 2 or min(len(m1), len(m2)) <= 1:\n", " cache[m1,m2] = dist_hamming(m1,m2)\n", " return cache[m1,m2]\n", " else:\n", " collecte = []\n", " for i in range(1,len(m1)):\n", " for j in range(1,len(m2)):\n", " d1 = distance_edition_rec_cache(m1[:i],m2[:j], cache)\n", " d2 = distance_edition_rec_cache(m1[i:],m2[j:], cache)\n", " collecte.append(d1+d2)\n", " cache[m1,m2] = min(collecte)\n", " return cache[m1,m2] "]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [{"data": {"text/plain": ["(1, 1)"]}, "execution_count": 9, "metadata": {}, "output_type": "execute_result"}], "source": ["distance_edition_rec_cache(\"longmot\", \"liongmot\"), distance_edition_rec_cache(\"longmot\", \"longmoit\")"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["10 loops, best of 3: 48.4 ms per loop\n"]}], "source": ["%timeit distance_edition_rec(\"longmot\", \"longmoit\")"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["100 loops, best of 3: 4.82 ms per loop\n"]}], "source": ["%timeit distance_edition_rec_cache(\"longmot\", \"longmoit\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Il appara\u00eet qu'on perd un temps fou dans la premi\u00e8re version \u00e0 recalculer un grand nombre de fois les m\u00eames distances. Conserver ces r\u00e9sultats permet d'aller beaucoup plus vite."]}, {"cell_type": "markdown", "metadata": {"collapsed": true}, "source": ["## Exercice 3 : impl\u00e9menter la distance d'\u00e9dition"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### version r\u00e9cursive avec cache\n", "\n", "On reprend la derni\u00e8re version en la modificant pour ne tenir compte des mots ins\u00e9cables."]}, {"cell_type": "code", "execution_count": 11, "metadata": {"collapsed": true}, "outputs": [], "source": ["def distance_edition_rec_cache_insecable(m1,m2,cache=None):\n", " if cache is None:\n", " cache = {}\n", " if (m1,m2) in cache:\n", " return cache[m1,m2]\n", " if max(len(m1), len(m2)) <= 2 or min(len(m1), len(m2)) <= 1:\n", " cache[m1,m2] = dist_hamming(m1,m2)\n", " return cache[m1,m2]\n", " else:\n", " i = len(m1)\n", " j = len(m2)\n", " d1 = distance_edition_rec_cache_insecable(m1[:i-2],m2[:j-1], cache) + dist_hamming(m1[i-2:], m2[j-1:])\n", " d2 = distance_edition_rec_cache_insecable(m1[:i-1],m2[:j-2], cache) + dist_hamming(m1[i-1:], m2[j-2:])\n", " d3 = distance_edition_rec_cache_insecable(m1[:i-1],m2[:j-1], cache) + dist_hamming(m1[i-1:], m2[j-1:])\n", " cache[m1,m2] = min(d1,d2,d3)\n", " return cache[m1,m2] "]}, {"cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [{"data": {"text/plain": ["(1, 1)"]}, "execution_count": 13, "metadata": {}, "output_type": "execute_result"}], "source": ["distance_edition_rec_cache_insecable(\"longmot\", \"liongmot\"), distance_edition_rec_cache_insecable(\"longmot\", \"longmoit\")"]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["The slowest run took 4.80 times longer than the fastest. This could mean that an intermediate result is being cached \n", "1000 loops, best of 3: 227 \u00b5s per loop\n"]}], "source": ["%timeit distance_edition_rec_cache_insecable(\"longmot\", \"longmoit\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["C'est encore plus rapide."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### version non r\u00e9cursive\n", "\n", "La version non r\u00e9cursive est plus simple \u00e0 envisager dans ce cas."]}, {"cell_type": "code", "execution_count": 14, "metadata": {"collapsed": true}, "outputs": [], "source": ["def distance_edition_insecable(m1,m2,cache=None):\n", " dist = {}\n", " dist[-2,-1] = 0\n", " dist[-1,-2] = 0\n", " dist[-1,-1] = 0\n", " for i in range(0,len(m1)):\n", " dist[i,-1] = i\n", " dist[i,-2] = i\n", " for j in range(0,len(m2)):\n", " dist[-1,j] = j\n", " dist[-2,j] = j\n", " \n", " for i in range(0,len(m1)):\n", " for j in range(0,len(m2)):\n", " d1 = dist[i-2,j-1] + dist_hamming(m1[i-2:i], m2[j-1:j])\n", " d2 = dist[i-1,j-2] + dist_hamming(m1[i-1:i], m2[j-2:j])\n", " d3 = dist[i-1,j-1] + dist_hamming(m1[i-1:i], m2[j-1:j])\n", " dist[i,j] = min(d1,d2,d3)\n", " return dist[len(m1)-1, len(m2)-1]"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["(1, 1)"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["distance_edition_insecable(\"longmot\", \"liongmot\"), distance_edition_insecable(\"longmot\", \"longmoit\")"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["The slowest run took 7.55 times longer than the fastest. This could mean that an intermediate result is being cached \n", "1000 loops, best of 3: 354 \u00b5s per loop\n"]}], "source": ["%timeit distance_edition_insecable(\"longmot\", \"longmoit\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## diff\u00e9rence avec l'algorithme de wikip\u00e9dia\n", "\n", "La distance de Hamming n'est pas pr\u00e9sente dans l'algorithme d\u00e9crit sur la page Wikipedia. C'est parce qu'on d\u00e9compose la distance de Hamming entre un mot de 1 caract\u00e8re et un mot de 2 caract\u00e8res par une comparaison et une insertion (ou une suppression)."]}, {"cell_type": "code", "execution_count": 17, "metadata": {"collapsed": true}, "outputs": [], "source": ["def distance_edition(m1,m2,cache=None):\n", " dist = {}\n", " dist[-1,-1] = 0\n", " for i in range(0,len(m1)):\n", " dist[i,-1] = i\n", " for j in range(0,len(m2)):\n", " dist[-1,j] = j\n", " \n", " for i, c in enumerate(m1):\n", " for j, d in enumerate(m2):\n", " d1 = dist[i-1,j] + 1 # insertion\n", " d2 = dist[i,j-1] + 1 # suppression\n", " x = 0 if c == d else 1\n", " d3 = dist[i-1,j-1] + x\n", " dist[i,j] = min(d1,d2,d3)\n", " return dist[len(m1)-1, len(m2)-1]"]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [{"data": {"text/plain": ["(1, 1)"]}, "execution_count": 19, "metadata": {}, "output_type": "execute_result"}], "source": ["distance_edition(\"longmot\", \"liongmot\"), distance_edition(\"longmot\", \"longmoit\")"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["The slowest run took 4.42 times longer than the fastest. This could mean that an intermediate result is being cached \n", "1000 loops, best of 3: 355 \u00b5s per loop\n"]}], "source": ["%timeit distance_edition_insecable(\"longmot\", \"longmoit\")"]}, {"cell_type": "code", "execution_count": 20, "metadata": {"collapsed": true}, "outputs": [], "source": []}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1"}}, "nbformat": 4, "nbformat_minor": 2}