{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# TD7 - Analyse de texte - correction\n", "\n", "Analyse de texte, TF-IDF, LDA, moteur de recherche, expressions r\u00e9guli\u00e8res (correction)."]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [{"data": {"text/html": ["
run previous cell, wait for 2 seconds
\n", ""], "text/plain": [""]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["from jyquickhelper import add_notebook_menu\n", "add_notebook_menu()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## R\u00e9cup\u00e9ration des donn\u00e9es\n", "\n", "Il est possible de t\u00e9l\u00e9charger les donn\u00e9es [df_pocket.zip](http://www.xavierdupre.fr/enseignement/complements/df_pocket.zip)."]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/plain": ["['data_pocket.json', 'df_pocket.csv']"]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["from pyensae.datasource import download_data\n", "download_data(\"df_pocket.zip\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Nettoyer les donn\u00e9es (regexp et nltk)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Mais avant tout, nous devrions augmenter la qualit\u00e9 de nos donn\u00e9es (en am\u00e9liorant les parsers et la liste des stopwords). C'est l'objet de cette section."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
resolved_urlresolved_titleexcerpttags
1883956314http://www.xavierdupre.fr/app/teachpyx/helpsph...Types et variables du langage python\u00b6Il est impossible d\u2019\u00e9crire un programme sans u...{'python': {'item_id': '1883956314', 'tag': 'p...
1895830689https://www.pluralsight.com/paths/javascript?a...JavaScriptES6 is a major update to the JavaScript langua...NaN
1916603293http://www.seloger.com/annonces/locations/appa...Location Appartement 56,88m\u00b2 Asnieres-sur-Sein...Prix au m\u00b2 fourni \u00e0 titre indicatif, seul un p...NaN
1916600800http://www.seloger.com/annonces/locations/appa...Location Appartement 82m\u00b2 Asnieres sur Seine -...Trouvez votre bien \u00e0 tout moment ...NaN
1916598390http://www.seloger.com/annonces/locations/appa...Location Appartement 93,6m\u00b2 Asni\u00e8res-sur-SeineTrouvez votre bien \u00e0 tout moment ...NaN
\n", "
"], "text/plain": [" resolved_url \\\n", "1883956314 http://www.xavierdupre.fr/app/teachpyx/helpsph... \n", "1895830689 https://www.pluralsight.com/paths/javascript?a... \n", "1916603293 http://www.seloger.com/annonces/locations/appa... \n", "1916600800 http://www.seloger.com/annonces/locations/appa... \n", "1916598390 http://www.seloger.com/annonces/locations/appa... \n", "\n", " resolved_title \\\n", "1883956314 Types et variables du langage python\u00b6 \n", "1895830689 JavaScript \n", "1916603293 Location Appartement 56,88m\u00b2 Asnieres-sur-Sein... \n", "1916600800 Location Appartement 82m\u00b2 Asnieres sur Seine -... \n", "1916598390 Location Appartement 93,6m\u00b2 Asni\u00e8res-sur-Seine \n", "\n", " excerpt \\\n", "1883956314 Il est impossible d\u2019\u00e9crire un programme sans u... \n", "1895830689 ES6 is a major update to the JavaScript langua... \n", "1916603293 Prix au m\u00b2 fourni \u00e0 titre indicatif, seul un p... \n", "1916600800 Trouvez votre bien \u00e0 tout moment ... \n", "1916598390 Trouvez votre bien \u00e0 tout moment ... \n", "\n", " tags \n", "1883956314 {'python': {'item_id': '1883956314', 'tag': 'p... \n", "1895830689 NaN \n", "1916603293 NaN \n", "1916600800 NaN \n", "1916598390 NaN "]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["import json\n", "from pprint import pprint\n", "\n", "with open('./data_pocket.json') as fp: \n", " dict_pocket = json.load(fp)\n", "dict_to_df = {}\n", "\n", "keys = ['resolved_url', 'resolved_title', 'excerpt', 'tags']\n", "\n", "for (k,v) in dict_pocket.items():\n", " dict_to_df[k] = dict(zip(keys, [v[key] for key in keys if key in v]))\n", "import pandas as p\n", "df_pocket = p.DataFrame.from_dict(dict_to_df, orient = \"index\")\n", "df_pocket.head()"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tagsurlexcerpttitledomainhtml_soup
0['mobile app']https://www.grafikart.fr/tutoriels/cordova/ion...Ionic est un framework qui va vous permettre d...Tutoriel Vid\u00e9o Apache CordovaIonic Frameworkgrafikart.fr{'h2': ['Petit', 'tour', 'du', 'propri\u00e9taire']...
1['lewagon']http://www.colorhunt.coHome Create Likes () About Add To Chrome Faceb...Color Huntcolorhunt.co{}
2['data science']https://jakevdp.github.io/blog/2015/08/14/out-...In recent months, a host of new tools and pack...Out-of-Core Dataframes in Python: Dask and Ope...jakevdp.github.io{'h2': ['Pubs', 'of', 'the', 'British', 'Isles...
3['abtest']https://blog.dominodatalab.com/ab-testing-with...In this post, I discuss a method for A/B testi...A/B Testing with Hierarchical Models in Pythonblog.dominodatalab.com{'h2': ['Recent', 'Posts'], 'h3': ['Related'],...
4['mdn', 'documentation']https://developer.mozilla.org/en-US/docs/Learn...Getting started with the Web is a concise seri...Getting started with the Webdeveloper.mozilla.org{'h2': ['Mozilla'], 'h3': ['How', 'the', 'web'...
\n", "
"], "text/plain": [" tags \\\n", "0 ['mobile app'] \n", "1 ['lewagon'] \n", "2 ['data science'] \n", "3 ['abtest'] \n", "4 ['mdn', 'documentation'] \n", "\n", " url \\\n", "0 https://www.grafikart.fr/tutoriels/cordova/ion... \n", "1 http://www.colorhunt.co \n", "2 https://jakevdp.github.io/blog/2015/08/14/out-... \n", "3 https://blog.dominodatalab.com/ab-testing-with... \n", "4 https://developer.mozilla.org/en-US/docs/Learn... \n", "\n", " excerpt \\\n", "0 Ionic est un framework qui va vous permettre d... \n", "1 Home Create Likes () About Add To Chrome Faceb... \n", "2 In recent months, a host of new tools and pack... \n", "3 In this post, I discuss a method for A/B testi... \n", "4 Getting started with the Web is a concise seri... \n", "\n", " title domain \\\n", "0 Tutoriel Vid\u00e9o Apache CordovaIonic Framework grafikart.fr \n", "1 Color Hunt colorhunt.co \n", "2 Out-of-Core Dataframes in Python: Dask and Ope... jakevdp.github.io \n", "3 A/B Testing with Hierarchical Models in Python blog.dominodatalab.com \n", "4 Getting started with the Web developer.mozilla.org \n", "\n", " html_soup \n", "0 {'h2': ['Petit', 'tour', 'du', 'propri\u00e9taire']... \n", "1 {} \n", "2 {'h2': ['Pubs', 'of', 'the', 'British', 'Isles... \n", "3 {'h2': ['Recent', 'Posts'], 'h3': ['Related'],... \n", "4 {'h2': ['Mozilla'], 'h3': ['How', 'the', 'web'... "]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["import pandas as p\n", "import ast\n", "df_pocket = p.read_csv('./df_pocket.csv')\n", "df_pocket.head()"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tagsurlexcerpttitledomainhtml_soup
0[mobile app]https://www.grafikart.fr/tutoriels/cordova/ion...Ionic est un framework qui va vous permettre d...Tutoriel Vid\u00e9o Apache CordovaIonic Frameworkgrafikart.fr{'h2': ['Petit', 'tour', 'du', 'propri\u00e9taire']...
1[lewagon]http://www.colorhunt.coHome Create Likes () About Add To Chrome Faceb...Color Huntcolorhunt.co{}
2[data science]https://jakevdp.github.io/blog/2015/08/14/out-...In recent months, a host of new tools and pack...Out-of-Core Dataframes in Python: Dask and Ope...jakevdp.github.io{'h2': ['Pubs', 'of', 'the', 'British', 'Isles...
3[abtest]https://blog.dominodatalab.com/ab-testing-with...In this post, I discuss a method for A/B testi...A/B Testing with Hierarchical Models in Pythonblog.dominodatalab.com{'h2': ['Recent', 'Posts'], 'h3': ['Related'],...
4[mdn, documentation]https://developer.mozilla.org/en-US/docs/Learn...Getting started with the Web is a concise seri...Getting started with the Webdeveloper.mozilla.org{'h2': ['Mozilla'], 'h3': ['How', 'the', 'web'...
\n", "
"], "text/plain": [" tags url \\\n", "0 [mobile app] https://www.grafikart.fr/tutoriels/cordova/ion... \n", "1 [lewagon] http://www.colorhunt.co \n", "2 [data science] https://jakevdp.github.io/blog/2015/08/14/out-... \n", "3 [abtest] https://blog.dominodatalab.com/ab-testing-with... \n", "4 [mdn, documentation] https://developer.mozilla.org/en-US/docs/Learn... \n", "\n", " excerpt \\\n", "0 Ionic est un framework qui va vous permettre d... \n", "1 Home Create Likes () About Add To Chrome Faceb... \n", "2 In recent months, a host of new tools and pack... \n", "3 In this post, I discuss a method for A/B testi... \n", "4 Getting started with the Web is a concise seri... \n", "\n", " title domain \\\n", "0 Tutoriel Vid\u00e9o Apache CordovaIonic Framework grafikart.fr \n", "1 Color Hunt colorhunt.co \n", "2 Out-of-Core Dataframes in Python: Dask and Ope... jakevdp.github.io \n", "3 A/B Testing with Hierarchical Models in Python blog.dominodatalab.com \n", "4 Getting started with the Web developer.mozilla.org \n", "\n", " html_soup \n", "0 {'h2': ['Petit', 'tour', 'du', 'propri\u00e9taire']... \n", "1 {} \n", "2 {'h2': ['Pubs', 'of', 'the', 'British', 'Isles... \n", "3 {'h2': ['Recent', 'Posts'], 'h3': ['Related'],... \n", "4 {'h2': ['Mozilla'], 'h3': ['How', 'the', 'web'... "]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["df_pocket['html_soup'] = df_pocket['html_soup'].apply(lambda x : ast.literal_eval(x) if x != \"scraper banned\" else x)\n", "df_pocket['tags'] = df_pocket['tags'].apply(lambda x : ast.literal_eval(x) if x == x else x)\n", "df_pocket.head()"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": ["def nan_to_string(x):\n", " if x==x:\n", " return x\n", " else:\n", " return ''\n", "\n", "title_string = ' '.join(df_pocket['title'].apply( lambda x: nan_to_string(x)))"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": ["import re\n", "def url_cleaning(url):\n", " return ' '.join(re.split(r'\\/|\\.|\\:|-|\\?',url))\n", "\n", "url_string = ' '.join(df_pocket['url'].apply(lambda x : url_cleaning(x)))"]}, {"cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": ["def hx_cleaning(d,hx):\n", " if str(hx) in d:\n", " return ' '.join(d[str(hx)])\n", " else: \n", " return ''\n", " \n", "h1_string = ' '.join(df_pocket['html_soup'].apply(lambda x : hx_cleaning(x,'h1')))"]}, {"cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": ["h2_string = ' '.join(df_pocket['html_soup'].apply(lambda x : hx_cleaning(x,'h2')))\n", "h3_string = ' '.join(df_pocket['html_soup'].apply(lambda x : hx_cleaning(x, 'h3')))\n", "excerpt_string = ' '.join(df_pocket['excerpt'].apply( lambda x: nan_to_string(x)))"]}, {"cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": ["def p_cleaning(x):\n", " if (type(x) == dict) & ('p' in x ):\n", " return ' '.join(x['p'])\n", " else: \n", " return ''\n", "\n", "p_string = ' '.join(df_pocket['html_soup'].apply(lambda x : p_cleaning(x)))"]}, {"cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [{"data": {"text/plain": ["'Tutoriel Vido Apache CordovaIonic Framework Color Hunt OutofCore Dataframes in Python Dask and OpenStreetMap AB Testing with Hierarchical Models in Python Getting started with the Web Le Wagon UI components shakacodereactonrails Le Wagon Alumni Nocturne Opus No Frederic Chopin Piano Tutorial Synthesia ES Promises in Depth Jupyter Notebook Viewer Introduction lanalyse de texte avec nltk Tokenization CamFind API Documentation Build an Elasticsearch Index with PythonMachine Learning Series Part The MustHave Discovery Tool for every Media Professional Python NLTK WTF Chapter Notes on things that dont work right Google Maps Geolocation Tracking in Realtime with JavaScript The Definitive Guide to Natural Language Processing lewagonrailsstylesheets Productivity Tips for Programmers with ADHD EquityOwl No Cash No Salaries Just Equity React For Beginners Learn Enough Text Editor to Be Dangerous Michael Hartl Mapbox The New WordPresscom Why Proxima Nova Is Everywhere PayByPhone Adds NFC in San F...'"]}, "execution_count": 12, "metadata": {}, "output_type": "execute_result"}], "source": ["words = ' '.join([title_string,url_string,h1_string,h2_string,h3_string,excerpt_string])\n", "# on ne conserve que les mots\n", "words_string = re.sub('[^A-Za-z ]','', words)\n", "#on \"nettoie les espaces\"\n", "words_string = re.sub('\\s+',' ', words_string)\n", "words_string[:1000] + '...'"]}, {"cell_type": "code", "execution_count": 12, "metadata": {"scrolled": true}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Defaulting to user installation because normal site-packages is not writeable\n", "Requirement already satisfied: nltk in c:\\python395_x64\\lib\\site-packages (3.7)\n", "Requirement already satisfied: joblib in c:\\python395_x64\\lib\\site-packages (from nltk) (1.1.0)\n", "Requirement already satisfied: click in c:\\python395_x64\\lib\\site-packages (from nltk) (8.0.1)\n", "Requirement already satisfied: regex>=2021.8.3 in c:\\python395_x64\\lib\\site-packages (from nltk) (2022.1.18)\n", "Requirement already satisfied: tqdm in c:\\python395_x64\\lib\\site-packages (from nltk) (4.62.3)\n", "Requirement already satisfied: colorama in c:\\python395_x64\\lib\\site-packages (from click->nltk) (0.4.4)\n"]}], "source": ["! pip install nltk"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Nltk contient un corpus de stopwords en plusieurs langues. On peut enrichir la liste d\u00e9j\u00e0 cr\u00e9\u00e9e."]}, {"cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["{'her', \"i'd\", 'than', 'under', 'yours', 'such', \"haven't\", \"doesn't\", \"didn't\", 'then', \"couldn't\", 'myself', 'from', 'since', 'are', 'have', 'again', 'the', 'did', 'being', 'which', 'be', 'yourself', 'am', 'here', \"weren't\", \"he's\", 'himself', 'ought', 'for', \"you're\", 'because', 'how', 'about', 'that', 'he', 'theirs', 'she', 'those', 'after', \"shan't\", 'other', 'ours', 'so', 'very', 'between', \"how's\", 'you', \"when's\", \"she'd\", 'above', \"he'd\", 'with', \"we're\", \"shouldn't\", 'www', 'should', 'in', 'like', 'down', 'hence', 'own', \"they'd\", 'any', 'but', 'as', 'was', \"aren't\", \"won't\", 'r', 'hers', 'them', \"mustn't\", 'this', 'too', 'most', 'not', 'through', \"let's\", \"she's\", 'all', 'up', 'get', 'do', 'both', 'been', \"that's\", \"you've\", \"he'll\", \"isn't\", 'is', 'further', \"wasn't\", \"wouldn't\", 'has', 'once', 'out', 'our', 'when', 'yourselves', 'until', 'also', 'otherwise', 'can', 'more', 'during', \"here's\", 'only', 'however', 'no', 'it', \"hasn't\", 'same', 'herself', \"they'll\", 'itself', 'i', 'ever', 'nor', \"she'll\", \"where's\", 'whom', 'there', \"hadn't\", 'would', 'these', 'does', 'each', 'off', 'what', 'at', 'or', 'to', \"you'd\", 'themselves', 'doing', \"what's\", 'and', 'him', 'had', \"we've\", \"i've\", \"they're\", 'my', 'therefore', \"i'll\", 'on', 'against', 'his', 'why', 'an', 'its', 'else', 'by', 'me', 'ourselves', \"they've\", 'k', \"i'm\", 'of', 'could', 'their', \"we'd\", 'below', 'they', 'who', 'com', 'just', 'few', 'before', \"you'll\", 'if', 'some', \"we'll\", 'we', \"why's\", 'shall', 'having', 'while', 'into', \"it's\", 'your', 'http', 'a', 'where', \"there's\", \"who's\", \"can't\", 'were', 'over', 'cannot', \"don't\"}\n"]}], "source": ["from wordcloud import WordCloud, STOPWORDS\n", "\n", "stopwords = set(STOPWORDS)\n", "print(STOPWORDS)"]}, {"cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": ["for wds in ['http', 'https', 'www', 'fr', 'com', 'io', 'org', 'co', 'jo', 'edu', 'news', 'html', 'htm',\\\n", " 'github', 'youtube', 'google', 'blog', 'watch', 'de', 'le', 'la', 'en', 'sur', 'vous', 'les', \\\n", " 'ajouter', 'README', 'md', 'et', 'PROCESS', 'CMYK', 'des', 'chargement', 'playlists', 'endobj', \\\n", " 'obj','est', 'use', 'using', 'will', 'web', 'first','pour', 'du', 'une', 'que']:\n", " stopwords.add(wds)"]}, {"cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [{"data": {"text/plain": ["['her',\n", " 'sont',\n", " 'under',\n", " 'such',\n", " \"haven't\",\n", " 'jo',\n", " 'then',\n", " 'myself',\n", " 'fussiez',\n", " 'sa',\n", " 'fr',\n", " 'haven',\n", " 'again',\n", " 'md',\n", " '\u00e9t\u00e9es',\n", " 'the',\n", " 'un',\n", " 'did',\n", " 'seront',\n", " 'par',\n", " 'playlists',\n", " 'yourself',\n", " 've',\n", " \"he's\",\n", " 'himself',\n", " 'use',\n", " 'ma',\n", " 'ses',\n", " 'about',\n", " 'he',\n", " 'theirs',\n", " 'she',\n", " 'ayant',\n", " 'en',\n", " 'news',\n", " 'other',\n", " 'll',\n", " 'weren',\n", " 'et',\n", " 'so',\n", " 'between',\n", " 'you',\n", " 'dans',\n", " \"when's\",\n", " 'aies',\n", " 't',\n", " \"she'd\",\n", " 'with',\n", " '\u00e9tants',\n", " \"that'll\",\n", " '...']"]}, "execution_count": 16, "metadata": {}, "output_type": "execute_result"}], "source": ["import nltk\n", "\n", "stopwords_fr_ntlk = set(nltk.corpus.stopwords.words('french'))\n", "stopwords_en_ntlk = set(nltk.corpus.stopwords.words('english'))\n", "stopwords_clean = [ l.lower() for l in list(stopwords.union(stopwords_fr_ntlk).union(stopwords_en_ntlk))]\n", "stopwords_clean[:50] + ['...']"]}, {"cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": ["%matplotlib inline"]}, {"cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": ["
"]}, "metadata": {"needs_background": "light"}, "output_type": "display_data"}], "source": ["wordcloud = WordCloud(stopwords=stopwords, background_color=\"white\")\n", "\n", "wordcloud.generate(words_string)\n", "\n", "import matplotlib.pyplot as plt\n", "plt.imshow(wordcloud)\n", "plt.axis('off');"]}, {"cell_type": "markdown", "metadata": {}, "source": ["On applique tout \u00e7a \u00e0 df_pocket."]}, {"cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": ["def words_cleaning(url,title,excerpt,html_soup):\n", " url_clean = url_cleaning(url)\n", " title_clean = nan_to_string(title)\n", " excerpt_clean = nan_to_string(excerpt)\n", " h1_clean = hx_cleaning(html_soup,'h1')\n", " h2_clean = hx_cleaning(html_soup,'h2')\n", " h3_clean = hx_cleaning(html_soup,'h3')\n", " p_clean = p_cleaning(html_soup)\n", " words = ' '.join([url_clean, title_clean, excerpt_clean, h1_clean, h2_clean, h3_clean, p_clean])\n", " words_clean = re.sub('[^A-Za-z ]','', words)\n", " words_clean = re.sub('\\s+',' ', words_clean)\n", " words_list = words_clean.split(' ')\n", " return ' '.join([w.lower() for w in words_list if w not in stopwords_clean])"]}, {"cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tagsurlexcerpttitledomainhtml_soupwords_string
0[mobile app]https://www.grafikart.fr/tutoriels/cordova/ion...Ionic est un framework qui va vous permettre d...Tutoriel Vid\u00e9o Apache CordovaIonic Frameworkgrafikart.fr{'h2': ['Petit', 'tour', 'du', 'propri\u00e9taire']...grafikart tutoriels cordova ionic framework tu...
1[lewagon]http://www.colorhunt.coHome Create Likes () About Add To Chrome Faceb...Color Huntcolorhunt.co{}colorhunt color hunt home create likes about a...
2[data science]https://jakevdp.github.io/blog/2015/08/14/out-...In recent months, a host of new tools and pack...Out-of-Core Dataframes in Python: Dask and Ope...jakevdp.github.io{'h2': ['Pubs', 'of', 'the', 'British', 'Isles...jakevdp core dataframes python outofcore dataf...
3[abtest]https://blog.dominodatalab.com/ab-testing-with...In this post, I discuss a method for A/B testi...A/B Testing with Hierarchical Models in Pythonblog.dominodatalab.com{'h2': ['Recent', 'Posts'], 'h3': ['Related'],...dominodatalab ab testing hierarchical models p...
4[mdn, documentation]https://developer.mozilla.org/en-US/docs/Learn...Getting started with the Web is a concise seri...Getting started with the Webdeveloper.mozilla.org{'h2': ['Mozilla'], 'h3': ['How', 'the', 'web'...developer mozilla us docs learn gettingstarted...
\n", "
"], "text/plain": [" tags url \\\n", "0 [mobile app] https://www.grafikart.fr/tutoriels/cordova/ion... \n", "1 [lewagon] http://www.colorhunt.co \n", "2 [data science] https://jakevdp.github.io/blog/2015/08/14/out-... \n", "3 [abtest] https://blog.dominodatalab.com/ab-testing-with... \n", "4 [mdn, documentation] https://developer.mozilla.org/en-US/docs/Learn... \n", "\n", " excerpt \\\n", "0 Ionic est un framework qui va vous permettre d... \n", "1 Home Create Likes () About Add To Chrome Faceb... \n", "2 In recent months, a host of new tools and pack... \n", "3 In this post, I discuss a method for A/B testi... \n", "4 Getting started with the Web is a concise seri... \n", "\n", " title domain \\\n", "0 Tutoriel Vid\u00e9o Apache CordovaIonic Framework grafikart.fr \n", "1 Color Hunt colorhunt.co \n", "2 Out-of-Core Dataframes in Python: Dask and Ope... jakevdp.github.io \n", "3 A/B Testing with Hierarchical Models in Python blog.dominodatalab.com \n", "4 Getting started with the Web developer.mozilla.org \n", "\n", " html_soup \\\n", "0 {'h2': ['Petit', 'tour', 'du', 'propri\u00e9taire']... \n", "1 {} \n", "2 {'h2': ['Pubs', 'of', 'the', 'British', 'Isles... \n", "3 {'h2': ['Recent', 'Posts'], 'h3': ['Related'],... \n", "4 {'h2': ['Mozilla'], 'h3': ['How', 'the', 'web'... \n", "\n", " words_string \n", "0 grafikart tutoriels cordova ionic framework tu... \n", "1 colorhunt color hunt home create likes about a... \n", "2 jakevdp core dataframes python outofcore dataf... \n", "3 dominodatalab ab testing hierarchical models p... \n", "4 developer mozilla us docs learn gettingstarted... "]}, "execution_count": 20, "metadata": {}, "output_type": "execute_result"}], "source": ["import numpy as np\n", "df_pocket['words_string'] = np.vectorize(words_cleaning)(df_pocket['url'], \\\n", " df_pocket['title'], \\\n", " df_pocket['excerpt'], \\\n", " df_pocket['html_soup'])\n", "df_pocket.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["A pr\u00e9sent la base df_pocket est nettoy\u00e9e et pr\u00eate \u00e0 \u00eatre utilis\u00e9e pour les analyses de textes. "]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Analyse des donn\u00e9es textuelles - TD-IDF, similarit\u00e9 cosine et n-grams\n", "\n", "Le calcul [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (term frequency\u2013inverse document frequency) permet de calculer un score de proximit\u00e9 entre un terme de recherche et un document (c'est ce que font les moteurs de recherche). \n", "\n", "\n", "La partie tf calcule une fonction croissante de la fr\u00e9quence du terme de recherche dans le document \u00e0 l'\u00e9tude, la partie idf calcule une fonction inversement proportionnelle \u00e0 la fr\u00e9quence du terme dans l'ensemble des documents (ou corpus). \n", "\n", "\n", "\n", "Le score total, obtenu en multipliant les deux composantes, permet ainsi de donner un score d'autant plus \u00e9lev\u00e9 que le terme est surr\u00e9pr\u00e9sent\u00e9 dans un document (par rapport \u00e0 l'ensemble des documents). Il existe plusieurs fonctions, qui p\u00e9nalisent plus ou moins les documents longs, ou qui sont plus ou moins smooth."]}, {"cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": ["corpus = { \n", " 'a' : \"Mr. Green killed Colonel Mustard in the study with the candlestick. \"\n", " \"Mr. Green is not a very nice fellow.\",\n", " 'b' : \"Professor Plum has a green plant in his study.\",\n", " 'c' : \"Miss Scarlett watered Professor Plum's green plant while he was away \"\n", " \"from his office last week.\"\n", "}\n", "terms = {\n", " 'a' : [ i.lower() for i in corpus['a'].split() ],\n", " 'b' : [ i.lower() for i in corpus['b'].split() ],\n", " 'c' : [ i.lower() for i in corpus['c'].split() ]\n", "}"]}, {"cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": ["from math import log\n", "\n", "QUERY_TERMS = ['mr.', 'green']\n", "\n", "def tf(term, doc, normalize=True):\n", " doc = doc.lower().split()\n", " if normalize:\n", " return doc.count(term.lower()) / float(len(doc))\n", " else:\n", " return doc.count(term.lower()) / 1.0\n", "\n", "\n", "def idf(term, corpus):\n", " num_texts_with_term = len([True for text in corpus if term.lower() \\\n", " in text.lower().split()])\n", " try:\n", " return 1.0 + log(float(len(corpus)) / num_texts_with_term)\n", " except ZeroDivisionError:\n", " return 1.0\n", " \n", "def tf_idf(term, doc, corpus):\n", " return tf(term, doc) * idf(term, corpus)"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["a : Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.\n", "b : Professor Plum has a green plant in his study.\n", "c : Miss Scarlett watered Professor Plum's green plant while he was away from his office last week.\n", "\n", "\n", "TF(a): mr. 0.10526315789473684\n", "TF(b): mr. 0.0\n", "TF(c): mr. 0.0\n", "IDF: mr. 2.09861228866811\n", "\n", "\n", "TF-IDF(a): mr. 0.22090655670190631\n", "TF-IDF(b): mr. 0.0\n", "TF-IDF(c): mr. 0.0\n", "\n", "\n", "TF(a): green 0.10526315789473684\n", "TF(b): green 0.1111111111111111\n", "TF(c): green 0.0625\n", "IDF: green 1.0\n", "\n", "\n", "TF-IDF(a): green 0.10526315789473684\n", "TF-IDF(b): green 0.1111111111111111\n", "TF-IDF(c): green 0.0625\n", "\n", "\n", "Score TF-IDF total pour le terme 'mr. green'\n", "a 0.3261697145966431\n", "b 0.1111111111111111\n", "c 0.0625\n"]}], "source": ["for (k, v) in sorted(corpus.items()):\n", " print(k, ':', v)\n", "print('\\n')\n", "\n", "query_scores = {'a': 0, 'b': 0, 'c': 0}\n", "for term in [t.lower() for t in QUERY_TERMS]:\n", " for doc in sorted(corpus):\n", " print('TF({}): {}'.format(doc, term), tf(term, corpus[doc]))\n", " print('IDF: {}'.format(term, ), idf(term, corpus.values()))\n", " print('\\n')\n", " for doc in sorted(corpus):\n", " score = tf_idf(term, corpus[doc], corpus.values())\n", " print('TF-IDF({}): {}'.format(doc, term), score)\n", " query_scores[doc] += score\n", " print('\\n')\n", "\n", "print(\"Score TF-IDF total pour le terme '{}'\".format(' '.join(QUERY_TERMS), ))\n", "for (doc, score) in sorted(query_scores.items()):\n", " print(doc, score)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Exercice guid\u00e9 - Calcul de TF-IDF \n", "\n", "Quel document est le plus proche du terme \"green plant\" ? Calculer les scores TF-IDF pour le terme \"green plant\". Cela correspond-il \u00e0 vos attentes ? Que se passe-t-il avec \"green\" seul ?\n", "\n", "### Green plant"]}, {"cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": ["\n", "from math import log\n", "\n", "def tf(term, doc, normalize=True):\n", " doc = doc.lower().split()\n", " if normalize:\n", " return doc.count(term.lower()) / float(len(doc))\n", " else:\n", " return doc.count(term.lower()) / 1.0\n", "\n", "\n", "def idf(term, corpus):\n", " num_texts_with_term = len([True for text in corpus if term.lower()\n", " in text.lower().split()])\n", " try:\n", " return 1.0 + log(float(len(corpus)) / num_texts_with_term)\n", " except ZeroDivisionError:\n", " return 1.0\n", "\n", "def tf_idf(term, doc, corpus):\n", " return tf(term, doc) * idf(term, corpus)\n", "\n"]}, {"cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Score TF-IDF total pour le terme 'green plant'\n", "a 0.10526315789473684\n", "b 0.26727390090090714\n", "c 0.1503415692567603\n"]}], "source": ["QUERY_TERMS = ['green', 'plant']\n", "query_scores = {'a': 0, 'b': 0, 'c': 0}\n", "for term in [t.lower() for t in QUERY_TERMS]:\n", " for doc in sorted(corpus):\n", " score = tf_idf(term, corpus[doc], corpus.values())\n", " query_scores[doc] += score\n", "\n", "print(\"Score TF-IDF total pour le terme '{}'\".format(' '.join(QUERY_TERMS), ))\n", "for (doc, score) in sorted(query_scores.items()):\n", " print(doc, score)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Deux documents possibles : b ou c (a ne contient pas le mot \u00ab plant \u00bb). B est plus court : donc green plant \u00ab p\u00e8se \u00bb plus."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Green"]}, {"cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Score TF-IDF total pour le terme 'green'\n", "a 0.10526315789473684\n", "b 0.1111111111111111\n", "c 0.0625\n"]}], "source": ["QUERY_TERMS = ['green']\n", "term = [t.lower() for t in QUERY_TERMS]\n", "term = 'green'\n", "\n", "query_scores = {'a': 0, 'b': 0, 'c': 0}\n", "\n", "for doc in sorted(corpus):\n", " score = tf_idf(term, corpus[doc], corpus.values())\n", " query_scores[doc] += score\n", "\n", "print(\"Score TF-IDF total pour le terme '{}'\".format(term))\n", "for (doc, score) in sorted(query_scores.items()):\n", " print(doc, score)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 1 - TF-IDF sur des donn\u00e9es pocket"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Prenez 5 articles enregistr\u00e9s dans pocket et d\u00e9terminer leur score pour les mots python, data et science."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 2 - Cr\u00e9ation d'un moteur de recherche pour les donn\u00e9es pocket"]}, {"cell_type": "markdown", "metadata": {}, "source": ["L'id\u00e9e de cet exercice est de cr\u00e9er un moteur de recherche pour ['python','data','science']. \n", "\n", "Le but : trouver les 5 articles les plus pertinents pour ces termes. \n", "\n", " 1) La premi\u00e8re \u00e9tape sera de calculer pour chaque article de la base le score td-idf.\n", "\n", " 2) La seconde \u00e9tape sera de trier ces scores du plus \u00e9lev\u00e9 au moins fort. "]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 3 - Cat\u00e9gorisation automatique avec des m\u00e9thodes non supervis\u00e9es"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Avec ce que vous venez d'apprendre (le tf-idf), il est possible de cr\u00e9er un mod\u00e8le de cat\u00e9gorisation automatique \"non-supervis\u00e9\". Ce terme barbare signifie que l'on peut cr\u00e9\u00e9r des tags \u00e0 partir des seules variables explicatives, sans utiliser de \"label\", c'est-\u00e0-dire de donn\u00e9es qui valident si la pr\u00e9diction (ici, pr\u00e9sence de tel ou tel mot dans les tags) est correcte. Normalement, on utilise ce genre de m\u00e9thode quand on a pas de labels et que l'on cherche \u00e0 faire ressortir des r\u00e9gularit\u00e9s (des patterns) dans les donn\u00e9es. D'autres m\u00e9thodes de machine learning non-supervis\u00e9es connues sont : le clustering, les ACP."]}, {"cell_type": "markdown", "metadata": {}, "source": ["Pour bien comprendre le tf-idf, on vous l'a fait coder \"\u00e0 la main\". En r\u00e9alit\u00e9, c'est tellement classique, qu'il existe des librairies qui l'ont d\u00e9j\u00e0 cod\u00e9. Voir [scikitlearn.feature_extraction.text](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["L'id\u00e9e est la suivante : on va retenir comme \"tags\", les 3 mots les plus \"caract\u00e9ristiques\" d'un document. C'est-\u00e0-dire, les mots correspondants aux 3 scores tf-idf les plus \u00e9lev\u00e9s."]}, {"cell_type": "markdown", "metadata": {}, "source": ["Les \u00e9tapes \u00e0 suivre : \n", "- transformer les mots en vecteurs. L'id\u00e9e est de cr\u00e9er une matrice, avec en ligne les documents, en colonne les mots possibles (prendre tous le smots uniques pr\u00e9sents dans l'ensemble des documents). Cela se fait en 3 lignes de code, voir la documentation [scikitlearn.feature_extraction.text](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)\n", "- calculer les tf-idf normalis\u00e9s\n", "- r\u00e9cup\u00e9rer les indices dont les mots ont le score le plus \u00e9lev\u00e9 : voir la m\u00e9thode [argsort](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.argsort.html)\n", "- r\u00e9cup\u00e9rer la correspondance mots et indices\n", "- r\u00e9cup\u00e9rer les 3 mots les plus caract\u00e9ristiques, et comparer aux tags de la table df_pocket"]}, {"cell_type": "markdown", "metadata": {}, "source": ["# Approche contextuelle"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Les approches bag-of-words, bien que simplistes, permettent de cr\u00e9er, d'indexer et de comparer des documents. La prise en compte des suites de 2, 3 ou plus mots serait un moyen d'affiner de tels mod\u00e8les. Cela permet aussi de mieux comprendre le sens des homonymes, et des phrases (d'une mani\u00e8re g\u00e9n\u00e9rale, la s\u00e9mantique).\n", "\n", "nltk offre des methodes pour tenir compte du contexte : pour ce faire, nous calculons les n-grams, c'est-\u00e0-dire l'ensemble des co-occurrences successives de mots deux-\u00e0-deux (bigrams), trois-\u00e0-trois (tri-grams), etc. \n", "\n", "En g\u00e9n\u00e9ral, on se contente de bi-grams, au mieux de tri-grams : \n", "- les mod\u00e8les de classification, analyse du sentiment, comparaison de documents, etc. qui comparent des n-grams avec n trop grands sont rapidement confront\u00e9s au probl\u00e8me de donn\u00e9es sparse, cela r\u00e9duit la capacit\u00e9 pr\u00e9dictive des mod\u00e8les ; \n", "- les performances d\u00e9croissent tr\u00e8s rapidement en fonction de n, et les co\u00fbts de stockage des donn\u00e9es augmentent rapidement (environ n fois plus \u00e9lev\u00e9 que la base de donn\u00e9e initiale)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["r\u00e9f\u00e9rence : [introduction \u00e0 nltk](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx/notebooks/td2a_TD5_Traitement_automatique_des_langues_en_Python.html#introduction-a-nltk) \n", "\n", "\n", "Dans cette partie, nous allons nous int\u00e9resser au nombre d'occurences et de co-occurences des termes dans les articles de la base pocket. Pour cela, nous utilisons les m\u00e9thodes disponibles dans le package nltk"]}, {"cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": ["import re, nltk\n", "#Tokenisation na\u00efve sur les espaces entre les mots => on obtient une liste de mots\n", "tokens = re.split('\\s+', ' '.join(df_pocket['words_string']))"]}, {"cell_type": "code", "execution_count": 27, "metadata": {"scrolled": true}, "outputs": [], "source": ["#On transforme cette liste en objet nltk \"Text\" (objet chaine de caract\u00e8re qui conserve la notion de tokens, et qui \n", "#comprend un certain nombre de m\u00e9thodes utiles pour explorer les donn\u00e9es.\n", "text = nltk.Text(tokens)"]}, {"cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [{"data": {"text/plain": ["[('grafikart', 1),\n", " ('tutoriels', 1),\n", " ('cordova', 1),\n", " ('ionic', 4),\n", " ('framework', 8),\n", " ('tutoriel', 3),\n", " ('vido', 7),\n", " ('apache', 6),\n", " ('cordovaionic', 2),\n", " ('va', 5),\n", " ('permettre', 1),\n", " ('crer', 3),\n", " ('applications', 17),\n", " ('mobiles', 1),\n", " ('utilisant', 1),\n", " ('technologies', 8),\n", " ('web', 39),\n", " ('base', 2),\n", " ('cela', 3),\n", " ('dautres', 3),\n", " ('frameworks', 3),\n", " ('fait', 6),\n", " ('leurs', 1),\n", " ('preuves', 1),\n", " ('avant', 1),\n", " ('pouvoir', 1),\n", " ('commencer', 2),\n", " ('faut', 1),\n", " ('videmment', 1),\n", " ('installer', 1),\n", " ('loutil', 1),\n", " ('petit', 3),\n", " ('tour', 3),\n", " ('propritaire', 1),\n", " ('la', 16),\n", " ('compilation', 1),\n", " ('colorhunt', 2),\n", " ('color', 10),\n", " ('hunt', 5),\n", " ('home', 12),\n", " ('create', 30),\n", " ('likes', 2),\n", " ('about', 8),\n", " ('add', 13),\n", " ('to', 14),\n", " ('chrome', 9),\n", " ('facebook', 17),\n", " ('thanks', 7),\n", " ('your', 26),\n", " ('scheme', 4)]"]}, "execution_count": 29, "metadata": {}, "output_type": "execute_result"}], "source": ["## la m\u00e9thode vocab permet d'obtenir pour chaque terme pr\u00e9sent dans la liste text nltk, le nombre d'occurence des termes\n", "## ici on cr\u00e9e le dictionnaire fdist\n", "\n", "fdist = text.vocab()\n", "\n", "list(fdist.items())[:50]"]}, {"cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Exemples d'occurences du terme 'python' :\n", "Displaying 25 of 52 matches:\n", "wed hunters jakevdp core dataframes python outofcore dataframes python dask op\n", "aframes python outofcore dataframes python dask openstreetmap in recent months\n", "ckages announced working data scale python for excellent entertaining summary \n", "mary id suggest watching rob storys python data bikeshed talk pydata seattle c\n", "alab ab testing hierarchical models python ab testing hierarchical models pyth\n", "thon ab testing hierarchical models python in post i discuss method ab testing\n", "te nltk tokenization nltk librairie python trs utile manipuler texte market ma\n", "s qbox building elasticsearch index python build elasticsearch index pythonmac\n", "volving world media mikesboyle post python nltk wtf chapter notes things pytho\n", "ython nltk wtf chapter notes things python nltk wtf chapter notes things dont \n", "orpus linguistics fan learning code python excellent online tutorial called na\n", " called natural language processing python book buy case feel bit freeloader p\n", "n book buy case feel bit freeloader python nltk wtf chapter notes things dont \n", " a fast ondisk format data frames r python powered apache arrow this past janu\n", "discussed systems challenges facing python r open source communities feather a\n", " a fast ondisk format data frames r python powered apache arrow about rstudio \n", "ack last updated september tryolabs python elasticsearch steps python elastics\n", "tryolabs python elasticsearch steps python elasticsearch first steps lately tr\n", "ommunity tutorials install anaconda python distribution ubuntu how to install \n", "tion ubuntu how to install anaconda python distribution ubuntu anaconda openso\n", "er environment manager distribution python r programming languages report bug \n", " for many cases writing pandas pure python numpy sufficient in computationally\n", "ns evaluated numexpr must evaluated python space transparently user this done \n", "ing experience writing applications python flask microframework categories wha\n", "enchwebfr frenchweb jobs dveloppeur python startup hf les dernires offres demp\n", "\n", "\n"]}], "source": ["# Une autre m\u00e9thode \"concordance\" : montre les occurences d'un mot dans son contexte\n", "print(\"Exemples d'occurences du terme 'python' :\")\n", "text.concordance(\"python\")\n", "print('\\n')"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 4 - Occurences dans Pocket"]}, {"cell_type": "markdown", "metadata": {}, "source": ["1) Calculer le nombre de mots et de mots uniques\n", "\n", "Astuce : utiliser l'object fdist qui donne la fr\u00e9quence de distribution de chaque terme."]}, {"cell_type": "markdown", "metadata": {}, "source": ["2) Calculer trouver les termes suivants et leur contexte : \n", "- github\n", "- data"]}, {"cell_type": "markdown", "metadata": {}, "source": ["3) Trouvez les 100 mots les plus fr\u00e9quents "]}, {"cell_type": "markdown", "metadata": {}, "source": ["4) Trouvez les 100 mots les plus fr\u00e9quents (sans les stopwords)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["5) Trouvez les co-occurences les plus fr\u00e9quentes en utilisant la m\u00e9thode collocations()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 5 - cat\u00e9gorisation automatique non supervis\u00e9e avec contexte : LDA"]}, {"cell_type": "markdown", "metadata": {}, "source": ["S'inspirer de [LDA](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/) et [un exemple de LDA](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx/notebooks/td2a_some_nlp.html#lda) et proposer une cat\u00e9gorisation automatique des documents. Comparer aux tags initiaux."]}, {"cell_type": "markdown", "metadata": {}, "source": ["### Exercice 6 - cat\u00e9gorisation automatique supervis\u00e9e sans avec contexte : classification binaire"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Lire le notebook sur les [rep\u00e8res en machine learning](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx/specials/machine_learning.html#hyperparametres). S'inspirer de ce [notebook](http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx/ml_basic/plot_binary_classification.html#sphx-glr-ml-basic-plot-binary-classification-py) et pr\u00e9dire les tags avec un mod\u00e8le logit."]}, {"cell_type": "markdown", "metadata": {}, "source": ["Choisir de tenir compte du contexte (features en bi-grams, ou tri-grams) ou non en fonction des r\u00e9sultats d'un [Gridsearch](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). "]}], "metadata": {"anaconda-cloud": {}, "kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5"}}, "nbformat": 4, "nbformat_minor": 2}