Lecture 7: From Unstructured Texts to Structured Data - Part II

The Art of Analyzing Big Data - The Data Scientist’s Toolbox

By Dr. Michael Fire


0. Package Setup

For this lecture, we are going to use the Kaggle, TuriCreate, Gensim, pyLDAvis, spaCy, NLTK, Plotly Express Afinn packages. Let's set them up:

In [0]:
!pip install turicreate
!pip install kaggle 
!pip install gensim
!pip install pyLDAvis
!pip install spaCy
!pip install afinn
!pip install nltk
!pip install plotly_express

import nltk
nltk.download('stopwords')
nltk.download('punkt')

!python -m spacy download en_core_web_lg # Important! you need to restart runtime after install
Collecting turicreate
  Downloading https://files.pythonhosted.org/packages/4a/7b/97b192ace93d4230bb992aacae19df8165dd00ca48f99c8c2b9947845c60/turicreate-6.2-cp36-cp36m-manylinux1_x86_64.whl (91.8MB)
     |████████████████████████████████| 91.8MB 33kB/s 
Requirement already satisfied: pillow>=5.2.0 in /usr/local/lib/python3.6/dist-packages (from turicreate) (7.0.0)
Collecting tensorflow<=2.0.1,>=2.0.0
  Downloading https://files.pythonhosted.org/packages/43/16/b07e3f7a4a024b47918f7018967eb984b0c542458a6141d8c48515aa81d4/tensorflow-2.0.1-cp36-cp36m-manylinux2010_x86_64.whl (86.3MB)
     |████████████████████████████████| 86.3MB 36kB/s 
Collecting coremltools==3.3
  Downloading https://files.pythonhosted.org/packages/77/19/611916d1ef326d38857d93af5ba184f6ad7491642e0fa4f9082e7d82f034/coremltools-3.3-cp36-none-manylinux1_x86_64.whl (3.4MB)
     |████████████████████████████████| 3.4MB 47.2MB/s 
Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from turicreate) (1.18.3)
Requirement already satisfied: decorator>=4.0.9 in /usr/local/lib/python3.6/dist-packages (from turicreate) (4.4.2)
Collecting resampy==0.2.1
  Downloading https://files.pythonhosted.org/packages/14/b6/66a06d85474190b50aee1a6c09cdc95bb405ac47338b27e9b21409da1760/resampy-0.2.1.tar.gz (322kB)
     |████████████████████████████████| 327kB 62.5MB/s 
Requirement already satisfied: pandas>=0.23.2 in /usr/local/lib/python3.6/dist-packages (from turicreate) (1.0.3)
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from turicreate) (1.4.1)
Requirement already satisfied: prettytable==0.7.2 in /usr/local/lib/python3.6/dist-packages (from turicreate) (0.7.2)
Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.6/dist-packages (from turicreate) (1.12.0)
Requirement already satisfied: requests>=2.9.1 in /usr/local/lib/python3.6/dist-packages (from turicreate) (2.21.0)
Requirement already satisfied: google-pasta>=0.1.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (0.2.0)
Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (3.2.1)
Requirement already satisfied: keras-applications>=1.0.8 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (1.0.8)
Requirement already satisfied: absl-py>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (0.9.0)
Requirement already satisfied: keras-preprocessing>=1.0.5 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (1.1.0)
Collecting tensorflow-estimator<2.1.0,>=2.0.0
  Downloading https://files.pythonhosted.org/packages/fc/08/8b927337b7019c374719145d1dceba21a8bb909b93b1ad6f8fb7d22c1ca1/tensorflow_estimator-2.0.1-py2.py3-none-any.whl (449kB)
     |████████████████████████████████| 450kB 51.8MB/s 
Collecting tensorboard<2.1.0,>=2.0.0
  Downloading https://files.pythonhosted.org/packages/76/54/99b9d5d52d5cb732f099baaaf7740403e83fe6b0cedde940fabd2b13d75a/tensorboard-2.0.2-py3-none-any.whl (3.8MB)
     |████████████████████████████████| 3.8MB 69.3MB/s 
Requirement already satisfied: wrapt>=1.11.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (1.12.1)
Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (1.1.0)
Requirement already satisfied: grpcio>=1.8.6 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (1.28.1)
Requirement already satisfied: wheel>=0.26; python_version >= "3" in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (0.34.2)
Collecting gast==0.2.2
  Downloading https://files.pythonhosted.org/packages/4e/35/11749bf99b2d4e3cceb4d55ca22590b0d7c2c62b9de38ac4a4a7f4687421/gast-0.2.2.tar.gz
Requirement already satisfied: astor>=0.6.0 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (0.8.1)
Requirement already satisfied: protobuf>=3.6.1 in /usr/local/lib/python3.6/dist-packages (from tensorflow<=2.0.1,>=2.0.0->turicreate) (3.10.0)
Requirement already satisfied: numba>=0.32 in /usr/local/lib/python3.6/dist-packages (from resampy==0.2.1->turicreate) (0.48.0)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.23.2->turicreate) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.23.2->turicreate) (2018.9)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests>=2.9.1->turicreate) (2020.4.5.1)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests>=2.9.1->turicreate) (1.24.3)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests>=2.9.1->turicreate) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests>=2.9.1->turicreate) (2.8)
Requirement already satisfied: h5py in /usr/local/lib/python3.6/dist-packages (from keras-applications>=1.0.8->tensorflow<=2.0.1,>=2.0.0->turicreate) (2.10.0)
Requirement already satisfied: setuptools>=41.0.0 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (46.1.3)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (3.2.1)
Requirement already satisfied: google-auth<2,>=1.6.3 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (1.7.2)
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (1.0.1)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.6/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (0.4.1)
Requirement already satisfied: llvmlite<0.32.0,>=0.31.0dev0 in /usr/local/lib/python3.6/dist-packages (from numba>=0.32->resampy==0.2.1->turicreate) (0.31.0)
Requirement already satisfied: rsa<4.1,>=3.1.4 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (4.0)
Requirement already satisfied: cachetools<3.2,>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (3.1.1)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (0.2.8)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.6/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (1.3.0)
Requirement already satisfied: pyasn1>=0.1.3 in /usr/local/lib/python3.6/dist-packages (from rsa<4.1,>=3.1.4->google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (0.4.8)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.1.0,>=2.0.0->tensorflow<=2.0.1,>=2.0.0->turicreate) (3.1.0)
Building wheels for collected packages: resampy, gast
  Building wheel for resampy (setup.py) ... done
  Created wheel for resampy: filename=resampy-0.2.1-cp36-none-any.whl size=320850 sha256=29c3300d7ad29e88fcc1f1600b1072db695e00ca34ca4e2823cb44f85b964ffb
  Stored in directory: /root/.cache/pip/wheels/ff/4f/ed/2e6c676c23efe5394bb40ade50662e90eb46e29b48324c5f9b
  Building wheel for gast (setup.py) ... done
  Created wheel for gast: filename=gast-0.2.2-cp36-none-any.whl size=7540 sha256=a1f2f68966df59304d4cd4b1cfa0bc015ac0cbcb1b700095dfcda141d573f7ad
  Stored in directory: /root/.cache/pip/wheels/5c/2e/7e/a1d4d4fcebe6c381f378ce7743a3ced3699feb89bcfbdadadd
Successfully built resampy gast
ERROR: tensorflow-probability 0.10.0rc0 has requirement gast>=0.3.2, but you'll have gast 0.2.2 which is incompatible.
Installing collected packages: tensorflow-estimator, tensorboard, gast, tensorflow, coremltools, resampy, turicreate
  Found existing installation: tensorflow-estimator 2.2.0
    Uninstalling tensorflow-estimator-2.2.0:
      Successfully uninstalled tensorflow-estimator-2.2.0
  Found existing installation: tensorboard 2.2.1
    Uninstalling tensorboard-2.2.1:
      Successfully uninstalled tensorboard-2.2.1
  Found existing installation: gast 0.3.3
    Uninstalling gast-0.3.3:
      Successfully uninstalled gast-0.3.3
  Found existing installation: tensorflow 2.2.0rc3
    Uninstalling tensorflow-2.2.0rc3:
      Successfully uninstalled tensorflow-2.2.0rc3
  Found existing installation: resampy 0.2.2
    Uninstalling resampy-0.2.2:
      Successfully uninstalled resampy-0.2.2
Successfully installed coremltools-3.3 gast-0.2.2 resampy-0.2.1 tensorboard-2.0.2 tensorflow-2.0.1 tensorflow-estimator-2.0.1 turicreate-6.2
Requirement already satisfied: kaggle in /usr/local/lib/python3.6/dist-packages (1.5.6)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.8.1)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.6/dist-packages (from kaggle) (4.0.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.6/dist-packages (from kaggle) (2020.4.5.1)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.21.0)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.12.0)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.24.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from kaggle) (4.38.0)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.6/dist-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (2.8)
Requirement already satisfied: gensim in /usr/local/lib/python3.6/dist-packages (3.6.0)
Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.11.1)
Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.4.1)
Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.12.0)
Requirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.18.3)
Requirement already satisfied: boto in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (2.49.0)
Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (1.12.46)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (2.21.0)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (0.9.5)
Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (0.3.3)
Requirement already satisfied: botocore<1.16.0,>=1.15.46 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (1.15.46)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (1.24.3)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (2020.4.5.1)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/lib/python3.6/dist-packages (from botocore<1.16.0,>=1.15.46->boto3->smart-open>=1.2.1->gensim) (2.8.1)
Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/dist-packages (from botocore<1.16.0,>=1.15.46->boto3->smart-open>=1.2.1->gensim) (0.15.2)
Collecting pyLDAvis
  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
     |████████████████████████████████| 1.6MB 2.8MB/s 
Requirement already satisfied: wheel>=0.23.0 in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (0.34.2)
Requirement already satisfied: numpy>=1.9.2 in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (1.18.3)
Requirement already satisfied: scipy>=0.18.0 in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (1.4.1)
Requirement already satisfied: pandas>=0.17.0 in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (1.0.3)
Requirement already satisfied: joblib>=0.8.4 in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (0.14.1)
Requirement already satisfied: jinja2>=2.7.2 in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (2.11.2)
Requirement already satisfied: numexpr in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (2.7.1)
Requirement already satisfied: pytest in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (3.6.4)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from pyLDAvis) (0.16.0)
Collecting funcy
  Downloading https://files.pythonhosted.org/packages/ce/4b/6ffa76544e46614123de31574ad95758c421aae391a1764921b8a81e1eae/funcy-1.14.tar.gz (548kB)
     |████████████████████████████████| 552kB 50.7MB/s 
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.17.0->pyLDAvis) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.17.0->pyLDAvis) (2018.9)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2>=2.7.2->pyLDAvis) (1.1.1)
Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis) (19.3.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis) (46.1.3)
Requirement already satisfied: more-itertools>=4.0.0 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis) (8.2.0)
Requirement already satisfied: atomicwrites>=1.0 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis) (1.3.0)
Requirement already satisfied: py>=1.5.0 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis) (1.8.1)
Requirement already satisfied: pluggy<0.8,>=0.5 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis) (0.7.1)
Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.6/dist-packages (from pytest->pyLDAvis) (1.12.0)
Building wheels for collected packages: pyLDAvis, funcy
  Building wheel for pyLDAvis (setup.py) ... done
  Created wheel for pyLDAvis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97711 sha256=755be1b5fcf8b2d046bb5b3b81ed491f0ebd438ca2db5b57d96812d77c5861ec
  Stored in directory: /root/.cache/pip/wheels/98/71/24/513a99e58bb6b8465bae4d2d5e9dba8f0bef8179e3051ac414
  Building wheel for funcy (setup.py) ... done
  Created wheel for funcy: filename=funcy-1.14-py2.py3-none-any.whl size=32042 sha256=b6cd8ca1e67fe1cbb363a692a7048d31f54267f2a8e96a188a9faab5b79313c6
  Stored in directory: /root/.cache/pip/wheels/20/5a/d8/1d875df03deae6f178dfdf70238cca33f948ef8a6f5209f2eb
Successfully built pyLDAvis funcy
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-1.14 pyLDAvis-2.1.2
Requirement already satisfied: spaCy in /usr/local/lib/python3.6/dist-packages (2.2.4)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from spaCy) (3.0.2)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.6/dist-packages (from spaCy) (1.0.2)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.6/dist-packages (from spaCy) (1.1.3)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.6/dist-packages (from spaCy) (1.0.2)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.6/dist-packages (from spaCy) (4.38.0)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.6/dist-packages (from spaCy) (7.4.0)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.6/dist-packages (from spaCy) (1.18.3)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.6/dist-packages (from spaCy) (2.21.0)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spaCy) (0.6.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from spaCy) (2.0.3)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spaCy) (0.4.1)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/dist-packages (from spaCy) (1.0.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from spaCy) (46.1.3)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spaCy) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spaCy) (2020.4.5.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spaCy) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spaCy) (1.24.3)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from catalogue<1.1.0,>=0.0.7->spaCy) (1.6.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spaCy) (3.1.0)
Collecting afinn
  Downloading https://files.pythonhosted.org/packages/86/e5/ffbb7ee3cca21ac6d310ac01944fb163c20030b45bda25421d725d8a859a/afinn-0.1.tar.gz (52kB)
     |████████████████████████████████| 61kB 2.0MB/s 
Building wheels for collected packages: afinn
  Building wheel for afinn (setup.py) ... done
  Created wheel for afinn: filename=afinn-0.1-cp36-none-any.whl size=53452 sha256=595ff53e325f3dcca199dc2a58cd00bacfd95c1a5af7a8b5938d56dcddab23a5
  Stored in directory: /root/.cache/pip/wheels/b5/1c/de/428301f3333ca509dcf20ff358690eb23a1388fbcbbde008b2
Successfully built afinn
Installing collected packages: afinn
Successfully installed afinn-0.1
Requirement already satisfied: nltk in /usr/local/lib/python3.6/dist-packages (3.2.5)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from nltk) (1.12.0)
Collecting plotly_express
  Downloading https://files.pythonhosted.org/packages/d4/d6/8a2906f51e073a4be80cab35cfa10e7a34853e60f3ed5304ac470852a08d/plotly_express-0.4.1-py2.py3-none-any.whl
Requirement already satisfied: plotly>=4.1.0 in /usr/local/lib/python3.6/dist-packages (from plotly_express) (4.4.1)
Requirement already satisfied: numpy>=1.11 in /usr/local/lib/python3.6/dist-packages (from plotly_express) (1.18.3)
Requirement already satisfied: pandas>=0.20.0 in /usr/local/lib/python3.6/dist-packages (from plotly_express) (1.0.3)
Requirement already satisfied: patsy>=0.5 in /usr/local/lib/python3.6/dist-packages (from plotly_express) (0.5.1)
Requirement already satisfied: statsmodels>=0.9.0 in /usr/local/lib/python3.6/dist-packages (from plotly_express) (0.10.2)
Requirement already satisfied: scipy>=0.18 in /usr/local/lib/python3.6/dist-packages (from plotly_express) (1.4.1)
Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.6/dist-packages (from plotly>=4.1.0->plotly_express) (1.3.3)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from plotly>=4.1.0->plotly_express) (1.12.0)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.20.0->plotly_express) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.20.0->plotly_express) (2018.9)
Installing collected packages: plotly-express
Successfully installed plotly-express-0.4.1
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Collecting en_core_web_lg==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9MB)
     |████████████████████████████████| 827.9MB 1.1MB/s 
Requirement already satisfied: spacy>=2.2.2 in /usr/local/lib/python3.6/dist-packages (from en_core_web_lg==2.2.5) (2.2.4)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (7.4.0)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (1.0.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (1.0.2)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (1.18.3)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (2.21.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (4.38.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (2.0.3)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (1.1.3)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (0.6.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (46.1.3)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (1.0.2)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (3.0.2)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_lg==2.2.5) (0.4.1)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_lg==2.2.5) (1.6.0)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_lg==2.2.5) (2.8)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_lg==2.2.5) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_lg==2.2.5) (2020.4.5.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_lg==2.2.5) (3.0.4)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_lg==2.2.5) (3.1.0)
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... done
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-cp36-none-any.whl size=829180944 sha256=8c36fca5bc461463b9429e671b312dd52eb13a4d21425a1fb01387f035a87a7e
  Stored in directory: /tmp/pip-ephem-wheel-cache-tti1f140/wheels/2a/c1/a6/fc7a877b1efca9bc6a089d6f506f16d3868408f9ff89f8dbfc
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.2.5
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_lg')
In [0]:
#setting up Kaggle & TuriCreate package s
import json
import os

!mkdir /root/.kaggle/
# Installing the Kaggle package

#Important Note: complete this with your own key - after running this for the first time remmember to **remove** your API_KEY
api_token = {"username":"<Insert Your Kaggle User Name>","key":"<Insert Your Kaggle API key>"}

# creating kaggle.json file with the personal API-Key details 
# You can also put this file on your Google Drive

with open('/root/.kaggle/kaggle.json', 'w') as file:
  json.dump(api_token, file)
!chmod 600 /root/.kaggle/kaggle.json
mkdir: cannot create directory ‘/root/.kaggle/’: File exists

Example 1: The World of Fake News

In this example, we are going to use the methods we learned in order to create a fake news classifier. For this example, we will use the Fake News Dataset. First let's load the dataset into a DataFrame object:

In [0]:
!mkdir ./datasets
!mkdir ./datasets/fake-news

# download the dataset from Kaggle and unzip it
!kaggle datasets download jruvika/fake-news-detection -p ./datasets/fake-news
!unzip ./datasets/fake-news/*.zip  -d ./datasets/fake-news/
mkdir: cannot create directory ‘./datasets’: File exists
mkdir: cannot create directory ‘./datasets/fake-news’: File exists
Downloading fake-news-detection.zip to ./datasets/fake-news
100% 4.89M/4.89M [00:00<00:00, 40.1MB/s]

Archive:  ./datasets/fake-news/fake-news-detection.zip
  inflating: ./datasets/fake-news/data.csv  
  inflating: ./datasets/fake-news/data.h5  
In [0]:
import turicreate as tc
%matplotlib inline

fake_news_dataset_path = "./datasets/fake-news/data.csv"
sf = tc.SFrame.read_csv(fake_news_dataset_path)
sf
Finished parsing file /content/datasets/fake-news/data.csv
Parsing completed. Parsed 100 lines in 0.154547 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /content/datasets/fake-news/data.csv
Parsing completed. Parsed 4009 lines in 0.159761 secs.
Out[0]:
URLs Headline Body Label
http://www.bbc.com/news/w
orld-us- ...
Four ways Bob Corker
skewered Donald Trump ...
Image copyright Getty
Images\nOn Sunday ...
1
https://www.reuters.com/a
rticle/us-filmfestival- ...
Linklater's war veteran
comedy speaks to modern ...
LONDON (Reuters) - “Last
Flag Flying”, a comedy- ...
1
https://www.nytimes.com/2
017/10/09/us/politics ...
Trump’s Fight With Corker
Jeopardizes His ...
The feud broke into
public view last week ...
1
https://www.reuters.com/a
rticle/us-mexico-oil- ...
Egypt's Cheiron wins tie-
up with Pemex for Mex ...
MEXICO CITY (Reuters) -
Egypt’s Cheiron Holdings ...
1
http://www.cnn.com/videos
/cnnmoney/2017/10/08/ ...
Jason Aldean opens 'SNL'
with Vegas tribute ...
Country singer Jason
Aldean, who was ...
1
http://beforeitsnews.com/
sports/2017/09/jetnat ...
JetNation FanDuel League;
Week 4 ...
JetNation FanDuel League;
Week 4\n% of readers ...
0
https://www.nytimes.com/2
017/10/10/us/politics ...
Kansas Tried a Tax Plan
Similar to Trump’s. It ...
In 2012, Kansas
lawmakers, led by Gov. ...
1
https://www.reuters.com/a
rticle/us-india-cenbank- ...
India RBI chief: growth
important, but not at ...
The Reserve Bank of India
(RBI) Governor Urjit ...
1
https://www.reuters.com/a
rticle/us-climatechange- ...
EPA chief to sign rule on
Clean Power Plan exit on ...
Scott Pruitt,
Administrator of the ...
1
https://www.reuters.com/a
rticle/us-air-berlin- ...
Talks on sale of Air
Berlin planes to easyJet ...
FILE PHOTO - An Air
Berlin sign is seen a ...
1
[4009 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [0]:
sf['full_text'] = sf.apply(lambda r: r['Headline'] + "\n\n" + r['Body'])
sf
Out[0]:
URLs Headline Body Label
http://www.bbc.com/news/w
orld-us- ...
Four ways Bob Corker
skewered Donald Trump ...
Image copyright Getty
Images\nOn Sunday ...
1
https://www.reuters.com/a
rticle/us-filmfestival- ...
Linklater's war veteran
comedy speaks to modern ...
LONDON (Reuters) - “Last
Flag Flying”, a comedy- ...
1
https://www.nytimes.com/2
017/10/09/us/politics ...
Trump’s Fight With Corker
Jeopardizes His ...
The feud broke into
public view last week ...
1
https://www.reuters.com/a
rticle/us-mexico-oil- ...
Egypt's Cheiron wins tie-
up with Pemex for Mex ...
MEXICO CITY (Reuters) -
Egypt’s Cheiron Holdings ...
1
http://www.cnn.com/videos
/cnnmoney/2017/10/08/ ...
Jason Aldean opens 'SNL'
with Vegas tribute ...
Country singer Jason
Aldean, who was ...
1
http://beforeitsnews.com/
sports/2017/09/jetnat ...
JetNation FanDuel League;
Week 4 ...
JetNation FanDuel League;
Week 4\n% of readers ...
0
https://www.nytimes.com/2
017/10/10/us/politics ...
Kansas Tried a Tax Plan
Similar to Trump’s. It ...
In 2012, Kansas
lawmakers, led by Gov. ...
1
https://www.reuters.com/a
rticle/us-india-cenbank- ...
India RBI chief: growth
important, but not at ...
The Reserve Bank of India
(RBI) Governor Urjit ...
1
https://www.reuters.com/a
rticle/us-climatechange- ...
EPA chief to sign rule on
Clean Power Plan exit on ...
Scott Pruitt,
Administrator of the ...
1
https://www.reuters.com/a
rticle/us-air-berlin- ...
Talks on sale of Air
Berlin planes to easyJet ...
FILE PHOTO - An Air
Berlin sign is seen a ...
1
full_text
Four ways Bob Corker
skewered Donald ...
Linklater's war veteran
comedy speaks to modern ...
Trump’s Fight With Corker
Jeopardizes His ...
Egypt's Cheiron wins tie-
up with Pemex for Mex ...
Jason Aldean opens 'SNL'
with Vegas ...
JetNation FanDuel League;
Week 4\n\nJetNation ...
Kansas Tried a Tax Plan
Similar to Trump’s. It ...
India RBI chief: growth
important, but not at ...
EPA chief to sign rule on
Clean Power Plan exit on ...
Talks on sale of Air
Berlin planes to easyJet ...
[4009 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Let's use TuriCreate to create topic models for the unreliable news:

In [0]:
import turicreate as tc
from nltk.corpus import stopwords
from nltk.stem.porter import *
from functools import lru_cache
from collections import Counter
from nltk.tokenize import word_tokenize
import nltk


stop_words_set = set(stopwords.words("english"))
stemmer = PorterStemmer()

#Using cahcing for faster performence
@lru_cache(maxsize=None)
def word_stemming(w):
    return stemmer.stem(w)


def skip_word(w):
    if len(w) <2:
        return True
    if w.isdigit():
        return True
    if w in stop_words_set or stemmer.stem(w) in stop_words_set:
        return True
    return False

def text_to_bow(text):
    text = text.lower()
    l = [word_stemming(w) for w in word_tokenize(text) if not skip_word(w) ]
    l = [w for w in l if not skip_word(w)]
    d = Counter(l)
    return dict(d)

f_sf = sf[sf['Label'] == 1]
bow_list = []
for t in f_sf['Headline']:
    bow_list.append(text_to_bow(t))
f_sf['bow'] = bow_list

bow_list = []
for t in f_sf['full_text']:
    bow_list.append(text_to_bow(t))
f_sf['full_bow'] = bow_list

f_sf.materialize()
docs = f_sf['bow']
docs[:2]
Out[0]:
dtype: dict
Rows: 2
[{'four': 1, 'way': 1, 'bob': 1, 'corker': 1, 'skewer': 1, 'donald': 1, 'trump': 1}, {'linklat': 1, "'s": 1, 'war': 1, 'veteran': 1, 'comedi': 1, 'speak': 1, 'modern': 1, 'america': 1, 'say': 1, 'star': 1}]
In [0]:
topic_model = tc.topic_model.create(docs, num_topics=100)
Learning a topic model
       Number of documents      1872
           Vocabulary size      4048
   Running collapsed Gibbs sampling
+-----------+---------------+----------------+-----------------+
| Iteration | Elapsed Time  | Tokens/Second  | Est. Perplexity |
+-----------+---------------+----------------+-----------------+
| 10        | 43.798ms      | 3.98114e+06    | 0               |
+-----------+---------------+----------------+-----------------+
In [0]:
topic_model.get_topics().print_rows(200)
+-------+------------+-----------------------+
| topic |    word    |         score         |
+-------+------------+-----------------------+
|   0   |   trump    |  0.021310320535400298 |
|   0   |    deal    |  0.01602676998943328  |
|   0   |    say     |  0.01602676998943328  |
|   0   |    open    |  0.010743219443466265 |
|   0   |    new     |  0.008982035928143927 |
|   1   |   korea    |  0.009554140127388746 |
|   1   |   gunman   |  0.009554140127388746 |
|   1   |    iran    |  0.007680779318096836 |
|   1   | catalonia  |  0.007680779318096836 |
|   1   |  european  |  0.005807418508804925 |
|   2   |    u.s.    |  0.05055775458798247  |
|   2   |    back    |  0.012774379273120126 |
|   2   |   could    |  0.010975170924793347 |
|   2   |  opposit   |  0.010975170924793347 |
|   2   |   fight    |  0.00737675422813979  |
|   3   |   polit    |  0.012866980790141645 |
|   3   |   anthem   |  0.00924247915911583  |
|   3   | catalonia  |  0.007430228343602922 |
|   3   |   arrest   |  0.007430228343602922 |
|   3   |  independ  |  0.005617977528090014 |
|   4   |    leav    |  0.011623475609756349 |
|   4   |    open    |  0.011623475609756349 |
|   4   |    may     |  0.00781250000000017  |
|   4   |   harvey   |  0.00590701219512208  |
|   4   |   offic    |  0.004001524390243989 |
|   5   |   indian   |  0.01134250650799579  |
|   5   |   asset    |  0.009483079211603037 |
|   5   |    take    |  0.005764224618817533 |
|   5   |   attack   |  0.005764224618817533 |
|   5   |   presid   |  0.005764224618817533 |
|   6   |    vega    |  0.011579347000759549 |
|   6   |   europ    |  0.00588458618071387  |
|   6   |    open    |  0.00588458618071387  |
|   6   |    rico    |  0.00588458618071387  |
|   6   |    die     |  0.003986332574031976 |
|   7   |    get     |  0.017015706806283077 |
|   7   | weinstein  |  0.01514584891548274  |
|   7   |    ban     |  0.009536275243081725 |
|   7   |    shun    |  0.005796559461481049 |
|   7   |    big     |  0.005796559461481049 |
|   8   |     eu     |  0.020299926847110943 |
|   8   |   stori    |  0.009326993416240163 |
|   8   |   russia   |  0.009326993416240163 |
|   8   |   brazil   |  0.007498171177761699 |
|   8   | weinstein  |  0.007498171177761699 |
|   9   |   sourc    |  0.013580719204284917 |
|   9   |   shoot    |  0.007842387146136361 |
|   9   |   polic    |  0.005929609793420176 |
|   9   |   state    |  0.005929609793420176 |
|   9   |    rape    |  0.00401683244070399  |
|   10  |     's     |  0.03473990542015358  |
|   10  |   women    |  0.011094943615860565 |
|   10  |  shanghai  |  0.00927610040014572  |
|   10  |    game    |  0.00927610040014572  |
|   10  |    hit     |  0.007457257184430872 |
|   11  |   woman    |  0.007902852737085757 |
|   11  |   nobel    |  0.007902852737085757 |
|   11  |    vega    |  0.005975327679259963 |
|   11  |    u.s.    |  0.005975327679259963 |
|   11  |    back    |  0.005975327679259963 |
|   12  |   demand   | 0.0075534266764924266 |
|   12  |   arrest   |  0.005711127487104031 |
|   12  |   hotel    |  0.005711127487104031 |
|   12  |    put     | 0.0038688282977156338 |
|   12  |    win     | 0.0038688282977156338 |
|   13  |   brazil   |  0.011667941851568732 |
|   13  |   obama    |  0.007842387146136361 |
|   13  |    citi    |  0.007842387146136361 |
|   13  |    rise    |  0.005929609793420176 |
|   13  |    way     |  0.00401683244070399  |
|   14  |   trump    |  0.023612112472963735 |
|   14  |  wildfir   |  0.010994953136265556 |
|   14  |    shot    |  0.010994953136265556 |
|   14  |    fire    |  0.010994953136265556 |
|   14  |   futur    |  0.007390050468637504 |
|   15  |     's     |  0.049651887138147006 |
|   15  |   india    |  0.011176255038475894 |
|   15  |   russia   |  0.007511909124221502 |
|   15  |   brazil   |  0.007511909124221502 |
|   15  |   chief    |  0.005679736167094307 |
|   16  |    next    |  0.012820512820513105 |
|   16  |   presid   |  0.009209100758396737 |
|   16  |    talk    |  0.007403394727338554 |
|   16  |   russia   |  0.007403394727338554 |
|   16  |    poll    |  0.007403394727338554 |
|   17  |     's     |  0.012843704775687687 |
|   17  |   china    |  0.011034732272069704 |
|   17  |    give    |  0.009225759768451719 |
|   17  |    seek    |  0.007416787264833735 |
|   17  |  spanish   |  0.007416787264833735 |
|   18  |   trump    |  0.01646164978292365  |
|   18  |    say     |  0.012843704775687685 |
|   18  |    deal    |  0.007416787264833733 |
|   18  |    shot    |  0.00560781476121575  |
|   18  |  america   |  0.00560781476121575  |
|   19  |     's     |  0.013376036171816414 |
|   19  |    call    |  0.009608138658628692 |
|   19  |   elect    |  0.00772418990203483  |
|   19  |     wo     | 0.0058402411454409695 |
|   19  |   reform   | 0.0058402411454409695 |
|   20  |     's     |  0.023188961287850262 |
|   20  |   refuge   |  0.011690302798007157 |
|   20  |    deal    |  0.007857416634726121 |
|   20  |   korean   |  0.005940973553085605 |
|   20  |   review   |  0.004024530471445087 |
|   21  |     's     |  0.020563171545017116 |
|   21  |   india    | 0.0075954057058171326 |
|   21  |   cancel   |  0.005742867728788564 |
|   21  |    day     |  0.005742867728788564 |
|   21  |   innov    |  0.005742867728788564 |
|   22  |   train    |  0.007948817371074226 |
|   22  |   korean   |  0.006010081426909781 |
|   22  |   white    |  0.006010081426909781 |
|   22  |    wind    |  0.006010081426909781 |
|   22  |    bid     | 0.0040713454827453355 |
|   23  |    say     |  0.026072485207101162 |
|   23  |    new     | 0.0075813609467457275 |
|   23  |  scandal   |  0.005732248520710185 |
|   23  |   steel    |  0.005732248520710185 |
|   23  |   shoot    |  0.005732248520710185 |
|   24  |     's     |  0.009830377794911548 |
|   24  |  hurrican  |  0.007902852737085754 |
|   24  |   turkey   |  0.007902852737085754 |
|   24  |   market   |  0.007902852737085754 |
|   24  |     la     |  0.005975327679259961 |
|   25  |     's     |  0.018710633567988553 |
|   25  |   mexico   |  0.009447943682845706 |
|   25  |    call    |  0.007595405705817136 |
|   25  |   korea    |  0.005742867728788566 |
|   25  |   state    |  0.005742867728788566 |
|   26  |   woman    |  0.011492087415222542 |
|   26  |   media    |  0.005840241145440965 |
|   26  |   first    |  0.005840241145440965 |
|   26  |   yanke    |  0.003956292388847105 |
|   26  |  children  |  0.003956292388847105 |
|   27  |     's     |  0.030388825972065606 |
|   27  |   reform   |  0.005851264628161701 |
|   27  |   expect   |  0.005851264628161701 |
|   27  |    star    |  0.005851264628161701 |
|   27  |    rape    |  0.003963759909399862 |
|   28  |     's     |  0.020411916145642233 |
|   28  |    two     |  0.014895182052225413 |
|   28  |    new     |  0.011217359323280867 |
|   28  |    need    |  0.00753953659433632  |
|   28  |  facebook  |  0.005700625229864048 |
|   29  |    vega    |  0.005884586180713865 |
|   29  |   order    |  0.005884586180713865 |
|   29  |    lose    |  0.005884586180713865 |
|   29  |   polic    | 0.0039863325740319725 |
|   29  |   sudan    | 0.0039863325740319725 |
|   30  |    amaz    |  0.007609502598366911 |
|   30  |    cup     |  0.005753526354862787 |
|   30  |    bob     |  0.005753526354862787 |
|   30  |   olymp    |  0.005753526354862787 |
|   30  |    team    |  0.005753526354862787 |
|   31  |   trump    |  0.025739320920044412 |
|   31  |    meet    |  0.014786418400876576 |
|   31  |    keep    |  0.009309967141292659 |
|   31  |   attack   |  0.009309967141292659 |
|   31  |    fund    |  0.00748448338809802  |
|   32  |    war     |  0.01664228237015401  |
|   32  |    tax     |  0.009326993416240159 |
|   32  | weinstein  |  0.009326993416240159 |
|   32  |    u.s.    | 0.0074981711777616965 |
|   32  |  compani   | 0.0074981711777616965 |
|   33  |    win     |  0.009554140127388743 |
|   33  |  nuclear   |  0.007680779318096832 |
|   33  |   gener    |  0.007680779318096832 |
|   33  |   trump    |  0.005807418508804923 |
|   33  |    nfl     |  0.005807418508804923 |
|   34  |     's     |  0.06735657225853459  |
|   34  |   elect    |  0.00925925925925947  |
|   34  |     tv     |  0.005628177196804776 |
|   34  |  protest   |  0.005628177196804776 |
|   34  |   gener    | 0.0038126361655774293 |
|   35  |   offici   | 0.0077828397873957655 |
|   35  |    make    | 0.0077828397873957655 |
|   35  |   ahead    | 0.0058845861807138725 |
|   35  |  concern   | 0.0058845861807138725 |
|   35  |    vote    | 0.0058845861807138725 |
|   36  | california |  0.013350883790899114 |
|   36  |    say     |  0.007709665287702305 |
|   36  |   exclus   |  0.005829259119970036 |
|   36  |    big     |  0.005829259119970036 |
|   36  |    bad     |  0.003948852952237766 |
|   37  | weinstein  |   0.039704365761431   |
|   37  |   harvey   |  0.02251632863527039  |
|   37  |   alleg    |  0.010484702646957968 |
|   37  |   leagu    |  0.008765898934341907 |
|   37  |    sale    |  0.007047095221725847 |
|   38  |   japan    |  0.011513778784447211 |
|   38  |   brexit   |  0.009626274065685373 |
|   38  |    sign    |  0.009626274065685373 |
|   38  |   turkey   |  0.007738769346923535 |
|   38  |  russian   |  0.007738769346923535 |
|   39  |    help    |  0.009326993416240149 |
|   39  |   puerto   |  0.009326993416240149 |
|   39  |   artist   |  0.007498171177761689 |
|   39  |   nobel    |  0.007498171177761689 |
|   39  |   court    |  0.005669348939283229 |
+-------+------------+-----------------------+
[500 rows x 3 columns]

Let's use BM25 to find the most relevant items about aliens:

In [0]:
tc.text_analytics.bm25(f_sf['bow'], ['trump', 'obama']).sort('bm25', ascending=False)
Out[0]:
doc_id bm25
945 6.934268829142688
358 6.193069615628017
121 5.98166894704028
1379 5.98166894704028
1857 5.575055437176514
684 5.575055437176514
29 5.575055437176514
427 5.575055437176514
1268 4.90782142209159
401 4.630714726516615
[205 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [0]:
f_sf[945]['Headline']
Out[0]:
'Trump administration to roll back Obama clean power rule'
In [0]:
f_sf[358]['Headline']
Out[0]:
'Trump Takes a First Step Toward Scrapping Obama’s Global Warming Policy'
In [0]:
tc.text_analytics.bm25(f_sf['bow'], ['brexit']).sort('bm25', ascending=False)
Out[0]:
doc_id bm25
720 5.093822731593708
1323 5.093822731593708
1758 4.747561987668387
1204 4.747561987668387
1390 4.747561987668387
784 4.445380163974075
1640 4.445380163974075
143 4.179364077783416
313 4.179364077783416
1659 4.179364077783416
[22 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [0]:
f_sf[1323]['Headline']
Out[0]:
'In ‘The Party,’ a Portrait of a U.K. Divided by ‘Brexit’'

Let's find the most common people/organizations/locations in the texts:

In [0]:
import spacy
from tqdm import tqdm

nlp = spacy.load('en_core_web_lg')
def get_entites_from_text(text):
    entities_dict= {}
    #using spaCy to get entities
    doc = nlp(text)
    for entity in doc.ents:
        label = entity.label_
        if  label not in entities_dict:
            entities_dict[label] = set()
        entities_dict[label].add(entity.text)        

    return entities_dict

l =[] 
for i in tqdm(range(len(sf['full_text']))):
    t = sf[i]['full_text']
    l.append(get_entites_from_text(t))

sf['entities_dict'] = l
f_sf = sf[sf['Label'] == 1]
f_sf
100%|██████████| 4009/4009 [04:41<00:00, 14.26it/s]
Out[0]:
URLs Headline Body Label
http://www.bbc.com/news/w
orld-us- ...
Four ways Bob Corker
skewered Donald Trump ...
Image copyright Getty
Images\nOn Sunday ...
1
https://www.reuters.com/a
rticle/us-filmfestival- ...
Linklater's war veteran
comedy speaks to modern ...
LONDON (Reuters) - “Last
Flag Flying”, a comedy- ...
1
https://www.nytimes.com/2
017/10/09/us/politics ...
Trump’s Fight With Corker
Jeopardizes His ...
The feud broke into
public view last week ...
1
https://www.reuters.com/a
rticle/us-mexico-oil- ...
Egypt's Cheiron wins tie-
up with Pemex for Mex ...
MEXICO CITY (Reuters) -
Egypt’s Cheiron Holdings ...
1
http://www.cnn.com/videos
/cnnmoney/2017/10/08/ ...
Jason Aldean opens 'SNL'
with Vegas tribute ...
Country singer Jason
Aldean, who was ...
1
https://www.nytimes.com/2
017/10/10/us/politics ...
Kansas Tried a Tax Plan
Similar to Trump’s. It ...
In 2012, Kansas
lawmakers, led by Gov. ...
1
https://www.reuters.com/a
rticle/us-india-cenbank- ...
India RBI chief: growth
important, but not at ...
The Reserve Bank of India
(RBI) Governor Urjit ...
1
https://www.reuters.com/a
rticle/us-climatechange- ...
EPA chief to sign rule on
Clean Power Plan exit on ...
Scott Pruitt,
Administrator of the ...
1
https://www.reuters.com/a
rticle/us-air-berlin- ...
Talks on sale of Air
Berlin planes to easyJet ...
FILE PHOTO - An Air
Berlin sign is seen a ...
1
https://www.reuters.com/a
rticle/us-deloitte- ...
Deloitte cyber attack
affected up to 350 ...
FILE PHOTO: The Deloitte
Company logo is seen ...
1
full_text entities_dict
Four ways Bob Corker
skewered Donald ...
{'CARDINAL': ['four',
'only two', '52', 'Fo ...
Linklater's war veteran
comedy speaks to modern ...
{'PERSON': ['Saddam
Hussein', 'Cranston', ...
Trump’s Fight With Corker
Jeopardizes His ...
{'ORG': ["The New York
Times's", 'Senate', ...
Egypt's Cheiron wins tie-
up with Pemex for Mex ...
{'GPE': ['MEXICO CITY',
'Egypt', 'Reuters'], ...
Jason Aldean opens 'SNL'
with Vegas ...
{'PERSON': ['Jason
Aldean', "Tom Petty's"], ...
Kansas Tried a Tax Plan
Similar to Trump’s. It ...
{'GPE': ['Kansas',
'Washington'], 'ORG': ...
India RBI chief: growth
important, but not at ...
{'ORG': ['the monetary
policy committee', ...
EPA chief to sign rule on
Clean Power Plan exit on ...
{'ORG': ['EPA',
'REUTERS/', 'Obama', ...
Talks on sale of Air
Berlin planes to easyJet ...
{'ORG': ['Air Berlin',
'easyJet', 'Etihad', ...
Deloitte cyber attack
affected up to 350 ...
{'ORG': ['Deloitte', 'The
Deloitte Company', ...
[? rows x 6 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
In [0]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from collections import Counter
%matplotlib inline

def draw_word_cloud(words_list, min_times=10):
    stopwords = set(STOPWORDS) 
    stopwords_parts = {"'s", " ' s'", " `s" }
    wordcloud = WordCloud(width = 800, height = 800, 
                    background_color ='white', 
                    stopwords = stopwords, 
                    min_font_size = 10)
    def skip_entity(e):
        if e in stopwords:
            return True
        for p in stopwords_parts:
            if p in e:
                return True
        return False
    c = Counter(words_list)
    # using the subject frquencies
    d = {k:v for k,v in dict(c).items() if v > min_times and not skip_entity(k)}
    wordcloud.generate_from_frequencies(d)
    plt.figure(figsize = (20, 20), facecolor = None) 
    plt.imshow(wordcloud)

find_most_common_person = []
for d in f_sf['entities_dict']:
    if 'PERSON' in d:
        find_most_common_person +=  d['PERSON'] 

draw_word_cloud(find_most_common_person, min_times=20)
In [0]:
find_most_common_location = []
for d in f_sf['entities_dict']:
    if 'LOC' in d:
        find_most_common_location +=  d['LOC'] 

draw_word_cloud(find_most_common_location, min_times=10)