Collecting, Analyzing, and Visualizing Data with Python - Part II

The Art of Analyzing Big Data - The Data Scientist’s Toolbox - Lecture 3

By Dr. Michael Fire


0. Installing TuriCreate SFrame

In this lecture, we are going to wrok with TuriCreate, let's install it:

If running the notebook on your own laptop, we recommend installing TuriCreate using anaconda. Use the following command:

$ conda create -n venv anaconda
$ source activate venv
$ pip install -U turicreate

Additional installation instructions can be found at TuriCreate Homepage.

1. Introduction to SFrame using Seattle Library Collection Inventory Dataset

Let's analyze the Seattle Library Collection Inventory Dataset (11GB) using SFrame. First, let's download the dataset:

In [1]:
# Installing the Kaggle package
!pip install kaggle 

#Important Note: complete this with your own key - after running this for the first time remmember to **remove** your API_KEY
api_token = {"username":"<Insert Your Kaggle User Name>","key":"<Insert Your Kaggle API key>"}

# creating kaggle.json file with the personal API-Key details 
# You can also put this file on your Google Drive
with open('~/.kaggle/kaggle.json', 'w') as file:
  json.dump(api_token, file)
!chmod 600 ~/.kaggle/kaggle.json
Requirement already satisfied: kaggle in /anaconda3/envs/massivedata/lib/python3.6/site-packages (1.5.6)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (1.24.2)
Requirement already satisfied: six>=1.10 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (1.12.0)
Requirement already satisfied: tqdm in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (4.36.1)
Requirement already satisfied: requests in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (2.22.0)
Requirement already satisfied: certifi in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (2019.9.11)
Requirement already satisfied: python-slugify in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (4.0.0)
Requirement already satisfied: python-dateutil in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kaggle) (2.8.0)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests->kaggle) (2.8)
Requirement already satisfied: text-unidecode>=1.3 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from python-slugify->kaggle) (1.3)
In [2]:
# Creating a dataset directory

!mkdir ./datasets
!mkdir ./datasets/library-collection

# download the dataset from Kaggle and unzip it
!kaggle datasets download city-of-seattle/seattle-library-collection-inventory  -f library-collection-inventory.csv -p ./datasets/library-collection/
!unzip ./datasets/library-collection/*.zip  -d ./datasets/library-collection
!ls ./datasets/library-collection
library-collection-inventory.csv     library-collection-inventory.csv.zip
In [3]:
import turicreate as tc
%matplotlib inline

#Loading a CSV to SFrame (this can take some time)
sf = tc.SFrame.read_csv("./datasets/library-collection/library-collection-inventory.csv")
sf
Successfully parsed 10 tokens: 
	0: 735439
	1: ["Genealog ... t.",,1947]
	2: 
	3: Enloes family
	4: arbk
	5: caref
	6: 
	7: cen
	8: 2017-09-01 ... :00:00.000
	9: 1
1 lines failed to parse correctly
Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/library-collection/library-collection-inventory.csv
Parsing completed. Parsed 100 lines in 0.607321 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,str,str,str,str,str,str,str,str,str,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Successfully parsed 10 tokens: 
	0: 735439
	1: [Genealogy ... t.",,1947]
	2: 
	3: Enloes family
	4: arbk
	5: caref
	6: 
	7: cen
	8: 2017-09-01 ... :00:00.000
	9: 1
Read 158429 lines. Lines per second: 177097
Successfully parsed 9 tokens: 
	0: 362786
	1: [Records., ... l Society]
	2: Registers  ... astchester
	3: arbk
	4: caref
	5: 
	6: cen
	7: 2017-09-01 ... :00:00.000
	8: 1
Successfully parsed 10 tokens: 
	0: 28078
	1: [Papers.," ... 9)",,1969]
	2: 
	3: Genealogy Congresses
	4: arbk
	5: caref
	6: 
	7: cen
	8: 2017-09-01 ... :00:00.000
	9: 6
Read 1267346 lines. Lines per second: 215302
Successfully parsed 9 tokens: 
	0: 362786
	1: [Records., ... l Society]
	2: Registers  ... astchester
	3: arbk
	4: caref
	5: 
	6: cen
	7: 2017-10-01 ... :00:00.000
	8: 1
Successfully parsed 10 tokens: 
	0: 735439
	1: [Genealogy ... t.",,1947]
	2: 
	3: Enloes family
	4: arbk
	5: caref
	6: 
	7: cen
	8: 2017-10-01 ... :00:00.000
	9: 1
Successfully parsed 10 tokens: 
	0: 28078
	1: [Papers.," ... 9)",,1969]
	2: 
	3: Genealogy Congresses
	4: arbk
	5: caref
	6: 
	7: cen
	8: 2017-10-01 ... :00:00.000
	9: 6
Read 2374800 lines. Lines per second: 205838
Successfully parsed 10 tokens: 
	0: 735439
	1: [Genealogy ... t.",,1947]
	2: 
	3: Enloes family
	4: arbk
	5: caref
	6: 
	7: cen
	8: 2017-11-01 ... :00:00.000
	9: 1
Successfully parsed 10 tokens: 
	0: 28078
	1: [Papers.," ... 9)",,1969]
	2: 
	3: Genealogy Congresses
	4: arbk
	5: caref
	6: 
	7: cen
	8: 2017-11-01 ... :00:00.000
	9: 6
Read 3323492 lines. Lines per second: 196593
Successfully parsed 9 tokens: 
	0: 362786
	1: [Records., ... l Society]
	2: Registers  ... astchester
	3: arbk
	4: caref
	5: 
	6: cen
	7: 2017-11-01 ... :00:00.000
	8: 1
Read 4276116 lines. Lines per second: 190869
Read 5244354 lines. Lines per second: 190166
Read 6210933 lines. Lines per second: 186790
Read 7176009 lines. Lines per second: 184445
Read 8140271 lines. Lines per second: 183656
Read 8942541 lines. Lines per second: 181425
Read 9904635 lines. Lines per second: 180208
Read 10866196 lines. Lines per second: 178460
Read 11826794 lines. Lines per second: 178452
Read 12786183 lines. Lines per second: 177921
Read 13744955 lines. Lines per second: 176924
Read 14703184 lines. Lines per second: 177141
Read 15659858 lines. Lines per second: 176221
Read 16297690 lines. Lines per second: 173832
Read 17096109 lines. Lines per second: 172671
Read 17894258 lines. Lines per second: 171931
Read 18851736 lines. Lines per second: 171679
Read 19648313 lines. Lines per second: 170583
Read 20602993 lines. Lines per second: 170297
Read 21556704 lines. Lines per second: 170215
Read 22509210 lines. Lines per second: 170446
Read 23461186 lines. Lines per second: 170443
Read 24412122 lines. Lines per second: 170743
Read 25362401 lines. Lines per second: 170818
Read 26152994 lines. Lines per second: 138888
Read 27101342 lines. Lines per second: 139571
Read 28048708 lines. Lines per second: 140413
Read 28979658 lines. Lines per second: 140905
Successfully parsed 10 tokens: 
	0: 332256
	1: [Souvenir  ... )",,1930?]
	2: Lumbermen' ... rint. Co.,
	3: 
	4: arbk
	5: casea
	6: 
	7: cen
	8: 2019-07-01 ... :00:00.000
	9: 1
Read 29908093 lines. Lines per second: 141445
Read 30680053 lines. Lines per second: 141625
Read 31453271 lines. Lines per second: 141934
Read 32227265 lines. Lines per second: 142144
Read 33154631 lines. Lines per second: 142528
Read 34081313 lines. Lines per second: 143161
Read 35007247 lines. Lines per second: 143675
14 lines failed to parse correctly
Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/library-collection/library-collection-inventory.csv
Parsing completed. Parsed 35531294 lines in 246.675 secs.
Out[3]:
BibNum Title Author ISBN PublicationYear
3011076 A tale of two friends /
adapted by Ellie O'Ry ...
O'Ryan, Ellie 1481425730, 1481425749,
9781481425735, ...
2014.
2248846 Naruto. Vol. 1, Uzumaki
Naruto / story and ar ...
Kishimoto, Masashi, 1974- 1569319006 2003, c1999.
3209270 Peace, love & Wi-Fi : a
ZITS treasury / by Jerry ...
Scott, Jerry, 1955- 144945867X, 9781449458676 2014.
1907265 The Paris pilgrims : a
novel / Clancy Carlile. ...
Carlile, Clancy, 1930- 0786706155 c1999.
1644616 Erotic by nature : a
celebration of life, of ...
094020813X 1991, c1988.
1736505 Children of Cambodia's
killing fields : memoirs ...
0300068395, 0300078730 c1997.
1749492 Anti-Zionism : analytical
reflections / editors: ...
091559773X c1989.
3270562 Hard-hearted Highlander /
Julia London. ...
London, Julia 0373789998, 037380394X,
9780373789993, ...
[2017]
3264577 The Sandcastle Empire /
Kayla Olson. ...
Olson, Kayla 0062484877, 9780062484871 2017.
3236819 Doctor Who. The return of
Doctor Mysterio / BBC ; ...
[2017]
Publisher Subjects ItemType ItemCollection FloatingItem ItemLocation
Simon Spotlight, Musicians Fiction,
Bullfighters Fiction, ...
jcbk ncrdr Floating qna
Viz, Ninja Japan Comic books
strips etc, Comic books ...
acbk nycomic None lcy
Andrews McMeel
Publishing, ...
Duncan Jeremy Fictitious
character Comic books ...
acbk nycomic None bea
Carroll & Graf, Hemingway Ernest 1899
1961 Fiction, ...
acbk cafic None cen
Red Alder Books/Down
There Press, ...
Erotic literature
American, American ...
acbk canf None cen
Yale University Press, Political atrocities
Cambodia, Children ...
acbk canf None cen
Amana Books, Berger Elmer 1908 1996,
Zionism Controversial ...
acbk canf None cen
HQN, Man woman relationships
Fiction, Betrothal ...
acbk nanew None lcy
HarperTeen, Survival Juvenile
fiction, Islands Juve ...
acbk nynew None nga
BBC Worldwide, Doctor Fictitious
character Drama, Time ...
acdvd nadvd Floating wts
ReportDate ItemCount
2017-09-01T00:00:00.000 1
2017-09-01T00:00:00.000 1
2017-09-01T00:00:00.000 1
2017-09-01T00:00:00.000 1
2017-09-01T00:00:00.000 1
2017-09-01T00:00:00.000 1
2017-09-01T00:00:00.000 1
2017-09-01T00:00:00.000 1
2017-09-01T00:00:00.000 1
2017-09-01T00:00:00.000 2
[35531294 rows x 13 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

We loaded 35.5 million rows with 13 columns to an SFrame object. We can get a first impression of the data in the dataset by using the show function:

In [4]:
sf.show()
Materializing SFrame

Let's create a new column with the publication year of each book as an integer:

In [5]:
sf['PublicationYear'] # SArray object
Out[5]:
dtype: str
Rows: 35531294
['2014.', '2003, c1999.', '2014.', 'c1999.', '1991, c1988.', 'c1997.', 'c1989.', '[2017]', '2017.', '[2017]', '2014.', '[2015]', '[2006?]', '2017.', '2017.', 'c2015.', '2016.', '[2015]', '2016.', 'c2008.', '2016.', '2000.', '1960.', 'c2000.', 'c2014.', '[2014]', '©2014', 'c2005.', '2008.', '2004.', '[2015]', '2012.', '[1983]', 'c1987.', '2014.', '2011.', '2005.', 'c2012.', '[1973]', '[2016]', '[1958]', '2012.', '[2016]', 'c2009.', '2016.', '2008.', '1982.', '1974.', 'c2012.', '2001.', '2016.', 'p2009.', '[2017]', '1981.', '2013.', '2011.', '[2014]', '2014.', 'c2002.', '2016.', 'c2011.', '2017.', '2015.', 'c2000.', '', '2013.', '1988.', '[2017]', '', '2013.', '2016.', '[2016]', 'c2007.', '[1971]', 'c1945.', '[2016]', '[2010]', 'c2012.', 'c1994.', '1974.', '2001, c2000.', '1905.', '1995.', 'p2002.', '2011.', 'c2007.', '2011.', 'c2011.', 'c2002.', 'c2010.', '2012.', 'p1990.', 'c2003.', 'c2011.', '1998.', 'c2013.', '2009.', '', 'c2013.', '[2015]', ... ]
In [6]:
import re
r = re.compile('\\d{4}')
def get_year(y_str):
    l = r.findall(y_str) # take the first year
    if len(l) == 0:
        return None
    return int(l[0])

sf['year'] = sf['PublicationYear'].apply(lambda s: get_year(s))
sf['year']
    
Out[6]:
dtype: int
Rows: 35531294
[2014, 2003, 2014, 1999, 1991, 1997, 1989, 2017, 2017, 2017, 2014, 2015, 2006, 2017, 2017, 2015, 2016, 2015, 2016, 2008, 2016, 2000, 1960, 2000, 2014, 2014, 2014, 2005, 2008, 2004, 2015, 2012, 1983, 1987, 2014, 2011, 2005, 2012, 1973, 2016, 1958, 2012, 2016, 2009, 2016, 2008, 1982, 1974, 2012, 2001, 2016, 2009, 2017, 1981, 2013, 2011, 2014, 2014, 2002, 2016, 2011, 2017, 2015, 2000, None, 2013, 1988, 2017, None, 2013, 2016, 2016, 2007, 1971, 1945, 2016, 2010, 2012, 1994, 1974, 2001, 1905, 1995, 2002, 2011, 2007, 2011, 2011, 2002, 2010, 2012, 1990, 2003, 2011, 1998, 2013, 2009, None, 2013, 2015, ... ]
In [7]:
?sf.materialize
sf.materialize()

Let's find in which year there are the most published books:

In [8]:
sf2 = sf['BibNum', 'year'].unique() # remove duplications
sf2
Out[8]:
BibNum year
328223 1936
2238986 2004
598018 1901
2397795 2007
1846241 1997
2720373 2011
3460306 2019
3442648 2019
259334 1939
350368 1984
[792403 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [9]:
import turicreate.aggregate as agg
g = sf2.groupby('year', {'Count': agg.COUNT()})
print("Min year: %s" % g['year'].min())
print("Max year: %s"% g['year'].max())
g.sort("Count", ascending=False)
Min year: 1174
Max year: 9836
Out[9]:
year Count
2015 28681
2013 28539
2016 28513
2014 27945
2017 27655
2012 27411
2010 27244
None 26081
2011 25843
2018 25708
[341 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [10]:
g.sort("year", ascending=True)
Out[10]:
year Count
None 26081
1174 1
1199 1
1277 1
1342 1
1406 1
1416 1
1431 1
1460 1
1493 1
[341 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

We can see that the first book publication year is 1342 (probably correct), and the last book publication year is in the far future 9836. Let's search for this book. But before that let's do some plotting:

In [11]:
import matplotlib.pyplot as plt
g = g[g['year'] < 2020] # remove "future" published books
plt.bar(g['year'], list(g['Count']))
plt.xlabel("Year")
plt.ylabel("Count")
Out[11]:
Text(0, 0.5, 'Count')

Let's zoom in to books published since 1900:

In [12]:
g2 =  g[g['year']>= 1900]
plt.bar(g2['year'], g2['Count'])
plt.xlabel("Year")
plt.ylabel("Count")
Out[12]:
Text(0, 0.5, 'Count')

Let's look for the oldest book(s) in the library (it can take some time):

In [13]:
sf[sf['year'] < 1350]['Title', 'Author', 'year'].unique()
Out[13]:
Author Title year
[47 leaves from early
printed books, ...
1277
Boccaccio, Giovanni,
1313-1375, ...
Amorosa visione / di
Giovanni Boccaccio ; ...
1342
Linking transportation
and land use planning : ...
1199
Orton, Vrest, 1897-1986 Observations on the
forgotten art of buil ...
1174
[4 rows x 3 columns]

Let's find the manuscript details on Wikipedia:

In [14]:
!pip install wikipedia
Requirement already satisfied: wikipedia in /anaconda3/envs/massivedata/lib/python3.6/site-packages (1.4.0)
Requirement already satisfied: beautifulsoup4 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from wikipedia) (4.8.0)
Requirement already satisfied: requests<3.0.0,>=2.0.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from wikipedia) (2.22.0)
Requirement already satisfied: soupsieve>=1.2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from beautifulsoup4->wikipedia) (1.9.3)
Requirement already satisfied: certifi>=2017.4.17 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (2019.9.11)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (1.24.2)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (2.8)
In [15]:
import wikipedia
w = wikipedia.page('Amorosa visione')
w.summary
Out[15]:
'Amorosa visione (1342, revised c. 1365) is a narrative poem by Boccaccio, full of echoes of the Divine Comedy and consisting of 50 canti in terza rima. It tells of a dream in which the poet sees, in sequence, the triumphs of Wisdom, Earthly Glory, Wealth, Love, all-destroying Fortune (and her servant Death), and thereby becomes worthy of the now heavenly love of Fiammetta. The triumphs include mythological, classical and contemporary medieval figures. Their moral, cultural and historical architecture was without precedent, and led Petrarch to create his own Trionfi on the same model. Among contemporaries Giotto and Dante stand out, the latter being celebrated above any other artist, ancient or modern.'

Let's find the most popular subjects in a specific year:

In [16]:
sf2 = sf['BibNum','year', 'Subjects'] # to make things run faster, we create smaller SFrame
sf2['subject_list'] = sf2['Subjects'].apply(lambda s: s.split(","))
sf2['subject_list'] = sf2['subject_list'].apply(lambda l: [subject.strip() for subject in l])
sf2 = sf2.remove_column('Subjects')
# we want to remove the duplication of subject by specific books
sf2 = sf2.unique() 
sf2
Out[16]:
BibNum subject_list year
3102550 [Vitality, Fatigue
Prevention, Health] ...
2015
3428724 [Butterflies Life cycles
Juvenile literature, ...
2019
2792378 [Vietnam War 1961 1975
Juvenile fiction, ...
2012
2255555 [Success, Success
Psychological aspects] ...
2004
3222016 [British Germany Fiction,
Friendship Germany Bad ...
2012
3488479 [Stories in rhyme, Snow
Juvenile fiction, ...
2019
2808486 [Investment bankers
Fiction, Financial cr ...
2012
3469961 [Fatherhood Popular
works, Pregnancy Popular ...
2019
3021201 [Automobile industry and
trade United States ...
1997
3118745 [Death Valley National
Park Calif and Nev ...
2015
[883543 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [17]:
sf2 = sf2.stack("subject_list", new_column_name="subject") 
sf2['subject']
Out[17]:
dtype: str
Rows: 3019994
['Vitality', 'Fatigue Prevention', 'Health', 'Butterflies Life cycles Juvenile literature', 'Caterpillars Juvenile literature', 'Vietnam War 1961 1975 Juvenile fiction', 'Soldiers Juvenile fiction', 'War stories', 'Shooters of firearms Juvenile fiction', 'Best friends Juvenile fiction', 'Friendship Juvenile fiction', 'War Fiction', 'Sharpshooters Fiction', 'Success', 'Success Psychological aspects', 'British Germany Fiction', 'Friendship Germany Bad Nauheim Fiction', 'Married people Germany Bad Nauheim Fiction', 'Adultery Germany Bad Nauheim Fiction', 'Middle class Germany Bad Nauheim Fiction', 'Bad Nauheim Germany Fiction', 'Domestic fiction', 'Stories in rhyme', 'Snow Juvenile fiction', 'Community life Juvenile fiction', 'Wishes Juvenile fiction', 'Stories in rhyme', 'Picture books', 'Investment bankers Fiction', 'Financial crises Fiction', 'Family secrets Fiction', 'Upper class New York State New York Fiction', 'Large type books', 'New York N Y Fiction', 'Suspense fiction', 'Fatherhood Popular works', 'Pregnancy Popular works', 'Childbirth Popular works', 'Automobile industry and trade United States Statistics Periodicals', 'Automobiles Marketing Statistics Periodicals', 'Death Valley National Park Calif and Nev Guidebooks', 'Death Valley Calif and Nev', 'California Southern Guidebooks', 'Oceanography', 'Submarine geology', 'Murder Fiction', 'Forensic scientists Fiction', 'Superstition Fiction', 'Tennessee Fiction', 'Romantic suspense fiction', 'Japanese fiction 21st century', 'Ferris wheels Fiction', 'Vertigo Fiction', 'Teenage boys Fiction', 'Menopause Popular works', 'Christian art and symbolism', 'Christian antiquities', '', 'Moaveni Azadeh 1976', 'Iranian American women Biography', 'Iranian Americans Biography', 'Women Iran Biography', 'Journalists Iran Biography', 'Women journalists Iran Biography', 'Iran Social conditions 1997', 'Queen Victoria Ship Fiction', 'Pendergast Aloysius Fictitious character Fiction', 'Government investigators Fiction', 'Americans Himalaya Mountains Fiction', 'Archaeological thefts Fiction', 'Monks Fiction', 'Ocean liners Fiction', 'Thrillers Fiction', 'Planets Environmental engineering Juvenile fiction', 'Space flight to Mars Juvenile fiction', 'Cyborgs Juvenile fiction', 'Space colonies Juvenile fiction', 'Family life Fiction', 'Science fiction Juvenile fiction', 'Russia Federation Fiction', 'Families Russia Federation Fiction', 'Food History', 'Food habits History', 'Food preferences History', 'Agriculture History', 'Food Social aspects', 'Food Symbolic aspects', 'Food Economic aspects', 'Large type books', 'Washington State Puget Sound Water Quality Authority Bibliography Catalogs', 'Water quality Washington State Puget Sound Bibliography Catalogs', 'Water quality management Washington State Puget Sound Bibliography Catalogs', 'Puget Sound Wash', 'Escort services Fiction', 'Single women Fiction', 'Rich people Fiction', 'Man woman relationships Fiction', 'Erotic fiction', 'Short stories', 'Illumination of books and manuscripts German', ... ]

Using the stack to separate the subject list into separate rows, we got over 2.4 million subjects. Let's check what is the most common subject:

In [18]:
g = sf2.groupby('subject',{'Count': agg.COUNT()})
g.sort('Count', ascending=False ).print_rows(100)
+-------------------------------+-------+
|            subject            | Count |
+-------------------------------+-------+
|                               | 35433 |
|        Large type books       | 24947 |
| Video recordings for the h... | 21249 |
|         Graphic novels        | 19349 |
|        Mystery fiction        | 17158 |
|       Historical fiction      | 15342 |
|         Feature films         | 15296 |
|         Fiction films         | 11864 |
| Detective and mystery fiction | 11244 |
|          Love stories         | 11150 |
|        Fantasy fiction        |  9592 |
| Man woman relationships Fi... |  8643 |
|           Audiobooks          |  8642 |
|  Fiction television programs  |  8548 |
|        Science fiction        |  7966 |
|  Murder Investigation Fiction |  7586 |
|       Television series       |  7570 |
|       Thrillers Fiction       |  7267 |
|        Suspense fiction       |  7179 |
|        Domestic fiction       |  6759 |
|      Young adult fiction      |  6609 |
|       Friendship Fiction      |  6397 |
|        Romance fiction        |  5786 |
|  Friendship Juvenile fiction  |  5770 |
|         Short stories         |  5585 |
|     Psychological fiction     |  5271 |
|    Popular music 2011 2020    |  5266 |
|           Cookbooks           |  5142 |
|        Schools Fiction        |  5080 |
|      Comics Graphic works     |  4938 |
|       Documentary films       |  4832 |
|      Rock music 2011 2020     |  4807 |
|        Humorous stories       |  4618 |
|        Nonfiction films       |  4541 |
|         Popular music         |  4516 |
|         Magic Fiction         |  4481 |
|        Humorous fiction       |  4035 |
|         Picture books         |  3828 |
|     Comic books strips etc    |  3807 |
|       Christian fiction       |  3774 |
| Mystery and detective stories |  3628 |
|           Rock music          |  3606 |
|    Schools Juvenile fiction   |  3572 |
|      Cartoons and comics      |  3356 |
|          Comedy films         |  3236 |
|        Childrens films        |  3156 |
|            Fantasy            |  3060 |
|     Magic Juvenile fiction    |  2925 |
|             Songs             |  2906 |
|  Brothers and sisters Fiction |  2889 |
|   Spanish language materials  |  2881 |
|   Romantic suspense fiction   |  2829 |
|        Families Fiction       |  2713 |
|       Adventure stories       |  2693 |
|         Fantasy comics        |  2692 |
|       Paranormal fiction      |  2671 |
|        Stories in rhyme       |  2649 |
|      New York N Y Fiction     |  2611 |
|          Biographies          |  2601 |
| Man woman relationships Drama |  2507 |
|      Rock music 2001 2010     |  2499 |
|    Popular music 2001 2010    |  2496 |
|          Dogs Fiction         |  2459 |
|         Bildungsromans        |  2458 |
| Adventure and adventurers ... |  2454 |
|          Fairy tales          |  2432 |
| Childrens television programs |  2319 |
|    Missing persons Fiction    |  2315 |
| Science fiction comic book... |  2314 |
| Vietnamese language materials |  2277 |
|     Cats Juvenile fiction     |  2175 |
|      Television comedies      |  2160 |
|        Animals Fiction        |  2148 |
| Stories in rhyme Juvenile ... |  2146 |
|     Family secrets Fiction    |  2048 |
|   Families Juvenile fiction   |  2040 |
|         Murder Fiction        |  2031 |
| Brothers and sisters Juven... |  2018 |
| Childrens songs Juvenile s... |  2013 |
|         Horror fiction        |  1982 |
|      Family life Fiction      |  1982 |
|  Animated television programs |  1953 |
|        Western stories        |  1952 |
|        Sisters Fiction        |  1934 |
|      Biographical fiction     |  1916 |
|   Picture books for children  |  1890 |
|   Action and adventure films  |  1872 |
| Superheroes Comic books st... |  1866 |
|        Autobiographies        |  1856 |
|       Adventure fiction       |  1842 |
|     London England Fiction    |  1839 |
|  Action and adventure fiction |  1792 |
|     Dogs Juvenile fiction     |  1787 |
| City planning Washington S... |  1781 |
|         Animated films        |  1768 |
|    Animals Juvenile fiction   |  1763 |
|          Love Fiction         |  1741 |
|       Biographical films      |  1692 |
|           Rap Music           |  1636 |
|   Chinese language materials  |  1620 |
+-------------------------------+-------+
[577844 rows x 2 columns]

Let's visualize the subjects in a word cloud using WordCloud Package:

In [19]:
!pip install wordcloud
Collecting wordcloud
  Downloading https://files.pythonhosted.org/packages/ce/e7/37c4bc1416d01102d792dac3cb1ebe4b62d5e5e1e585dbfb3e02d8ebd484/wordcloud-1.6.0-cp36-cp36m-macosx_10_6_x86_64.whl (157kB)
     |████████████████████████████████| 163kB 410kB/s eta 0:00:01
Requirement already satisfied: numpy>=1.6.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from wordcloud) (1.17.2)
Requirement already satisfied: pillow in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from wordcloud) (6.2.0)
Requirement already satisfied: matplotlib in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from wordcloud) (3.1.1)
Requirement already satisfied: cycler>=0.10 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from matplotlib->wordcloud) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from matplotlib->wordcloud) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from matplotlib->wordcloud) (2.4.2)
Requirement already satisfied: python-dateutil>=2.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from matplotlib->wordcloud) (2.8.0)
Requirement already satisfied: six in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from cycler>=0.10->matplotlib->wordcloud) (1.12.0)
Requirement already satisfied: setuptools in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib->wordcloud) (41.4.0)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.6.0
In [21]:
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='black', 
                stopwords = stopwords, 
                min_font_size = 10)

# using the subject frquencies
wordcloud.generate_from_frequencies(frequencies={r['subject']:r['Count'] for r in g})
plt.figure(figsize = (20, 20), facecolor = None) 
plt.imshow(wordcloud)
Out[21]:
<matplotlib.image.AxesImage at 0xa24b313c8>

2. Analyzing the Blog Authorship Corpus

For this part, we will analyze the The Blog Authorship Corpus. The corpus consists of data from 9,320 bloggers who have written 681,288 posts. Each blogger's posts are saved as a separate XML files, in which each file name contains the blogger's metadata. For example, 9470.male.25.Communications-Media.Aries.XML contains the posts of a 25-year-old male blogger, with Aries sign on the topic of Communications.

We will start by converting the XML files into a JSON file:

In [1]:
!mkdir ./datasets/BIU-Blog-Authorship
!wget -O ./datasets/BIU-Blog-Authorship/blogs.zip http://www.cs.biu.ac.il/~koppel/blogs/blogs.zip
!unzip /content/datasets/BIU-Blog-Authorship/*.zip  -d /content/datasets/BIU-Blog-Authorship/
In [4]:
#first we create a directory to put the JSON files
import os
import json 
from tqdm.notebook import tqdm
blogger_xml_dir = "./datasets/BIU-Blog-Authorship/blogs"
#os.mkdir("f{blogger_xml_dir}/json")

#We create a short code which parse the XML and convert it to JSON files
def get_posts_from_file(file_name):
    posts_dict = {}
    txt = open(file_name, "r",  encoding="utf8", errors='ignore').read()
    txt = txt.replace("&nbsp;", " ")
    for p in txt.split("</post>"):
        if "<post>" not in p or "<date>" not in p:
            continue
        post = p.split("<post>")[1].strip()
        dt = p.split("</date>")[0].split("<date>")[1].strip()
        posts_dict[dt] = post

    return posts_dict
            

def blogger_xml_to_json(file_name):
    l = file_name.split("/")[-1].split(".")
    if len(l) != 6:
        raise Exception("Could not analyze file f{file_name} - Length %s" % len(l) )
    j = {"id": l[0], "gender": l[1], "age":int(l[2]), "topic":l[3], "sign": l[4], "posts": get_posts_from_file(file_name)}
    return j

# converting all the XMLs to a single large JSON file
all_jsons = []
for p in tqdm(os.listdir(blogger_xml_dir)):
    if not p.endswith(".xml"):
        continue
    j = blogger_xml_to_json(f"{blogger_xml_dir}/" + p)
    all_jsons.append(j)
json.dump(all_jsons, open(f"{blogger_xml_dir}/all_bloggers.json","w" ))

Now let's load the JSON file to an SFrame object using the _readjson function:

In [5]:
import turicreate as tc
import turicreate.aggregate as agg


sf = tc.SFrame.read_json(f"{blogger_xml_dir}/all_bloggers.json")
sf
Parsing JSON records from /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/BIU-Blog-Authorship/blogs/all_bloggers.json
Successfully parsed 19320 elements from the JSON file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/BIU-Blog-Authorship/blogs/all_bloggers.json
Out[5]:
age gender id posts sign topic
16 male 4162441 {'19,August,2004':
"DESTINY... you ...
Sagittarius Student
25 female 3489929 {'29,May,2004': 'It\'s
been a long time coming, ...
Cancer Student
23 female 3954575 {'17,July,2004': "Thought
I'd start off with a ...
Gemini BusinessServices
16 male 3364931 {'21,May,2004': "Today
was....normal. Nothing ...
Virgo Student
24 female 3162067 {'22,April,2004': 'I feel
it in the water; the ...
Cancer Education
23 female 813360 {'19,August,2002': "Just
to start, a little about ...
Capricorn BusinessServices
17 female 4028373 {'29,July,2004': "You
ever notice that you ...
Leo indUnk
34 male 3630901 {'30,June,2004': 'naked
spheres we seek not the ...
Leo Technology
23 female 2467122 {'31,December,2003':
"Okay- so today is the ...
Taurus Student
45 female 3732850 {'30,June,2004': 'Write
about something people ...
Taurus Technology
[19320 rows x 6 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Let's draw some charts using Matplotlib and Seaborn:

In [6]:
import matplotlib.pyplot as plt
import seaborn as sns
g = sf.groupby("gender", {"Count": agg.COUNT()})
barlist = plt.bar(g['gender'], g['Count'], align='center', alpha=0.5)
plt.ylabel('Number of Bloggers')
barlist[1].set_color('r') # changing the bar color
plt.title("Bloggers' Gender Distribution")
Out[6]:
Text(0.5, 1.0, "Bloggers' Gender Distribution")
In [7]:
g = sf.groupby(["gender", "topic"], {"Count": agg.COUNT()})
g_male = g[g['gender'] == 'male'].rename({'gender': 'male', 'Count': 'Count_male'})
g_female = g[g['gender'] == 'female'].rename({'gender': 'female','Count': 'Count_female'})
g2 = g_male.join(g_female, on='topic', how="outer")
# filling out missing values
g2 = g2.fillna('Count_male', 0)
g2 = g2.fillna('Count_female', 0)
g2['total'] = g2.apply(lambda r: r['Count_male'] + r['Count_female'])
g2
Out[7]:
male topic Count_male female Count_female total
male Military 84 female 32 116
male Marketing 73 female 107 180
male Arts 302 female 419 721
male Communications-Media 270 female 209 479
male Internet 296 female 101 397
male Manufacturing 63 female 24 87
male Architecture 34 female 35 69
male Non-Profit 178 female 194 372
male Engineering 242 female 70 312
male Consulting 118 female 73 191
[40 rows x 6 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [8]:
# see also https://seaborn.pydata.org/examples/horizontal_barplot.html
df = g2.to_dataframe()
plt.figure(figsize = (20, 20), facecolor = None) 
sns.set_color_codes("pastel")
sns.barplot(x="total", y="topic", data=df,
            label="Total", color="b")

sns.set_color_codes("muted")
sns.barplot(x="Count_female", y="topic", data=df,
            label="Total", color="r")
plt.xlabel("Total Bloggers")
plt.ylabel("Topic")
Out[8]:
Text(0, 0.5, 'Topic')

3. Matplotlib - A Closer Look

In this section, we will take a closer look into matplotlib. We will use a version of the US Baby Names dataset.

Note: This section is inspired from Python Data Science Handbook, Chapter 4 - Visualization with Matplotlib, which is a very recommended read.

To use matplotlib, we first need to import it:

In [1]:
import matplotlib.pyplot as plt
# %matplotlib inline will lead to embbeded static images in the notebook
%matplotlib inline 

Now let's, download the dataset and load it using TuriCreate:

In [2]:
# Creating a dataset directory
!mkdir ./datasets/us-baby-name

# download the dataset from Kaggle and unzip it
!kaggle datasets download kaggle/us-baby-names -f NationalNames.csv -p ./datasets/us-baby-name/
!unzip ./datasets/us-baby-name/*.zip  -d ./datasets/us-baby-name/
mkdir: ./datasets/us-baby-name: File exists
Downloading NationalNames.csv.zip to ./datasets/us-baby-name
 96%|████████████████████████████████████▍ | 11.0M/11.5M [00:02<00:00, 6.29MB/s]
100%|██████████████████████████████████████| 11.5M/11.5M [00:02<00:00, 5.26MB/s]
Archive:  ./datasets/us-baby-name/NationalNames.csv.zip
  inflating: ./datasets/us-baby-name/NationalNames.csv  
In [3]:
import turicreate as tc
sf = tc.SFrame.read_csv("./datasets/us-baby-name/NationalNames.csv")
sf
Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/us-baby-name/NationalNames.csv
Parsing completed. Parsed 100 lines in 1.2137 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,int,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/us-baby-name/NationalNames.csv
Parsing completed. Parsed 1825433 lines in 1.02251 secs.
Out[3]:
Id Name Year Gender Count
1 Mary 1880 F 7065
2 Anna 1880 F 2604
3 Emma 1880 F 2003
4 Elizabeth 1880 F 1939
5 Minnie 1880 F 1746
6 Margaret 1880 F 1578
7 Ida 1880 F 1472
8 Alice 1880 F 1414
9 Bertha 1880 F 1320
10 Sarah 1880 F 1288
[1825433 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Now let's create a small DateFrame with data on the name Elizabeth, and create a figure with the name trends over time:

In [4]:
eliza_sf = sf[sf.apply(lambda r: r['Gender'] == 'F' and r['Name'] == "Elizabeth")].sort("Year")
eliza_sf
Out[4]:
Id Name Year Gender Count
4 Elizabeth 1880 F 1939
2004 Elizabeth 1881 F 1852
3939 Elizabeth 1882 F 2187
6066 Elizabeth 1883 F 2255
8150 Elizabeth 1884 F 2549
10447 Elizabeth 1885 F 2582
12741 Elizabeth 1886 F 2680
15132 Elizabeth 1887 F 2681
17505 Elizabeth 1888 F 3224
20156 Elizabeth 1889 F 3058
[135 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [5]:
x = list(eliza_sf["Year"])
y = list(eliza_sf["Count"])
plt.plot(x, y)
Out[5]:
[<matplotlib.lines.Line2D at 0xa25682518>]

We can change the image styles using the following:

In [6]:
plt.style.use('dark_background') 
plt.plot(x, y)
Out[6]:
[<matplotlib.lines.Line2D at 0xa28511198>]

We can use print(plt.style.available) to get all the available styles:

In [7]:
print(plt.style.available)
['seaborn-dark', 'seaborn-darkgrid', 'seaborn-ticks', 'fivethirtyeight', 'seaborn-whitegrid', 'classic', '_classic_test', 'fast', 'seaborn-talk', 'seaborn-dark-palette', 'seaborn-bright', 'seaborn-pastel', 'grayscale', 'seaborn-notebook', 'ggplot', 'seaborn-colorblind', 'seaborn-muted', 'seaborn', 'Solarize_Light2', 'seaborn-paper', 'bmh', 'tableau-colorblind10', 'seaborn-white', 'dark_background', 'seaborn-poster', 'seaborn-deep']

If we have two or more curves, there are two interfaces that we can use to plot the curves in subplots. The first interface is MATLAB-style Interface. The second interface is Object-oriented interface. Let's draw the curve with each one of the interfaces:

In [8]:
mary_sf = sf[sf.apply(lambda r: r['Gender'] == 'F' and r['Name'] == "Mary")].sort("Year")

plt.style.use('ggplot') 
#MATLAB Style Interface
plt.figure()  # create a plot figure

# create the first of two panels and set current axis
plt.subplot(2, 1, 1) # (rows, columns, panel number)
plt.plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]))

# create the second panel and set current axis
plt.subplot(2, 1, 2)
plt.plot(list(mary_sf["Year"]), list(mary_sf["Count"]))
Out[8]:
[<matplotlib.lines.Line2D at 0xa264185f8>]
In [9]:
# Object-oriented interface
# First create a grid of plots
# ax will be an array of two Axes objects
fig, ax = plt.subplots(2)

# Call plot() method on the appropriate object
ax[0].plot(list(eliza_sf["Year"]), list(eliza_sf["Count"]))
ax[1].plot(list(mary_sf["Year"]), list(mary_sf["Count"]))
Out[9]:
[<matplotlib.lines.Line2D at 0xa27e7a710>]