Lecture 4: Analyzing Massive Graphs - Part I

The Art of Analyzing Big Data - The Data Scientist’s Toolbox - Lecture 2

By Dr. Michael Fire


0. Package Setup

For this lecture, we are going to use the Kaggle, TuriCreate, Networkx, igraph packages. Let's set them up:

In [ ]:
# Installing the Kaggle package
!pip install kaggle 

#Important Note: complete this with your own key - after running this for the first time remmember to **remove** your API_KEY
api_token = {"username":"<Insert Your Kaggle User Name>","key":"<Insert Your Kaggle API key>"}

# creating kaggle.json file with the personal API-Key details 
# You can also put this file on your Google Drive
with open('~/.kaggle/kaggle.json', 'w') as file:
  json.dump(api_token, file)
!chmod 600 ~/.kaggle/kaggle.json
In [1]:
!pip install turicreate
Requirement already satisfied: turicreate in /anaconda3/envs/massivedata/lib/python3.6/site-packages (6.1)
Requirement already satisfied: decorator>=4.0.9 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from turicreate) (4.4.0)
Requirement already satisfied: requests>=2.9.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from turicreate) (2.22.0)
Requirement already satisfied: resampy==0.2.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from turicreate) (0.2.1)
Requirement already satisfied: pillow>=5.2.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from turicreate) (6.2.0)
Requirement already satisfied: prettytable==0.7.2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from turicreate) (0.7.2)
Requirement already satisfied: six>=1.10.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from turicreate) (1.12.0)
Requirement already satisfied: tensorflow>=2.0.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from turicreate) (2.1.0)
Requirement already satisfied: pandas>=0.23.2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from turicreate) (0.25.1)
Requirement already satisfied: scipy>=1.1.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from turicreate) (1.3.1)
Requirement already satisfied: coremltools==3.3 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from turicreate) (3.3)
Requirement already satisfied: numpy in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from turicreate) (1.17.2)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests>=2.9.1->turicreate) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests>=2.9.1->turicreate) (2019.9.11)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests>=2.9.1->turicreate) (1.24.2)
Requirement already satisfied: idna<2.9,>=2.5 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests>=2.9.1->turicreate) (2.8)
Requirement already satisfied: numba>=0.32 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from resampy==0.2.1->turicreate) (0.45.1)
Requirement already satisfied: wrapt>=1.11.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorflow>=2.0.0->turicreate) (1.11.2)
Requirement already satisfied: gast==0.2.2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorflow>=2.0.0->turicreate) (0.2.2)
Requirement already satisfied: google-pasta>=0.1.6 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorflow>=2.0.0->turicreate) (0.1.8)
Requirement already satisfied: astor>=0.6.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorflow>=2.0.0->turicreate) (0.8.1)
Requirement already satisfied: keras-preprocessing>=1.1.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorflow>=2.0.0->turicreate) (1.1.0)
Requirement already satisfied: tensorboard<2.2.0,>=2.1.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorflow>=2.0.0->turicreate) (2.1.0)
Requirement already satisfied: tensorflow-estimator<2.2.0,>=2.1.0rc0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorflow>=2.0.0->turicreate) (2.1.0)
Requirement already satisfied: wheel>=0.26; python_version >= "3" in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorflow>=2.0.0->turicreate) (0.33.6)
Requirement already satisfied: keras-applications>=1.0.8 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorflow>=2.0.0->turicreate) (1.0.8)
Requirement already satisfied: absl-py>=0.7.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorflow>=2.0.0->turicreate) (0.9.0)
Requirement already satisfied: termcolor>=1.1.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorflow>=2.0.0->turicreate) (1.1.0)
Requirement already satisfied: protobuf>=3.8.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorflow>=2.0.0->turicreate) (3.11.3)
Requirement already satisfied: opt-einsum>=2.3.2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorflow>=2.0.0->turicreate) (3.1.0)
Requirement already satisfied: grpcio>=1.8.6 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorflow>=2.0.0->turicreate) (1.27.2)
Requirement already satisfied: python-dateutil>=2.6.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from pandas>=0.23.2->turicreate) (2.8.0)
Requirement already satisfied: pytz>=2017.2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from pandas>=0.23.2->turicreate) (2019.3)
Requirement already satisfied: llvmlite>=0.29.0dev0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from numba>=0.32->resampy==0.2.1->turicreate) (0.29.0)
Requirement already satisfied: markdown>=2.6.8 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0->turicreate) (3.2.1)
Requirement already satisfied: werkzeug>=0.11.15 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0->turicreate) (0.16.0)
Requirement already satisfied: setuptools>=41.0.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0->turicreate) (41.4.0)
Requirement already satisfied: google-auth<2,>=1.6.3 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0->turicreate) (1.11.2)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0->turicreate) (0.4.1)
Requirement already satisfied: h5py in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from keras-applications>=1.0.8->tensorflow>=2.0.0->turicreate) (2.9.0)
Requirement already satisfied: rsa<4.1,>=3.1.4 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0->turicreate) (4.0)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0->turicreate) (0.2.8)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0->turicreate) (4.0.0)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0->turicreate) (1.3.0)
Requirement already satisfied: pyasn1>=0.1.3 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from rsa<4.1,>=3.1.4->google-auth<2,>=1.6.3->tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0->turicreate) (0.4.8)
Requirement already satisfied: oauthlib>=3.0.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.2.0,>=2.1.0->tensorflow>=2.0.0->turicreate) (3.1.0)
In [2]:
!pip install networkx
!pip install python-igraph
Requirement already satisfied: networkx in /anaconda3/envs/massivedata/lib/python3.6/site-packages (2.3)
Requirement already satisfied: decorator>=4.3.0 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from networkx) (4.4.0)
Requirement already satisfied: python-igraph in /anaconda3/envs/massivedata/lib/python3.6/site-packages (0.8.0)
Requirement already satisfied: texttable>=1.6.2 in /anaconda3/envs/massivedata/lib/python3.6/site-packages (from python-igraph) (1.6.2)

Example 1: Marvel Superheroes - Working with Networkx

In this example, we will learn how to work with graphs using the Marvel Universe Social Network dataset. First, let's download the dataset, and use it to construct an undirected graph:

In [3]:
# Creating a dataset directory
!mkdir ./datasets
!mkdir ./datasets/the-marvel-universe-social-network

# download the dataset from Kaggle and unzip it
!kaggle datasets download csanhueza/the-marvel-universe-social-network -p ./datasets/the-marvel-universe-social-network
!unzip ./datasets/the-marvel-universe-social-network/*.zip  -d ./datasets/the-marvel-universe-social-network/
mkdir: ./datasets: File exists
mkdir: ./datasets/the-marvel-universe-social-network: File exists
the-marvel-universe-social-network.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  ./datasets/the-marvel-universe-social-network/the-marvel-universe-social-network.zip
replace ./datasets/the-marvel-universe-social-network/edges.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C
In [4]:
import networkx as nx
import turicreate as tc 

n_sf = tc.SFrame.read_csv("./datasets/the-marvel-universe-social-network/nodes.csv")
e_sf = tc.SFrame.read_csv("./datasets/the-marvel-universe-social-network/hero-network.csv")

n_sf
Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/the-marvel-universe-social-network/nodes.csv
Parsing completed. Parsed 100 lines in 0.03434 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/the-marvel-universe-social-network/nodes.csv
Parsing completed. Parsed 19090 lines in 0.011335 secs.
Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/the-marvel-universe-social-network/hero-network.csv
Parsing completed. Parsed 100 lines in 0.285698 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /Users/michael/Dropbox (BGU)/massive data mining/ 2020/notebooks/datasets/the-marvel-universe-social-network/hero-network.csv
Parsing completed. Parsed 574467 lines in 0.301575 secs.
Out[4]:
node type
2001 10 comic
2001 8 comic
2001 9 comic
24-HOUR MAN/EMMANUEL hero
3-D MAN/CHARLES CHAN hero
4-D MAN/MERCURIO hero
8-BALL/ hero
A '00 comic
A '01 comic
A 100 comic
[19090 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.
In [5]:
e_sf
Out[5]:
hero1 hero2
LITTLE, ABNER PRINCESS ZANDA
LITTLE, ABNER BLACK PANTHER/T'CHAL
BLACK PANTHER/T'CHAL PRINCESS ZANDA
LITTLE, ABNER PRINCESS ZANDA
LITTLE, ABNER BLACK PANTHER/T'CHAL
BLACK PANTHER/T'CHAL PRINCESS ZANDA
STEELE, SIMON/WOLFGA FORTUNE, DOMINIC
STEELE, SIMON/WOLFGA ERWIN, CLYTEMNESTRA
STEELE, SIMON/WOLFGA IRON MAN/TONY STARK
STEELE, SIMON/WOLFGA IRON MAN IV/JAMES R.
[574467 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Now let's load the nodes (vertices) and edges (links) data into a graph object. We can create the graph by inserting each node and each edge one after the other, or by inserting the nodes and edges all at once:

In [6]:
%%timeit
g = nx.Graph() # Creating Undirected Graph

# adding each node and edge one after the other
for n in n_sf['node']:
    g.add_node(n)
    
for r in e_sf:
    g.add_edge(r['hero1'], r['hero2'])
2.28 s ± 31.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]:
%%timeit
g = nx.Graph() # Creating Undirected Graph
# adding all nodes and vertices at once
g.add_nodes_from(n_sf['node'])
g.add_edges_from([(r['hero1'],r['hero2']) for r in e_sf])
2.25 s ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]:
g = nx.Graph() # Creating Undirected Graph
g.add_nodes_from(n_sf['node'])
g.add_edges_from([(r['hero1'],r['hero2']) for r in e_sf])
print(nx.info(g))
Name: 
Type: Graph
Number of nodes: 19232
Number of edges: 167219
Average degree:  17.3897

We can see that the constructed graph has over 19,000 nodes and over 167,000 edges. Let's use the graph structure to answer several questions.

Question: Who is the most friendly superhero?

Note: If we wanted to answer this question using DataFrame, it wouldn't be trivial because for each hero we would need to count the number of distinct friends both when the hero appears in the Hero1 column and the Hero2 column. However, answering this question using a graph object is relatively easy; we simply need to find the node with the maximal degree).

Let's calculate the degree of each vertex:

In [9]:
d = g.degree()
list(dict(d).items())[:20]
Out[9]:
[('2001 10', 0),
 ('2001 8', 0),
 ('2001 9', 0),
 ('24-HOUR MAN/EMMANUEL', 5),
 ('3-D MAN/CHARLES CHAN', 122),
 ('4-D MAN/MERCURIO', 72),
 ('8-BALL/', 14),
 ("A '00", 0),
 ("A '01", 0),
 ('A 100', 0),
 ('A 101', 0),
 ('A 102', 0),
 ('A 103', 0),
 ('A 104', 0),
 ('A 105', 0),
 ('A 106', 0),
 ('A 107', 0),
 ('A 108', 0),
 ('A 109', 0),
 ('A 10', 0)]
In [10]:
print("There are %s superheroes connected to Black Panter"  %
      d["BLACK PANTHER/T'CHAL"])
There are 711 superheroes connected to Black Panter

Let's find the vertex with the highest degree:

In [11]:
import operator
max(dict(d).items(), key=operator.itemgetter(1))
Out[11]:
('CAPTAIN AMERICA', 1908)

So, using the degree, we discovered that the "most friendly" superhero is Captain America who is connected to 1908 heroes. Let's use seaborn to calculate the graph degree distribution:

In [12]:
import seaborn as sns
%matplotlib inline
sns.set()
sns.distplot([v for v in dict(d).values()])
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0xa22043e48>

From the above plot, we can see that many nodes have 0 or 1 degree, i.e. these heroes are not connected to any other hero, or connected to only a single hero. Let's create a subgraph without these nodes:

In [13]:
# let's create a list with nodes that have degree > 1
selected_nodes_list = [n for n,d in dict(d).items() if d > 1]
# create a subgraph with only nodes from the above list
h = g.subgraph(selected_nodes_list)
print(nx.info(h))
Name: 
Type: Graph
Number of nodes: 6373
Number of edges: 167167
Average degree:  52.4610

We were left with only 6373 heroes out of 19232 heroes. Among the wonderful things that are useful using graphs as data structures is the ability to separate them into communities, i.e., disjoint subgraphs. Let's use Clauset-Newman-Moore greedy modularity maximization to separate the graph into communities, and answer the following question:

Question: What is the largest community in the graph?

In [14]:
from networkx.algorithms.community import greedy_modularity_communities
cc = greedy_modularity_communities(h) # this can take some time
len(cc)
Out[14]:
68
In [15]:
list(cc[0])[:20]
Out[15]:
['GLITTER/',
 'WATCHLORD/',
 'TEN-THIRTIFOR',
 'HUMAN TORCH ANDROID/',
 'AVRIL, YVETTE',
 'EEL/LEOPOLD STRYKE',
 'HELA [ASGARDIAN]',
 'KID COLT',
 "IRON FIST H'YLTHRI I",
 'SPLICE/CHANDRA KU',
 'ASBERY, SHAMARI',
 'SHROUD/MAXIMILLIAN Q',
 'AUNTIE FREEZE/',
 'DEIMOS',
 'FORTHWARD, KENT',
 'VOLSTAGG',
 'BYRD, SEN. HARRINGTO',
 'MAXXAM',
 'NAPIER, RAMONA DR.',
 'HAROKIN [ASGARDIAN]']

Using the community detection algorithm, we detected 66 communities of different sizes. Let's view the size of the distribution of the community sizes:

In [16]:
import matplotlib.pyplot as plt
community_size_list = [len(c) for c in cc]
plt.hist(community_size_list)
Out[16]:
(array([64.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  1.]),
 array([   3. ,  237.1,  471.2,  705.3,  939.4, 1173.5, 1407.6, 1641.7,
        1875.8, 2109.9, 2344. ]),
 <a list of 10 Patch objects>)

We can see that most communities are relatively small. Let's find a community that is larger than 100 but smaller than 500:

In [17]:
selected_community_list = [c for c in cc if 500 > len(c) > 100]
len(selected_community_list)
Out[17]:
2

Let's draw both communities:

In [18]:
plt.figure(figsize=(20,20))
c1 = h.subgraph(selected_community_list[0])
nx.draw_kamada_kawai(c1, with_labels=True)
/anaconda3/envs/massivedata/lib/python3.6/site-packages/networkx/drawing/nx_pylab.py:579: MatplotlibDeprecationWarning: 
The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead.
  if not cb.iterable(width):