For this lecture, we are going to use the Kaggle, TuriCreate, Networkx, igraph packages. Let's set them up:

In [ ]:

```
# Installing the Kaggle package
!pip install kaggle
#Important Note: complete this with your own key - after running this for the first time remmember to **remove** your API_KEY
api_token = {"username":"<Insert Your Kaggle User Name>","key":"<Insert Your Kaggle API key>"}
# creating kaggle.json file with the personal API-Key details
# You can also put this file on your Google Drive
with open('~/.kaggle/kaggle.json', 'w') as file:
json.dump(api_token, file)
!chmod 600 ~/.kaggle/kaggle.json
```

In [1]:

```
!pip install turicreate
```

In [2]:

```
!pip install networkx
!pip install python-igraph
```

In this example, we will learn how to work with graphs using the Marvel Universe Social Network dataset. First, let's download the dataset, and use it to construct an undirected graph:

In [3]:

```
# Creating a dataset directory
!mkdir ./datasets
!mkdir ./datasets/the-marvel-universe-social-network
# download the dataset from Kaggle and unzip it
!kaggle datasets download csanhueza/the-marvel-universe-social-network -p ./datasets/the-marvel-universe-social-network
!unzip ./datasets/the-marvel-universe-social-network/*.zip -d ./datasets/the-marvel-universe-social-network/
```

In [4]:

```
import networkx as nx
import turicreate as tc
n_sf = tc.SFrame.read_csv("./datasets/the-marvel-universe-social-network/nodes.csv")
e_sf = tc.SFrame.read_csv("./datasets/the-marvel-universe-social-network/hero-network.csv")
n_sf
```

Out[4]:

In [5]:

```
e_sf
```

Out[5]:

Now let's load the nodes (vertices) and edges (links) data into a graph object. We can create the graph by inserting each node and each edge one after the other, or by inserting the nodes and edges all at once:

In [6]:

```
%%timeit
g = nx.Graph() # Creating Undirected Graph
# adding each node and edge one after the other
for n in n_sf['node']:
g.add_node(n)
for r in e_sf:
g.add_edge(r['hero1'], r['hero2'])
```

In [7]:

```
%%timeit
g = nx.Graph() # Creating Undirected Graph
# adding all nodes and vertices at once
g.add_nodes_from(n_sf['node'])
g.add_edges_from([(r['hero1'],r['hero2']) for r in e_sf])
```

In [8]:

```
g = nx.Graph() # Creating Undirected Graph
g.add_nodes_from(n_sf['node'])
g.add_edges_from([(r['hero1'],r['hero2']) for r in e_sf])
print(nx.info(g))
```

We can see that the constructed graph has over 19,000 nodes and over 167,000 edges. Let's use the graph structure to answer several questions.

**Question:** Who is the most friendly superhero?

**Note:** If we wanted to answer this question using DataFrame, it wouldn't be trivial because for each hero we would need to count the number of distinct friends both when the hero appears in the Hero1 column and the Hero2 column. However, answering this question using a graph object is relatively easy; we simply need to find the node with the maximal degree).

Let's calculate the degree of each vertex:

In [9]:

```
d = g.degree()
list(dict(d).items())[:20]
```

Out[9]:

In [10]:

```
print("There are %s superheroes connected to Black Panter" %
d["BLACK PANTHER/T'CHAL"])
```

Let's find the vertex with the highest degree:

In [11]:

```
import operator
max(dict(d).items(), key=operator.itemgetter(1))
```

Out[11]:

So, using the *degree*, we discovered that the "most friendly" superhero is Captain America who is connected to 1908 heroes.
Let's use seaborn to calculate the graph degree distribution:

In [12]:

```
import seaborn as sns
%matplotlib inline
sns.set()
sns.distplot([v for v in dict(d).values()])
```

Out[12]:

From the above plot, we can see that many nodes have 0 or 1 degree, i.e. these heroes are not connected to any other
hero, or connected to only a single hero. Let's create a *subgraph* without these nodes:

In [13]:

```
# let's create a list with nodes that have degree > 1
selected_nodes_list = [n for n,d in dict(d).items() if d > 1]
# create a subgraph with only nodes from the above list
h = g.subgraph(selected_nodes_list)
print(nx.info(h))
```

We were left with only 6373 heroes out of 19232 heroes. Among the wonderful things that are useful using graphs as data structures is the ability to separate them into communities, i.e., disjoint subgraphs. Let's use Clauset-Newman-Moore greedy modularity maximization to separate the graph into communities, and answer the following question:

**Question:** What is the largest community in the graph?

In [14]:

```
from networkx.algorithms.community import greedy_modularity_communities
cc = greedy_modularity_communities(h) # this can take some time
len(cc)
```

Out[14]:

In [15]:

```
list(cc[0])[:20]
```

Out[15]:

Using the community detection algorithm, we detected 66 communities of different sizes. Let's view the size of the distribution of the community sizes:

In [16]:

```
import matplotlib.pyplot as plt
community_size_list = [len(c) for c in cc]
plt.hist(community_size_list)
```

Out[16]:

We can see that most communities are relatively small. Let's find a community that is larger than 100 but smaller than 500:

In [17]:

```
selected_community_list = [c for c in cc if 500 > len(c) > 100]
len(selected_community_list)
```

Out[17]:

Let's draw both communities:

In [18]:

```
plt.figure(figsize=(20,20))
c1 = h.subgraph(selected_community_list[0])
nx.draw_kamada_kawai(c1, with_labels=True)
```