节点大小: pagerank值
节点颜色: 队伍
连线宽度: 传球次数(接球和发球)
工作流
调用API
我使用 playerdashptpass 的端点并且将同队所有球员数据保存到本地的 JSON 文件中。
数据来自 2015-16赛季的传球记录。# 金州勇士球员 IDs playerids = [201575,201578,2738,202691,101106,2760,2571,203949,203546, 203110,201939,203105,2733,1626172,203084] # 调用 API 并且存储结果为 JSON for playerid in playerids: os.system('curl "http://stats.nba.com/stats/playerdashptpass?' 'DateFrom=&' 'DateTo=&' 'GameSegment=&' 'LastNGames=0&' 'LeagueID=00&' 'Location=&' 'Month=0&' 'OpponentTeamID=0&' 'Outcome=&' 'PerMode=Totals&' 'Period=0&' 'PlayerID={playerid}&' 'Season=2015-16&' 'SeasonSegment=&' 'SeasonType=Regular+Season&' 'TeamID=0&' 'VsConference=&' 'VsDivision=" > {playerid}.json'.format(playerid=playerid))
JSON -> Panda’s DataFrame
接着,我结合每个JSON文件到一个 DataFrame 中。
raw = pd.DataFrame() for playerid in playerids: with open("{playerid}.json".format(playerid=playerid)) as json_file: parsed = json.load(json_file)['resultSets'][0] raw = raw.append( pd.DataFrame(parsed['rowSet'], columns=parsed['headers'])) raw = raw.rename(columns={'PLAYER_NAME_LAST_FIRST': 'PLAYER'}) raw['id'] = raw['PLAYER'].str.replace(', ', '')
准备节点和边
你需要为 Spark 中的 GraphFrames 准备一个像点+边的特殊的数据格式。顶点表示了图中的节点和运动员ID,边表示节点之间的关系。你可以添加一些附加特征比如权重,但是你没法找出在稍后的分析中可以更好表现的特征。一个可行的办法是尝试穷举所有的可能方案。(也欢迎大家留言讨论)
# 生成初始节点 pandas_vertices = raw[['PLAYER', 'id']].drop_duplicates() pandas_vertices.columns = ['name', 'id'] # 生成初始边 pandas_edges = pd.DataFrame() for passer in raw['id'].drop_duplicates(): for receiver in raw[(raw['PASS_TO'].isin(raw['PLAYER'])) & (raw['id'] == passer)]['PASS_TO'].drop_duplicates(): pandas_edges = pandas_edges.append(pd.DataFrame( {'passer': passer, 'receiver': receiver .replace( ', ', '')}, index=range(int(raw[(raw['id'] == passer) & (raw['PASS_TO'] == receiver)]['PASS'].values)))) pandas_edges.columns = ['src', 'dst']
图分析
vertices = sqlContext.createDataFrame(pandas_vertices) edges = sqlContext.createDataFrame(pandas_edges) # Analysis part g = GraphFrame(vertices, edges) print("vertices") g.vertices.show() print("edges") g.edges.show() print("inDegrees") g.inDegrees.sort('inDegree', ascending=False).show() print("outDegrees") g.outDegrees.sort('outDegree', ascending=False).show() print("degrees") g.degrees.sort('degree', ascending=False).show() print("labelPropagation") g.labelPropagation(maxIter=5).show() print("pageRank") g.pageRank(resetProbability=0.15, tol=0.01).vertices.sort( 'pagerank', ascending=False).show()
网络可视化
当你运行 GitHub 仓库中的代码 gsw_passing_network.py,你需要检查在工作目录下有 passes.csv、groups.csv、size.csv 这三个文件。我用R中的networkD3
包来实现酷炫的可交互的 D3 制图。
library(networkD3) setwd('/Users/yuki/Documents/code_for_blog/gsw_passing_network') passes <- read.csv("passes.csv") groups <- read.csv("groups.csv") size <- read.csv("size.csv") passes$source <- as.numeric(as.factor(passes$PLAYER))-1 passes$target <- as.numeric(as.factor(passes$PASS_TO))-1 passes$PASS <- passes$PASS/50 groups$nodeid <- groups$name groups$name <- as.numeric(as.factor(groups$name))-1 groups$group <- as.numeric(as.factor(groups$label))-1 nodes <- merge(groups,size[-1],by="id") nodes$pagerank <- nodes$pagerank^2*100 forceNetwork(Links = passes, Nodes = nodes, Source = "source", fontFamily = "Arial", colourScale = JS("d3.scale.category10()"), Target = "target", Value = "PASS", NodeID = "nodeid", Nodesize = "pagerank", linkDistance = 350, Group = "group", opacity = 0.8, fontSize = 16, zoom = TRUE, opacityNoHover = TRUE)
参考资料
Introducing GraphFrames
项目 GitHub 源码
Weaver: A High-Performance, Transactional Graph Database Based on Refinable Timestamps
本文已获得原作者:YUKI KATOH 授权HarryZhu翻译
英文原文地址:http://opiateforthemass.es/ar...作为分享主义者(sharism),本人所有互联网发布的图文均遵从CC版权,转载请保留作者信息并注明作者 Harry Zhu 的 FinanceR专栏:https://segmentfault.com/blog...,如果涉及源代码请注明GitHub地址:https://github.com/harryprince。微信号: harryzhustudio商业使用请联系作者。
精彩评论