Www.cs.umd.edu



>> Lev Nachmansen: Everybody is familiar with his similar work on treemaps and many other works that have to do with information visualization and computer -- human computer interaction. And of course he has a long list of awards that I cannot talk about because if I say all of this then we wouldn't have time to hear his speech.

However, I have the privilege -- should I introduce it now?

>> Ben Shneiderman: Please.

>> Lev Nachmansen: I have the privilege to announce that he received the IEEE Career Award on visualization, which will be awarded in the InfoVis next week.

>> Ben Shneiderman: That's right.

[applause]

>> Ben Shneiderman: All right. Thank you for that kind of introduction. I'm very, very pleased to be here. I've for years followed the Graph Drawing Conference, but I don't think I've ever been. I was invited to be on the program committee. So I'm very pleased that you're seeking the interactive element as part of this community.

My background isn't very much in this framework of algorithm design for database and file strategies and indexing and search, and those kind of algorithms still to me are the heart of computer science, but I become what I say is 20 percent of an experimental psychologist in trying to study the way people use technology.

So I was pleased that at least one of the talk had some empirical results about people, not just empirical results with data.

And so I'm here today in appreciation and recognition of your work, but also I hope to change your position a little, your attitude, and maybe shift your attention towards the real opportunities. Because 20 years ago when graph drawing started, it was a very different world. Networks were a rare beast. It was hard to get the data. There weren't that many people who were interested. Now suddenly we're surrounded by social media, and the opportunities and the demand and the pressure and the interest in visualization overall, the number of blogs and the cultural phenomena that's become visualization, is startling.

So I'm very pleased with that. So maybe I'll just plant one idea in your mind; that maybe graph drawing should rename, still keep a GD, but call it graph discovery. Because the idea is discovering and making insights. The purpose of drawing graphs is not pictures, it's insight. And that's what I hope to show and promote.

But first I have great appreciation to the organizers, Morizio [phonetic] and Walter, and a copy of the book, which I've signed for them, so --

>>: Thank you very much.

>> Ben Shneiderman: Very much appreciate the opportunity, and thank you very much.

So today is meant as kind of a review, and you can look in the paper for a more detailed analysis, and I'm pleased that Cody Dunne, a Ph.D. student working on these problems, will present the latest -- last part of the talk about his work.

So also I'm proud to represent the human computer interaction lab, which this year celebrates its 30th anniversary. And I gave up being director to Ben Bederson and then Allison Druin, and Jen Golbeck's now the director. We are supported and administered by both computer science and the College of Information Studies and many relations around campus with different departments, including the wonderfully titled Maryland Institute for Technology and the Humanities.

So these humanities applications are increasingly interesting ones. In fact, I'm working with a classics professor who has the social network of Alexander the Great, over 32 years of his life as he traveled around and 650 connections, and it's a fascinating story, and how do we make a visual representation of that social network is the kind of challenge that she's asking.

So MITH is the Maryland Institute for Technology and Humanities. If you visit our lab's Web site, you'll find 650 technical reports, 200 videos, 40 pieces of software, and lots more about our projects.

I hope you know me from the book Designing the User Interface, which is now in 5th edition. It's written -- my coauthor for the fifth edition is Catherine Plaisant who's been my collaborator for 25 years, and the work you'll hear about has partially come from her, but also the wonderful graduate students that I've had the pleasure to work with over the years.

The story for you here is to recognize that when the 5th edition came out it had a whole new section on social media. In 2004, when we did the 4th edition, there was no Twitter, there was no Facebook, YouTube was small, Wikipedia was just starting. And now all of a sudden we're surrounded by social networks. If you haven't heard, that's the hot story around.

And so visualization has also gained a separate chapter, and so those are the important issues.

With Stu Card and Jock Mackinlay we tried to lay out the basis for this new field. This is a book from -- that goes back a few years now. And Stu Card gave the title Using Vision to Think; that visual representations are not just a representation but it's a way of solving problems. And that was really the significant point; that within 400 milliseconds, if the interface is correctly designed with color, shape, size, and proximity, then you will be able to spot clusters, gaps, outliers, and trends in that short amount of time. So there's many implications, and we've explored that. This book collects 47 papers from different sources and 60,000 words of our own work.

I think I just can't resist telling you about Spotfire, our early work. The paper was in 1994 at the CHI conference, remains one of the most cited papers and led to the company formed by Chris Ahlberg in '97 which grew to 200 people by 2007 and was purchased by Tibco. So it was great success story.

Here we're looking at 15,000 births in Washington, D.C. The red dots are girls, the blue dots are boys. The age of the mother is over here. You can see they go from about 12 to 50. The age of the father from about 13 to 65. And you get to see many, many patterns, these multiple coordinated windows. And the dynamic query sliders were the key features of that invasion.

Spotfire has grown to be a place to analyze -- tool for analyzing large complex datasets, and one of the lessons we've learned that's appropriate for today's talk is that one single visualization is not the way to show a complex amount of information, but here are 27 windows that are coordinated and so that if you filter, it filters everywhere. If you select in one window, it highlights in the other windows. That's the way to deal with complexity in data, not by trying to pack everything into one screen.

Okay. So the visual world is getting richer and richer, and here you see examples of the kind of environments people are working in to increase productivity, make better decisions and understand the world around us.

And, as I said, there's just a rich cultural phenomena around -- just last night I saw there's a new -- there's two new blogs and a new conference. The New York Times will have a conference November 8th and 9th in New York called Visualized, which brings together 25 designers who are look at or making creative visualizations.

Control rooms with lots of visual information and collaborative environments are becoming more and more the way realtime decisions are made. People, this is -- the counterterrorism center.

Also on small devices we see increasing use of visualization, and that's become another popular phenomena. Can even see some treemaps over here to get an idea of what's going on.

So we learn from that. I wrote down one day in a very playful way and called it the information seeking mantra, and I wrote it in this paper, 12 lines, each one represents one project where we struggled for weeks or months to find the right design, and it turned out to be show the overview first, even if it's a million or a billion items, so the users can get an understanding of the range of data, the clusters, the size of the clusters, the gaps, the outliers, and so on, and then allow the user to zoom in on what they want, filter out what they don't want, and click for details on demand.

And this has collected almost 2,000 citations, which is kind of amazing, and people who use it, people who contradict it, people who extend it, people who make jokes about it, so it's gained its own kind of little phenomena.

And I think what people like about it is that it asserts this neutrality of human decision-making -- [phone ringing] that's embarrassing -- the neutrality of human decision-making where the user gets the overview, the user zooms in on what they want and then filters out what they don't want.

So we're not talking about algorithms or data mining. We're talking about a process by which users make decisions, make discoveries and make insights.

I was very pleased, for example, in March the White House issued its statement about big data and its expenditure, about $220 million in this country from seven different research agencies, and I had some influence, I'm pleased to say, but in that three-page press release the word "visualization" appeared five times. The words "data mining" did not appear at all.

So we're seeing this sort of shift in understanding that visual analytics and visual approaches are the way people make discoveries and that we support discovery by people aided by rich and powerful statistical methods; that integration of statistics and visualization is what I really want to stress with you.

So a little bit of my way of seeing the field. We have the traditional field of scientific visualization, has 50-year history of including geographic information systems and medical and architecture and so on. These are great success stories, especially if you go to Hollywood movies or play video games.

But the story that I'm talking about is here. Multivariate data, where Spotfire has been joined by very effective competitors like Tableau and many other tools, temporal data series, tree structures, and seems many people know about treemaps, so I'm pleased about that.

And then I save for myself in my work networks for last, because they seem to me the most difficult aspect of the work; that is, by -- when I think of networks, I think of nodes and edges, but the nodes may have many attributes and the edges may have many attributes and the problems we have to ask against those networks are very complex. And so that I felt was a substantive challenge.

And so I've become more and more devoted to this issue, especially because the social media have produced such huge resources and such important questions that we need to understand, not just for entertainment or e-commerce, but also for important national priority, such as disaster response, health care, community safety, just so many ways that the benefits -- I noticed outside I think there's a sign left from yesterday where Chris Dockus [phonetic] was speaking here. Maybe someone attended his talk. But he's said the key figure of Harvard Medical School has promoted the notion that if you study the networks of patients, you will find out that patients become obese if their friends become obese, they lose weight if their friends lose weight, they stop smoking if their friends stop. And the social networks determine these medical outcomes in a way that's remarkably powerful. In fact, so powerful that there are many sceptics of Chris Dockus' work.

We've run this Summer Social Webshop with 50 doctoral students around the country twice now and been a great success story, and we're just happy to continue that.

And I just want to end the introduction by saying I hope you will think every time you go do your work that some way you're contributing to these important priorities of not only national but international, and I like to use this as an illustration, the goal set out by the United Nations in the year 2000 of ending poverty and hunger, universal education, gender equality, child health, maternal health, combat HIV/AIDS, environmental sustainability, and global partnership.

In some way I want my discipline and the work I do, and I hope you devote yourself also, to working in ways that your work gets applied to making the world a better place.

Okay. So we turn now narrowly to focus on networks, and hope some of you know this wonderful Web site by Manuel Lima called Visual Complexity, ironically called Visual Complexity. He has a new book out called Visual Complexity that I think you might want to take a look at, beautifully produced book that shows many network drawings. And he has 772 examples of network systems and endpointers to those working tools. And as you can see many of them are very colorful and very beautiful, but many of them are also a mess, and the usual talk of hairball or bird's nest or spaghetti is what we see.

So some of them are beautiful and we might admire them, like Hubble Telescope photographs, and we can say something about the clusters and the size of groups here, but it's pretty hard to make sense of it. Some of you might want to frame it and put it on the wall, but I'm not sure if you can make any insights or discoveries in which you would make a decision to change things.

And some more examples, these tangled messes where you cannot see what's going on, there are some labels, but you don't even know what the labels are connected to, et cetera.

Okay. So one time to continue the mantra idea I made this little phrase of NetViz Nirvana. Our goal I would say for network visualization is that every node should be visible. I think you all agree with that and the metrics developed, Peter. I should say it's great to be in the room with heroes of mine like Peter Eades and Milor Brandis [phonetic] other leaders and Roberto Tomasia [phonetic] and others. And actually all four authors of the great graph drawing book are in the room together, which is quite a wonderful thing. And also new younger stars of people who are working and doing great work in this area.

So, I mean, the idea that every node be visible is pretty common in this community, and there are metrics for visibility, et cetera, but for every node you can count its degree, for every link you can follow it from source to destination, and for the cluster you can even see them all and maybe see their sizes and also spot the outliers.

So I wrote this down in a rather playful way, but it's become a pretty important thing. And like nirvana, it's never really attainable. We're not always attainable. But it's something we should strive for in order to make graphs visible, comprehensible in a way that people can make insights that they can depend on, that they can make a decision, that they can commit action to.

So here's the outline for the talk. There are four methods I want to talk about. These are all interactive and dynamic approaches that we have been developing and refining inside the tool NodeXL. That's the book that I handed out, and you'll see more of that.

And their basic ideas of filtering, the dynamic filtering queries are alive and well in NodeXL, double box sliders by which you can filter out the low edge density or the high edge density or both, or you can look for the high eigenvalue centrality or low eigenvector centralities. All these different metrics are built in. And then we'll look at clustering, grouping, and motif simplification. So that's the goals here.

And in a way I see this as the beginnings, the beginnings of a process model. What do I do first? Well, first I want to filter to look at a simplified graph. Let me try that, see what I can learn from filtering, then let me try clustering, see what I get from that, grouping, maybe grouping first or clustering first, and then we'll see about motif simplification.

Okay. So we just start, and we'll just take quick examples of these. There's more examples in the paper.

So here is a great story that came to us from a practical problem. A journalist named Chris Wilson working in Washington, D.C., for Slate Magazine wanted to analyze the senate voting pattern. So there are in the U.S. 100 senators, and he had the data for the year 2007. And what he was trying to look at is the similarity in voting patterns. Okay. So the strength of each edge is an indication of how many times they voted the same way on a bill. Okay. So if there are a hundred senators, how many edges are there?

>>: [inaudible]

>> Ben Shneiderman: No, no. Not N squared. 100 choose two, which is?

[laughter]

>> Ben Shneiderman: I'll wait.

>>: [inaudible]

>> Ben Shneiderman: Let's see. This is -- who is this story? This is --

>>: [inaudible]

>> Ben Shneiderman: 4950. Okay. 4,950 edges. And of course it therefore forms a very dense, packed area. So that's not going to help. If you see all the edges at once, you really can't see the patterns. But if you filter out to show 65 percent similarity, you get this lovely network. Okay.

And the blue democrats are here and the red republicans are here, and in the middle we have three senators: Olympia Snowe, Arlen Specter, and Susan Collins. And this is 2007, so they are closer to the democratic position, they have a stronger relation. This is Fruchterman-Reingold layout on top of the filtering.

And so it really showed a dramatic result. And it was remarkably predictive because two years later in February 2009 three republicans crossed over to vote for the Obama Stimulus Bill, and it was exactly these three senators. So you can see quite a lot. And if you look carefully, you'll find that the more liberal, progressive democrats are over here, the more conservative republicans are over here.

So we did a lot just by filtering, but it's still not perfect and you'll see how we do even better to work on this graph.

And this is -- shows you first example of NodeXL. It's embedded in Excel as a template, so its strength is that it's easy to use, it's free, embedded in Excel, ends disadvantages. It's embedded in Excel, which has many limitations, and so it has these problems. I mean, the benefits are -- we believe that we are -- with the book are trying to promote the democratization of social network analysis, to allow many more people to do it. It does not require programming, does not require advanced work, and in a couple of weeks in a sociology and a political science, or computer science, in my class I have a three-week section, and my students do very ambitious projects inside NodeXL. So you might want to try it, free to download. We've had 125,000 downloads, so you can join that crowd and take a look at it.

Okay. So that was simple and, you know, many other examples of filtering. It's an easy concept and it's just easily done with a slider inside NodeXL and you can filter by any of the metrics of the nodes or the edges and -- so we're going to look more at clustering. Clustering I define here. There are many terms of aggregation clustering, grouping we'll see, simplification summarization, meta nodes, we still have to get our language organized.

Graph theory has been marked over the years by a failure for consistency and terminology of the battling language of different groups that make it hard to speak. But we'll see if I can convince you that.

So here is a graph from somebody else. This is the network of people -- of actors in the play Les Misérables. And there's an edge between any pair of actors if they appear in the same scene. Okay?

Now, you can't make too much of that in here. And even if you look at it that way, there's not too much that leaps out. So that's really a bad example and it doesn't satisfy the NetViz Nirvana principle that all the nodes are occluding one another, the edges are impossible to follow, you really can't see the structure or the cluster.

So inside NodeXL -- this is one of the examples from the book -- we do a little bit more of color coding by clustering. We have three different clustering algorithms built into NodeXL. If you want to add a fourth one, please help us by extending NodeXL and adding the code for yet another one.

So we can see immediately the main character, Jean Valjean, who appears in many scenes, and then there are various, like Fantine, some of you may know the play, Javert, our key players, and then there are some groups that appear only in one scene. They form a click over here.

And so we have other -- another small clique. They appear only once and they appear in the same scene and that's the end of it. So you can understand they're not very important, although usually you think of cliques as important, but here they're relatively low importance.

And then you can see other clusters of characters who have strong ties and connections with each other, and the thickness of the edges shows the strength of the connections between them.

So the clustering here illustrated by coloring gives you some help in understanding and making sense of it, and, again, size, color coding and so on. And then in here we label only the key players so we don't clutter the screen with other information.

This was another wonderful success story. We also -- NodeXL has importers for networks from Flickr, YouTube, Facebook, and other sources, graph NL and many other formats. You can just type it in or cut and paste it. If you have an edge list, you can just cut and paste it.

This was just another -- this was a remarkably good application of clustering where we took all the photos in Flickr that had the word "mouse" in them, and then by the color codings of other terms, we created the linkages, and the clustering algorithm did a perfect job. And so you see in natural language processing this is called word sense disambiguation. And so here the yellow cluster is exactly the computer mouse, the blue cluster is the animal mouse, and the red cluster is Mickey Mouse. And so it just turned out to be a really nice example where clustering worked very sweetly. It's not always so fortunate that clustering works out so well, but here was a good example.

We also began to study and see the patterns of clustering in popular data. This is the Twitter stream of all the tweets at a certain point that had the hashtag "GOP." In the U.S. GOP stands for Grand Old Party, which is a short form for the republican party.

So these are all the people who used hashtag GOP in a tweet, and the clustering showed a large cluster, which we did color red for the republicans, the traditional color, and a smaller cluster of blue for the democrats. The red cluster is much more dense. There's many more of them. They're thickly connected. And there's a high density there. And there's a fewer number and a less dense connection over here.

And you can see the bridge between them is relatively mild except for one large node. These are between the centrality-sized nodes, and that one node is the political Web site called Politico which both democrats and republicans will read. And so we got to see quite a lot.

There's another cluster of green which are kind of independents. They're floating around over here, and some other smaller clusters around there, but they didn't quite show up there. But I think you get the idea. This is a traditional example of conflict in social networks, where there's two tightly woven groups that are quite independent and the bridge between them is relatively low.

Okay. Any questions about this? You got the idea? All right. So this was actually for Microsoft TechFest on the campus here in '11. We were beginning to develop these techniques. This is still Fruchterman-Reingold. But with clustering you can see the colored clusters, not very effective, and this is what motivated us to try to do better.

On the bottom the singletons who are not connected -- I should say the way the network is formed is Twitter lets you download not just, you know, if you search on these terms you'll get all the tweets, but then you'll get the person who did the tweet and you get their follower network. So you build a network out of the followers. Okay? So very powerful.

These people were not connected. About 20 percent of them were just independent. And then you had one main large cluster here, but not well differentiated. And, as you can guess, the clusters are not strongly identified, but we developed the technique called group in a box by which we put the clusters in, believe it or not, a treemap.

So here are the singletons. That's the biggest group, actually. And then this main cluster -- this is Microsoft and its publicity mechanisms, and so all the people around here. These were Microsoft Research employees. And then we had a certain group -- I forgot which group this was, but we had a Brazilian group in here, which surprised us. And then the smaller groups get laid out over here.

So I would advocate rather than trying to draw one graph that you draw these multiple boxes. In these cases you can say the clustering is not so solid because there are lots of links between some of these clusters. So they're not type or well form clusters. In NodeXL we let you delete the edges or bundle them if you wish -- I'll show you that soon -- between the cluster so as to clarify what's going on in the cluster.

My playful motivation for this when I gave the first talk about this was I went out and bought some grapes and I sort of asked my audience, and I can ask you, how many grapes are in this picture? Any guesses?

>>: [inaudible]

>> Ben Shneiderman: A hundred? 200? 300? Well, that's pretty good. There are 149 grapes here. But it's hard to tell, hard to count the grapes.

How many clusters are there? It's hard to see. But if I tear them apart and then lay them out on the table, as I did, you can count them. And there are -- you can count -- pretty close, there's still a couple of obscured ones, occluded ones, but pretty close to counting the 149, and you can see the nine clusters that were there in the grapes.

So it's sort of motivating the idea. And this was a conference I attended in April at MIT called Collective Intelligence, and this one turned out to be very nice and a good demonstration of our techniques.

And so this main cluster was a group of academics that includes me over here. I've become quite active in Twitter. Actually, how many people have Facebook accounts? About a third. How many people have Twitter accounts? Whoa, only about six or eight. That's pretty interesting and pretty typical. Computer science people are not quite into the social network things as much.

And I have to tell you, when I speak to business or sociology or other student groups, it's 95 percent. And this conference does not even -- does it have a hashtag? I searched on GD 2012. I found three tweets including one that I had of announcing the conference. There were two others. But there's just not a tweet stream coming from this audience, which just reminds you about the sort different kinds of people in communities there are.

But conferences like the Webshop we ran generated thousands of tweets in a 24-hour period. So people are quite active, and understanding those patterns is of course an important social question, business question, but also important national security and health and other applications, which is why there's such a strong interest in studying these Twitter patterns.

In any case, the largest group was over here, and there were other people you may recognize. I guess Elizabeth Churchill from Yahoo!  I can't even see it on this screen from here. Let's see. Sean Munson, now University of Washington, Mike Bernstein from MIT. There were a whole bunch of those academics.

And then we had another group of -- this was I guess the French group, this Brazilian group was here, this woman was in the room and was very excited to see that she was quite central to the discussion there. And this German fellow over here turned out to be an important component and he had his own little community.

So being able to see these communities. And we -- in this case I used the technique called combined edges so that the edges across the clusters were combined into the single light gray edge so you could get an idea of the relative connectedness among the communities.

We also do edge bundling and curly edges and we can have tight or lose edge bundling. And so here was another conference that was run at Maryland called Theorizing the Web, and the main cluster had links to quite a few of the other clusters.

But I'll show you others where you see distinct differences among the clusters. For example, you see no links between these clusters, or very few between these other clusters, but this main cluster had many links to the others.

This gives you an example of the power of this. We were approached, Cody and I, and especially Cody worked with Scott Dempwolf, an analyst who was working on innovation patterns in the state of Pennsylvania. This is 11,000 nodes and 26,000 edges. It looked quite beautiful. Done in NodeXL. But it doesn't really tell you too much about what's going on there. And so we need to -- hope you can see it.

But there we broke out the clusters. The main cluster turned out to be two key individuals who each had about a hundred patents, and they were the main drivers of innovation and economic development in the state of Pennsylvania.

Secondly, Westinghouse Electric in Pittsburgh was a great source of patents and other innovative work, and then we had the unfortunate problem that the two suburbs of Philadelphia were diagonally across, and there's some wispy lines going across them there, and so you get quite a lot. So that was clustering. But if I apply filtering, I can make a simpler story and get rid of the edges, and now I can much more clearly see who are the key groups, who are the key influencers in each group, and if I want to get out there and try to promote innovation, that might be a good way, so I filter down even more to just having about a dozen groups, I now can focus my attention and know what's going on in this dataset.

I asked Scott for an analysis of Maryland innovation, and so he gave -- he favored me by doing that one, and this is our lab, Human Computer Interaction Lab. These are NSF grants and copartnerships, so there's quite a few NSF grants in our group. Catherine Plaisant's there, I'm over here, Jen Golbeck, the current director, Ben Bederson, Allison Druin, I hope you know some of these names, and the partnerships we have with other groups around the state of Maryland and their own clusters of Johns Hopkins or in Baltimore or other places.

So it was a kind of confirming sense that these analysis tools were giving us insight to what's going on. This is also kind of pretty, but it is closer to NetViz Nirvana because you can read almost all the names here. This was done automatically. And sometimes we clean these up by hand to make it a little better and clean up some of the occlusions. But I think you get the basic idea of how to do this.

And this group in a box strategy I would like to recommend to you. I hope some of you will try it. And Cody, for his dissertation, will continue to work on what we call meta layouts, other layouts which have other properties that are effective where the clusters sit inside one region and then you can see the connectedness to other regions.

A question? Thank you. Roberto.

>>: I was wondering if clusters are placed inside the [inaudible] of the treemap and how do you decide what is the classification, what is the underlying tree, so how do you decide --

>> Ben Shneiderman: It's not a tree. I mean, it's a clustering. We use the Newman-Girvan -- actually Michelle Girvan is a member of our faculty in physics, so that's the one I think was used in this case. Do you recall? But we've got [inaudible] and we've got three different clustering algorithms in there.

So you cluster. It creates a clusters. And it's not a tree structure, but they're partitioned. And then they're laid out just by the size of the cluster here. The biggest cluster goes here and the smallest one goes over here. Okay. So very straightforward. It's not optimal.

And do I give away the doughnut? Cody, don't do this. It's Cody. But the idea is put the big cluster in the middle and paint the other ones around it, and then you'll be able to see the connections more easily. And Cody's got three other ideas that will be I think nice improvements in ways of dealing with clusters and relationships among clusters.

But I think also managing the edges, either deleting them, combining them, okay, or showing them our powerful ways. Because you want to control. You're essentially filtering those edges. You want to control the visibility so as to achieve NetViz Nirvana to be able to read what remains and to understand what's going on.

And selectively, as a sequence, not just one picture that solves a problem, but a process by which you interact and you successively explore hypotheses or seek out attributes that you believe are important. You go in -- if people come to me with a network and they say show me what's there, I say you're not ready. And I say what's your question? If you don't have a question, you're not ready to work. Okay. You have to have a question. You have to have at least one. I mean, we train our potential users to have questions, and that's where it starts.

Now, you know, we tried to have a systematic-yet-flexible, SYF, systematic-yet-flexible, process to explore. So we try to go in order so we would accomplish the systematic approach. But when something interesting pops up, you want to be flexible to go exploring. So it's not an automated process. And domain experts have a huge amount of knowledge by which they will spot things that we can't spot. Okay. Anything else? Thank you, Roberto. Yes, Christian.

>>: [inaudible] many of the networks that you have shown so far are actually two-mode networks, like this one as well where the edges are defined by some other type of entity and the code adjacency towards these entities. Have you built any repetitions with these [inaudible]?

>> Ben Shneiderman: Let's see. If I remember, this is not a bipartite. The nodes are all principal investigators on NSF grants. And so it's --

>>: NSF grants would be the other node?

>> Ben Shneiderman: The grants are not shown here. But, yes, we could.

So, yeah, bipartite graphs is a very tempting thing. We've done a few things, nothing brilliant. And I think that's another good topic that we'd love to work on. Bipartite especially is tempting. But I haven't found a brilliant solution for that except just lining them up and showing the connections there, if you have a good idea.

Spotfire does include networks and it just has two regions and it will randomly jitter them around in two regions and then show the connections that way. Not a much better solution, but it allows for more than just a one-linear flow.

So I think bipartite, tripartite are other good problems to work on. They are many, many. We have about 150 items on our NodeXL to-do list of things that we want to do. And then of course conversion from a bipartite to a single mode graph would be another natural thing we've talked about.

This was just pretty. This was analysis of community discussion groups. They were all very independent. They were just simply a posting and then discussions that followed the posting. But it just shows you the other way. You can go, and it's kind of pretty. I thought we'd make a T-shirt out of this one or something.

Okay. So let me go on to grouping. And grouping is a very simple idea. Instead of clustering where you cluster by edge, grouping you group by node attributes. For me the idea of a network being points that are just like physics points in space are not very interesting. I really look for data where the nodes have many attributes so you can do something interesting about them.

So the classic one was -- here was another version of the senate voting patterns. And if you break them out by the attribute of which region of the country they come from, then you use group in a box, you get this very nice structure, which shows the Southern senators have the high degree, the republicans are very tightly woven together, the democrats less so, and then we see other regions we have.

You can see immediately the relative number of democrats and republicans, and you can see that like in the Pacific region that kind of -- that separation does not occur nearly as strongly. And sometimes there are overlaps as well.

So you get a lot more by tearing apart the graph and showing parts of it at a time, and that is a big win if you have categorical attributes for the nodes.

We can also -- I mean, in NodeXL, you can replace multiple nodes that are the same attributes with a single group node, a node that has a big plus sign in the middle of it, so you can simplify the graphs by grouping. Which takes me immediately to Cody and the idea of let's find another way to simplify the graph like common motifs.

>> Cody Dunne: So we can take these nodes and combine them into a meta node, but when you do that you don't know really what -- sorry -- where it came from, you don't know anything about the underlying topology, you don't know anything about the attributes.

But my idea with motif simplification is to take specific repeating patterns that take up a whole lot of the screen space and replace it with representative glyphs that tell you what's inside them.

So, for example, we have a fan motif. It's all these singly connected nodes that are connected to only one head node and then to the rest of the network.

And the idea behind the glyphs is to replace these fan nodes with a fan-shaped glyph, you know, the arc is sized according to how many nodes it's replacing, so a large glyph will replace a lot of nodes, a small glyph will replace a small amount of nodes, and then if you have a color scale on it, in this case going from orange to purple, let's take all those attributes, so let's apply a function to it, like the mean, and let's put that on the same color scale and color the glyph using that color.

That way we can show some information, anyway, about the attributes, we can show how many nodes were inside it, and then the topology that it's replacing.

Similarly, we can look at a connector motif. This is ideal, you have this functional equivalent span nodes in the middle that are doing nothing except connecting two or more other nodes together. And we can replace these with the exact same visual representation as you would get if you drew it nicely in a graph, this tapered diamond shape.

Again, we could do some sizing based on how many nodes we're replacing, we can have meta edges on each side that are sized or colored depending on the edges that they replace. And, again, we have some nice coloring for those things.

Let's look at an example here. Here we have a bipartite network. It's wiki edit, so there's four wiki pages here from the Lostpedia wiki. So they have their main discussions and they have their theory of the lost universe discussions and they're kind of separate. So we have those four wiki pages, and then we have all the little circular editors editing those pages. You can see there's a lot of editors in those big fans that had only one page, and there's a fair number in the middle that edit two pages. And then in the very center of the drawing there's some that edit two or three or four pages and so on.

If we take these motifs and replace them with glyphs, we get that drawing there on the right. So that really big fan down in the bottom is replaced with that large arc fan. You can see that it has a very purple attribute value, whatever purple means in this network, and then we can see the cycles of the fans throughout it.

And then we can also do the pairwise connections and see that main discussion and main have a whole lot more editors editing both of those than main and theory on the far right. We went from about 512 nodes down to 25. And so now it's easier on you, it's easier on seeing labels at a distance, and it's easier on your layout algorithms, although this is going to be a denser network after you get rid of all that peripheral stuff.

So we can also look at cliques. Cliques are very interesting parts of a network. And finding a maximal clique is a hard task for a user to do just by looking at it. We can take these cliques, so like a four clique, five clique, six clique, replace them with glyphs, again size depending how many member nodes there are in the clique. And when we look at this senate example that we just saw, 65 percent agreement, this entire network gets simplified down into three cliques and one little individual node there.

Okay. We have 51 nodes in that top right democrat clique that's all the democrats, it's two independents, and it's Olympia Snowe who's actually in that. So it's a little bit off-blue instead of just being pure blue. You can't really see that, though.

We have on the bottom left 38 republicans. And then in the right we have four moderate republicans. We have McCain -- let's see who else is in there -- that's Collins and Smith and Specter as well. These are the moderate bridge builders. And we can see based on that meta edge that they're tightly connected with the rest of the republican party, but there are also a fair number of connections to the democrats.

And then down there in the bottom we have Coburn. He's a very staunch republican when it comes to very contentious issues, but he votes with his heart, so he's a little bit of a wildcard and he just kind of pops out right there.

So let's see what happens when we go to a higher threshold of agreement. We were at 65 percent. Let's move on to 70 percent. So in the node link visualization, when we laid it out again, I think it was with Harold Corn [phonetic] when I did that, it spreads out a little bit more. You see a few more of the edges disappear, and you see a little bit more information happening.

You see Olympia Snowe come out of the democrat clique. She's got that really tight connection there still. We still see Specter and Collins in the middle, and then over on the left side we see a bunch of wildcard republicans come out.

And I was talking to my brother, he's in political science, he says yes, these are the people who you just don't know what's going to happen with them. We've gone Voinovich and Vitter and Hagel and that Coburn that we saw before. And that small clique, it's still got McCain in it, but the members have actually changed as the edges have been deleted. Some of them came out of the main clique into his little moderate clique, and others went out into the rest of the network.

Moving on, 80 percent agreement, we see the network it gets bisected, we saw this really tight, dense connection on the democrat side, but the republican party cohesion is starting to break down in 2007 anyway.

Up in the very top right we have the extreme liberal democrats. These are the East Coasters. Let's see. Who's that? That's Lieberman, Feingold, Kennedy, and Biden. Feingold, anyway, used to be an East Coaster. And then the small clique just below that are the really moderate republicans. That's actually called the Blue Dog Coalition. And right there down there in the bottom, Nelson sticking out, he is the epitome of Blue Dog democrats. And for of those you who don't know, the Blue Dogs are the extreme moderates of the democrat party.

So, as we continue on, 85 percent, we see a little bit more breaking down, 90 percent, we start seeing a whole lot of segmentation here. 95 percent, we still have a democrat clique, but then we have Isaacson and Chambliss. Those are the only republicans left. And those are from the same state, Georgia.

So we can see some interesting patterns here just by looking at the maximal clique. And in this case I was just using a greedy algorithm, taking all of the cliques in the network, using the that Tomita algorithm, I believe, and then just picking the largest one as we go.

There is a more expensive approach you could use to find which ones would be most effective to show, but this case it actually worked rather well.

Now, one thing you might want to think about when you're figuring out what motifs you want to show and how to combine them is how you can overlap these glyphs together. Right? How can you design your motifs like the fans so that they can hang off the side of a clique or your parallel motifs, your connector motifs so they can hang off the side of a clique as well. There's some design issues going into how do you show these things in a small amount of space and in interesting combinations.

And of course it's not very useful unless you have interactions so you can see what's actually inside them. You have tool tips that can show exactly how many nodes are inside it if you can't understand the scale, information about where these -- all these leaf nodes are anchored, and then the context menu that lets you move back to the original visualization that's -- well, it's losses. You start with a big simplified view, you get an overview, but you then can get to the details on demand.

Let's look at another network here. Here we have a big Web crawl, something like 4,000 nodes. Anytime you do these egocentrically collected datasets for Web crawls, or what we'd usually do to get social science datasets, you have those big fans along the periphery.

Here we have 800 nodes in any one of these fans. And you might say that this is a reasonable drawing. I can see most of what's going on here. But there's actually some hidden features based on the land that's used.

If we color by the fans in the network, we can see a lot of the ones we saw before. But then down there in the bottom right we see a lot of overlap between the two fans. And there's actually a bunch of black nodes in the middle that aren't part of either.

You know, this is a layout heuristic at work. It isn't showing us every individual feature. But if we simplify these things away and including all those little fans there in the middle that really shouldn't be there, they're not really central core parts of the network, if we simplify those things away into the fan glyphs, we're using much smaller ones in this case, we now have dropped the amount of screen space required by two-thirds.

We still have the glyph so we can look at them. If we look at them closely, we can see the arcs and see exactly how much leaf nodes there are in these individual places. But we don't really care about those leaf nodes. We just want to know that they're there and how many of them there are. They're not the core interesting things to us. The rest of this stuff is, including that giant connector motif down there in the bottom right that was completely obscured by those fans.

So if we do the connector motifs, we color them, we see all their original locations in the network, we simplify them away, and now we have a drawing that has a fraction of the original number of nodes. It's a lot more dense, but you can lay it out again in a much larger screen space. And if we have a color scale, in this case it's Eigenvector centrality, that color gets mapped onto the nodes -- sorry, the glyphs there.

Just some information. Those two networks that I was showing, the Lostpedia wiki and the Voson network, we drop the number of nodes by an order of magnitude pretty similar with the number of edges. And when you start looking at metrics for graph drawing aesthetics or readability metrics, you'll actually see that because we're getting rid of a lot of these edge crossings, we're getting rid of a lot of this node overlap caused by our layout heuristics. In a very limited amount of screen space, we're going to be ending up with a much more readable drawing.

So we showed this to some users. They said I'm overwhelmed. It's like one of those vision tests at the eye doctor when they're looking at the original network. But then when we put the simplified version, they said, okay, now I can see the central pages, there's few enough nodes in the network that I can do pairwise comparisons to look at the things. When there's 4,000 things I can't do that, when there's 500 it starts to become feasible.

And I just finished running statistics this morning on 38 users using this study, and it turns out that for a lot of interesting tasks, like finding labels and finding maximal cliques and doing things like that, this approach really works. For other stuff like tracking edges there's some work to be done still. But there's some interesting results there.

So motif simplification, it's pretty good for producing complexity and understanding the large relationships in your network. However, like I said, it might not be so great for the edges. The frequent motifs you're interested in might not be included in the corpus of things I like to do.

So if you're a biologist, you're interested in feed-forward loops or something like that. What I focused on is the really high payoff things that take up a whole lot of screen space but don't really give you much in return.

And of course glyph design has tradeoffs. If you want really tiny glyphs that take up a very small amount of the screen space, you can't show distributions in them. You can't show much of the information about the underlying nodes. So how do you design that to match that tradeoff. And there's all sorts of details and algorithms in our tech report if you're interested.

And with that I'd like to turn it back over to Ben to finish this off.

>> Ben Shneiderman: Maybe I should just pause a minute and maybe go back to originally my -- I'll just review all that stuff. I want to go to my list. I should have had that -- oh, too far back. Where's my list? There.

So this is the first time we publicly presented the motif simplification. I just want to pause a little and ask for your comments. We know that it's a little extra cognitive complexity because you have to learn what those motifs are, you have to train your eye and your mind to look for them, but the dramatic simplification seems to be a winning strategy.

And, as Cody said, he's just finishing. He has two more hours before the deadline for the CHI conference to finish and submit the paper about it, but the results were very promising. There were 31 different tasks, and not every one of them was the motif simplification and benefit, but for many it was and we're trying to understand better when it works and doesn't work. So any comments or challenges? Yeah.

>>: So both of those motifs you listed are examples of the sorts of things we found by a technique called modular decomposition of a graph where a module is a set of vertices that all have the same connections to the rest of the graph? And so I wondered if you had thought about using modular decomposition more generally as a way to visualization.

>> Cody Dunne: I hadn't talked -- thought about that in specific. I'd like to talk with you about it after. I had used an approach called Graph Summarization by Saket Navlakha that was designed mainly for biologic networks, finding functional equivalent things. But the problem was, like I said, it's these heuristics that combine things without you really knowing what the topology was.

>> Ben Shneiderman: So we chose only three to start with. Fans, connectors, and cliques. Which are -- and I guess where's Natalie Reese [phonetic]? Natalie did good work about doing cliques and near cliques. So that was another inspiration for us as well.

And there are lots of -- the algorithms exist to find all these things. We thought the important -- the contribution here is to turn them into glyphs that made sense, and there was more than we thought design effort about choosing the edges and the colors and those things.

I see two more hands. Ulrich, your chance.

>>: Two generalizations you might want to think about is, first of all, the degree 1 nodes, you can extend those into complete trees and then draw them as [inaudible].

>> Ben Shneiderman: Degree 1 nodes -- say again?

>>: If you look at the one shell, so you eliminate all the degree 1 nodes, continue doing this until you stop.

>> Ben Shneiderman: I see. I see.

>>: [inaudible]

>> Ben Shneiderman: Recur on the idea.

>>: [inaudible] that you represented as --

>> Ben Shneiderman: So the fan of fans essentially. Okay.

>>: And the second thing is that the connectors are not only a special case of modules and modular decomposition, but the first [inaudible] here are all the same as well. They're [inaudible] coordinates [inaudible] same neighbors but are not connected to each other or are all connected to each other.

>> Ben Shneiderman: Right. Well, we chose the very simple case of the two, three, and four connectors. That's what we were -- that's what we implement. This is working. It's in NodeXL. It's shippable. You can try it. Well, close to. Is Tony Capone [phonetic] here?

>>: Linear time.

>> Ben Shneiderman: Tony is our programmer.

>> Cody Dunne: What's that on linear time?

>>: You structure the equivalent classes and determine linear time, so it might be interesting because it extends to higher degrees as well.

>> Ben Shneiderman: Well, we see many general --

>> Cody Dunne: So for the connector motifs, I'm finding them in time proportional to the number of edges times the average -- sorry, the number of nodes times the average node degree.

>> Ben Shneiderman: We think they can be made faster to identify, and then the control panel about how you do the replace. And what you allow for color and shape and size, also the cliques -- well, okay. There's still lots to be done in many variations.

Cody mentioned that specially rich motif work is done in biology, where they're looking in particular motifs in the biological pathways.

And I saw one more hand. I give Peter a chance.

>>: [inaudible] I think the clique I think [inaudible] when we did that was that so the [inaudible] we're looking at is [inaudible] but an ambiguity of it [inaudible] every graph that we looked at had many different clique counts and so you need an authorization [inaudible] choose what is not unique. Lots of different ways of [inaudible].

>> Cody Dunne: So I think the best approach would be to solve the set packing problem and choose the set of cliques that combine the most nodes or satisfies whatever property you're trying to satisfy.

The one I'm doing right now is just a greedy approach, examples anyway. Where there are a lot of overlapping cliques seems to perform pretty well.

>> Ben Shneiderman: And let's take two or three more. I love this. Yes. That's what we came for.

>>: I'm sorry, I'm having trouble telling the difference between a clique cliff and a connector cliff. They look the same to me. Are they different?

>> Ben Shneiderman: Yes. Clique is a fully connected subgraph.

[multiple people speaking at once]

>> Ben Shneiderman: The cliff, oh, they're rotated. They're 90 degrees rotated. We had long debates about this. I in my last e-mail to Cody said I thought that's a problem too. They're rotated by 90 degrees. They're lovely --

>>: Then we have edges coming in all different directions.

>> Ben Shneiderman: Correct. But that's another problem we were not able to solve in the cute way. But when you have three and four connectors like you do over here, you can't make them come in one corner and out the other. It doesn't work out.

We had other -- we had a bridge-shaped clique and a glyph. We tried many things. You may find ways to improve this. We look for your suggestions. This is still a fresh idea that's just getting tuned up. It will be part of Cody's dissertation, so don't steal it until you please help him.

But I think -- there's many things to be done to extend these ideas, and we think there are many ways people go -- and, Roberto, you get the last word.

>>: I was just wondering why the [inaudible] have one vertical edge and then the clockwise because --

>> Ben Shneiderman: Yes. That was my obsession.

>>: [inaudible]

>> Ben Shneiderman: That was my obsession. I won't fight that battle. I wanted them always to start straight up and then they would arc out over. And that way you could tell more easily the difference by how much angle was attended and also visually if in a cluttered graph you would be able to spot them easier, but you're looking essentially -- your preattempt of processing is looking for that one vertical edge.

Notice also in the current design the length of these fans is the same. That was another feature to keep this simple. We could have encoded some other variable, and yet we tried to make the glyphs look different from nodes, so we didn't use -- we have many different glyphs that we use circles, triangles and so on, but we tried to make these look different. So the difference should be there. There always should be a pointy part and a curvy part.

>> Cody Dunne: There's another cool thing you can do with the fan orientation. So just like this can overlap on each other and make a little pie chart and show us the proportions, but if you care about directed edges, how many of those edges to those fans are going out or coming in or bidirectional, then you can actually segment it in terms of the difference from the vertical. That way you see exactly which ones are going which direction without having to draw any additional little icons.

>> Ben Shneiderman: Right. So the further version for directed graphs splits this in three sections and they're pointed straight up so you go left and right and then center.

So there's a bunch of design problems that remain here, and especially Cody showed you this little thing, that they actually are combinable. And we think we got it right, but we have no proof that these will combine in a way that does not cause conflict, and we also still have to prove the theorem that says the order in which you create these glyphs will not affect the outcome.

But we think we can argue that case. Okay? I mean, there's some suggestion if you did one of these first then the other wouldn't work out, but, no, we think we defined these motifs in a way that they're independent of the sequence in which you apply them. Got it? Okay.

So we can talk more about it, but our time is running out, and now I'm going to have to go forward.

I just have a couple of closing slides, so we'll flash through here as fast as this will let me go, and you can review the whole presentation. And there we go.

So just to say more about NodeXL, we do have the NodeXL Graph Gallery. This is a public open source place like many nets where you can upload your graph datasets and your visualizations. There's thousands of them out there by at least hundreds of people, many different ways. Some of them are good, not all of them are beautiful.

And we also -- it's in NodeXL which runs inside Excel. When you say export, it will export to graph gallery and give you the option of exporting the dataset as well.

So this page, if you're looking for network datasets, there are thousands of network datasets out there, and you can go and grab them if people have up loaded them.

And usually there are descriptions of them in great detail. Mark Smith, who's our strong collaborator, has many politically oriented ones, and so you can take a look at whatever he's up to today.

That's the book. The first three chapters describe network analysis and social media, then there's four chapters that walk you through the use of the tool, and then we have eight application chapters that shows you analyses for e-mail, threaded networks, Twitter, Facebook, World Wide Web, Flickr, YouTube, and wiki network.

So it's sort a starting place. We wrote this as a textbook and to guide newcomers from many different disciplines to be able to create their own networks.

NodeXL was supported on -- happy to say this on Microsoft territory -- by Microsoft external research for more than four years, and then they said it's time for you to got off and find the rest of your support. And so we are now owned by what we created, the social media research foundation for free, open source data, open data, open tools, open scholarship.

And we struggle. So if you can find someone to help sponsor us and support it, we'd appreciate that. But the social media research foundation is the home base for NodeXL. We see it as like the R statistics package. We'd like to keep it going as a community-based open source and free tool for people to use.

So if you can help us out, please do. Come visit SMR Foundation or the NodeXL site itself.

And I'd just close by thanking you from HCIL and our 25, 30 now years of happy use, of happy community, and the NodeXL Web site is down here, nodexl..

And I thank you for giving me the opportunity and look forward to discussions. Thank you very much.

[applause]

>> Ben Shneiderman: Thank you. Okay.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download