Conducting Interview Studies: Challenges, Lessons Learned ...

Conducting Interview Studies: Challenges, Lessons Learned, and Open Questions

Jim Witschey, Emerson Murphy-Hill, Shundan Xiao Department of Computer Science North Carolina State University Raleigh, NC, USA

jwshephe@ncsu.edu, emerson@csc.ncsu.edu, sxiao@ncsu.edu

Abstract--Our recent work uses sociological theories and interview techniques to discover why so few developers use tools that help them write secure code. In this experience report, we describe nine challenges we encountered in planning and conducting an interview study with industrial practitioners, from choosing a population of interest to presenting the work in a way that resonates with the research community. In doing so, we aim to spur discussion in the software engineering research community about common challenges in empirical research and ways to address them.

I. INTRODUCTION

In our recent work, we have been studying software developers' adoption of tools that help them write secure software. Software is secure when it functions correctly even when under attack by malicious parties [1]. Although the current best practice for creating secure software is to consider security through the entire software development lifecycle [2], many developers do not know about or use secure development practices. The use of secure development tools (subsequently called "security tools" for short), which help developers write more secure code, is one such underused secure development practice.

The primary thrust of our ongoing research is the application of a technology adoption theory called diffusion of innovations (DoI) [3] to understand developers' adoption of security tools. DoI theory is a long- and widely-used theory that has been applied to technologies in many domains including, to a limited extent, software engineering [4], [5]. We have conducted semi-structured interviews with 43 professional developers to determine how DoI might explain developers' decisions to adopt or reject security tools. Our work is novel in using this formal sociological approach to study adoption of security tools.

However, we have faced a number of challenges in defining and conducting this research. We do not consider these challenges minor or attributable to inexperience -- indeed, over the years we have conducted many studies involving software developers, including usability studies [6], heuristic evaluations [7], panels, controlled experiments [8], surveys [9], interviews [10], and paper mockups [11]. We faced these challenges in spite of following general [12] and methodologyspecific [13] guidelines for conducting empirical research. In fact, following these guidelines was itself challenging in

ways we did not expect. Guidelines are undoubtedly helpful, and turn to them to conduct our research in a principled way. However, principles are not instruction manuals, and guidelines are difficult to implement.

In this experience report, we present a frank portrayal of nine challenges we faced. We hope to start discussion about difficulties we face in applying guidelines and provide concrete examples that other researchers can use alongside these guidelines, as has proved useful in other disciplines [14]. We also would like to discuss difficulties that fall outside existing guidelines. We believe that such a discussion will help us, as a community, create more concrete and comprehensive guidelines for conducting high-quality empirical research.

II. CHALLENGES AND OPEN QUESTIONS

Over the course of developing and conducting our interview study, we have encountered a number of challenges, nine of which we detail in in each of the following subsections. In each subsection, we first explain what we did and describe what challenges we faced. We end each subsection with a few questions to prompt discussion in the community in the hopes that we can share knowledge, begin to define best practices, and voice opinions left, as yet, unsaid.

A. Choosing a Population: Who Do We Want to Talk To, Anyway?

Before we developed our interview script and fully defined our research questions, we chose the population of developers we wanted to study. In doing so, we ostensibly followed Kitchenham's guidelines [12]. Naively, we might have defined the population to include all developers, but we recognized that security tools might not be relevant for all developers, so we chose all developers who work with security-sensitive software as our population.

Our population defined, we set out to interview a representative sample of these developers. We did so by recruiting developers in different-sized companies and with different programming roles within those companies. This made us confident that our results would generalize across many subpopulations of developers.

However, by only sampling developers in professional development environments, we made a fundamental mistake: we ignored open-source developers who write security-sensitive

software. We realized this only after the interviews were complete. We now face a choice: we must either limit the scope of our entire project to security tool adoption among professional developers, or we must conduct more interviews, this time with open-source developers. We chose the latter, and will conduct more interviews, but we would not be able to if our project were less flexible. We believe that the community could define some process by which to consider important subpopulations of developers in study design.

What processes can we use to define our population? What processes can we use to define subpopulations to sample that will represent this population?

B. Recruiting Participants: Plenty of Fish in the Sea (but Where to Cast Our Nets?)

To recruit developers from our population, we used three different sampling methods. We recruited most of our participants using snowball sampling, in which we recruited participants through personal contacts, who forwarded our requests for participants to their contacts. We recruited some participants by posting flyers in break rooms at a major software development corporation. We also emailed 25 companies that had posted profiles on Guru [15], a website that connects freelance developers to those who want to hire them. Unfortunately, these companies were not interested in connecting us with their employees.

However, here are a number of other ways that we might have tried to recruit developers. For example, we could have emailed open source software (OSS) development mailing lists or recruited users of websites frequented by developers, such as technical or programming blogs, and social media sites like Quora [16] and Stack Overflow [17].

All of these methods have their strengths and limitations. Perhaps other recruitment methods would recruit more representative samples. Other research communities have not solved this problem either; for example, researchers studying healthcare professionals are still working to define best practices for determining sample sizes [18].

How do we recruit a representative sample of software developers? How can we predict what recruiting methods will be most effective? How can we objectively evaluate our and others' choices in recruiting methods?

C. Behaviors vs. Generalizations: Well, That's Just, Like, Your Opinion, Man

We asked participants in our practice interviews how they used security tools and about their attitudes towards security tools. However, we found it difficult to differentiate between facts about developers' behavior and developers' opinions about how developers should behave. For instance, we asked participants to rank different attributes of security tools by how important they were to adoption decisions. Most developers said, for example, that they were more likely to adopt a security tool that was not too complex than one that was highly configurable. However is unclear whether their actual behavior when evaluating a new tool would reflect that stated priority.

To separate fact from opinion during our interviews, we asked participants for details about their behavior to ground their responses in their experiences. For instance, when a participant mentioned using a particular tool, we asked her when she first used it. If a participant reported trying a tool but not adopting it for long-term use, we asked an open-ended question about why she stopped using it. We also plan on conducting surveys in the future to triangulate our results, to further distinguish between fact and opinion, as has been done in other studies [19].

How can we design studies that will allow us to differentiate between fact and opinion? How do we design such studies on principle, rather than on intuition?

D. Interview Refinement: Adapting Semi-Structured Interviews

Since our interviews were semi-structured, the interviewer asked questions that were not on the interview script in order to further explore potentially interesting things that participants said. If such a spontaneously-added question yielded useful information, then we added it to the script for subsequent interviews. If answers to a particular question became consistently uninformative, we removed it from the script. We interviewed no more participants once the entire interview stopped yielding new types of information, or when it had reached saturation. Ending studies upon reaching saturation is standard practice in other fields as well [20]

However, we would have benefited from a more concrete set of principles by which to modify the script. We used an ad hoc approach, modifying the interview script as we saw fit. While we knew we needed to remove questions that did not prove useful in order to make room for new, more interesting questions, it was difficult to decide that a particular question would yield no interesting responses from any future participants. Similarly, we had no formal definition with which to identify saturation, so we identified when we had reached saturation intuitively.

We may have been able to avoid some of these difficulties if the interviewer and senior researcher had had more in-depth, explicit discussions about the evolution of the script and the principles behind that evolution. We also would have benefited from more in-depth, explicit discussions in the literature of other researchers' decisions about how best to modify and use semi-structured interviews.

How do we identify and validate when studies reach saturation? How should we design our study protocol so that it is easy to use and to change?

E. Interview Preparation: An Ounce of Practice

Conducting practice interviews is an important part of the interview design process: it helps refine the interview script to so that it flows well and is not based on flawed assumptions, and helps prepare the interviewer for problems that can arise when conducting real interviews. As such, we conducted a number of practice interviews with graduate students with

significant industry experience. Though most of these students were not developing software in industry at the time, their industry experience exposed a number of inadequacies in our interview script and in the underlying assumptions we made. For instance, we discovered problems in our initial definition of security tools using these practice interviews, which we corrected before continuing further with the research. We think that these practice interviews prepared us and our script well for the actual interviews. However, we made two important choices in arbitrary ways:

First, we had to choose subjects with whom to practice. While there were many graduate students available to be interviewed locally, we chose to interview only those with industry experience because we believed they would best help us improve our script. We could have interviewed current industry developers, but that would have significantly increased our recruiting costs. While this reasoning seems sound to us, we do not know how to validate this choice or how others would have made this choice, especially if students with industry experience had not available.

Second, we had to decide when to stop practicing and start collecting data. We did not have a principled way to determine when the script was ready -- when we felt that practicing would not expose more shortcomings, we stopped.

While any guide to performing interviews (such as [21]) will say that practicing is an essential part of developing interviews, using these practice interviews effectively seems to depend largely on experience and intuition, and we have no criteria with which to evaluate our choices.

How can we best choose subjects with whom to practice interviews? How do we know when an interview is ready?

F. Recognizing What's Interesting: Working the Crowd

When analyzing interview data, the graduate research assistant performing the data analysis also was on the lookout for interesting responses and interesting patterns in the data. The software engineering research community values novelty, so she used the results of her studies of the research literature to determine what parts of her findings were novel.

However, determining what is interesting requires an understanding of the current research community [22], an understanding that generally comes only from the experience afforded by being a member of it for more than a few years. In order to find potentially interesting patterns in the data from our interviews, the research advisor would ask the graduate research assistant questions about her findings. He based these questions on his intuitive understanding of the interests of the current research community. Naturally, this is an essential part of the advisor-student relationship, but it increases the cost and difficulty of discerning interesting findings in the data, since answering these questions required the graduate research assistant to iteratively consult with the advisor and the data.

Ideally, the researcher examining data for interesting trends would deeply understand the community's interests and be so equipped to find interesting results in the data. However,

this deep understanding comes from experience working and organizing within the community, such as the experience gained by taking part in program committee discussions. The graduate research personnel working on this project, naturally, have not spent as much time in the research community as the more senior research advisor. However, it was not senior research personnel, but a graduate research assistant, who analyzed most of the data to find interesting patterns. We find it notable that those who do the data analysis are those who understand the community least.

How can we sort through large amounts of data and find what will be interesting to the community? How can we better pass knowledge of the community's values and priorities to young researchers?

G. Data Overload: $5000 Buys a Lot of Ramen

As previously mentioned, we conducted hour-long interviews with 43 developers -- and as a result, we had to transcribe and code that data. We could have paid for professional transcription services, but this would have cost approximately $5000. Our experience indicates that these transcripts would still have to validated by hand by our research staff. Given the nominal cost of a graduate research assistant's time, we decided to transcribe the interviews ourselves.

Coding these transcripts required many passes over the data (which was over 40 hours of audio), since some pertinent codes only emerged in later interviews. We used Atlas.ti [23], a qualitative data analysis program, to code and analyze our transcriptions. This software was helpful, but inspecting the entirety of our large dataset so many times was still exhausting. It seems unavoidable that iterating over large amounts of data by hand will be time-consuming, difficult, and sometimes unpleasant.

How do we best process and analyze large amounts of qualitative data? How do we balance financial costs with less tangible ones when making that decision?

H. Onboarding and Offboarding: Someday, This All Will Be Yours

Recently, responsibility for this project passed from one graduate student to another. We took steps to ensure that the departing student's work on the project was finished: she had conducted and transcribed the interviews, she coded and analyzed the transcriptions, and she wrote most of a forthcoming journal paper on the work. It seemed that she had written down everything the new student needed to know.

In spite of these efforts to ensure that the first student's work was done, the new student has had to work closely with the first to transfer the knowledge necessary to his finishing the paper and continuing the research. This has been timeconsuming for both students. This takes time away from not only designing the next phases of the research, but also from an adult who would like to move on with her life.

We believe that many researchers face these challenges as well: in academic research projects, graduate students often

know the particularities of a given project more intimately than anyone else. In qualitative research, these particularities can be fundamentally important to the research. When these students leave, their in-depth knowledge of their work leaves with them. We know we are not the first to encounter problems in transferring knowledge between researchers: other studies have documented the research challenges in continuing a thread of research with different research staff [24]. However, we have not yet seen any explicit discussions of best practices for easing transitions between researchers in qualitative empirical research.

How can we best hand responsibility for a research project between research personnel? How do we define when a student's work on a project is finished?

I. Presenting the Work: Tough Crowd

The software engineering research community often accepts work that presents the need for a tool, an implementation of that tool, and an experiment demonstrating that the tool solves some problem. More recently, some papers explore and evaluate practices in an area without proposing any particular tool [25], [26]. Our work follows neither of these common patterns, and perhaps as a result, we have had difficulties in presenting the work we've done so far: we had a poster and a student research competition abstract rejected. We feel that reviewers were uncomfortable viewing the tool adoption problem at such a high level and with the use of sociological methodologies. We believe we did not present the work in a way that shows how it will produce actionable results.

To address this challenge, we have tried to get feedback and advice from researchers in and out of the software engineering research community. We have discussed our preliminary findings with our peers in workshops. While the submissions mentioned above were rejected, the reviewers' responses helped us evaluate our presentation of the work. We believe it would be useful to discuss the ways and venues in which researchers present work that uses unconventional methodologies and how this work comes to be accepted by the community.

How can we present work that uses unorthodox methodologies in ways that will interest the research community? Where can we present this work?

III. CONCLUSION

The paper represents our frank portrayal of real challenges we have encountered in doing research in industry. In doing so, we hope to receive feedback to improve our own work, but also discuss these challenges with others. We hope that such discussion will lead to more concrete guidelines and improve the community's research practices.

ACKNOWLEDGMENT

Thanks to Michael Bazik, Xi Ge, Brittany Johnson, Gustavo Soares, Nuo Shi, and Yoonki Song for providing comments on drafts of this paper. Thanks to the National Security Agency for funding the research described in this paper.

REFERENCES

[1] G. McGraw, "Software security," Security Privacy, IEEE, vol. 2, no. 2, pp. 80 ? 83, Mar.-Apr. 2004.

[2] ----, "Building secure software: better than protecting bad software," Software, IEEE, vol. 19, no. 6, pp. 57 ? 58, Nov./Dec. 2002.

[3] E. M. Rogers, Diffusion of Innovation. Free Press, 1995, vol. 4, no. 1. [4] L. A. Meyerovich and A. S. Rabkin, "Socio-plt: principles for program-

ming language adoption," in Onward!, October 2012, pp. 39?54. [5] S. Raghavan and D. Chand, "Diffusing software-engineering methods,"

Software, IEEE, vol. 6, no. 4, pp. 81?90, Jul. 1989. [6] E. Murphy-Hill and A. P. Black, "An interactive ambient visualization

for code smells," in Proceedings of the 5th International Symposium on Software Visualization, ser. SOFTVIS '10. ACM, 2010, pp. 5?14. [7] E. Murphy-Hill, T. Barik, and A. P. Black, "Interactive ambient visualizations for soft advice," Information Visualization, 2013, to appear. [8] E. Murphy-Hill and A. P. Black, "Breaking the barriers to successful refactoring: Observations and tools for extract method," in ICSE '08: Proceedings of the 30th International Conference on Software Engineering, 2008, pp. 421?430. [9] E. Murphy-Hill, T. Zimmermann, C. Bird, and N. Nagappan., "The design of bug fixes," in Proceedings of the International Conference on Software Engineering, 2013, to Appear. [10] E. Murphy-Hill and G. C. Murphy, "Peer interaction effectively, yet infrequently, enables programmers to discover new tools," in Proceedings of the ACM 2011 Conference on Computer Supported Cooperative Work, 2011, pp. 405?414. [11] E. Murphy-Hill and A. P. Black, "Programmer-friendly refactoring errors," IEEE Transactions on Software Engineering, vol. 99, 2011. [12] B. Kitchenham, S. L. Pfleeger, L. Pickard, P. Jones, D. C. Hoaglin, K. E. Emam, and J. Rosenberg, "Preliminary guidelines for empirical research in software engineering," IEEE Trans. Software Eng., vol. 28, no. 8, pp. 721?734, 2002. [13] T. Punter, M. Ciolkowski, B. Freimut, and I. John, "Conducting online surveys in software engineering," in Proceedings of the 2003 International Symposium onEmpirical Software Engineering, 2003, pp. 80 ? 88. [14] L. Tetzlaff and D. R. Schwartz, "The use of guidelines in interface design," in CHI '91: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. New York: ACM, 1991, pp. 329?333. [15] "Guru," , February 2013. [16] "Quora," , February 2013. [17] "Stack overflow," , February 2013. [18] A. Chang, P. K. McDonald, and P. M. Burton, "Methodological choices in work-life balance research 1987 to 2006 : a critical review," International Journal of Human Resource Management, vol. 21, no. 13, pp. 2381?2413, 2010. [19] A. Bacchelli and C. Bird, "Expectations, outcomes, and challenges of modern code review," in Proceedings of the 35th International Conference on Software Engineering, 2013. [20] J. J. Francis, M. Johnston, C. Robertson, L. Glidewell, V. Entwistle, M. P. Eccles, and J. M. Grimshaw, "What is an adequate sample size? operationalising data saturation for theory-based interview studies," Psychology Health, vol. 25, no. 10, pp. 1229?1245, 2010. [21] N. M. Bradburn, S. Sudman, and B. Wansink, Asking Questions: The Definitive Guide to Questionnaire Design ? For Market Research, Political Polls, and Social and Health Questionnaires, revised ed. JosseyBass, 2004. [22] M. Davis, "That's interesting!" Philosophy of the Social Sciences, vol. 1, no. 2, p. 309, 1971. [23] ATLAS.ti Scientific Software Development GmbH, "Atlas.ti," , February 2013. [24] J. Lung, J. Aranda, S. M. Easterbrook, and G. V. Wilson, "On the difficulty of replicating human subjects studies in software engineering," in Proceedings of the 30th international conference on Software engineering, ser. ICSE '08. ACM, 2008, pp. 191?200. [25] C. Bird and T. Zimmermann, "Assessing the value of branches with what-if analysis," in Proceedings of the 20th International Symposium on Foundations of Software Engineering. ACM, 2012. [26] M. Cherubini, G. Venolia, R. DeLine, and A. J. Ko, "Let's go to the whiteboard: how and why software developers use drawings," in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ser. CHI '07. ACM, 2007, pp. 557?566.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download