Structural and Evolutionary Analysis of Developer Networks

  • Large-scale software engineering projects are often distributed among a number sites that are geographically separated by a substantial distance. In globally distributed software projects, time zone issues, language and cultural barriers, and a lack of familiarity among members of different sites all introduce coordination complexity and present significant obstacles to achieving a coordinated effort. For large-scale software engineering projects to satisfy their scheduling and quality goals, many developers must be capable of completing work items in parallel. A key factor to achieving this goal is to remove interdependencies among work items insofar as possible. By applying principles of modularity, work item interdependence can be reduced, but not removed entirely. As a result of uncertainty during the design and implementation phases and incomplete or misunderstood design intents, dependencies between work items inevitably arises and leads to requirements for developers to coordinate. The capacity of a project to satisfyLarge-scale software engineering projects are often distributed among a number sites that are geographically separated by a substantial distance. In globally distributed software projects, time zone issues, language and cultural barriers, and a lack of familiarity among members of different sites all introduce coordination complexity and present significant obstacles to achieving a coordinated effort. For large-scale software engineering projects to satisfy their scheduling and quality goals, many developers must be capable of completing work items in parallel. A key factor to achieving this goal is to remove interdependencies among work items insofar as possible. By applying principles of modularity, work item interdependence can be reduced, but not removed entirely. As a result of uncertainty during the design and implementation phases and incomplete or misunderstood design intents, dependencies between work items inevitably arises and leads to requirements for developers to coordinate. The capacity of a project to satisfy coordination needs depends on how the work items are distributed among developers and how developers are organizationally arranged, among other factors. When coordination requirements fail to be recognized and appropriately managed, anecdotal evidence and prior empirical studies indicate that this condition results in decreased product quality and developer productivity. In essence, properties of the socio-technical environment, comprised of developers and the tasks they must complete, provides important insights concerning the project's capacity to meet product quality and scheduling goals. In this dissertation, we make contributions to support socio-technical analyses of software projects by developing approaches for abstracting and analyzing the technical and social activities of developers. More specifically, we propose a fine-grained, verifiable, and fully automated approach to obtain a proper view on developer coordination, based on commit information and source-code structure, mined from version-control systems. We apply methodology from network analysis and machine learning to identify developer communities automatically. To evaluate our approach, we analyze ten open-source projects with complex and active histories, written in various programming languages. By surveying 53 open-source developers from the ten projects, we validate the accuracy of the extracted developer network and the authenticity of the inferred community structure. Our results indicate that developers of open-source projects form statistically significant community structures and this particular network view largely coincides with developers' perceptions. Equipped with a valid network view on developer coordination, we extend our approach to analyze the evolutionary nature of developer coordination. By means of a longitudinal empirical study of 18 large open-source projects, we examine and discuss the evolutionary principles that govern the coordination of developers. We found that the implicit and self-organizing structure of developer coordination is ubiquitously described by non-random organizational principles that defy conventional software-engineering wisdom. In particular, we found that: (a) developers form scale-free networks, in which the majority of coordination requirements arise among an extremely small number of developers, (b) developers tend to accumulate coordination requirements with more and more developers over time, presumably limited by an upper bound, and (c) initially developers are hierarchically arranged, but over time, form a hybrid structure, in which highly central developers are hierarchically arranged and all other developers are not. Our results suggest that the organizational structure of large software projects is constrained to evolve towards a state that balances the costs and benefits of coordination, and the mechanisms used to achieve this state depend on the project's scale. As a final contribution, we use developer networks to establish a richer understanding of the different roles that developers play in a project. Developers of open-source projects are often classified according to core and peripheral roles. Typically, count-based operationalizations, which rely on simple counts of individual developer activities (e.g., number of commits), are used for this purpose, but there is concern regarding their validity and ability to elicit meaningful insights. To shed light on this issue, we investigate whether count-based operationalizations of developer roles produce consistent results, and we validate them with respect to developers' perceptions by surveying 166 developers. We improve over the state of the art by proposing a relational perspective on developer roles, using our fine-grained developer networks, and by examining developer roles in terms of developers' positions and stability within the developer network. In a study of 10 substantial open-source projects, we found that the primary difference between the count-based and our proposed network-based core--peripheral operationalizations is that the network-based ones agree more with developer perception than count-based ones. Furthermore, we demonstrate that a relational perspective can reveal further meaningful insights, such as that core developers exhibit high positional stability, upper positions in the hierarchy, and high levels of coordination with other core developers, which confirms assumptions of previous work. Overall, our research demonstrates that data stored in software repositories, paired with appropriate analysis approaches, can elicit valuable, practical, and valid insights concerning socio-technical aspects of software development.show moreshow less

Download full text files

Export metadata

Metadaten
Author:Mitchell Joblin
URN:urn:nbn:de:bvb:739-opus4-4616
Advisor:Sven Apel, Wolfgang Mauerer
Document Type:Doctoral Thesis
Language:English
Year of Completion:2017
Date of Publication (online):2017/03/08
Date of first Publication:2017/03/08
Publishing Institution:Universität Passau
Granting Institution:Universität Passau, Fakultät für Informatik und Mathematik
Date of final exam:2017/02/17
Release Date:2017/03/08
GND Keyword:Software Engineering
Page Number:192 S.
Institutes:Fakultät für Informatik und Mathematik
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
open_access (DINI-Set):open_access
Licence (German):License LogoStandardbedingung laut Einverständniserklärung