Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contributor and repository graph analysis #175

Merged
merged 1 commit into from
Dec 6, 2022

Conversation

oindrillac
Copy link
Member

@oindrillac oindrillac commented Oct 28, 2022

In this PR, we created initial graph representations of existing open source GitHub repositories falling under a certain category using NetworkX.

related #168

We created 2 type of graph representations from the same data:

  • One where repositories and contributors both are both nodes (differently colored). Viewing which repositories share which set of contributors and analyzing their clusters can give an idea about how projects are connected to each other and to what degree

Screen Shot 2022-10-27 at 8 45 12 PM

  • One where repositories are nodes, and edges are number of contributions. The distance between repositories, how close or far they are will depend on the number of shared contributions that exist between them, and how well connected projects are with each other.

Screen Shot 2022-10-27 at 8 44 54 PM

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Oct 31, 2022

View / edit / reply to this conversation on ReviewNB

JamesKunstle commented on 2022-10-31T16:43:21Z
----------------------------------------------------------------

Line #20.    issue_contrib.columns =['Repo ID', 'Git', 'Issue Authors', 'Issue ID']

Purely aesthetic request: Name columns in the 'Select' statement for an easy 1:1 mental mapping between names in the table. Applies to other queries. Just a little nicer to read but doesn't impact correctness.


oindrillac commented on 2022-11-07T15:35:52Z
----------------------------------------------------------------

I agree with this. Will rename this, cleaner than having spaces in the names.

oindrillac commented on 2022-11-14T17:32:36Z
----------------------------------------------------------------

Modified the column names @JamesKunstle

@review-notebook-app
Copy link

review-notebook-app bot commented Oct 31, 2022

View / edit / reply to this conversation on ReviewNB

JamesKunstle commented on 2022-10-31T16:43:22Z
----------------------------------------------------------------

Line #8.                    ca.cntrb_id,

Something tricky: commits have both Authors and Committers who are not necessarily the same person. It certainly isn't incorrect to use only the committer as this data-point, but perhaps there's something interesting to be used later by considering both the 'Committer' and the 'Author' because we'd pay attention to both contributor's work.


oindrillac commented on 2022-11-07T15:36:50Z
----------------------------------------------------------------

That's a very good point. Will see how we can extend this work to include the Authors as well.

oindrillac commented on 2022-11-14T18:05:54Z
----------------------------------------------------------------

Created an issue to address this https://github.com//issues/177

@review-notebook-app
Copy link

review-notebook-app bot commented Oct 31, 2022

View / edit / reply to this conversation on ReviewNB

JamesKunstle commented on 2022-10-31T16:43:23Z
----------------------------------------------------------------

From here to the bottom of this notebook, I can imagine the steps that are being taken (e.g. above, I think that's a sparse edge matrix between commit creators and repos) but I would really appreciate some in-depth annotation of your process. What I'm seeing throughout this notebook is phenomenal work but I don't want to misinterpret your results by injecting an incorrect assumption.

essentially: would love more annotation on phases of working on this network representation.


oindrillac commented on 2022-11-07T15:46:28Z
----------------------------------------------------------------

Yes absolutely. Intend to add more explanation, examples and samples with this https://github.com//issues/174 but even at this phase, going to add some in-depth comments/annotation to improve readability

oindrillac commented on 2022-11-14T18:06:19Z
----------------------------------------------------------------

Added some explanation for these transformations and the graphs.

@review-notebook-app
Copy link

review-notebook-app bot commented Oct 31, 2022

View / edit / reply to this conversation on ReviewNB

JamesKunstle commented on 2022-10-31T16:43:24Z
----------------------------------------------------------------

this is such an awesome view! very cool to see the inter-connectedness of the 704 centroid w.r.t the relatively sparse connectivity of the other centroids.


oindrillac commented on 2022-11-07T15:57:26Z
----------------------------------------------------------------

Right! That makes me think, it would be cool to dynamic plotly like graphs from these where we can hover over these to see metadata on these nodes etc

cdolfi commented on 2022-11-11T16:27:17Z
----------------------------------------------------------------

Is the coloring here the same as above? blue for repo and yellow for contributor?

@review-notebook-app
Copy link

review-notebook-app bot commented Oct 31, 2022

View / edit / reply to this conversation on ReviewNB

JamesKunstle commented on 2022-10-31T16:43:25Z
----------------------------------------------------------------

Is this a fully-connected graph? e.g. # edges = (#nodes * (#nodes - 1)) I think. I'd assume that it would be- the weightedness of the edges would be interesting to see -- where are the largest contributor population intersection between projects?


oindrillac commented on 2022-11-07T15:55:54Z
----------------------------------------------------------------

No this is not a complete (fully-connected) graph. All nodes do not have a direct connection to each node. But each node has atleast one connection. Similar to a social media platform where the requirement to be on the network would be to have atleast one connection. For example - see project node '27034' and '27032', they are not connected as they do not have any shared contributors.

To your previous comment, along with details on how we got to this graph, I will also add some comments on how this graph can be interpreted.

oindrillac commented on 2022-11-07T15:58:42Z
----------------------------------------------------------------

Would be cool to have dynamic graph here too, so one can hover over the edges to see "which contributors" constitute these shared contributors. Would be so nice to be able to drill down into those views.

cdolfi commented on 2022-11-11T16:29:25Z
----------------------------------------------------------------

For the edge lengths here what do they specifically correlate to?

oindrillac commented on 2022-11-14T18:07:24Z
----------------------------------------------------------------

@cdolfi the edge lengths here correlate to the number of shared contributions amongst them. Added some explanations for the graph below for more clarity.

@JamesKunstle
Copy link
Collaborator

Overall: I want to re-review this thoroughly with more annotation so I can follow your process very closely. It's so so cool to see these results as they are though, the interconnectedness is fascinating to see. Amazing work ya'll.

Copy link
Member Author

I agree with this. Will rename this, cleaner than having spaces in the names.


View entire conversation on ReviewNB

Copy link
Member Author

That's a very good point. Will see how we can extend this work to include the Authors as well.


View entire conversation on ReviewNB

Copy link
Member Author

Yes absolutely. Intend to add more explanation, examples and samples with this https://github.com//issues/174 but even at this phase, going to add some in-depth comments/annotation to improve readability


View entire conversation on ReviewNB

Copy link
Member Author

oindrillac commented Nov 7, 2022

No this is not a complete (fully-connected) graph. All nodes do not have a direct connection to each node. But each node has atleast one connection. Imagine a social media platform where there is a requirement that, to be on the network you need to have atleast one connection, this would be similar to such a graph. For example - see project node '27034' and '27032', they are not connected as they do not have any shared contributors.

To your previous comment, along with details on how we got to this graph, I will also add some comments on how this graph can be interpreted.


View entire conversation on ReviewNB

Copy link
Member Author

Right! That makes me think, it would be cool to dynamic plotly like graphs from these where we can hover over these to see metadata on these nodes etc


View entire conversation on ReviewNB

Copy link
Member Author

Would be cool to have dynamic graph here too, so one can hover over the edges to see "which contributors" constitute these shared contributors. Would be so nice to be able to drill down into those views.


View entire conversation on ReviewNB

Copy link
Contributor

cdolfi commented Nov 11, 2022

Is the coloring here the same as above? blue for repo and yellow for contributor?


View entire conversation on ReviewNB

Copy link
Contributor

cdolfi commented Nov 11, 2022

For the edge lengths here what do they specifically correlate to?


View entire conversation on ReviewNB

Co-authored-by: Hema Veeradhi <hveeradh@redhat.com>
Copy link
Member Author

Modified the column names @JamesKunstle


View entire conversation on ReviewNB

Copy link
Member Author

Created an issue to address this https://github.com//issues/177


View entire conversation on ReviewNB

Copy link
Member Author

Added some explanation for these transformations and the graphs.


View entire conversation on ReviewNB

Copy link
Member Author

@cdolfi the edge lengths here correlate to the number of shared contributions amongst them. Added some explanations for the graph below for more clarity.


View entire conversation on ReviewNB

@oindrillac
Copy link
Member Author

@cdolfi @JamesKunstle added some more annotations and explanations to the notebook. Please take a look

@oindrillac
Copy link
Member Author

Is the coloring here the same as above? blue for repo and yellow for contributor?

View entire conversation on ReviewNB

Yes, thats correct. Added legends to each graph for clarity

@oindrillac
Copy link
Member Author

Just want to bump this up. Comments above were addressed and this is ready for another look.

Copy link
Contributor

@cdolfi cdolfi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! @JamesKunstle What do you think?

@JamesKunstle
Copy link
Collaborator

big lgtm from me

Copy link
Collaborator

@JamesKunstle JamesKunstle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@cdolfi cdolfi merged commit 796cbc6 into oss-aspen:main Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants