Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored data clumps with the help of LLMs (research project) #9352

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

compf
Copy link

@compf compf commented Jun 5, 2024

Hello maintainers,

I am conducting a master thesis project focused on enhancing code quality through automated refactoring of data clumps, assisted by Large Language Models (LLMs).

Data clump definition

A data clump exists if

  1. two methods (in the same or in different classes) have at least 3 common parameters and one of those methods does not override the other, or
  2. At least three fields in a class are common with the parameters of a method (in the same or in a different class), or
  3. Two different classes have at least three common fields

See also the following UML diagram as an example
Example data clump

I believe these refactoring can contribute to the project by reducing complexity and enhancing readability of your source code.

Pursuant to the EU AI Act, I fully disclose the use of LLMs in generating these refactorings, emphasizing that all changes have undergone human review for quality assurance.

Even if you decide not to integrate my changes to your codebase (which is perfectly fine), I ask you to fill out a feedback survey, which will be scientifically evaluated to determine the acceptance of AI-supported refactorings. You can find the feedback survey under https://campus.lamapoll.de/Data-clump-refactoring/en

Thank you for considering my contribution. I look forward to your feedback. If you have any other questions or comments, feel free to write a comment, or email me under tschoemaker@uni-osnabrueck.de .

Best regards,
Timo Schoemaker
Department of Computer Science
University of Osnabrück

Proposed changelog entries

refactored data clumps

Proposed upgrade guidelines

N/A

Submitter checklist

Edit tasklist title
Beta Give feedback Tasklist Submitter checklist, more options

Delete tasklist

Delete tasklist block?
Are you sure? All relationships in this tasklist will be removed.
  1. The Jira issue, if it exists, is well-described.
    Options
  2. The changelog entries and upgrade guidelines are appropriate for the audience affected by the change (users or developers, depending on the change) and are in the imperative mood (see examples). Fill in the Proposed upgrade guidelines section only if there are breaking changes or changes that may require extra steps from users during upgrade.
    Options
  3. There is automated testing or an explanation as to why this change has no tests.
    Options
  4. New public classes, fields, and methods are annotated with @Restricted or have @since TODO Javadocs, as appropriate.
    Options
  5. New deprecations are annotated with @Deprecated(since = "TODO") or @Deprecated(forRemoval = true, since = "TODO"), if applicable.
    Options
  6. New or substantially changed JavaScript is not defined inline and does not call eval to ease future introduction of Content Security Policy (CSP) directives (see documentation).
    Options
  7. For dependency updates, there are links to external changelogs and, if possible, full differentials.
    Options
  8. For new APIs and extension points, there is a link to at least one consumer.
    Options
Loading

Desired reviewers

Before the changes are marked as ready-for-merge:

Maintainer checklist

Edit tasklist title
Beta Give feedback Tasklist Maintainer checklist, more options

Delete tasklist

Delete tasklist block?
Are you sure? All relationships in this tasklist will be removed.
  1. There are at least two (2) approvals for the pull request and no outstanding requests for change.
    Options
  2. Conversations in the pull request are over, or it is explicit that a reviewer is not blocking the change.
    Options
  3. Changelog entries in the pull request title and/or Proposed changelog entries are accurate, human-readable, and in the imperative mood.
    Options
  4. Proper changelog labels are set so that the changelog can be generated automatically.
    Options
  5. If the change needs additional upgrade steps from users, the upgrade-guide-needed label is set and there is a Proposed upgrade guidelines section in the pull request title (see example).
    Options
  6. If it would make sense to backport the change to LTS, a Jira issue must exist, be a Bug or Improvement, and be labeled as lts-candidate to be considered (see query).
    Options
Loading

Copy link

welcome bot commented Jun 5, 2024

Yay, your first pull request towards Jenkins core was created successfully! Thank you so much!

A contributor will provide feedback soon. Meanwhile, you can join the chats and community forums to connect with other Jenkins users, developers, and maintainers.

Copy link
Contributor

@mawinter69 mawinter69 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is common to all classes were you added the ProcessProperties is that they inherit from UnixProcess. So instead adding a new class just for the properties wouldn't it be better to just define the things in UnixProcess?

private int ppid = -1;
private EnvVars envVars;
private List<String> arguments;
private ProcessProperties properties = new ProcessProperties(-1, null, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, in all other places the ProcessProperties are defined transient, why not here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats seems to be a oversight by me. Spotbug complained that I should add transient everywhere and when it stopped complaining I didn't look more. Strange :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me why the fields are being made transient. If it's SE_BAD_FIELD, wouldn't making ProcessProperties Serializable address this without potentially causing serialization trouble?

(FWIW removing the transient doesn't fail Spotbugs for me locally.)

@compf
Copy link
Author

compf commented Jun 5, 2024

What is common to all classes were you added the ProcessProperties is that they inherit from UnixProcess. So instead adding a new class just for the properties wouldn't it be better to just define the things in UnixProcess?

Thank you for the feedback. In your particular case, that might be a better solution. But the LLM chooses the approach that always works, But I agree that pulling up those fields can also be a solution to solve data clumps :)

Copy link
Member

@daniel-beck daniel-beck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall #9352 (review) seems preferable to a new type. The following should do it, all we'd lose is the finality of ppid, no different from this proposal.

diff --git a/core/src/main/java/hudson/util/ProcessTree.java b/core/src/main/java/hudson/util/ProcessTree.java
index 8fbb80c8a8..80155d3d37 100644
--- a/core/src/main/java/hudson/util/ProcessTree.java
+++ b/core/src/main/java/hudson/util/ProcessTree.java
@@ -796,6 +796,10 @@ public abstract class ProcessTree implements Iterable<OSProcess>, IProcessTree,
      * A process.
      */
     public abstract class UnixProcess extends OSProcess {
+        protected final int ppid = -1;
+        protected EnvVars envVars;
+        protected List<String> arguments;
+
         protected UnixProcess(int pid) {
             super(pid);
         }
@@ -877,9 +881,6 @@ public abstract class ProcessTree implements Iterable<OSProcess>, IProcessTree,
         }
 
         class LinuxProcess extends UnixProcess {
-            private int ppid = -1;
-            private EnvVars envVars;
-            private List<String> arguments;
 
             LinuxProcess(int pid) throws IOException {
                 super(pid);
@@ -1001,13 +1002,9 @@ public abstract class ProcessTree implements Iterable<OSProcess>, IProcessTree,
              */
             private final boolean b64;
 
-            private final int ppid;
-
             private final long pr_envp;
             private final long pr_argp;
             private final int argc;
-            private EnvVars envVars;
-            private List<String> arguments;
 
             private AIXProcess(int pid) throws IOException {
                 super(pid);
@@ -1327,7 +1324,6 @@ public abstract class ProcessTree implements Iterable<OSProcess>, IProcessTree,
              */
             private final boolean b64;
 
-            private final int ppid;
             /**
              * Address of the environment vector.
              */
@@ -1337,8 +1333,6 @@ public abstract class ProcessTree implements Iterable<OSProcess>, IProcessTree,
              */
             private final long argp;
             private final int argc;
-            private EnvVars envVars;
-            private List<String> arguments;
 
             private SolarisProcess(int pid) throws IOException {
                 super(pid);
@@ -1596,9 +1590,6 @@ public abstract class ProcessTree implements Iterable<OSProcess>, IProcessTree,
         }
 
         private class DarwinProcess extends UnixProcess {
-            private final int ppid;
-            private EnvVars envVars;
-            private List<String> arguments;
 
             DarwinProcess(int pid, int ppid) {
                 super(pid);
@@ -1881,10 +1872,6 @@ public abstract class ProcessTree implements Iterable<OSProcess>, IProcessTree,
 
         private class FreeBSDProcess extends UnixProcess {
 
-            private final int ppid;
-            private EnvVars envVars;
-            private List<String> arguments;
-
             FreeBSDProcess(int pid, int ppid) {
                 super(pid);
                 this.ppid = ppid;

import hudson.EnvVars;
import java.util.List;

public class ProcessProperties {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a public class?

private int ppid = -1;
private EnvVars envVars;
private List<String> arguments;
private ProcessProperties properties = new ProcessProperties(-1, null, null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me why the fields are being made transient. If it's SE_BAD_FIELD, wouldn't making ProcessProperties Serializable address this without potentially causing serialization trouble?

(FWIW removing the transient doesn't fail Spotbugs for me locally.)

@@ -0,0 +1,16 @@
package hudson.util;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add license header.

@compf
Copy link
Author

compf commented Jun 13, 2024

Thank you very much for the feedback. @daniel-beck You are correct that your proposal is better. I haven't encountered this corner case where fields are shared in derived classes before so it is interesting that the LLM did not spot this. I can update this PR to use your "pulling fields up proposal" when I find time :)

@MarkEWaite MarkEWaite added the skip-changelog Should not be shown in the changelog label Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip-changelog Should not be shown in the changelog
Projects
None yet
4 participants