How to detect code duplication during development?

We have a fairly large code base, 400K LOC in C ++, and code duplication is a bit of a problem. Are there any tools that can efficiently detect duplicate blocks of code?

Ideally, this would be something developers could use during development, rather than just running from time to time to see where the problems are. It would be nice if we could integrate such a tool with CruiseControl to give a report after each check.

I recently looked at Duploc , it showed a good graph, but its use requires an environment with a small number of lines, which makes it pretty difficult to automatically start.

Free tools would be nice, but if there are good commercial tools, I would also be interested.

+66
c ++ code-duplication
Oct 10 '08 at 14:34
source share
13 answers

Simian detects duplicate code in C ++ projects.

Update: also works with Java, C #, C, COBOL, Ruby, JSP, ASP, HTML, XML, Visual Basic, Groovy source code and even text files

+33
Oct 10 '08 at 14:40
source share

I used the PMD Copy-and-Paste-Detector and integrated it into CruiseControl using the following shell script (required to get pmd-jar in the classpath).

Our check is carried out at night. If you want to limit the output to a list of only files from the current set of changes, you may need some user programming (idea: check everything and list only duplicates in which one of the changed files is involved. You should check all files because the change can use some code from an immutable file). Must be doable by using XML output and parsing the result. Remember to post that script when it was done;)

To start, the โ€œTextโ€ output should be fine, but you want to display the results in a user-friendly way, for which I use a perl script to generate HTML files from the "xml" CPD output. You can access them by sending them to tomcat, where There is information about cruise jsp. Developers can view them from there and see the results of their dirty hacking :)

It works pretty fast, less than 2 seconds on 150 KLoc code (empty lines and comments are not taken into account in this number).

duplicatecheck.xml

<project name="duplicatecheck" default="cpd"> <property name="files.dir" value="dir containing your sources"/> <property name="output.dir" value="dir containing results for publishing"/> <target name="cpd"> <taskdef name="cpd" classname="net.sourceforge.pmd.cpd.CPDTask"/> <cpd minimumTokenCount="100" language="cpp" outputFile="${output.dir}/duplicates.txt" ignoreLiterals="false" ignoreIdentifiers="false" format="text"> <fileset dir="${files.dir}/"> <include name="**/*.h"/> <include name="**/*.cpp"/> <!-- exclude third-party stuff --> <exclude name="boost/"/> <exclude name="cppunit/"/> </fileset> </cpd> </target> 

+18
Nov 24 '08 at 17:12
source share

duplo seems to be an implementation of the C algorithm used in Duploc. It is easy to compile and install, and although the options are limited, it seems that more or less works out of the box.

+6
Dec 17 '08 at 4:54
source share

Check out the PMD project .

I never used it, but always wanted to.

+5
Oct 10 '08 at 14:43
source share

Well, you can run the clone detector on your source code base every night.

Many clone detectors work by comparing source strings and can only find an exact duplicate of code.

CCFinder, above, works by comparing the language of tokens, so it is not sensitive to the gap of change. It can detect clones that are variants of the source code if there is only one change token (for example, change the variable X to Y per clone).

Ideally, what you want is higher, but the ability to find clones where variations are allowed are relatively arbitrary, for example, replace a variable with an expression, a statement with a block, etc.

Our clone detector CloneDR does this for Java, C #, C ++, COBOL, VB.net, VB6, Fortran and a variety of other languages. This can be seen at: http://www.semdesigns.com/Products/Clone/index.html

In addition to being able to manage multiple languages, the CloneDR engine is capable of handling various input styles, including ASCII, ISO-8859-1, UTF8, UTF16, EBCDIC, a number of Microsoft encodings, and (Japanese) Shift-JIS.

The site has several reports on the launch of clone checking, including one for C ++.

EDIT Feb 2014: Now handles all C ++ 14.

+2
Jun 28 '09 at 19:27
source share

For my own future reference, these Debian packages seem to be doing something in this direction:

I could swear that I have other packages (s) installed that may be even more important, but I cannot find them at the moment. (That's why I list my results here this time: to give myself a chance to find them again!)

PS It seems that there should be a debtags tag for all tools related to finding [near] duplication. (But what would it be called?)

+2
Mar 13 2018-12-12T00:
source share

CCFinderX is a free (for your own use) cloned code detector that supports several programming languages โ€‹โ€‹(Java, C, C ++, COBOL, VB, C #).

+1
Oct 11 '08 at 4:55
source share

Searching for โ€œidenticalโ€ code fragments is relatively simple; there is already an existing tool that already does this (see other answers).

Sometimes itโ€™s good, sometimes itโ€™s not; he can drive the development time, if you make it at too thin a "level"; that is, trying to reorganize so much code will lose your goal (and probably damage your milestones and graphics).

What is more difficult to find are several functions / methods that do the same, but with different (but similar) inputs and / or algorithm without proper documentation.

If you need to perform two or more methods to do the same, and the programmer will try to fix one instance, but forget (or donโ€™t know that they exist) to fix the others, you will increase the risk for your software.

+1
Nov 24 '08 at 17:25
source share

The same thing ( http://sourceforge.net/projects/same/ ) is very simple, but it works on text strings instead of tokens, which is useful if you are using a language that is not supported by one of the fancier search crawlers.

+1
Aug 25 '09 at 16:10
source share

ConQAT is a great tool that supports C ++ code analysis. You can find duplicates, ignoring spaces. It has full convenient gui and console interfaces. Due to its flexibility, it is not easy to configure. I found this blog post very useful for creating a C ++ project .

+1
Aug 03 '13 at 14:48
source share

You can use our SourceMeter tool to detect code duplication. This is a command line tool (very similar to compilers), so you can easily integrate it into continuous integration tools like the CruiseControl you mentioned, or Jenkins .

+1
Jul 31 '15 at 16:12
source share

There is also Simian , which supports Java, C #, C ++, C, Objective-C, JavaScript ...

It is supported by Hudson (e.g. CPD).

If you are not an open source project, you have to pay for Simian.

0
Jul 15 '10 at 22:18
source share

TeamCity has a powerful code duplication mechanism for .NET and java, which can easily be run as part of your build system.

-3
Nov 17 '08 at 16:20
source share



All Articles