Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Would you pay for a code similarity detection tool?
33 points by pka on July 27, 2015 | hide | past | favorite | 45 comments
I've been working on a proof-of-concept code similarity detection tool.

The tool is based on matching semantically equivalent code fragments, i.e. it detects similarities across inlined or lifted functions, reordered but equivalent expressions and so on.

Check out the attached screenshot [1].

The idea is to mine the publicly available npm repos, and provide a payed service for detecting similar code fragments already implemented in npm libraries -- basically an open REST service and one or two plugins for the most popular editors (vim, Sublime, Atom?)

If this proves successful, I would extend the service to other languages and services, where applicable.

Would you pay for such a service? What kind of features would you expect?

[1] https://dl.dropboxusercontent.com/u/30225560/ase.png



Commercial consumption of this idea is around verifying licensing. My employer runs a tool against our internal repositories looking for code it's aware of on the internet. When it flags a match, a human looks at our use of the publicly available code and verifies our compliance with the license that came with the code.

I, a software developer, wouldn't buy such a thing, but there's certainly enterprise/corporate demand for it.


I haven't thought about enterprise, it may be direction worth pursuing, thanks!




I've come across simian - isn't it a fuzzy matcher without semantic analysis? I'll check out the rest, thanks!


Well there are plagiarism detection tools that education institutions use to detect cheating in CS classes.

There are several machine learning solutions that can look for your IP in either source code or binary form even on an abstract (algorithm) level.

Many tools exist like that, the real question is what exactly your target audience is?

If you are scanning public repo's what service does the tool actually provides, and no detecting code similarity isn't the answer here.

Selling something isn't really an issue as long as it has a clear purpose which i don't think your idea has at this moment in time.


Many of the plagiarism detection tools don't work on a semantic level. There are some, I haven't found an easy, it-just-works online solution.

Could you share links to the IP detection ML tools?

And to answer your question, please read my answer here [1].

[1] https://news.ycombinator.com/item?id=9954879


Been demoed those solutions by some consultancy firms will have to dig them up.

But still what is the use case?

I mean if i wrote code which is functioning why would i replace it with some one elses code?

Taking in raw code even (especially from) OSS repo's is a huge huge can of worms.

Say you are developing a product if you use an OSS library which is distributed as is you can bundle it with your product under most licenses unless you modify it without having worry about anything.

If you copy paste code form that library into your own code base well than what is it? a copyright violation? derivative work? I can easily bundle OpenSSL with their Apache license and just make a remark about it some where during the installation, i don't have to distribute my software or source code under that license or under any other OSS license.

But if i take the raw source code say their ASN.1 parser and implement it in my own program? what now? I'm not an IP lawyer but I'm pretty sure this either violates the license outright or my software now just became some derivative work which means the terms (or some of them) of the original license now apply.

Even if my software was meant to be OSS it's still an issue maybe i don't want to use Apache or BSD maybe it want GPLv2 or V3 or my own license or what ever.

The other issue that stands out to me is that I already wrote functioning code, it works, it's mine, i know it i understand it i can maintain it, why should i on-board someone elses code that i won't know, won't understand, wont be able to maintain as easily? Where is the benefit in that if I've already written all / most of my code for your similarity score to trigger a suggestion?


Personally no.

Sometimes I deliberately choose not to abstract away simple logic (as in your screenshot) for performance reasons, or just to reduce a complex dependency chain (I'm all for code reuse, but I do also believe in a balance when writing portable code). And the instances where the required logic is more complex, I'd know to be looking for a module before writing my own code.

Because of these reasons, I couldn't even see myself using this tool if it was free.

However, this does sound an interesting project and I think you should still proceed with it regardless of my feedback as, even if it doesn't because a profitable exercise, I could see this becoming a future must-have feature for IDEs - eg for code refactoring. In fact maybe you could extend this tool to analyse repeated code within a project and suggest abstracting that out to a function (that's a tool I probably would use on larger code bases!)


Would you pay for it yourself? I'm struggling to see what a normal developer would use it for. I think your main customer will be high schools, colleges and universities. Not so much professional developers.


I would, yes - it's a form of scratching my own itch.

I don't know about you, but I find myself writing boilerplate code very often. Converting between timezones, reading config files, making HTTP requests (+error handling), UI idioms (click & disable), etc.

Often the code handling such things is spread over a bigger function, like open a file on line 5, loop over lines and read into array on lines 24, 25, 26, close file in finally {} clause.

What I want is a tool telling me "hey, you can replace these lines by function X in open-source package Y." Maybe I overestimate its universal usefulness though :)


In certain industries, where independent double programming is required, this could be used to catch people who steal/copy code.


There's a YC company doing something very much like this, but against compiled binaries.


Would you mind sharing which, or haven't they gone public yet?


I'm the founder of SourceDNA, which tptacek is referring to.

https://sourcedna.com/


You've got YC investment now? Congratulations!


Mirror, in case the OP's Dropbox account hits its bandwidth limit: http://i.imgur.com/ciEUQth.png


Your lefthand example works on "i" as a global variable - it has additional effects compared to the example on the right.

Tools like this are common within Java as Java and similar are easier to this kind of analysis on.

IntelliJ IDEA is long reigning champ in this area. (Which is a product I would pay for.)


You are right - sorry about that. The example was put together in 5mins and was meant to just convey the general idea.

I'll have to check out IDEA again, thanks!


(I work mostly on Python, so this is my opinion based on that)

Most of the code that I write that would be duplicating the functionality in a library would be doing so because we don't need the extra functionality. For example, I need to pluralise a small set of words, so I write 3-4 clause if-statement and append some "s" characters because I don't see the need to use python-inflection. The latter is massively more complex, so unlikely to be detected as the same thing.

Sure, there might be a few matches, but I suspect they will mostly be helper functions within libraries, rather than the public API of libraries.

I would prefer not to send code to a web service for detection, although not totally against it. In many companies this would not be allowed, either through policy, firewalls, exfiltration detection systems, or lack of internet on development machines (I know people who work, or have worked myself, in all of these situations).

Something I think would be far more valuable, and possibly more realistic as well, is local detection that highlights possibly duplicated code in a codebase. I find little snippets (1-2 lines) that have been duplicated on a fairly regular basis, and if I could identify those to be extracted out into re-usable methods, that would be amazing.

Would I pay money for it? Probably not, it's not that much of a problem, and all those sorts of tools are usually open source anyway. Unfortunately that's my expectation now.


All valid concerns.

I guess the best way to address the usefulness issue is to run it against popular frameworks, like React or Angular, and see what happens.

If it proves useful (like say 10% of code could be replaced by existing functions), would you change your mind?


Tools are very hard to sell. Especially very specialized tools. And tools for programmers often come with the end-user price of 0, so justifying anything above that is hard.

For me personally, I don't see the need to even use such a tool, let alone paying for it. Many programmers seem to focus on code, but in my experience, that's usually not the problem you have when things go bad.


Your example looks similar to what ReSharper does: http://blog.jetbrains.com/dotnet/2009/12/11/resharper-50-pre...


I think Visual Studio does it too with its Analyse Solution for Code Clones. It's an Ultimate edition feature though.



I was just reading about plagiarism detection recently http://theory.stanford.edu/~aiken/moss/

How does your techniques compare to Winnowing? http://theory.stanford.edu/~aiken/publications/papers/sigmod...


It works on a semantic level (i.e. what the code actually means), rather than fingerprinting strings. This means that reordering code segments, renaming variables, inlining or lifting functions wouldn't affect a match, if the code is semantically equivalent.


cool. Sounds more robust.


> basically an open REST service

Be careful with that, it restricts your customers to a certain lightmillisecond distance from your server - if the plugin is intended to be real-time.


If it turns out to be at all useful with regard to maintaining revenue-generating code in production, $25 per year is a laughably low ask, even if it's only useful some of the time for some of the people.

Figuring out which people and when is, generically, a matter of doing your homework with customer development and targeted marketing, and then further improving your marketing surface as you learn more about who is a customer and how to find them.

Going from "this might be useful" to a working business model is, of course, a deeply nontrivial problem...but it's one that many people have solved before, and it's a heck of a lot easier if you don't start with "maybe $25 per year."


For some reason my brain added in an estimated cost to the OP. You're right: it's outside the scope of what is being asked. Removed the pricing stuff.


There is a project called jsinspect which does what you are talking about. I dont know if yours would be a bit more robust than this one.. But just wanted to provide this - so you know what you are competing with.

https://github.com/danielstjules/jsinspect


Thanks, I've seen jsinspect. It works by comparing ASTs, which is probably than matching strings, but I guess it wouldn't be able to deal with inlined functions or reordered code segments etc.


Would love to see this integrated into an IDE such that it is capable of detecting whether or not what you're typing (or something close to it) already exists. For companies with large codebases this is a major enough problem that a decent solution would have a compelling enough ROI to shell out some cash.


I'm not sure if I would buy it, but it could be a very nice way of introducing to a new language/framework. When learning a new environment, one doesn't know all the functions and it's very easy to write code that is already implemented.

So, maybe try asking someone who is into programming training programs.


McCabe software has a tool that's supposed to find duplicate code within a code base (e.g. cut-n-pasted functions), using path analysis. They have been around for ages, but I've never used them since they're rather expensive.


As a professor, I always missed an easy-to-use and modern tool for that. I didn't research this topic much, but not finding anything easy and ready to use is probably a market opportunity (if there is a market).


Have you looked at Code Climate?

https://codeclimate.com/


No, thanks for the link!

Code Climate looks more like a linter though?


How did you choose to work on this idea?


Basically by wanting to make my life easier when working on bigger, hard-to-maintain codebases :)


How do you determine semantically equivalent code fragments? Is it a dynamic solution or a static solution?


Personally, no. Even if it was an IDE feature, I would rather just learn it if I use it frequently.


I would not stuff like that exist for free and I rarely use that kind of tools.


Generally, no. There is no money to be made making tools.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: