How does code completion work?

Many editors and IDEs have code completion. Some of them are very "smart", others are actually not. I'm interested in a more intelligent type. For example, I saw IDEs that only offer a function if it is a) available in the current area b) its return value is valid. (For example, after "5 + foo [tab]", it offers only functions that return what can be added to integers or variable names of the correct type.) I also saw that they put the most used or longest option ahead of the list .

I understand that you need to parse the code. But usually when editing the current code are unacceptable, it has syntax errors. How do you parse something when it is incomplete and contains errors?

There is also a time limit. Completing is useless if it takes a few seconds to make a list. Sometimes the completion algorithm deals with thousands of classes.

What are good algorithms and data structures for this?

+69
algorithm autocomplete code-completion
Aug 2 '09 at 22:55
source share
3 answers

The IntelliSense engine in my UnrealScript language service product is complex, but I will cover it here in the best possible way. C # language service in VS2008 SP1 is my performance goal (for good reason). This is still not the case, but it’s fast / accurate enough that I can safely offer suggestions after entering one character without waiting for ctrl + space or the user who typed . (point). The more information people [working on linguistic services] learn about this, the better is the end-user experience that I get if I should ever use their products. There are a number of products that I came across with unsuccessful experience working with them that did not pay so much attention to details, and as a result, I struggled with the IDE more than I was.

In my language service, it looks like this:

  • Get expression in cursor. This happens from the beginning of the member access expression to the end of the identifier by which the cursor is completed. A member access expression is usually in the form aa.bb.cc , but may also contain method calls, as in aa.bb(3+2).cc .
  • Get the context surrounding the cursor. This is very difficult, because it does not always follow the same rules as the compiler (long story), but suppose it is. As a rule, this means obtaining cached information about the method / class in which the cursor is located.
  • Say the context object implements IDeclarationProvider , where you can call GetDeclarations() to get the IEnumerable<IDeclaration> all the elements visible in the area. In my case, this list contains locals / parameters (if in the method), members (fields and methods, static, only if in the instance method and without private members of the base types), global (types and constants for the language I'm working with) and keywords. This list will have an element named aa . As the first step in evaluating the expression in # 1, we select an element from the context enumeration named aa , providing us with IDeclaration for the next step.
  • Then I apply the operator to the IDeclaration representing aa to get another IEnumerable<IDeclaration> containing the "members" (in a sense) of aa . Because the operator . different from the -> operator, I call declaration.GetMembers(".") and expect the IDeclaration object IDeclaration correctly use the specified operator.
  • This continues until I delete cc , where the list of declarations may or may not contain an object named cc . As I am sure, you know that if several elements start with cc , they should also appear. I solve this by taking the final listing and passing it through my documented algorithm to provide the user with the most useful information.

Here are some additional notes for the IntelliSense backend:

  • I make extensive use of lazy LINQ evaluation mechanisms when implementing GetMembers . Each object in my cache is capable of providing a functor that evaluates its members, so performing complex actions with a tree is almost trivial.
  • Instead of each object containing the List<IDeclaration> its members, I save a List<Name> , where Name is a structure containing a hash of a specially formatted string describing the element. There is a huge cache that displays the names of objects. Thus, when re-analyzing the file, I can delete all the elements declared in the file from the cache and re-populate it with updated members. Thanks to how functors are configured, all expressions immediately evaluate new elements.

IntelliSense Interface

As the user enters data, the file is syntactically incorrect more often than correct. Thus, I do not want to accidentally delete cache sections when the user types. I have a large number of special rules that allow me to process incremental updates as quickly as possible. The incremental cache is stored only in an open file and helps to ensure that the user does not understand that entering them causes the backend cache to contain incorrect row and column information for things like every method in the file.

  • One repayment factor is my parser is fast. It can handle a full cache update of 20,000 lines of the source file in 150 ms while working offline in the background thread with low priority. Whenever this parser successfully completes the transfer of an open file (syntactically), the current state of the file is moved to the global cache.
  • If the file is not syntactically correct, I use the ANTLR filter parser (sorry for the link - most of the information is on the mailing list or source collected from reading) to revise the file that is looking for:
    • Variable / field declarations.
    • Signature for class / structure definitions.
    • Signature for method definitions.
  • In the local cache, class / structure / method definitions begin with a signature and end when the binding insertion level returns to even. Methods can also end if another method declaration is reached (no nesting methods).
  • In the local cache, variables / fields are associated with the immediately preceding unclosed element. The following is a snippet of short code below for an example of why this is important.
  • In addition, as the user enters, I save a reassignment table indicating the added / removed character ranges. This is used for:
    • Making sure that I can determine the correct cursor context, as the method can / can move between full parses in the file.
    • Make sure that Go To Declaration / Definition / Reference correctly finds items in open files.

Code snippet for the previous section:

 class A { int x; // linked to A void foo() // linked to A { int local; // linked to foo() // foo() ends here because bar() is starting void bar() // linked to A { int local2; // linked to bar() } int y; // linked again to A 

I decided that I would add a list of IntelliSense features that I implemented using this layout. Images of each of them are here.

  • Autofill
  • Tool tips
  • Method Tips
  • View class
  • Code Definition Window
  • Browser call (VS 2010 finally adds it to C #)
  • Semantically correct search for all links
+53
Aug 02 '09 at 23:59
source share

I cannot say for sure which algorithms are used by any particular implementation, but I can make some reasonable assumptions. A trie is a very useful data structure for this problem: the IDE can support a large amount of memory for all characters in your project, with some additional metadata in each node.

When you enter a character, it jumps along the path to trie. All descendants of a particular trie node are possible completions. After that, the IDE just needs to filter out those that make sense in the current context, but just need to calculate as many as can be displayed in the tab completion popup window.

More advanced tab filling requires more sophisticated controls. For example, Visual Assist X has a function in which you only need to enter the capital letters of the CamelCase characters - for example, if you type SFN, this shows you the SomeFunctionName character in the tab completion window.

Computing trie (or other data structures) requires parsing all of your code to get a list of all the characters in your project. Visual Studio stores this in its IntelliSense database, the .ncb file that is stored next to your project, so it does not need to repeat everything every time you close and reopen your project. The first time you open a large project (say, one of which you only synchronized the control of the source code of the form), VS will take time to parse everything and generate a database.

I do not know how it handles incremental changes. As you said, when you write code, this syntax is invalid 90% of the time, and whenever you are idle, you get a huge tax on your processor for very little benefit, especially if you change the header file included in a large number of source files .

I suspect that it either (a) only reparations whenever you really build your project (or perhaps when you close / open it), or (b) it does some kind of local parsing where it only analyzes code where you are "We just edited in a limited way, just to get the names of the corresponding characters. Since C ++ has such an extremely complex grammar, it can behave strangely in dark corners if you use heavy template metaprogramming, etc. .

+14
Aug 02 '09 at 23:12
source share

The following link will help you further.

Syntax highlighting: Fast color text box for syntax highlighting

+1
Apr 01 '12 at 8:21
source share



All Articles