I wrote several assemblers over the years, doing a manual disassembly, and to be honest, you are probably better off using a grammar language and a parser generator.
Here's why - a typical assembly line will probably look something like this:
[label:] [instruction|directive][newline]
and the instruction will be:
plain-mnemonic|mnemonic-withargs
and the directive will be:
plain-directive|directive-withargs
and etc.
With a decent parser generator like Gold , you can beat the grammar in 8051 in a few hours. The advantage of this manual parsing is that you can have fairly complex expressions in your assembler, for example:
.define kMagicNumber 0xdeadbeef CMPA
which can be a real bear made by hand.
If you want to do this manually, create a table of all your mnemonics, which will also include the various valid addressing modes that they support, and for each addressing mode, the number of bytes that each option will take, and the operation code for This. Something like that:
enum { Implied = 1, Direct = 2, Extended = 4, Indexed = 8 // etc } AddressingMode; /* for a 4 char mnemonic, this struct will be 5 bytes. A typical small processor * has on the order of 100 instructions, making this table come in at ~500 bytes when all * is said and done. * The time to binary search that will be, worst case 8 compares on the mnemonic. * I claim that I/O will take way more time than look up. * You will also need a table and/or a routine that given a mnemonic and addressing mode * will give you the actual opcode. */ struct InstructionInfo { char Mnemonic[4]; char AddessingMode; } /* order them by mnemonic */ static InstructionInfo instrs[] = { { {'A', 'D', 'D', '\0'}, Direct|Extended|Indexed }, { {'A', 'D', 'D', 'A'}, Direct|Extended|Indexed }, { {'S', 'U', 'B', '\0'}, Direct|Extended|Indexed }, { {'S', 'U', 'B', 'A'}, Direct|Extended|Indexed } }; /* etc */ static int nInstrs = sizeof(instrs)/sizeof(InstrcutionInfo); InstructionInfo *GetInstruction(char *mnemonic) { /* binary search for mnemonic */ } int InstructionSize(AddressingMode mode) { switch (mode) { case Inplied: return 1; / * etc */ } }
Then you will have a list of each command, which in turn contains a list of all addressing modes.
So, your parser will become something like this:
char *line = ReadLine(); int nextStart = 0; int labelLen; char *label = GetLabel(line, &labelLen, nextStart, &nextStart); // may be empty int mnemonicLen; char *mnemonic = GetMnemonic(line, &mnemonicLen, nextStart, &nextStart); // may be empty if (IsOpcode(mnemonic, mnemonicLen)) { AddressingModeInfo info = GetAddressingModeInfo(line, nextStart, &nextStart); if (IsValidInstruction(mnemonic, info)) { GenerateCode(mnemonic, info); } else throw new BadInstructionException(mnemonic, info); } else if (IsDirective()) { /* etc. */ }