Search words in an unextended paragraph?

I use python to create Caesar's encryption decryptor, it works and decrypts the already encrypted word. However, he shows all his brute force decryption attempts, for example, "HELLO", encrypted with a key of 3, is KHOOR. The results obtained after decryption are "KHOORJGNNQIFMMPHELLOGDKKNFCJJMEBIILDAHHKCZGGJBYFFIAXEEHZWDDGYVCCFXUBBEWTAADVSZZCURYYBTQXXASPWWZROVVYQNUUXPMTTWOLSSVNKRRUMJQQTLIPPS" I wonder if there is a way to use a dictionary with Python to find the English words in the output, or I can improve my code to print only certain English words. I apologize if this was asked earlier, I searched around and could not find the right solution.

+4
source share
2 answers
englishWords = ['HELLO', 'ME', 'AXE', 'FOO', 'BAR', 'BAZ'] #and many more
cypher = 'KHOORJGNNQIFMMPHELLOGDKKNFCJJMEBIILDAHHKCZGGJBYFFIAXEEHZWDDGYVCCFXUBBEWTAADVSZZCURYYBTQXXASPWWZROVVYQNUUXPMTTWOLSSVNKRRUMJQQTLIPPS'

for word in englishWords:
    if word not in cypher: continue
    print('Found "{}"'.format(word))

This gives:

Found "HELLO"
Found "ME"
Found "AXE"

If it concerns viewing, if the key to decrypt the text is correct, that is, if the result can be English words, I would not look for words, but try to find clusters inside the result that does not match the assembler of the English syllable.


Here's a very naive implementation scan for letter frequencies:

#! /usr/bin/python3

plain = 'Z RD NFEUVIZEX ZW KYVIV ZJ R NRP KF LJV R UZTKZFERIP NZKY GPKYFE KF JVRITY WFI RE VEXCZJY NFIU ZE KYZJ FLKGLK FI TRE Z ZDGIFMV DP TFUV KF FECP GIZEK FLK BEFNE VEXCZJY NFIUJ. RGFCFXZVJ ZW KYZJ YRJ SVVE RJBVU SVWFIV, Z JVRITYVU RIFLEU REU TFLCUE\'K JVVD KF WZEU KYV IZXYK KYZEX.'.upper ()

freqs = {'E': 12.7, 'T': 9.1, 'A': 8.2, 'O': 7.5, 'I': 7.0}

def cypher(text, key):
    return ''.join(chr((ord(c) - ord('A') + key) % 26 + ord('A')) if 'A' <= c <= 'Z' else c for c in text)


def crack(text):
    length = len(text)
    best = 100000
    bestMatch = ''
    for key in range(26):
        cand = cypher(text, key)
        quality = 0
        for l, c in {letter: sum(1 for c in cand if c == letter) for letter in 'ETAOI'}.items():
            quality += (c / length - freqs[l]) ** 2
        if quality < best:
            best = quality
            bestMatch = cand
    return bestMatch

print(crack(plain))

Here are three examples:

Input: TQ ESTD TD LMZFE DPPTYR, TQ ESP VPJ QZC OPNJASPCTYR L EPIE TD ESP NZCCPNE ZYP, T.P. TQ ESP CPDFWE XTRSE MP PYRWTDS HZCOD, T HZFWOY'E WZZV QZC HZCOD, MFE ECJ EZ QTYO NWFDEPCD TYDTOP ESP CPDFWE HSTNS OZ YZE NZXAWJ HTES ESP PYRWTDS DJWWLMWP LALCEFD.

Output: IF THIS IS ABOUT SEEING, IF THE KEY FOR DECYPHERING A TEXT IS THE CORRECT ONE, I.E. IF THE RESULT MIGHT BE ENGLISH WORDS, I WOULDN'T LOOK FOR WORDS, BUT TRY TO FIND CLUSTERS INSIDE THE RESULT WHICH DO NOT COMPLY WITH THE ENGLISH SYLLABLE APARTUS.

Input: KWSJUZAFY XGJ WFYDAKZ OGJVK AF S TDGUC GX MFVAXXWJWFLASLWV LWPL DACW LZSL AK UWJLSAFDQ HGKKATDW, SFV VGAFY AL WXXAUAWFLDQ AK S YWFMAFWDQ AFLWJWKLAFY HJGTDWE. TML AL'K HJGTDWESLAU XGJ DGLK GX JWSKGFK, BMKL GFW GX OZAUZ AK LZSL QGMJ WFUJQHLWV LWPL ESQ AFUDMVW LWPL LZSL JSFVGEDQ ZSHHWFK LG XGJE SF WFYDAKZ OGJV UGEHDWLWDQ TQ UZSFUW.

Output: SEARCHING FOR ENGLISH WORDS IN A BLOCK OF UNDIFFERENTIATED TEXT LIKE THAT IS CERTAINLY POSSIBLE, AND DOING IT EFFICIENTLY IS A GENUINELY INTERESTING PROBLEM. BUT IT PROBLEMATIC FOR LOTS OF REASONS, JUST ONE OF WHICH IS THAT YOUR ENCRYPTED TEXT MAY INCLUDE TEXT THAT RANDOMLY HAPPENS TO FORM AN ENGLISH WORD COMPLETELY BY CHANCE.

Input: QZC PILXAWP, UFDE ESP EPIE JZF'GP AZDEPO SPCP TYNWFOPD SPWW, TQ, WTA, WZR, LDA LYO ACZMLMWJ ZESPCD. JZF NZFWO ECTX OZHY ESP LWEPCYLETGPD MJ ZYWJ DPLCNSTYR QZC HZCOD ESP DLXP WPYRES LD JZFC ELCRPE HZCO, LYO ZYWJ QZC HZCOD HTES ESP DLXP WPEEPC ALEEPCY. MFE ESLE'D CPLWWJ BFTEP L WZE ZQ HZCV EZ RPE LCZFYO ESP QLNE ESLE JZFC TYTETLW ZFEAFE SLD L WZE ZQ FDPWPDD OLEL TY TE.

Output: FOR EXAMPLE, JUST THE TEXT YOU'VE POSTED HERE INCLUDES HELL, IF, LIP, LOG, ASP AND PROBABLY OTHERS. YOU COULD TRIM DOWN THE ALTERNATIVES BY ONLY SEARCHING FOR WORDS THE SAME LENGTH AS YOUR TARGET WORD, AND ONLY FOR WORDS WITH THE SAME LETTER PATTERN. BUT THAT REALLY QUITE A LOT OF WORK TO GET AROUND THE FACT THAT YOUR INITIAL OUTPUT HAS A LOT OF USELESS DATA IN IT.

And here is the last example without spaces or punctuation marks:

Input: ZRDLJZEXGPKYFEKFSLZCURTRVJRITZGYVIUVTIPGKVIZKNFIBJREUUVTIPGKJKYVRCIVRUPVETIPGKVUNFIUYFNVMVIZKJYFNJRCCZKJ SILKVWFITVUVTIPGKZFERKKVDGKJWFIVORDGCVYVCCFVETIPGKVUNZKYRBVPFW3ZJBYFFI

Output: IAMUSINGPYTHONTOBUILDACAESARCIPHERDECRYPTERITWORKSANDDECRYPTSTHEALREADYENCRYPTEDWORDHOWEVERITSHOWSALLITS BRUTEFORCEDECRYPTIONATTEMPTSFOREXAMPLEHELLOENCRYPTEDWITHAKEYOF3ISKHOOR
+4
source

, , , , - . , , , , .

, , , HELL, IF, LIP, LOG, ASP , , . , , , . , , .

, , :

  • (/usr/share/dict/words ).
  • , Python.
  • , Python.

, , , .

+3

All Articles