FCSC Quals 2022 - Perfect Cell (reverse)

May 07, 2022

Challenge description

cell

Votre ami féru de "homebrews" vous a fourni ce qu'il qualifie de "jeu de combat génial en réseau !" pour votre console "moddée", non sans un petit sourire en coin. Et vous êtes en effet assez dubitatif : son jeu ne semble pas faire grand chose ...

Connaissant les facéties de votre ami, et armé de votre curiosité, vous vous lancez dans l'analyse de ce qu'il vous a fourni en espérant lui prouver que vous relevez le challenge !

The attached binary file is a weird elf file named perfect-cell.self.

Recon

Running file on the binary does not tell us much

$ file perfect-cell.self
perfect-cell.self: data

However by viewing it in a hex viewer we notice a ELF header.

00000000  53 43 45 00 00 00 00 02  00 01 00 01 00 00 03 f0  |SCE.............|
00000010  00 00 00 00 00 00 0a 80  00 00 00 00 00 37 53 e8  |.............7S.|
00000020  00 00 00 00 00 00 00 03  00 00 00 00 00 00 00 70  |...............p|
00000030  00 00 00 00 00 00 00 90  00 00 00 00 00 00 00 d0  |................|
00000040  00 00 00 00 00 17 53 d0  00 00 00 00 00 00 02 90  |......S.........|
00000050  00 00 00 00 00 00 03 90  00 00 00 00 00 00 03 a0  |................|
00000060  00 00 00 00 00 00 00 70  00 00 00 00 00 00 00 00  |.......p........|
00000070  10 10 00 00 01 00 00 03  01 00 00 02 00 00 00 04  |................|
00000080  00 01 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000090  7f 45 4c 46 02 02 01 66  00 00 00 00 00 00 00 00  |.ELF...f........|

And by running file again we see it is a powerpc executable.

$ file file.elf
ELF 64-bit MSB executable, 64-bit PowerPC or cisco 7500, Unspecified or Power ELF V1 ABI, version 1, statically linked, stripped

Having done some reverse-engineering on the PSP, I knew that SCE was something sony related (present in the syscall names) and that .self are encrypted/compressed executables for sony consoles. The only PowerPC based sony console is the PS3.

To be able to play around with the program, I used the rpcs3 emulator (which can be found here). When launching the binary, we are met with the following screen:

main_screen

And not much happens beyond that. As it speaks about a “network shell” we can run netstat and check whether any port is open:

$ netstat -lpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:1337            0.0.0.0:*               LISTEN      12024/./bin/rpcs3   
tcp        0      0 127.0.0.1:2345          0.0.0.0:*               LISTEN      12024/./bin/rpcs3
[...]

Lo and behold, a very sus port 1337 is open. By connecting to it with netcat we are greeted by the following message:

$ nc localhost 1337
    ____            ____          __     ______     ____
   / __ \___  _____/ __/__  _____/ /_   / ____/__  / / /
  / /_/ / _ \/ ___/ /_/ _ \/ ___/ __/  / /   / _ \/ / / 
 / ____/  __/ /  / __/  __/ /__/ /_   / /___/  __/ / /  
/_/    \___/_/  /_/  \___/\___/\__/   \____/\___/_/_/   
                                                        
========================================================
Welcome to Pefect Cell!

Please provide a correct input ...

This is where we will enter our flag !

Reverse-Engineering the ELF

More than simply providing emulation, rpcs3 also provides some nifty developer oriented utilities, such as dumping the unencrypted ELF file on disk (Utilities > Decrypt PS3 Binaries).

Loading the ELF as is in a disassembler would work. But to be as comfortable as possible I used the Ps3GhidraScripts which provides Ghidra scripts to resolve PS3 syscalls.

I first started to look at the cross-references to the accept function.

accept_crossref

Which lands in a wrapper function, that I renamed calls_accept:

calls_accept

This function is called in only one place, which looks like the main client accept loop for the server:

bind_listing_accept

The rest of the function gives us some insight as to the form or our input. For example the loop only exits if, (1) the input size is 0x67, (2) the first 5 bytes match the value ‘FCSC{‘, (3) the last char of the input (not counting the ‘\n’) is ‘}’.

sanity_check

If everything matches, this function returns and we land in the function at address 0x11350. The program continues by setting up some stuff called SPUs, spu threads, groups, writing to snr (???), and then there is a memcmp of size 0x60 (which is the size of our input minus the size of ‘FCSC{}’) and two pointers.

spu_init

We can already guess that one of them is our input, and the other our target. To check the assumption, I configured the emulator to fully interpret the CPU (instead of using a JIT), and I added the following code to rpcs3/rpcs3/Emu/Cell/PPUThread.cpp:

void ppu_thread::exec_task()
{
// ...
        if (u64{cia} == 0x11628)
        {
            uint8_t user_input[0x60] = {};
            uint8_t cst_check[0x60] = {};

            std::printf("user input ptr: 0x%lx cst_check_ptr: 0x%lx\n", gpr[30], gpr[4]);

            std::memcpy(user_input, vm::base(gpr[30]), 0x60);
            std::memcpy(cst_check, vm::base(gpr[4]), 0x60);

            std::printf("user_input: 0x%lx\n", gpr[30]);

            for (unsigned i = 0; i < 0x60; i++)
            {
                std::printf("%02x ", user_input[i]);
            }

            std::printf("\n");
            std::printf("cst_check: 0x%lx\n", gpr[4]);

            for (unsigned i = 0; i < 0x60; i++)
            {
                std::printf("%02x ", cst_check[i]);
            }

            std::printf("\n");
        }

If we input a flag with all ‘A’s we get the following output:

user input ptr: 0x30000100 cst_check_ptr: 0x10001fd8
user_input: 0x30000100
a4 a4 a4 a4 a4 a4 a4 a4 a4 a4 a4 a4 a4 a4 a4 a4 f3 c0 e2 4d db 77 4d 1e f3 c0 e2 4d db 77 4d 1e cf 19 f5 d2 ed f7 71 0c cf 19 f5 d2 ed f7 71 0c af af af af af af af af af af af af af af af af 49 3b ac 1e 49 3b ac 1e 49 3b ac 1e 49 3b ac 1e 5e c8 5e c8 5e c8 5e c8 5e c8 5e c8 5e c8 5e c8 
cst_check: 0x10001fd8
05 d2 5a cb 9f 61 24 2e 24 90 c3 0e c3 dc 1b a6 92 dc 21 66 bb a1 3c 99 92 0a 21 75 24 9d 5d 6a c9 b2 58 0f 30 87 5b 47 46 07 fa 45 38 4d e8 4c ab 9d be 5b a7 cd a7 7b 1f 04 fe 86 78 5d cb 49 1b 8b d3 82 5a 7e cb a7 2a 2a 52 bc 73 77 b9 79 78 3a ae c4 6b 19 e8 a5 a7 75 fb 46 05 d3 82 b5 

We can already notice that our input was modified in a seemlingly linearish fashion for parts of it. And apart from a few places, it seems to have been modified in 16 bytes chunks (6 chunks of 16). Now we only need to reverse the algorithm used for the transformation… wait, what was that SPU thing again ??

Cell architecture and the SPU(s)

The PS3’s CPU is called Cell and it is known for being exotic and complex. It is architectured in the following way:

cell_architecture

First there is the main unit, the PowerPC Processing Element (or PPU). The code we analyzed so far concerns this elements. Then there are 7 powerful coprocessors called Synergistic Processor Unit (or SPU), which run on a different architecture and are orchestrated by the PPU. They act as accelerators and custom code can be uploaded to them. A very good resource for understanding more about them is the very good blogpost made by Rodrigo Copetti (https://www.copetti.org/writings/consoles/playstation-3/).

Now let’s take a look back at the code between our read and the memcpy check:

spu_setup

In (1) 6 SPUs are initialized, then in (2) an image (spu code) is loaded, then finally the SPU threads are configured with their index and our input. As our input is 96 bytes long, and as 6 SPU’s are spawned, we can guess that each of them will separately process 16 byte chunks of input.

By again injecting code in the PPU interpreter, the SPU image can be recovered during the call to sys_spu_image_import.

$ file spu_bin.elf 
spu_bin.elf: ELF 32-bit MSB executable, Cell SPU, version 1 (SYSV), statically linked, stripped

Reversing the SPUs code

Even though the SPU architecture is exotic, some reverse-engineering tools have been extended to support it. For reverse-engineering the code I used the GhidraSPU project to be able to analyze it in ghidra.

Before delving into the analysis of the code, I wanted to understand what kind of architecture I was up against. A particular sentence in the architecture manual sent chills down my spine:

The SPU architecture defines a set of 128 general-purpose registers (GPRs), each of which contains 128
data bits. Registers are used to hold fixed-point and floating-point data. Instructions operate on the full width
of the register, treating the register as multiple operands of the same format.

simd_code

Reverse-Engineering SIMD heavy code (at least for me) is extremely painful as the data can be broken up into chunks of different sizes, which can be interpreted differently from an instruction to the next. To add insult to injury this architecture has 128 SIMD REGISTERS. But anyway, a flag is a flag and now we are in too deep to go back.

Descent into madness

Now begs the question of the strategy to take to solve this challenge. As seen previously, on at least parts of the input, the transformations seem at first glance simple enough and could be simple substitutions. If this is the case a bruteforcing approach could be envisioned where we would bruteforce one by one the bytes of the input. However this would require repurposing the code of the emulator to extract only the SPU emulation code, which would incur significant engineering work. However by doing this we could completely forget about the abomination that is the SPU and treat it as a blackbox.

The other approach would be the classical one of fully reverse engineering the SPU code and try to inverse the algorithms transforming our flag (or bruteforcing the algorithm if needed). This requires looking at the code, and from a cursory glance this is a big no no. But in the end, I selected this method.

insanity

I spent about 15h hours reading the machine code line by line and reimplemented every relevant instruction in python. I do not recommend.

Methodology

As seen previously, 6 SPUs are launched to transform our input. To be able to follow the transformation of the input data and the operations, I chose to debug the code through execution traces. This can be done by setting the SPU emulation mode to interpreter and by injecting the following code in rpcs3/rpcs3/Emu/Cell/SPURecompiler.cpp:

std::ofstream* streams[8] = {};

void trace_open()
{
    if (!streams[0])
    {
        for (unsigned i = 0; i < 8; i++)
            streams[i] = new std::ofstream(fmt::format("spu_%u.trace", i));
    }
}


void spu_recompiler_base::old_interpreter(spu_thread& spu, void* ls, u8* /*rip*/)
{
// ...
        trace_open();
        *streams[spu.index] << fmt::format("--- pc: 0x%08x ---\n", spu.pc);
        std::string regs;

        for (unsigned i = 0; i < 128; i++)
        {
            *streams[spu.index] << fmt::format("r%u=", i);
            uint8_t* number = reinterpret_cast<uint8_t*>(&spu.gpr[i]);

            for (unsigned j = 0; j < 16; j++)
                *streams[spu.index] << fmt::format("%02x", number[15-j]);

            *streams[spu.index] << '\n';
        }
        streams[spu.index]->flush();

This gives a trace of the program counter as well as all gprs, which is enough for our purpose. Using execution traces is great for reverse engineering as it allows to record the full execution of a program, then go back and forth to any point in the execution which greatly speeds up the process of following the data and understanding the behavior of instructions.

input[0..15]

The function treating the first chunk starts at 0x5d0 and is one of the simple ones.

input_0_compute

In (1), the input is modified by substracting a value from it.

In (2), the bytes of the input are rotated using the rotqby instruction. This instruction operates on the level of bytes (rotqby of 2 means rotating by 16 bits). Afterwards a shuffling mask is computed using cbx and it is applied with the shufb instruction. As these rotations and shuffles don’t depend on the input, using the execution trace we can simply get the value before and after the loop to extract the shuffling constants.

In (3) there is this time a rotation at the bit level with the instruction rotqbii.

In (4) things get interesting. The shuffled and rotated bytes of the input are used as an index in a 256 bytes buffer of constants. This is done in the following way:

ai off, base, input_char
lxq cst, base, input_char
ai off, off, 0xd
rotqby s, cst, off
shufb result, s, ..., ...

In 1 the extracted input character value is added to the address of the array into an accumulator. In 2 the lxq instruction is used to load a 16 bytes constant from the array, do note that the offset base+input_char is aligned to 16 bytes before the load. In 3, 12 is added to the original offset. Rotqby is used with this offset to do a byte rotation on the constant array. Finally in 5 the shuffle instruction is used to extract a byte from the rotated constant to be placed in the resulting output (depending on the mask the shufb instruction can be used for shuffling or specific byte extraction).

From there it is not apparently clear how this is reversible, but by looking at the constant array itself we can devise a plan:

0x32, 0xa5, 0x2b, 0xdf, 0xac, 0xd9, 0xcd, 0xc1, 0xef, 0x3a,
0x84, 0x2c, 0x53, 0x3c, 0xff, 0x9a, 0xeb, 0x16, 0xd3, 0x3f,
0xbc, 0xc4, 0x31, 0x47, 0x12, 0x96, 0x75, 0x90, 0x8a, 0xc5,
0xbf, 0x43, 0x7a, 0x64, 0xb5, 0xc6, 0xaa, 0xa1, 0x22, 0x48,
0xf9, 0x99, 0x08, 0x34, 0x65, 0xa3, 0x98, 0x44, 0xa0, 0xe0,
0x6b, 0xb6, 0xbe, 0xce, 0xfc, 0x03, 0xa6, 0xa4, 0x35, 0x5e,
0x11, 0xd7, 0x76, 0x05, 0x0e, 0xf4, 0xa2, 0xae, 0x70, 0x6f,
0x78, 0x9c, 0x09, 0x4c, 0x17, 0xf0, 0xd8, 0x56, 0xfd, 0x15,
0x71, 0xf3, 0xfb, 0x1d, 0xb7, 0xa7, 0x66, 0x9f, 0xb3, 0x79,
0xc3, 0xb4, 0x5d, 0x5b, 0xab, 0xd6, 0x94, 0x1e, 0xc0, 0x0c,
0x9d, 0x04, 0x7f, 0xe4, 0x6a, 0x4b, 0x9b, 0xe8, 0x0d, 0x86,
0x82, 0xea, 0x52, 0xdb, 0x51, 0xd4, 0xb0, 0xb9, 0xa8, 0x61,
0x6d, 0xe6, 0x97, 0x7c, 0xad, 0xcb, 0xda, 0x88, 0x68, 0x1c,
0x07, 0x00, 0x30, 0x6c, 0x50, 0x59, 0x3e, 0x40, 0x20, 0xde,
0xf1, 0xf5, 0x0f, 0xa9, 0x5c, 0x4d, 0x89, 0x21, 0xbd, 0xe2,
0x57, 0x0b, 0x67, 0xc7, 0x87, 0xc2, 0xd2, 0xdc, 0x18, 0x93,
0x8e, 0x36, 0x77, 0x8c, 0xe5, 0x4e, 0x9e, 0x39, 0x60, 0x80,
0x73, 0xfa, 0xfe, 0x0a, 0x81, 0xed, 0xb1, 0x2f, 0xf6, 0xba,
0x58, 0xee, 0x85, 0x45, 0x38, 0xd5, 0x4a, 0xe7, 0x5f, 0x2e,
0xf7, 0x91, 0x2d, 0x95, 0x7d, 0xb8, 0x7e, 0x1f, 0x49, 0x02,
0x63, 0xe3, 0x2a, 0xe1, 0x10, 0xca, 0x06, 0x27, 0x37, 0x69,
0x3b, 0x83, 0x8f, 0x92, 0x13, 0x55, 0xcc, 0x29, 0x54, 0x26,
0x14, 0x1a, 0x1b, 0x41, 0xd1, 0x42, 0x46, 0xec, 0x62, 0x74,
0xd0, 0x28, 0x3d, 0x8b, 0xf8, 0xcf, 0x7b, 0xb2, 0x4f, 0x6e,
0xc8, 0xdd, 0x33, 0x23, 0xf2, 0xaf, 0x72, 0x24, 0xe9, 0x01,
0x19, 0x8d, 0xbb, 0x25, 0xc9, 0x5a

This 256 bytes array contain only unique values from 0 to 255. Which means that for any output byte, we can find the original with some tricks ! By taking the output byte we can find back the correct 16 byte value in which it appears, which gives us the offset from the start of the array, minus the alignment. By then computing the shift in position of our byte compared to the same in the original array, we get the missing nibble in the offset ! By substracting the offset we computed from the base address of the array, we get the original input value.

Finally in (5), the input is written back to the main processor.

The algorithm can be reversed with the following piece of code:

def input_0_rev(value):
    input_0_unshuffle = [5, 9, 8, 4, 1, 0, 10, 12, 11, 15, 14, 7, 13, 2, 3, 6]
    result = [0]*16
    r15 = [0]*16

    # Invert last loop
    for i in range(16):
        r22_3 = value[16 - i - 1]
        r19_off = 0

        for j in range(0, len(constant_table_1730), 16):
            cst = constant_table_1730[j:j+16]

            if r22_3 in cst:
	    	# j is the offset from the base of the array, aligned on 16 bytes
                r19 = cst.copy()
                r19_off = j
                break
        else:
	    # Should not happen
            raise Exception("")

	# Add the shift to the offset which gives us the original value
        r19_off += r19.index(r22_3)
        r15[16 - i - 1] = r19_off

    r12 = r15
    inv_r2 = ilh(0xeeef) # Add the inverse of the constant
    r5 = ah(r12, inv_r2)
    r5 = inv_rotqbii(r5, 5) # Inverse the bit shift

    result = [0]*16

    # Unshuffle (first loop)
    for i, e in enumerate(input_0_unshuffle):
        result[i] = r5[e]

    return result

Running this with the first 16 bytes of the expected bytes (the constant array in the memcmp), we get the string 30gbt9RzIif5L_s7.

input[16..31]

This is the part of the challenge that tooks the most time to solve. The checking function starts of very innocently with only a few instructions (at 0x6ac):

input_1_compute

This does a few computations and then calls the sub_fe0 function with the transformed input as an argument. The called function is simply massive to unpack:

sub_fe0

It does not look scary in the decompiler, however it is only composed of a series of shifts, loads, rotations, using at least 20 different registers moving data around. I spent about a full day on this function reimplementing it in python all the while comparing with the execution traces checking for correctness. In the end the algorithm is similar to the first function, except that it processes 8 bytes of input at a time:

def sub_fe0_rev(r3):
    def bruteforce_char(rev_reg):
        # Find the only non null input in reg
        for i in range(len(rev_reg)):
            if rev_reg[i] != 0:
                idx = i
                break
        else:
            raise Exception("noop")

        # Bruteforce char
        for j in range(256):
            constant = lxq_1730_i(j)
            shift = (0x1730 + j + 0xd) & 0xf
            res = rotqby(constant, shift)

            if res[idx] == rev_reg[idx]:
                return j

        raise Exception("nop")

    res = []

    for i in range(2):
        chunk = [0]*8
        value = r3[i*8:(i*8)+8]

        # input_0
        r63 = [0]*16
        r63[0] = value[0]
        r62 = shrb(r63, 0xf)
        r46 = rotqby(r62, 0x8)
        r42 = shlb(r46, 0x4)
        chunk[0] = bruteforce_char(r42)

        # input_1
        r74 = [0]*16
        r74[6] = value[6]
        r73 = shrb(r74, 0x9)
        r59 = rotqby(r73, 0x8)
        r54 = shlb(r59, 0x4)
        chunk[6] = bruteforce_char(r54)

        # input_2
        r4 = [0]*16
        r4[5] = value[5]
        r5 = shrb(r4, 0xa)
        r64 = rotqby(r5, 0x8)
        r60 = shlb(r64, 0x4)
        chunk[5] = bruteforce_char(r60)

        # input_3
        r16 = [0]*16
        r16[4] = value[4]
        r2 = shrb(r16, 0xb)
        r72 = rotqby(r2, 0x8)
        r68 = shlb(r72, 0x4)
        chunk[4] = bruteforce_char(r68)

        # input_4
        r55 = [0]*16
        r55[7] = value[7]
        r51 = shlb(r55, 0x4)
        chunk[7] = bruteforce_char(r51)

        # input_5
        r20 = [0]*16
        r20[3] = value[3]
        r14 = shrb(r20, 0xc)
        r78 = rotqby(r14, 0x8)
        r75 = shlb(r78, 0x4)
        chunk[3] = bruteforce_char(r75)

        # input_6
        r18 = [0]*16
        r18[2] = value[2]
        r24 = shrb(r18, 0xd)
        r10 = rotqby(r24, 0x8)
        r8 = shlb(r10, 0x4)
        chunk[2] = bruteforce_char(r8)

        # input_7
        r12 = [0]*16
        r12[1] = value[1]
        r26 = shrb(r12, 0xe)
        r11 = rotqby(r26, 0x8)
        r7 = shlb(r11, 0x4)
        chunk[1] = bruteforce_char(r7)

        res += chunk

    return res

The top function can then be inverted as well with the following code:

def input_1_rev(value):
    rev_value = sub_fe0_rev(value)
    r60 = u128_to_reg(0xfeedbabedeadbeeffeedbabedeadbeef)
    r65 = u128_to_reg(0x04050607c0c0c0c00c0d0e0fc0c0c0c0)
    r63 = u128_to_reg(0x00000000ffffffff00000000ffffffff)

    m = inv_sfx(r63, r60, rev_value)
    m = inv_rotqbii(m, 6)
    m = rotqby(m, 7)

    return m

By passing the second 16 bytes slice of the constant to this function we get the string 0cRm7gXHm_R-8WkK.

input[32..47]

After ~7h spent suffering on the previous function alone, the third one was a godsend (at 0x6e8).

input_2_compute

It simply xors the passed input with a constant, calls the previously reversed horrible function, and forwards the result to the main processor. As such only the xor needs to be reversed:

def input_2_rev(value):
    rev_value = sub_fe0_rev(value)
    cst = u128_to_reg(0xaabbccddeeff1122aabbccddeeff1122)
    return xor(rev_value, cst)

Passing the third 16 bytes slice of the constant array gives the string TVxSjeL5H9gjVnzk.

input[48..63]

This function at 0x6f8 follows the models of the previous ones, only again differing in the number of bytes processed at a time.

input_3_compute

In (1) we have the usual shuffle. In (2) some bit and byte level rotations as well as a not. In (3) the transformation of input bytes in a loop executing 6 times. And finally in (4) we have the same thing. I don’t know if it is a compiler shenanigan or voluntary, but the work done outside the last loop looks identical to the one done inside.

After reimplementing the parts (1) to (3) of the function, I tried to add two iterations two my second loop (from 6 to 8), and after comparing with execution traces I could confirm that this simplification was correct. The following python code can reverse the fourth step:

def input_3_rev(value):
    def bruteforce_char(rev_reg):
    	# ...

    rev_input = [0]*16

    for i in range(8):
        r16 = [0]*16
        r19 = [0]*16

        r16[2] = value[i*2]
        r19[3] = value[(i*2)+1]

        r14 = shrhi(r16, 0x8)
        rev_input[i*2] = bruteforce_char(r14)
        rev_input[(i*2)+1] = bruteforce_char(r19)

    rev_input = nor(rev_input)
    rev_input = inv_rotqbii(rev_input, 3)
    rev_input = rotqby(rev_input, 16-5)

    # Now unshuffle
    shuffle = [14, 15, 12, 13, 6, 7, 2, 3, 0, 1, 4, 5, 8, 9, 10, 11]
    result = [0]*16

    for i, e in enumerate(shuffle):
        result[i] = rev_input[shuffle[i]]

    return result

Passing it the fourth slice of our constant array gives the string SJBg3y4prWG4tU_5.

input[64..79]

input_4_compute

The fifth function at 0x250 starts of with simple xors and shuffles. The shuffled input is then passed to a function and the result is sent to the main program. Remembering the dreaded function at 0xfe0 I did not feel so good at first, but in the end it does not look so bad:

sub_dd0

The loop in (2) is executed twice and processes 4 bytes of input. As the parts (1) and (3) are pretty similar to the loop body, we can make the assumption that this is the same outlining issue as the previous function. After reimplementing in python, I confirmed that again, simplifying the code again by adding two iterations to the loop body gave the exact same result. The code to reverse this step is the following:

def sub_dd0_rev(value):
    def bruteforce_char(rev_reg):
    	# ...

    res = []

    for i in range(0, 16, 4):
        r20 = lxq_1730_i(value[i])
        r24 = lxq_1730_i(value[i+2])
        r17 = lxq_1730_i(value[i+3])
        r22 = lxq_1730_i(value[i+1])

        r0 = [0]*16
        r1 = [0]*16
        r2 = [0]*16
        r3 = [0]*16

        r0[3] = value[i]
        r1[3] = value[i+1]
        r2[3] = value[i+2]
        r3[3] = value[i+3]

        chunk = [
            bruteforce_char(r0),
            bruteforce_char(r1),
            bruteforce_char(r2),
            bruteforce_char(r3),
        ]

        res += chunk

    return res

def input_4_rev(value):
    rev_input = sub_dd0_rev(value)

    # unsuffle
    shuffle = [12, 13, 14, 15, 4, 5, 6, 7, 0, 1, 2, 3, 8, 9, 10, 11]
    res = [0]*16

    for i in range(16):
        res[i] = rev_input[shuffle[i]]

    return xor(res, u128_to_reg(0x87934520879345208793452087934520))

Passing the fifth slice to this function gives the string -10yxW8uYzWNMY54.

Fun fact, it is at this step that my monkey brain finally clicked and saw the similarities between the different transformation functions (shuffling + byte translation + misc operations for obfuscation). It clicked after about 20h into the challenge. Noticing this early would have saved a LOT of time, and it was a harsh reminder that sometimes your really need to step back and see the bigger picture.

input[80..95]

input_5_compute

The structure of this function (at 0x420) is eerily similar to the fourth one. At this point I was too lazy to reverse it properly. I just copy-pasted the solving code for the fourth step and fiddled with it until it worked.

def input_5_rev(value):
    def bruteforce_char(rev_reg):
    	# ...

    rev_input = [0]*16

    for i in range(8):
        r16 = [0]*16
        r19 = [0]*16

        r16[2] = value[i*2]
        r19[3] = value[(i*2)+1]

        r14 = shrhi(r16, 0x8)
        rev_input[i*2] = bruteforce_char(r14)
        rev_input[(i*2)+1] = bruteforce_char(r19)

    rev_input = rotqby(rev_input, 0)

    # Now unshuffle
    shuffle = [14, 15, 12, 13, 6, 7, 2, 3, 0, 1, 4, 5, 8, 9, 10, 11]
    result = [0]*16

    for i, e in enumerate(shuffle):
        result[i] = rev_input[shuffle[i]]

    return ah(ilh(0x551), result)

Passing the last slice of bytes of expected bytes to this function gives out the string ssDcpRHfKZ8KZkX3.

Now just concat the strings …

fusion

… and send them to the game for confirmation !

$ nc localhost 1337
    ____            ____          __     ______     ____
   / __ \___  _____/ __/__  _____/ /_   / ____/__  / / /
  / /_/ / _ \/ ___/ /_/ _ \/ ___/ __/  / /   / _ \/ / / 
 / ____/  __/ /  / __/  __/ /__/ /_   / /___/  __/ / /  
/_/    \___/_/  /_/  \___/\___/\__/   \____/\___/_/_/   
                                                        
========================================================
Welcome to Pefect Cell!

Please provide a correct input ...
FCSC{30gbt9RzIif5L_s70cRm7gXHm_R-8WkKTVxSjeL5H9gjVnzkSJBg3y4prWG4tU_5-10yxW8uYzWNMY54ssDcpRHfKZ8KZkX3}
Well done, this is a win! :-)

Afterword

Even though it was a really painful experience, the idea behind this challenge was really cool. It is a clear evolution of last year’s 2ndMix challenge, and it is always fun to reverse homebrews on weird architectures.