2

I have 15 .csv files with the following formats:

**File 1**
MYC
RASSF1
DAPK1
MDM2
TP53
E2F1
...

**File 2**
K06227
C00187
GLI1
PTCH1
BMP2
TP53
...

I would like to create a loop that runs through each of the 15 files and compares 2 at each time, creating unique pairs. So, File 1 and File 2 would be compared with each other giving an output telling me how many matches it found and what they were. So in the above example, the output would be:

1 match and TP53

The loops would be used to compare all the files against each other so 1,3 (File 1 against File 3), 1,4 and so on.

f1 = set(open(str(cancers[1]) + '.csv', 'r'))
f2 = set(open(str(cancers[2]) + '.csv', 'r'))
f3 = open(str(cancers[1]) + '_vs_' + str(cancers[2]) + '.txt', 'wb').writelines(f1 & f2)

The above works but I'm having a hard time creating the looping portion.

2 Answers 2

1

In order not to compare the same file, and make the code flexible to the number of cancers, I would code like this. I assume cancer is a list.

# example list of cancers
cancers = ['BRCA', 'BLCA', 'HNSC']
fout = open('match.csv', 'w')
for i in range(len(cancers)):
    for j in range(len(cancers)):
        if j > i:
            # if there are string elements in cancers,
            # then it doesn't need 'str(cancers[i])'
            f1 = [x.strip() for x in set(open(cancers[i] + '.csv', 'r'))]
            f2 = [x.strip() for x in set(open(cancers[j] + '.csv', 'r'))]
            match = list(set(f1) & set(f2))
            # I use ; to separate matched genes to make excel able to read
            fout.write('{}_vs_{},{} matches,{}\n'.format(
                cancers[i], cancers[j], len(match), ';'.join(match)))
fout.close()

Results

BRCA_vs_BLCA,1 matches,TP53
BRCA_vs_HNSC,6 matches,TP53;BMP2;GLI1;C00187;PTCH1;K06227
BLCA_vs_HNSC,1 matches,TP53
Sign up to request clarification or add additional context in comments.

4 Comments

This works but I'm still getting instances where the output .txt file equates to a comparison of the same file. Ex: BRCA_vs_BRCA.txt. Do you know how I could bypass this?
@Quintakov Yes. I meant to avoid the same file comparison, but I just found I didn't. Now it should work.
Do you know how I would be able to complete the part 1 match and TP53. Basically what I would like to do is to create an output .csv that contains the number of matches in all files and what they are? So something like file 1_vs_file 2, 2 matches, {TP53, BRCA1}
@Quintakov I edited as you requested. Please check the edited version.
1

To loop through all pairs up to 15, something like this can do it:

for i in range(1, 15):
    for j in range(i+1, 16):
        f1 = set(open(str(cancers[i]) + '.csv', 'r'))
        f2 = set(open(str(cancers[j]) + '.csv', 'r'))
        f3 = open(str(cancers[i]) + '_vs_' + str(cancers[j]) + '.txt',
                  'wb').writelines(f1 & f2)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.