I'm trying to process a file from the protein data bank which is separated by spaces (not \t). I have a .txt file and I want to extract specific rows and, from that rows, I want to extract only a few columns.
I need to do it in Python. I tried first with command line and used awk command with no problem, but I have no idea of how to do the same in Python.
Here is an extract of my file:
[...]
SEQRES 6 B 80 ALA LEU SER ILE LYS LYS ALA GLN THR PRO GLN GLN TRP
SEQRES 7 B 80 LYS PRO
HELIX 1 1 THR A 68 SER A 81 1 14
HELIX 2 2 CYS A 97 LEU A 110 1 14
HELIX 3 3 ASN A 122 SER A 133 1 12
[...]
For example, I'd like to take only the 'HELIX' rows and then the 4th, 6th, 7th and 9th columns. I started reading the file line by line with a for loop and then extracted those rows starting with 'HELIX'... and that's all.
EDIT: This is the code I have right now, but the print doesn't work properly, only prints the first line of each block (HELIX SHEET AND DBREF)
#!/usr/bin/python
import sys
for line in open(sys.argv[1]):
if 'HELIX' in line:
helix = line.split()
elif 'SHEET'in line:
sheet = line.split()
elif 'DBREF' in line:
dbref = line.split()
print (helix), (sheet), (dbref)
解决方案
If you already have extracted the line, you can split it using line.split(). This will give you a list, of which you can extract all the elements you need:
>>> test='HELIX 2 2 CYS A 97'
>>> test.split()
['HELIX', '2', '2', 'CYS', 'A', '97']
>>> test.split()[3]
'CYS'