I'm trying to read in data from a text corpus of science paper abstracts (available here). I've posted an example file below, where I've read in the data with
with open(filePath, "r") as f:
    data = f.readlines()
for i, x in enumerate(data): print i, x
I would like to extract only the category name at line 25 and the text from the Abstract; so in the example below that would be ("Commercial exploitation over the...", "Life Science Biological"). I cannot assume the category name and abstract will always appear at these specific line numbers. The abstract will always follow 2 lines after Abstract and run to the end of the file.
0 Title       : CRB: Genetic Diversity of Endangered Populations of Mysticete Whales:
1                Mitochondrial DNA and Historical Demography
2 Type        : Award
3 NSF Org     : DEB 
4 Latest
5 Amendment
6 Date        : August 1,  1991     
7 File        : a9000006
8 
9 Award Number: 9000006
10 Award Instr.: Continuing grant                             
11 Prgm Manager: Scott Collins                           
12        DEB  DIVISION OF ENVIRONMENTAL BIOLOGY       
13        BIO  DIRECT FOR BIOLOGICAL SCIENCES          
14 Start Date  : June 1,  1990       
15 Expires     : November 30,  1992   (Estimated)
16 Expected
17 Total Amt.  : $179720             (Estimated)
18 Investigator: Stephen R. Palumbi   (Principal Investigator current)
19 Sponsor     : U of Hawaii Manoa
20        2530 Dole Street
21        Honolulu, HI  968222225    808/956-7800
22 
23 NSF Program : 1127      SYSTEMATIC & POPULATION BIOLO
24 Fld Applictn: 0000099   Other Applications NEC                  
25               61        Life Science Biological                 
26 Program Ref : 9285,
27 Abstract    :
28                                                                                              
29               Commercial exploitation over the past two hundred years drove                  
30               the great Mysticete whales to near extinction.  Variation in                   
31               the sizes of populations prior to exploitation, minimal                        
32               population size during exploitation and current population                     
33               sizes permit analyses of the effects of differing levels of                    
34               exploitation on species with different biogeographical                         
35               distributions and life-history characteristics.  Dr. Stephen                   
36               Palumbi at the University of Hawaii will study the genetic                     
37               population structure of three whale species in this context,                   
38               the Humpback Whale, the Gray Whale and the Bowhead Whale.  The                 
39               effect of demographic history will be determined by comparing                  
40               the genetic structure of the three species.  Additional studies                
41               will be carried out on the Humpback Whale.  The humpback has a                 
42               world-wide distribution, but the Atlantic and Pacific                          
43               populations of the northern hemisphere appear to be discrete                   
44               populations, as is the population of the southern hemispheric                  
45               oceans.  Each of these oceanic populations may be further                      
46               subdivided into smaller isolates, each with its own migratory                  
47               pattern and somewhat distinct gene pool.  This study will                      
48               provide information on the level of genetic isolation among                    
49               populations and the levels of gene flow and genealogical                       
50               relationships among populations.  This detailed genetic                        
51               information will facilitate international policy decisions                     
52               regarding the conservation and management of these magnificent                 
53               mammals
UPDATE: the below code works for me, but is there a more efficient way to do this? with open(filePath, "r") as f: data = f.readlines()
  # Find the abstract and category
  abstract = re.compile("Abstract")
  for i, line in enumerate(data):
    if abstract.search(line): break
  # i is the line number of the "Abstract" identifier
  temp = "".join(data[i+1:])
  abstractText = " ".join(re.findall('[A-Za-z]+', temp))
  category = " ".join(re.findall('[A-Za-z]+', data[i-2]))
  return abstractText, category

