Broken link - domain name moved

Source Link

edit approved Jul 11, 2015 at 13:55

103
4

I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.

I'm curious about other methods for finding words without using whitespace.

def unique_words(string):
    words = dict()
    p1 = 0 # String slice position 1
    p2 = 1 # String slice position 2
    len_string = len(string)
    while p2 < len_string:
        p2 += 1
        sub1 = string[p1:p2] # A shorter sub
        sub2 = string[p1:(p2 + 1)] # A longer sub
        sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
        sub2_count = string.count(sub2) # Counts the frequency of the longer sub
        if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
            words[sub1] = ('') # Add 
            p1 = p2
    return words

The above code works when the number of unique words is small but fails when it is large. I've used the website TextMechanic TextMechanic to generate a random string like

'updownleftupdowndownleftupleftrightupdownleftup'

and the above code returns a dictionary exactly as desired:

{'up': '', 'down': '', 'left': '', 'right': ''}

Here's the problem:

When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.

My current solution uses the algorithm on short slices of the original string, but this involves trial-and-error and has artifacts.

I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.

I'm curious about other methods for finding words without using whitespace.

def unique_words(string):
    words = dict()
    p1 = 0 # String slice position 1
    p2 = 1 # String slice position 2
    len_string = len(string)
    while p2 < len_string:
        p2 += 1
        sub1 = string[p1:p2] # A shorter sub
        sub2 = string[p1:(p2 + 1)] # A longer sub
        sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
        sub2_count = string.count(sub2) # Counts the frequency of the longer sub
        if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
            words[sub1] = ('') # Add 
            p1 = p2
    return words

The above code works when the number of unique words is small but fails when it is large. I've used TextMechanic to generate a random string like

'updownleftupdowndownleftupleftrightupdownleftup'

and the above code returns a dictionary exactly as desired:

{'up': '', 'down': '', 'left': '', 'right': ''}

Here's the problem:

When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.

My current solution uses the algorithm on short slices of the original string, but this involves trial-and-error and has artifacts.

I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.

I'm curious about other methods for finding words without using whitespace.

def unique_words(string):
    words = dict()
    p1 = 0 # String slice position 1
    p2 = 1 # String slice position 2
    len_string = len(string)
    while p2 < len_string:
        p2 += 1
        sub1 = string[p1:p2] # A shorter sub
        sub2 = string[p1:(p2 + 1)] # A longer sub
        sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
        sub2_count = string.count(sub2) # Counts the frequency of the longer sub
        if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
            words[sub1] = ('') # Add 
            p1 = p2
    return words

The above code works when the number of unique words is small but fails when it is large. I've used the website TextMechanic to generate a random string like

'updownleftupdowndownleftupleftrightupdownleftup'

and the above code returns a dictionary exactly as desired:

{'up': '', 'down': '', 'left': '', 'right': ''}

Here's the problem:

When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.

My current solution uses the algorithm on short slices of the original string, but this involves trial-and-error and has artifacts.

deleted 17 characters in body

Source Link

edited Sep 9, 2014 at 22:47

Jamal

35.2k
13
134
238

I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.

I'm curious about other methods for finding words without using whitespace.

Here's my solution.

def unique_words(string):
    words = dict()
    p1 = 0 # String slice position 1
    p2 = 1 # String slice position 2
    len_string = len(string)
    while p2 < len_string:
        p2 += 1
        sub1 = string[p1:p2] # A shorter sub
        sub2 = string[p1:(p2 + 1)] # A longer sub
        sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
        sub2_count = string.count(sub2) # Counts the frequency of the longer sub
        if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
            words[sub1] = ('') # Add 
            p1 = p2
    return words

The above code works when the number of unique words is small but fails when it is large. I've used TextMechanic to generate a random string like 'updownleftupdowndownleftupleftrightupdownleftup'

'updownleftupdowndownleftupleftrightupdownleftup'

and the above code returns a dictionary exactly as desired {'up': '', 'down': '', 'left': '', 'right': ''}.:

{'up': '', 'down': '', 'left': '', 'right': ''}

Here's the problem:

When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.

My current solution is useuses the algorithm on short slices of the original string, but this involves trial and error-and-error and has artifacts.

I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.

I'm curious about other methods for finding words without using whitespace.

Here's my solution.

def unique_words(string):
    words = dict()
    p1 = 0 # String slice position 1
    p2 = 1 # String slice position 2
    len_string = len(string)
    while p2 < len_string:
        p2 += 1
        sub1 = string[p1:p2] # A shorter sub
        sub2 = string[p1:(p2 + 1)] # A longer sub
        sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
        sub2_count = string.count(sub2) # Counts the frequency of the longer sub
        if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
            words[sub1] = ('') # Add 
            p1 = p2
    return words

The above code works when the number of unique words is small but fails when it is large. I've used TextMechanic to generate a random string like 'updownleftupdowndownleftupleftrightupdownleftup' and the above code returns a dictionary exactly as desired {'up': '', 'down': '', 'left': '', 'right': ''}.

Here's the problem:

When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.

My current solution is use the algorithm on short slices of the original string, but this involves trial and error and has artifacts.

I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.

I'm curious about other methods for finding words without using whitespace.

def unique_words(string):
    words = dict()
    p1 = 0 # String slice position 1
    p2 = 1 # String slice position 2
    len_string = len(string)
    while p2 < len_string:
        p2 += 1
        sub1 = string[p1:p2] # A shorter sub
        sub2 = string[p1:(p2 + 1)] # A longer sub
        sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
        sub2_count = string.count(sub2) # Counts the frequency of the longer sub
        if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
            words[sub1] = ('') # Add 
            p1 = p2
    return words

The above code works when the number of unique words is small but fails when it is large. I've used TextMechanic to generate a random string like

'updownleftupdowndownleftupleftrightupdownleftup'

and the above code returns a dictionary exactly as desired:

{'up': '', 'down': '', 'left': '', 'right': ''}

Here's the problem:

When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.

My current solution uses the algorithm on short slices of the original string, but this involves trial-and-error and has artifacts.

Source Link

asked Sep 9, 2014 at 22:39

12345678910111213

567
2
8
16

A function for parsing words from a string without using whitespace

I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.

I'm curious about other methods for finding words without using whitespace.

Here's my solution.

def unique_words(string):
    words = dict()
    p1 = 0 # String slice position 1
    p2 = 1 # String slice position 2
    len_string = len(string)
    while p2 < len_string:
        p2 += 1
        sub1 = string[p1:p2] # A shorter sub
        sub2 = string[p1:(p2 + 1)] # A longer sub
        sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
        sub2_count = string.count(sub2) # Counts the frequency of the longer sub
        if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
            words[sub1] = ('') # Add 
            p1 = p2
    return words

The above code works when the number of unique words is small but fails when it is large. I've used TextMechanic to generate a random string like 'updownleftupdowndownleftupleftrightupdownleftup' and the above code returns a dictionary exactly as desired {'up': '', 'down': '', 'left': '', 'right': ''}.

Here's the problem:

When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.

My current solution is use the algorithm on short slices of the original string, but this involves trial and error and has artifacts.

Stack Exchange Network

Return to Question

A function for parsing words from a string without using whitespace