Skip to main content

I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.

I'm curious about other methods for finding words without using whitespace.

def unique_words(string):
    words = dict()
    p1 = 0 # String slice position 1
    p2 = 1 # String slice position 2
    len_string = len(string)
    while p2 < len_string:
        p2 += 1
        sub1 = string[p1:p2] # A shorter sub
        sub2 = string[p1:(p2 + 1)] # A longer sub
        sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
        sub2_count = string.count(sub2) # Counts the frequency of the longer sub
        if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
            words[sub1] = ('') # Add 
            p1 = p2
    return words

The above code works when the number of unique words is small but fails when it is large. I've used the website TextMechanicTextMechanic to generate a random string like

'updownleftupdowndownleftupleftrightupdownleftup'

and the above code returns a dictionary exactly as desired:

{'up': '', 'down': '', 'left': '', 'right': ''}

Here's the problem:

When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.

My current solution uses the algorithm on short slices of the original string, but this involves trial-and-error and has artifacts.

I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.

I'm curious about other methods for finding words without using whitespace.

def unique_words(string):
    words = dict()
    p1 = 0 # String slice position 1
    p2 = 1 # String slice position 2
    len_string = len(string)
    while p2 < len_string:
        p2 += 1
        sub1 = string[p1:p2] # A shorter sub
        sub2 = string[p1:(p2 + 1)] # A longer sub
        sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
        sub2_count = string.count(sub2) # Counts the frequency of the longer sub
        if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
            words[sub1] = ('') # Add 
            p1 = p2
    return words

The above code works when the number of unique words is small but fails when it is large. I've used TextMechanic to generate a random string like

'updownleftupdowndownleftupleftrightupdownleftup'

and the above code returns a dictionary exactly as desired:

{'up': '', 'down': '', 'left': '', 'right': ''}

Here's the problem:

When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.

My current solution uses the algorithm on short slices of the original string, but this involves trial-and-error and has artifacts.

I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.

I'm curious about other methods for finding words without using whitespace.

def unique_words(string):
    words = dict()
    p1 = 0 # String slice position 1
    p2 = 1 # String slice position 2
    len_string = len(string)
    while p2 < len_string:
        p2 += 1
        sub1 = string[p1:p2] # A shorter sub
        sub2 = string[p1:(p2 + 1)] # A longer sub
        sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
        sub2_count = string.count(sub2) # Counts the frequency of the longer sub
        if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
            words[sub1] = ('') # Add 
            p1 = p2
    return words

The above code works when the number of unique words is small but fails when it is large. I've used the website TextMechanic to generate a random string like

'updownleftupdowndownleftupleftrightupdownleftup'

and the above code returns a dictionary exactly as desired:

{'up': '', 'down': '', 'left': '', 'right': ''}

Here's the problem:

When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.

My current solution uses the algorithm on short slices of the original string, but this involves trial-and-error and has artifacts.

deleted 17 characters in body
Source Link
Jamal
  • 35.2k
  • 13
  • 134
  • 238

I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.

I'm curious about other methods for finding words without using whitespace.

Here's my solution.

def unique_words(string):
    words = dict()
    p1 = 0 # String slice position 1
    p2 = 1 # String slice position 2
    len_string = len(string)
    while p2 < len_string:
        p2 += 1
        sub1 = string[p1:p2] # A shorter sub
        sub2 = string[p1:(p2 + 1)] # A longer sub
        sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
        sub2_count = string.count(sub2) # Counts the frequency of the longer sub
        if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
            words[sub1] = ('') # Add 
            p1 = p2
    return words

The above code works when the number of unique words is small but fails when it is large. I've used TextMechanic to generate a random string like 'updownleftupdowndownleftupleftrightupdownleftup'

'updownleftupdowndownleftupleftrightupdownleftup'

and the above code returns a dictionary exactly as desired {'up': '', 'down': '', 'left': '', 'right': ''}.:

{'up': '', 'down': '', 'left': '', 'right': ''}

Here's the problem:

When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.

My current solution is useuses the algorithm on short slices of the original string, but this involves trial and error-and-error and has artifacts.

I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.

I'm curious about other methods for finding words without using whitespace.

Here's my solution.

def unique_words(string):
    words = dict()
    p1 = 0 # String slice position 1
    p2 = 1 # String slice position 2
    len_string = len(string)
    while p2 < len_string:
        p2 += 1
        sub1 = string[p1:p2] # A shorter sub
        sub2 = string[p1:(p2 + 1)] # A longer sub
        sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
        sub2_count = string.count(sub2) # Counts the frequency of the longer sub
        if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
            words[sub1] = ('') # Add 
            p1 = p2
    return words

The above code works when the number of unique words is small but fails when it is large. I've used TextMechanic to generate a random string like 'updownleftupdowndownleftupleftrightupdownleftup' and the above code returns a dictionary exactly as desired {'up': '', 'down': '', 'left': '', 'right': ''}.

Here's the problem:

When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.

My current solution is use the algorithm on short slices of the original string, but this involves trial and error and has artifacts.

I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.

I'm curious about other methods for finding words without using whitespace.

def unique_words(string):
    words = dict()
    p1 = 0 # String slice position 1
    p2 = 1 # String slice position 2
    len_string = len(string)
    while p2 < len_string:
        p2 += 1
        sub1 = string[p1:p2] # A shorter sub
        sub2 = string[p1:(p2 + 1)] # A longer sub
        sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
        sub2_count = string.count(sub2) # Counts the frequency of the longer sub
        if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
            words[sub1] = ('') # Add 
            p1 = p2
    return words

The above code works when the number of unique words is small but fails when it is large. I've used TextMechanic to generate a random string like

'updownleftupdowndownleftupleftrightupdownleftup'

and the above code returns a dictionary exactly as desired:

{'up': '', 'down': '', 'left': '', 'right': ''}

Here's the problem:

When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.

My current solution uses the algorithm on short slices of the original string, but this involves trial-and-error and has artifacts.

Source Link

A function for parsing words from a string without using whitespace

I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.

I'm curious about other methods for finding words without using whitespace.

Here's my solution.

def unique_words(string):
    words = dict()
    p1 = 0 # String slice position 1
    p2 = 1 # String slice position 2
    len_string = len(string)
    while p2 < len_string:
        p2 += 1
        sub1 = string[p1:p2] # A shorter sub
        sub2 = string[p1:(p2 + 1)] # A longer sub
        sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
        sub2_count = string.count(sub2) # Counts the frequency of the longer sub
        if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
            words[sub1] = ('') # Add 
            p1 = p2
    return words

The above code works when the number of unique words is small but fails when it is large. I've used TextMechanic to generate a random string like 'updownleftupdowndownleftupleftrightupdownleftup' and the above code returns a dictionary exactly as desired {'up': '', 'down': '', 'left': '', 'right': ''}.

Here's the problem:

When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.

My current solution is use the algorithm on short slices of the original string, but this involves trial and error and has artifacts.