nlp/notes.md
2017-03-22 14:08:15 -04:00

1.5 KiB

What needs to be improved about this repo:

Generalize and standardize the steps in an NLP pipeline into python classes and functions. I can think of these off the top of my head:

  • Scraper - get text from the internet to local file
  • Cleaner - clean raw text of non-corpus text
  • Ngramer - assemble text in python list of lists
  • Cfdister - restructure data into a conditional frequency distribution
  • Other? - restructure data by other metric (rhyming, similarity, etc.)
  • Assembler loop - takes structure above and outputs one word
    • Maybe should wrap in a sentence loop, line-by-line loop, paragraph loop, etc.

Syntax aware generate is actually pretty bad. I think it forces it to be too random. The POS tagging is too error prone and fine-detailed.

Ideas for the future:

Pick one or two lines of the haiku from actual haiku or other poems. Then add a line or two from the corpus (e.g. trump tweets) that both fits the syllables and rhymes with the end(s) of the real poetic line. I think both sources could be ngram generated, but I think it would be ideal if they were picked wholesale from the source. The problem with that approach is that you'd also have to find a common word between the two source extractions so that the sentence doesn't abruptly shift between lines. Or, maybe that's a good thing? I guess I should try both.

Maybe try just switching out the nouns, verbs, adjectives, and adverbs leaving the rest of the sentence structure largely intact after the tree replace?