2017-03-14 03:51:07 +00:00
|
|
|
What needs to be improved about this repo:
|
|
|
|
|
|
|
|
Generalize and standardize the steps in an NLP pipeline into python classes and
|
|
|
|
functions. I can think of these off the top of my head:
|
|
|
|
|
|
|
|
* Scraper - get text from the internet to local file
|
|
|
|
* Cleaner - clean raw text of non-corpus text
|
|
|
|
* Ngramer - assemble text in python list of lists
|
|
|
|
* Cfdister - restructure data into a conditional frequency distribution
|
|
|
|
* Other? - restructure data by other metric (rhyming, similarity, etc.)
|
|
|
|
* Assembler loop - takes structure above and outputs one word
|
|
|
|
- Maybe should wrap in a sentence loop, line-by-line loop, paragraph loop,
|
|
|
|
etc.
|
2017-03-22 18:08:15 +00:00
|
|
|
|
|
|
|
Syntax aware generate is actually pretty bad. I think it forces it to be too
|
|
|
|
random. The POS tagging is too error prone and fine-detailed.
|
|
|
|
|
|
|
|
Ideas for the future:
|
|
|
|
|
|
|
|
Pick one or two lines of the haiku from actual haiku or other poems. Then add a
|
|
|
|
line or two from the corpus (e.g. trump tweets) that both fits the syllables and
|
|
|
|
rhymes with the end(s) of the real poetic line. I think both sources could be
|
|
|
|
ngram generated, but I think it would be ideal if they were picked wholesale
|
|
|
|
from the source. The problem with that approach is that you'd also have to find
|
|
|
|
a common word between the two source extractions so that the sentence doesn't
|
|
|
|
abruptly shift between lines. Or, maybe that's a good thing? I guess I should
|
|
|
|
try both.
|
|
|
|
|
|
|
|
Maybe try just switching out the nouns, verbs, adjectives, and adverbs leaving
|
|
|
|
the rest of the sentence structure largely intact after the tree replace?
|