Looking at Text I: Low-Level Formatting Issues
Junk formatting/Content. Examples: document headers and separators, typesetter codes, table and diagrams, garbled data in the computer file. Also other problems if data was retrieved through OCR (unrecognized words). Often one needs a filter to remove junk content before any processing begins.
Uppercase and Lowercase: should we keep the case or not? The The and THE should all be treated the same but “brown” in “George Brown” and “brown dog” should be treated separately.