Here are some of the tools I use to process and clean data from all manner of customers:
detox
The detox utility renames files to make them easier to work with. It removes spaces and other such annoyances. It’ll also translate or cleanup Latin-1 (ISO 8859-1) characters encoded in 8-bit ASCII, Unicode characters encoded in UTF-8, and CGI escaped characters.
See other episodes for great sed information. I like to remove DOS end of line and end of file characters:
sed -i 's/
//g' *.txt
or
sed -i 's/r//g' *.txt
Command-line tools
ack
awk
detox
grep
pandoc
pdftotext -layout
sed
unix2dos and dos2unix
wget
curl
R libraries
RCurl
XML
rvest
tm
xlsx
Python libraries
beautifulsoup
csv
nltk YouTube Series
rdflib
re
Vim tricks
buffer searches (:vim /pattern/ ##)
Ack plugin
bufdo (:bufdo %s/pattern/replace/ge | update)
Other tools
OpenRefine
reconcile-csv
tabula