2015-12-08 01:30:02 +11:00
|
|
|
#WordOcc
|
|
|
|
|
|
|
|
A word frequency tool that outputs sorted results in csv format, supporting stop words.
|
|
|
|
|
|
|
|
# Requirements
|
|
|
|
|
|
|
|
- Python 2.6
|
|
|
|
- TextBlob https://textblob.readthedocs.org/en/dev/
|
|
|
|
|
|
|
|
# Usage
|
|
|
|
|
|
|
|
## Basic
|
|
|
|
|
|
|
|
python wordocc.py a_interesting_text.txt
|
|
|
|
|
2015-12-08 01:36:04 +11:00
|
|
|
Outputs such content in _wordocc.csv_ in the current directory :
|
2015-12-08 01:30:02 +11:00
|
|
|
|
|
|
|
top,43
|
|
|
|
image,31
|
|
|
|
sample,29
|
|
|
|
...
|
|
|
|
|
|
|
|
## Options
|
|
|
|
|
|
|
|
wordocc.py -h
|
|
|
|
Usage: wordocc.py [options] FILE
|
|
|
|
|
|
|
|
Options:
|
|
|
|
-h, --help show this help message and exit
|
|
|
|
-s STOP_WORDS, --stop-words=STOP_WORDS
|
|
|
|
path to stop word file
|
|
|
|
-o OUTPUT, --output=OUTPUT
|
2015-12-08 01:36:04 +11:00
|
|
|
csv output filename (default: wordocc.csv)
|
2015-12-08 01:30:02 +11:00
|
|
|
-e ENCODING, --encoding=ENCODING
|
|
|
|
file encoding (default: utf-8)
|
|
|
|
|
|
|
|
|
|
|
|
## Stop words
|
|
|
|
|
|
|
|
Stop words are words that are not interesting for the statistic study, like articles, conjunctions, etc ...
|
|
|
|
|
2015-12-08 01:36:04 +11:00
|
|
|
You can provide a file containing those words (one per line). Following files can help :
|
2015-12-08 01:30:02 +11:00
|
|
|
|
|
|
|
- English : http://snowball.tartarus.org/algorithms/english/stop.txt
|
|
|
|
- French :http://snowball.tartarus.org/algorithms/french/stop.txt
|
|
|
|
|
2015-12-08 01:36:04 +11:00
|
|
|
Use -s option to specify the file path :
|
2015-12-08 01:30:02 +11:00
|
|
|
|
2015-12-08 01:36:04 +11:00
|
|
|
python wordocc.py -s /home/jdoe/en/stop.txt a_interesting_text.txt
|