wordocc/README.md

51 lines
1.1 KiB
Markdown
Raw Normal View History

2015-12-08 01:30:02 +11:00
#WordOcc
A word frequency tool that outputs sorted results in csv format, supporting stop words.
# Requirements
- Python 2.6
- TextBlob https://textblob.readthedocs.org/en/dev/
# Usage
## Basic
python wordocc.py a_interesting_text.txt
Outputs such content in wordocc.csv :
top,43
image,31
sample,29
...
## Options
wordocc.py -h
Usage: wordocc.py [options] FILE
Options:
-h, --help show this help message and exit
-s STOP_WORDS, --stop-words=STOP_WORDS
path to stop word file
-o OUTPUT, --output=OUTPUT
csv output filename
-e ENCODING, --encoding=ENCODING
file encoding (default: utf-8)
## Stop words
### Introduction
Stop words are words that are not interesting for the statistic study, like articles, conjunctions, etc ...
You have to provide a file containing those words (one per line). Following files can help :
- English : http://snowball.tartarus.org/algorithms/english/stop.txt
- French :http://snowball.tartarus.org/algorithms/french/stop.txt
### Usage
python wordocc.py -e /home/jdoe/en/stop.txt a_interesting_text.txt