"90% of data analysis is counting" - John Rauser
…well, at least once you've figured out the right question to ask, which is, perhaps, the other 90%.
Example - Counting the size of a population
The simplest command for counting things is wc
, which stands for
word count. By default, wc
prints the number of lines, words, and
characters in a file.
$ wc pums_53.dat
85025 1219861 25659175 pums_53.dat
Nearly always we just want to count the number of lines (records), which can be
done by giving the -l
option to wc
:
$ wc -l pums_53.dat
85025 pums_53.dat
Example - Using grep to select a subset
So, recalling that this is a 1% sample, were there 8.5 million people in
Washington as of the 2000 census? Nope, the census data has two kinds of
records, one for households and one for persons. The first character of a
record, an H or P, indicates which kind of record it is. We can grep
for and
count just person records like this:
$ grep -c "^P" pums_53.dat
59150
The caret ^
means that the P
must occur at the beginning of the line. So
there were about 5.9 million people in Washington State in 2000. Also
interesting, the average household had 59,150/(85,025-59,150) = 2.3
people.