Displaylink setup

Recently I purchased a model# 131-00001-35 Displaylink iClever docking station to use with my laptop.

To properly configure, start by downloading and installing the drivers and xrandr. Check the system configuration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#pacman -S dkms
# git clone https://aur.archlinux.org/displaylink.git
# chown pl -R /home/pl/Downloads
$./Downloads/displaylink/makepkg
# pacman -S linux44-headers
pacman -U ./displaylink-1.1.62-2-x86_64.pkg.tar.xz
$ cat /proc/version
Linux version 4.4.21-1-MANJARO (builduser@manjaro) (gcc version 6.2.1 20160830 (GCC) ) #1 SMP PREEMPT Thu Sep 15 19:16:23 UTC 2016
$ pacman -Q xorg-xrandr
xorg-xrandr 1.5.0-1
$ pacman -Q dkms
dkms 2.2.0.3+git151023-12
$ pacman -Q xfwm4
xfwm4 4.12.3-2

Next start the display link service and list available monitors. setprovideroutputsource activates the second display.

1
2
3
4
5
6
7
8
9
$ xrandr --listproviders
Providers: number : 2
Provider 0: id: 0x48 cap: 0xb, Source Output, Sink Output, Sink Offload crtcs: 4 outputs: 5 associated providers: 0 name:Intel
Provider 1: id: 0x12a cap: 0x2, Sink Output crtcs: 1 outputs: 1 associated providers: 0 name:modesetting
$ xrandr --setprovideroutputsource 1 0

With the display running, xrandr will list available resolutions and refresh rates.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
$ xrandr
Screen 0: minimum 8 x 8, current 1600 x 960, maximum 32767 x 32767
eDP1 connected 1600x900+0+0 (normal left inverted right x axis y axis) 280mm x 160mm
1920x1080 59.93 +
1400x1050 59.98
1600x900 60.00*
1280x1024 60.02
1280x960 60.00
1368x768 60.00
1280x720 60.00
1024x768 60.00
1024x576 60.00
960x540 60.00
800x600 60.32 56.25
864x486 60.00
640x480 59.94
720x405 60.00
640x360 60.00
DP1 disconnected (normal left inverted right x axis y axis)
HDMI1 disconnected (normal left inverted right x axis y axis)
VGA1 disconnected (normal left inverted right x axis y axis)
VIRTUAL1 disconnected (normal left inverted right x axis y axis)
DVI-I-1 connected 1280x960+0+0 (normal left inverted right x axis y axis) 410mm x 256mm
1440x900 59.89 +
1280x1024 75.02 60.02
1280x960 60.00*
1152x864 75.00
1152x720 59.97
1024x768 75.08 60.00
832x624 74.55
800x600 75.00 60.32 56.25
848x480 60.00
640x480 75.00 60.00 59.94
720x400 70.08
1280x1024 (0x10b) 108.000MHz +HSync +VSync
h: width 1280 start 1328 end 1440 total 1688 skew 0 clock 63.98KHz
v: height 1024 start 1025 end 1028 total 1066 clock 60.02Hz
1280x960 (0x10c) 108.000MHz +HSync +VSync
h: width 1280 start 1376 end 1488 total 1800 skew 0 clock 60.00KHz
v: height 960 start 961 end 964 total 1000 clock 60.00Hz
1024x768 (0x110) 65.000MHz -HSync -VSync
h: width 1024 start 1048 end 1184 total 1344 skew 0 clock 48.36KHz
v: height 768 start 771 end 777 total 806 clock 60.00Hz
800x600 (0x113) 40.000MHz +HSync +VSync
h: width 800 start 840 end 968 total 1056 skew 0 clock 37.88KHz
v: height 600 start 601 end 605 total 628 clock 60.32Hz
800x600 (0x114) 36.000MHz +HSync +VSync
h: width 800 start 824 end 896 total 1024 skew 0 clock 35.16KHz
v: height 600 start 601 end 603 total 625 clock 56.25Hz
640x480 (0x118) 25.175MHz -HSync -VSync
h: width 640 start 656 end 752 total 800 skew 0 clock 31.47KHz
v: height 480 start 490 end 492 total 525 clock 59.94Hz

Screen 0 “eDP1” is my laptop screen with a resolution of 1600x99 and a refresh rate of 60 MHz.
Screen 1 “DVI-I-1” is my external display with a resolution of 1280x960.
Resolution can be set/changed with the xrandr command

1
2
3
4
5
;;;external
xrandr --output DVI-I-1 --mode 1280x960
;;;built in
xrandr --output eDP1 --mode 1600x900

Upon reboot, changes are lost. To automatically activate on boot, the systemctl displaylink.service command must be started as a service.
Systemctl requires root privileges.

1
2
3
4
# systemctl enable displaylink.service
Created symlink /etc/systemd/system/graphical.target.wants/displaylink.service → /usr/lib/systemd/system/displaylink.service.
#

The xrandr command needs to be executed after the X system is initialized during the boot process. Create an executable file with the required commands:

displaylink.sh
1
2
3
#!/usr/bin/env sh
sleep 10 && xrandr --setprovideroutputsource 1 0 && xrandr --output DVI-I-1 --mode 1280x960

Go into settings > settings manager > session and startup > application autostart > add

Add an autostart with the following parameters:
Name: Displaylink
Command: bash /home/<myaccount>/displaylink.sh

With this configuration I need to be docked during boot so the second display can be recognized. If I dock after boot, I can manually execute the ./displaylink.sh command to activate the monitor. Without additional configuration I can close the lid of the laptop, inactivating the built in screen, but retaining signal to the external display.

Share

Sentiment analysis - DT matrix

As I work with various packages related to text manipulation, I am beginning to realize what a mess the R package ecosystem can turn into. A variety of packages written by different contributers with no coordination amongst packages, overlapping functionality, colliding nomenclature. Many functions for “convenience” when base R could do the job. I also noticed this with packages like dplyr. I have commenced learning dplyr on multiple occasions only to find I don’t need it - I can do everything with base R without loading an extra package and learning new terminology. The problem I now encounter is that as these packages gain in popularity, code snippets and examples use them and I need to learn and understand the packages to make sense of the examples.

In my previous post on text manipulation I discussed the process of creating a corpus object. In this post I will investigate what can be done with a document term matrix. Starting with the previous post’s corpus:

1
2
3
dtm <- DocumentTermMatrix(corp)

There are a variety of methods available to inspect the dtmatrix:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
> dtm
<<DocumentTermMatrix (documents: 17, terms: 5500)>>
Non-/sparse entries: 18083/75417
Sparsity : 81%
Maximal term length: 26
Weighting : term frequency (tf)
> dim(dtm)
[1] 17 5500
> inspect(dtm[2, 50:100])
<<DocumentTermMatrix (documents: 1, terms: 51)>>
Non-/sparse entries: 10/41
Sparsity : 80%
Maximal term length: 9
Weighting : term frequency (tf)
Terms
Docs accentu accept access accid accompani accord account accumul
chapter02.xhtml 0 0 0 0 0 1 1 0
Terms
Docs accus accustom ach achiev acid acquaint acquir acr across
chapter02.xhtml 0 0 0 0 0 0 0 0 0

Note the sparsity is 81%. Remove sparse terms and inspect:

1
2
3
4
5
6
7
> dtms <- removeSparseTerms(dtm, 0.1) # This makes a matrix that is 10% empty space, maximum.
> dtms
<<DocumentTermMatrix (documents: 17, terms: 66)>>
Non-/sparse entries: 1082/40
Sparsity : 4%
Maximal term length: 26
Weighting : term frequency (tf)

Now sparsity is down to 4%. Calculate word frequencies and plot as a histogram.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
> freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
> head(freq, 14)
the said like one rock girl now littl miss look mrs know day
367 204 188 184 180 170 167 164 163 148 142 133 122
come
116
> wf <- data.frame(word=names(freq), freq=freq)
> head(wf)
word freq
the the 367
said said 204
like like 188
one one 184
rock rock 180
girl girl 170
>wf$nc <- sapply(as.character(wf$word),nchar)
>wf <- wf[wf$nc >3,]
>p <- ggplot(subset(wf, freq>60), aes(word, freq))
>p <- p + geom_bar(stat="identity")
>p <- p + theme(axis.text.x=element_text(angle=45, hjust=1))
>p

We can use hierarchal clustering to group related words. I wouldn’t read much meaning into this for Picnic, but it is comforting to see the xml/html terms clustering together in the third group - a sort of positive control.

1
2
3
4
5
6
7
8
9
library(cluster)
dtms <- removeSparseTerms(dtm, 0.05) # This makes a matrix that is 10% empty space, maximum.
d <- dist(t(dtms), method="euclidian")
fit <- hclust(d=d, method="ward")
plot(fit, hang=-1)
groups <- cutree(fit, k=4) # "k=" defines the number of clusters you are using
rect.hclust(fit, k=4, border="red") # draw dendogram with red borders around the 4 clusters

center

We can also use K-means clustering:

1
2
3
4
library(fpc)
d <- dist(t(dtms), method="euclidian")
kfit <- kmeans(d, 4)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)

Back here I didn’t mention that when creating the epub, it would display fine on my computer, but would not display on my Nook. A solution was to pass the file through Calibre. I diff’ed files coming out of Calibre with my originals but was not able to determine the minimum set of changes required for Nook compatibility. You can download the Calibre modified epub here, and the original here. If you determine what those Nook requirements are, please inform me.

Share

Sentiment analysis - Corpus

In a previous post on text manipulation I discussed text mining manipulations that could be performed with a data frame. In this post I will explore what can be done with a corpus. Start by importing the text manipulation package tm. tm has many useful methods for creating a corpus from various sources. My texts are in a directory as xhtml files, one per chapter. I will use VCorpus(DirSource()) to read the files into a corpus data object:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
> library(tm)
>
> myfiles <- paste(getwd(),"/xhtmlfiles",sep="")
> corp <- VCorpus(DirSource(myfiles), readerControl = list(language="en"))
>
> length(corp)
[1] 17
> corp[[2]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 14639
> writeLines(as.character(corp[[2]]))
<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta charset="UTF-8" /></head>
<body><p>Chapter 11</p><p> </p><p> Mrs Fitzhubert at the breakfast table looked out on to the mist-shrouded garden, and decided to instruct the maids to begin putting away the chintzes preparatory to the....

The variable “corp” is a 17 member list, each member containing a chapter. tm provides many useful methods for word munging, referred to as “transformations”. Transformations are applied with the tm_map() function. Below I remove white space, remove stop words, stem (i.e. remove common endings like “ing”, “es”, “s”), etc.:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
corp <- tm_map(corp, removeWords, stopwords("english"))
corp <- tm_map(corp, removePunctuation, preserve_intra_word_dashes = TRUE)
prop.nouns <- c("Albert","Miranda","Mike","Michael","Edith","Irma","Sara","Dora","Appleyard","Hussey","Fitzhubert","Bumpher","Leopold","McCraw","Marion","Woodend","Leopold","Lumley","pp","p" )
corp <- tm_map(corp, removeWords, prop.nouns)
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, stemDocument)
corp <- tm_map(corp, stripWhitespace)
> writeLines(as.character(corp[[2]]))
xml version10 encodingutf-8
html xmlnshttpwwww3org1999xhtml
headmeta charsetutf-8 head
bodypchapt 2pp
manmad improv natur picnic ground consist sever circl flat stone serv fireplac wooden privi shape japanes pagoda the creek close summer ran sluggish long dri grass now almost disappear re-appear shallow pool lunch set larg white tablecloth close shade heat sun two three spre

A corpus object allows for the addition of meta data. I will add two events per chapter, which may be useful as overlays during graphing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
meta(corp[[1]], "event1") <- "Exposition of main characters"
meta(corp[[1]], "event2") <- "Journey to the rock"
meta(corp[[2]], "event1") <- "Picnic"
meta(corp[[2]], "event2") <- "Crossing the creak"
meta(corp[[3]], "event1") <- "A surprising number of human beings are without purpose..."
meta(corp[[3]], "event2") <- "Edith screams, girls disappear"
meta(corp[[4]], "event1") <- "Sarah hasn't memorized "The Hesperus""
meta(corp[[4]], "event2") <- "Drag returns from the Rock"
meta(corp[[5]], "event1") <- "Michael interviewed by Constable Bumpher"
meta(corp[[5]], "event2") <- "The red cloud"
meta(corp[[6]], "event1") <- "The garden party"
meta(corp[[6]], "event2") <- "Mike decides to search for the girls"
meta(corp[[7]], "event1") <- "Mike decides to spend the night on the rock"
meta(corp[[7]], "event2") <- "Mike hallucinates on the rock"
meta(corp[[8]], "event1") <- "Michael rescued on the rock"
meta(corp[[8]], "event2") <- "Irma is found alive"
meta(corp[[9]], "event1") <- "Letters to/from parents"
meta(corp[[9]], "event2") <- "Sara informed of her debts to the school"
meta(corp[[10]], "event1") <- "Visit from the Spracks"
meta(corp[[10]], "event2") <- "Michael and Irma meet, date, break up"
meta(corp[[11]], "event1") <- "Michael avoids luncheon with Irma"
meta(corp[[11]], "event2") <- "Fitzhuberts entertain Irma"
meta(corp[[12]], "event1") <- "Irma visits the gymnasium"
meta(corp[[12]], "event2") <- "Mademoiselle de Poitiers threatens Dora Lumley"
meta(corp[[13]], "event1") <- "Reg collects his sister Dora"
meta(corp[[13]], "event2") <- "Reg and Dora die in a fire"
meta(corp[[14]], "event1") <- "Albert describes a dream about his kid sister"
meta(corp[[14]], "event2") <- "Mr Leopold thanks Albert with a cheque"
meta(corp[[15]], "event1") <- "Mrs Appleyard lies about Sara's situation"
meta(corp[[15]], "event2") <- "Mademoiselle de Poitiers reminisces about Sara"
meta(corp[[16]], "event1") <- "Mademoiselle de Poitiers letter to Constable Bumpher"
meta(corp[[16]], "event2") <- "Sara found dead"
meta(corp[[17]], "event1") <- "Newspaper extract"
meta(corp[[17]], "event2") <- ""
> meta(corp[[2]], "event2")
[1] "Crossing the creak"

The corpus object is a list of lists. The main object has 17 elements, one for each chapter, but each chapter element is also a list. The “content” variable of the list is a list of the original xml file contents, with each element being either xml notation, a blank line, or a paragraph of text. Looking at the second chapter’s contents corp[[2]]$content is a list of 18 elements. The first paragraph of the chapter begins with element 6:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> length(corp[[2]]$content)
[1] 18
> corp[[2]]$content[1]
[1] "xml version10 encodingutf-8"
> corp[[2]]$content[2]
[1] "html xmlnshttpwwww3org1999xhtml"
> corp[[2]]$content[3]
[1] ""
> corp[[2]]$content[4]
[1] "headmeta charsetutf-8 head"
> corp[[2]]$content[5]
[1] "bodypchapt 2pp"
> corp[[2]]$content[6]
[1] " manmad improv natur picnic ground consist sever circl flat stone serv fireplac wooden privi shape japanes pagoda the creek close summer ran sluggish long dri grass now almost disappear re-appear shallow pool lunch set larg white tablecloth close shade heat sun two three spread gum in addit chicken pie angel cake jelli tepid banana insepar australian picnic cook provid handsom ice cake shape heart tom oblig cut mould piec tin mr boil two immens billycan tea fire bark leav now enjoy pipe shadow drag keep watch eye hors tether shade"
>

This corpus is the end of the preprocessing stage of the document and will be the input for a document term matrix discussed in the next post

Share

PAHR Sentiment Network

In my previous post on sentiment analysis I used a dataframe to plot the trajectory of sentiment across the novel Picnic at Hanging Rock. In this post I will use the same dataframe of non-unique, non-stop, greater than 3 character words (red line from an earlier post) to create a network of associated words. Words can be grouped by sentence, paragraph, or chapter. I have already removed stop words and punctuation, so I will use my previous grouping of every 15 words in the order they appear in the novel. Looking at my dataframe rows 10 to 20:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
> d2[10:20,]
chpt word sentiment lexicon group
10 1 silent positive bing 1
11 1 ridiculous negative bing 1
12 1 supported positive bing 1
13 1 bust negative bing 1
14 1 tortured negative bing 1
15 1 hopeless negative bing 1
16 1 misfit negative bing 2
17 1 clumsy negative bing 2
18 1 gold positive bing 2
19 1 suitable positive bing 2
20 1 insignificant negative bing 2
>

You can see the column “group” has grouped every 15 words. First I create a table of word cooccurences using the pair_count function, then I use ggraph to create the network graph. The number of cooccurences are reflected in edge opacity and width. At the time of this writing, ggraph was still in beta and had to be downloaded from github and built locally. The igraph package provides the graph_from_data_frame function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
library(igraph)
library(ggraph)
word_cooccurences <- d2 %>%
pair_count(group, word, sort = TRUE)
set.seed(1900)
word_cooccurences %>%
filter(n >=4) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
geom_node_point(color = "darkslategray4", size = 5) +
geom_node_text(aes(label = name), vjust = 1.8) +
ggtitle(expression(paste("Word Network ",
italic("Picnic at Hanging Rock")))) +
theme_void()

center

Lets regroup every 25 words:

1
2
3
d2$group <- sort( c(rep(1:120, 25),rep(121,19)))

center

And now include only words with 5 occurences or more:

1
2
..... filter(n >=5) %>% ....

center

1
2
Share

PAHR Sentiment Trajectory

In my previous post on sentiment I discussed the process of building data frames of chapter metrics and word lists. I will use the word data frame to monitor sentiment across the book. I am working with non-unique, non-stop, greater than 3 character words (red line from the previous post). Looking at the word list and comparing to text, I can see that the words are in the order that they appear in the novel. I will use the Bing sentiment determinations from the tidytext package to annotate each word as being either of positive or negative sentiment. I will then group by 15 words and calculate the average sentiment.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
##make a dataframe of all chapters
##use non.stops which also has words with <=3 chars removed
word <- non.stop[[1]]
chpt <- rep(1, length(word))
pahr.words <-data.frame( cbind(chpt, word))
for(i in 2:17){
word <- non.stop[[i]]
chpt <- rep(i, length(word))
holder <- cbind(chpt, word)
pahr.words <- rbind(pahr.words, holder)
rm(holder)
}
##I checked and words are in the order that they appear
##in the novel
library(tidytext)
bing <- sentiments %>%
filter(lexicon == "bing") %>%
select(-score)
d2 <- pahr.words %>%
inner_join(bing) %>%
cbind(sort( c(rep(1:201, 15),rep(202,4)))) ##this will group words by 15 for averaging sentiment
names(d2)[5]<-"group"
d3 <- count(d2, chpt,group,sentiment)
library(tidyr)
d4 <- spread(d3, sentiment, n)
d4$sentiment <- d4$positive - d4$negative

Plot as a line graph, with odd chapters colored black and even chapters colored grey. I also annotate a few moments of trauma within the narrative.

1
2
3
4
library(ggplot2)
mycols <- c(rep(c("black","darkgrey"),8),"black")
ggplot( d4, aes(group, sentiment, color=chpt)) + geom_line() + scale_color_manual(values = mycols) + geom_hline(yintercept=0, linetype="dashed", color="red") + annotate("text", x = 146, y = -14, label = "Hysteria in the gymnasium") + annotate("text", x = 147, y = -13, label = "x") + annotate("text", x = 12, y = -11, label = "Edith screams on Rock") + annotate("text", x = 35, y = -11, label = "x") + annotate("text", x = 68, y = -13, label = "Bad news delivered\n to Ms Appleyard") + annotate("text", x = 49, y = -13, label = "x")

center

We can see that the novel starts with a positive sentiment - “Beautiful day for a picnic…” - which gradually moves into negative territory and remains there for the majority of the book.

Does sentiment analysis really work? Depends on how accurately word sentiment is characterized. Consider the word “drag”:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
> d2[d2$word=="drag",]
chpt word sentiment lexicon group
133 1 drag negative bing 9
141 1 drag negative bing 10
162 1 drag negative bing 11
169 1 drag negative bing 12
183 1 drag negative bing 13
198 1 drag negative bing 14
199 1 drag negative bing 14
213 1 drag negative bing 15
227 1 drag negative bing 16
250 1 drag negative bing 17
263 1 drag negative bing 18
275 2 drag negative bing 19
300 2 drag negative bing 20
457 3 drag negative bing 31
468 3 drag negative bing 32
585 4 drag negative bing 39
602 4 drag negative bing 41
630 4 drag negative bing 42
633 4 drag negative bing 43
665 4 drag negative bing 45
678 4 drag negative bing 46
679 4 drag negative bing 46
743 5 drag negative bing 50
1224 7 drag negative bing 82
2978 16 drag negative bing 199
>

There are many instances of the word drag annotated as negative. Consider the sentence “It’s a drag that sentiment analysis isn’t reliable.” That would be drag in a negative context. In Picnic, a drag is a buggy pulled by horses, mentioned many times, imparting lots of undeserved negative sentiment to the novel. Drag in Picnic is neutral and should have been discarded. Inspecting the sentiment annotated word list, many other examples similar to drag could be found, some providing negative, some positive sentiment, on average probably cancelling each other out. Even more abundant are words properly annotated, which, on balance may convey the proper sentiment. I would be skeptical, though, of any sentiment analysis without a properly curated word list.

In the next post I will look at what can be done with a corpus.

1
Share

Sentiment analysis

In my previous post on text manipulation I discussed the process of OCR and text munging to create a list of chapter contents. In this post I will investigate what can be done with a data-frame, and future posts will discuss using a corpus, and Document Term matrix.

Each chapter is an XML file so read those into a variable and inspect:

1
2
3
4
5
6
7
8
9
10
11
##Indicate working directory
>setwd("~/pahr/sentiment/")
> all.files <- list.files(paste(getwd(), "/xhtmlfiles", sep=""))
> allfiles
[1] "chapter10.xhtml" "chapter11.xhtml" "chapter12.xhtml" "chapter13.xhtml"
[5] "chapter14.xhtml" "chapter15.xhtml" "chapter16.xhtml" "chapter17.xhtml"
[9] "chapter1.xhtml" "chapter2.xhtml" "chapter3.xhtml" "chapter4.xhtml"
[13] "chapter5.xhtml" "chapter6.xhtml" "chapter7.xhtml" "chapter8.xhtml"
[17] "chapter9.xhtml"
>

Create a dataframe that will provide the worklist through which I will process, as well as hold data about each chapter. The dataframe will contain a row for each chapter and tally information such as:

  • bname: base name of the chapter XML file
  • chpt: chapter number
  • paragraphs: number of paragraphs
  • total: total number of words
  • nosmall: number of small (<4 characters) words
  • uniques: number of unique words
  • nonstop: number of non-stop words
  • unnstop: number of unique non-stop words
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
d <- data.frame(matrix(ncol = 9, nrow = length(all.files)))
names(d) <- c("file.name","bname","chpt","paragraphss","total","nosmall","uniques","nonstop","unnstop")
d$file.name <- all.files
for(i in 1:nrow(d)){
numc <- nchar(d[i,"file.name"])
d[i,"bname"] <- substring( d[i,"file.name"], 1, numc - 6)
d[i,"chpt"] <- as.integer(substring( d[i,"file.name"], 8, numc - 6))
}
d <- d[order(d$chpt),]
> head(d)
file.name bname chpt paragraphs total nosmall uniques nonstop
9 chapter1.xhtml chapter1 1 NA NA NA NA NA
10 chapter2.xhtml chapter2 2 NA NA NA NA NA
11 chapter3.xhtml chapter3 3 NA NA NA NA NA
12 chapter4.xhtml chapter4 4 NA NA NA NA NA
13 chapter5.xhtml chapter5 5 NA NA NA NA NA
14 chapter6.xhtml chapter6 6 NA NA NA NA NA
unnstop
9 NA
10 NA
11 NA
12 NA
13 NA
14 NA
>

I will read the chapter XML files into a list and at the same time count the number of paragraphs per chapter:

1
2
3
4
5
6
7
8
chpts <- vector(mode="list", length=nrow(d))
for(i in 1:nrow(d)){
chpt.num <- d[i,"chpt"]
chpts[[chpt.num]] <-xmlToList( paste( getwd(), "/xhtmlfiles/", d[i,"file.name"], sep=""))
d[i,"lines"] <- length(chpts[[chpt.num]]$body )
}

center
Each quote from a character is given its own paragraph, so a high paragraph count is indicative of lots of conversation.

Next create a list for each parameter I would like to extract. Stop words are provided by the tidytext package

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
library(tidytext)
total <- vector(mode="list", length=nrow(d))
nosmall <- vector(mode="list", length=nrow(d))
un <- vector(mode="list", length=nrow(d)) ##uniques no blanks
data(stop_words) #from tidytext package
non.stops <- vector(mode="list", length=nrow(d))
unstops <- vector(mode="list", length=nrow(d))
for(i in 1:nrow(d)){
chpt.num <- d[i,"chpt"]
total[[chpt.num]] <- strsplit(gsub( "[[:punct:]]", "", chpts[[chpt.num]])[2], " ", fixed=TRUE)
d[i,"total"] <- length(total[[chpt.num]][[1]] )
##eliminate words with fewer than 4 characters
nosmall[[chpt.num]] <- total[[chpt.num]][[1]][!(nchar(total[[chpt.num]][[1]] )<4)]
d[i,"nosmall"] <- length(nosmall[[chpt.num]] )
##uniques
un[[chpt.num]] <- unique(nosmall[[chpt.num]])
d[i,"uniques"] <- length( un[[chpt.num]] )
##no stops (but not unique)
non.stops[[chpt.num]] <-nosmall[[chpt.num]][!(nosmall[[chpt.num]] %in% as.list(stop_words[,1])$word)]
d[i,"nonstops"] <- length(non.stops[[chpt.num]] )
##no stops AND unique
unstops[[chpt.num]] <-un[[chpt.num]][!(un[[chpt.num]] %in% as.list(stop_words[,1])$word)]
d[i,"unstop"] <- length(unstops[[chpt.num]] )
}
> head(d)
file.name bname chpt paragraphs total nosmall uniques nonstop
9 chapter1.xhtml chapter1 1 50 5151 2854 1649 NA
10 chapter2.xhtml chapter2 2 59 3490 1844 1077 NA
11 chapter3.xhtml chapter3 3 42 2904 1632 971 NA
12 chapter4.xhtml chapter4 4 59 4064 2011 1066 NA
13 chapter5.xhtml chapter5 5 100 6216 3267 1572 NA
14 chapter6.xhtml chapter6 6 48 3305 1741 1028 NA
unnstop nonstops unstop
9 NA 2061 1414
10 NA 1228 883
11 NA 1107 786
12 NA 1290 843
13 NA 2124 1306
14 NA 1171 835
>
plot(d$chpt, d$total, type="o", ylab="Words", xlab="Chapter Number", main="Words by Chapter", ylim=c(0,9000))
points(d$chpt, d$nosmall, type="o", col="lightblue")
points(d$chpt, d$uniques, type="o", col="blue")
points(d$chpt, d$nonstops, type="o", col="red")
points(d$chpt, d$unstop, type="o", col="orange")
# get the range for the x and y axis
legend(1, 9000, c("Total words","Big words (> 3 chars)","Unique(Big words)","Non-stop(Big words)","Unique*Non-stop(Big words)"), col=c("black","lightblue","blue","red","orange"),lty=1, pch=1, cex=0.9)

center
The word count trends are the same for all categories, which is expected. I am interested in the “Non-stop(Big words)”, the red line, as I don’t want to normalize word dosage i.e. if the word “happy” is present 20 times, I want the 20x dosage of the happiness sentiment that I wouldn’t get using unique words. To visually inspect the word list I will simply pull out the first 50 words from each category for chapter 2:

center

Comparing nosmall to non.stops the first two words eliminated are words 9 and 24, “several” and “through”, two words I would agree don’t contribute to sentiment or content.

Next I will make a wordcloud of the entire book. To do so I must get the words into a dataframe.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
word <- non.stops[[1]]
chpt <- rep(1, length(word))
pahr.words <-data.frame( cbind(chpt, word))
for(i in 2:17){
word <- non.stops[[i]]
chpt <- rep(i, length(word))
holder <- cbind(chpt, word)
pahr.words <- rbind(pahr.words, holder)
rm(holder)
}
library('wordcloud')
wordcloud(pahr.words$word, max.words = 100, random.order = FALSE)

center
Appropriately “rock” is the most frequent word. The word cloud contains many proper nouns. I will make a vector of these nouns, remove them from the collection of words and re-plot:

1
2
3
4
5
6
7
8
9
10
11
> prop.nouns <- c("Albert","Miranda","Mike","Michael","Edith","Irma","Sara","Dora","Appleyard","Hussey","Fitzhubert","Bumpher","Leopold","McCraw","Marion","Woodend","Leopold","Lumley" )
> cloud.words <- as.character(pahr.words$word)
> ioi <- (cloud.words %in% prop.nouns)
> summary(ioi)
Mode FALSE TRUE NA's
logical 24524 1194 0
> cw2 <- cloud.words[!ioi]
>
> wordcloud(cw2, max.words = 100, random.order = FALSE)
>

center

In the next post I will look at what can be done with a corpus.

1
Share

ebook text manipulation

In my first post on creating an ebook I discussed the physical manipulation required to convert a paperback book into images and ultimately text files. Now I want to convert the text files into an ebook. Here is the sequence of events:

  1. Organize text in chapter/page order
  2. Read into a list, combining pages into chapters
  3. Remove ligatures, common misspellings, combine hyphenated word fragments
  4. Annotate with ebook XML tags
  5. Generate the ebook

Organize text

I start with my dataframe listing all files and their page numbers and read each individual page text file into an R list.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
library(tidytext)
library(dplyr)
library(stringr)
library(readr)
library(tokenizers)
> d
file.name bname chpt eo page img.num pnumber chpteo
1 ch1o-0001.txt ch1o-0001 1 o 1 1 NA 1o
2 ch1e-0010.txt ch1e-0010 1 e 2 10 NA 1e
3 ch1o-0002.txt ch1o-0002 1 o 3 2 NA 1o
4 ch1e-0009.txt ch1e-0009 1 e 4 9 NA 1e
5 ch1o-0003.txt ch1o-0003 1 o 5 3 NA 1o
6 ch1e-0008.txt ch1e-0008 1 e 6 8 NA 1e
7 ch1o-0004.txt ch1o-0004 1 o 7 4 7 1o
8 ch1e-0007.txt ch1e-0007 1 e 8 7 8 1e
9 ch1o-0005.txt ch1o-0005 1 o 9 5 9 1o
10 ch1e-0006.txt ch1e-0006 1 e 10 6 NA 1e

By counting the number of rows associated with each chapter in the dataframe, determine the number of pages per chapter then combine those pages into a list by chapters, 17 chapters total for Picnic. I will not annotate individual pages with page numbers, but will combine all pages into a chapter and let the epub format handle the flow.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
pages <- vector(mode="list", length=nrow(d))
for(i in 1:nrow(d)){
pages[i] <- list(read_lines( paste( getwd(), "/textfiles/", d[i,"bname"], ".txt" , sep=""), skip =0))
}
# how many pages per chapter
chap.lengths <- vector('integer', length=16)
for(i in 1:16){
chap.lengths[i] <- nrow(d[d$chpt== i,])
}
chapter <- vector(mode = "list", length = 16) #there are 16 chapters
page.counter <- 1
for(i in 1:length(chap.lengths)){ #i is the chapter number
for(j in 1:chap.lengths[i]){ #j is the page number per chapter but I need a page counter to span the book
chapter[[i]] <- c(chapter[[i]], pages[[page.counter]])
page.counter <- page.counter + 1
}
}

Remove ligatures, correct misspellings

OCR wil have introduced many misspellings, some of which can be corrected in bulk. I also want to remove ligatures, as this will interfere with word recognition when I am performing spell checking. Finally, the type setting process introduces many hyphenated words at the end of a line of text, to preserve readability. I want to remove these and let the epub flow text instead.

I create the method replaceforeignchars which will replace ligatures and common misspellings. The replacements to be executed are tabulated in the table “fromto”:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
##define method
fromto <- read.table(text="
from to
š s
-— -
—- -
» ''
œ oe
fi fi
fl fl
ğ g
mr Mr
mrs Mrs",header=TRUE)
replaceforeignchars <- function(dat,fromto) {
for(i in 1:nrow(fromto) ) {
dat <- gsub(fromto$from[i],fromto$to[i],dat)
}
dat
}
chapters2 <- chapter
for(i in 1:17){
chapters2[[i]] <- replaceforeignchars( chapters2[[i]], fromto)
}

With pages combined into chapters remove hyphenated words at the end of lines.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
chapters2 <- vector(mode = "list", length = 16) #there are 16 chapters
concat.flag <- FALSE #if True must concatenate last word on sentence to first on next sentence and remove dash
for(i in 1:17){ #i is the chapter counter
for(j in 1:length(chapter[[i]])){ #j is the line counter
sentence <- chapter[[i]][j]
sentence.length <- nchar(sentence)
if( substring( sentence, sentence.length, sentence.length)=='-'){
concat.flag <- TRUE
words.first <- tokenize_words(sentence)
words.first.len <- length(words.first[[1]])
words.second <- tokenize_words(chapter[[i]][j+1])
words.second.len <- length(words.second[[1]])
concatenated.word <- paste( words.first[[1]][words.first.len], words.second[[1]][1], sep="", collapse="")
new.sen.first <- paste(c(words.first[[1]][1:(words.first.len-1)],concatenated.word), sep=" ", collapse=" ")
new.sen.second <- paste( words.second[[1]][2:(words.second.len)], sep=" ", collapse=" ")
chapters2[[i]] <- c(chapters2[[i]], new.sen.first, new.sen.second)
}else{
if(concat.flag){
concat.flag <- FALSE
}else{
chapters2[[i]] <- c(chapters2[[i]], chapter[[i]][j] )
}
}
}
}

In practice this didn’t work so well. Tokenizing a sentence removes capitalization which then has to be manually corrected. There were also occasions where a line was duplicated and this had to be manully corrected. I decided to remove hyphens manually while editing the text.

Next I print out each chapter as a page with xhtml annotation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
for(i in 1:16){
out.file <- paste(getwd(), "/content/chapter", i, ".xhtml", sep="")
cat("<?xml version=\"1.0\" encoding=\"utf-8\"?>\n" , file = out.file, append=TRUE)
cat( "<html xmlns=\"http://www.w3.org/1999/xhtml\">\n\n" , file = out.file, append=TRUE)
cat( "<head>", file = out.file, append=TRUE)
cat( "<meta charset=\"UTF-8\" />" , file = out.file, append=TRUE)
cat( "</head>\n", file = out.file, append=TRUE)
cat( "<body>", file = out.file, append=TRUE)
cat( chpts3[[i]], file = out.file, append=TRUE)
cat( "</body>", file = out.file, append=TRUE)
cat( "\n</html>", file = out.file, append=TRUE)
}

A useful command is saveRDS which allows for the saving of R objects. Here I save my list, which I can read back into an object, modify, and resave.

1
2
saveRDS(chapters2, paste(getwd(), "/chptobj/chptobj.list", sep=""))
chapters2 <- readRDS(paste(getwd(), "/chptobj/chptobj.list", sep=""))

The package qdap provides an interactive method, check_spelling_interactive, for spell checking. A dialog bog will pop up for each unrecognized word in turn, providing you with a pick list of potential corrections or the opportunity to type in a correction manually.

1
2
library(qdap)
check_spelling_interactive(pages[[100]], range=2, assume.first.correct=FALSE)

I found that the pick list often did not provide the appropriate choice, capitalization is not preserved, and Picnic has many slang words that forced interaction with qdap too frequently. I decided to read through the text and correct manually.

Here is qdap flagging the French ‘alors’. There are settings for qdap that may improve the word choices available, but I did not spend the time investigating.

center

Once the chapters have been edited and proof read, it is time to create the epub. An epub is a zip file with the extension “.epub”. It also has a well defined directory layout and required files that define chapters, images, flow control, etc. ePub specifications and tutorials are readily available on line. Here I will show examples of some of the epub contents.

File toc.ncx

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1">
<head>
<meta name="dtb:uid" content="granitemtn.net [2016.05.30-07:52:00]"/>
<meta name="dtb:depth" content="3"/>
<meta name="dtb:totalPageCount" content="190"/>
<meta name="dtb:maxPageNumber" content="190"/>
</head>
<docTitle>
<text>Picnic at Hanging Rock</text>
</docTitle>
<navMap>
<navPoint id="navpoint-1" playOrder="1">
<navLabel>
<text>Cover</text>
</navLabel>
<content src="content/cover.xhtml"/>
</navPoint>
<navPoint id="navpoint-2" playOrder="2">
<navLabel>
<text>Title Page</text>
</navLabel>
<content src="content/title.xhtml"/>
</navPoint>
.
.
.
<navPoint id="navpoint-21" playOrder="21">
<navLabel>
<text>Back</text>
</navLabel>
<content src="content/back.xhtml"/>
</navPoint>
</navMap>
</ncx>

File metadata.opf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
<package xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="bookid">
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
<dc:title>Picnic at Hanging Rock</dc:title>
<dc:creator opf:file-as="Lindsay, Joan" opf:role="aut">Joan Lindsay</dc:creator>
<dc:language>en-US</dc:language>
<dc:identifier id="bookid">granitemtn.net [2016.05.30-07:52:00]</dc:identifier>
<dc:rights>Public Domain</dc:rights>
</metadata>
<manifest>
<item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
<item id="title" href="content/title.xhtml" media-type="application/xhtml+xml"/>
<item id="characters" href="content/characters.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter1" href="content/chapter1.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter2" href="content/chapter2.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter3" href="content/chapter3.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter4" href="content/chapter4.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter5" href="content/chapter5.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter6" href="content/chapter6.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter7" href="content/chapter7.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter8" href="content/chapter8.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter9" href="content/chapter9.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter10" href="content/chapter10.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter11" href="content/chapter11.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter12" href="content/chapter12.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter13" href="content/chapter13.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter14" href="content/chapter14.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter15" href="content/chapter15.xhtml" media-type="application/xhtml+xml"/>
<item id="chapter16" href="content/chapter16xhtml" media-type="application/xhtml+xml"/>
<item id="chapter17" href="content/chapter17.xhtml" media-type="application/xhtml+xml"/>
<item id="cover-image" href="content/images/cover.jpg" media-type="image/jpeg"/>
<item id="back-image" href="content/images/back.jpg" media-type="image/jpeg"/>
<item id="cover" href="content/cover.xhtml" media-type="application/xhtml+xml"/>
<item id="back" href="content/back.xhtml" media-type="application/xhml+xml"/>
</manifest>
<spine>
<itemref idref="cover"/>
<itemref idref="title"/>
<itemref idref="characters"/>
<itemref idref="chapter1"/>
<itemref idref="chapter2"/>
<itemref idref="chapter3"/>
<itemref idref="chapter4"/>
<itemref idref="chapter5"/>
<itemref idref="chapter6"/>
<itemref idref="chapter7"/>
<itemref idref="chapter8"/>
<itemref idref="chapter9"/>
<itemref idref="chapter10"/>
<itemref idref="chapter11"/>
<itemref idref="chapter12"/>
<itemref idref="chapter13"/>
<itemref idref="chapter14"/>
<itemref idref="chapter15"/>
<itemref idref="chapter16"/>
<itemref idref="chapter17"/>
<itemref idref="back"/>
</spine>
</package>

File container.xml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<?xml version="1.0"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
<rootfiles>
<rootfile full-path="EPUB/pahf.opf"
media-type="application/oebps-package+xml" />
</rootfiles>
</container>

File - an example chapter:

1
2
3
4
5
6
7
8
9
10
11
12
13
<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta charset="UTF-8" /></head>
<body><p>Chapter 2</p><p> </p>
<p> Manmade improvements on Nature at the Picnic Grounds consisted of several circles of flat stones to serve as fireplaces and a wooden privy in the shape of a Japanese pagoda. The creek at the close of summer ran sluggishly through long dry grass, now and then almost disappearing to re-appear as a shallow pool. Lunch had been set out on large white tablecloths close by, shaded from the heat of the sun by two or three spreading gums. In addition to the chicken pie, angel cake, jellies and the tepid bananas inseparable from an Australian picnic, Cook had provided a handsome iced cake in the shape of a heart, for which Tom had obligingly cut a mould from a piece of tin. Mr Hussey had boiled up two immense billycans of tea on a fire of bark and leaves and was now enjoying a pipe in the shadow of the drag where he could keep a watchful eye on his horses tethered in the shade. </p>
.
.
.
<p> The four girls were already out of sight when Mike came out of the first belt of trees. He looked up at the vertical face of the Rock and wondered how far they would go before turning back. The Hanging Rock, according to Albert, was a tough proposition even for experienced climbers. If Albert was right and they were only schoolgirls about the same age as his sisters in England, how was it they were allowed to set out alone, at the end of a summer afternoon? He reminded himself that he was in Australia now: Australia, where anything might happen. In England everything had, been done before: quite often by one’s own ancestors, over and over again. He sat down on a fallen log, heard Albert calling him through the trees, and knew that this was the country where he, Michael Fitzhubert, was going to live. What was her name, the tall pale girl with straight yellow hair, who had gone skimming over the water like one of the white swans on his Uncle’s lake? </p><p></p></body>
</html>

Once the files are in order they are zipped into an epub. Navigate to the directory containing your files and:

1
zip -Xr9D Picnic_at_Hanging_Rock.epub mimetype * -x .DS_Store

Some of the switches I am using:

X: Exclude extra file attributes (permissions, ownership, anything that adds extra bytes)

-r: Recurse into directories

-9: Slowest but most optimized compression

-D: Do not created directory entries in the zip archive

-x .DS_Store: Don’t include Mac OS X’s hidden file of snapshots etc.

The next post in this series discusses sentiment analysis.

1
1
Share

Create an eBook

One of my all-time favorite movies is Picnic at Hanging Rock by Peter Weir. Every scene is a painting, and the atmosphere transports you back to the Australian bush of 1900. The movie is based on a book by Joan Lindsay, who had the genius to leave the plot’s main mystery unresolved. During her lifetime she never discouraged anyone from claiming the book was based on real events. After her death in 1984 a “lost” final chapter was discovered, which purportedly resolved the mystery. Most (including myself) believe the final chapter is a hoax.
Recently on R-bloggers there has been a run on articles discussing sentiment analysis. I thought it would be fun to text mine and sentiment analyze Picnic. I purchased a paperback version of the book years ago, which I read while on vacation.

center

My book is old and the pages are yellowing. Time to preserve it for prosterity.
In this post I will discuss converting a paperback into an ebook. Future posts will discuss the text mining/sentiment analysis. The steps are:

  1. Cut off the spine
  2. Scan the pages, one image per page
  3. Perform OCR (optical character recognition)
  4. Assemble the text in page order
  5. Proofread

As an aside, one of the most impressive crowd sourcing pieces of software I have seen is Project Gutenberg’s Distributed Proofreaders website. Dump in your scanned images and the site will coordinate proofreading and text assembly. Procedures are in place for managing the workflow, resolving discrepancies, motivating volunteers, etc. Picnic doesn’t qualify for this treatment as it is not in the public domain. I will have to do it myself.

Cut off the spine

I used a single edge razor blade. Cut as smoothly and straight as possible. Keep the pages in numerical order.

Scan

I have an HP OfficeJet 5610 All-in-One multifunction printer equipped with a document feeder. I am working with Debian Linux, so I use Xsane as the scanning software. Searching the web I find that there is a lot of discussion concerning the optimum resolution, color, and file format that should be used for images destined for OCR. I decided on 300dpi grayscale TIFF, which in retrospect was a good choice. I load one chapter at a time onto the document feeder positioned such that the smooth edge enters the feeder first. This results in odd pages being rotated 90 degrees counterclockwise, and even pages being rotated 90 degrees clockwise. Xsane will auto-number the images, but I will supply a prefix following a convention: “chptNN[e|o]-NNNN” where e|o is e or o standing for even or odd page numbers, NN for the chapter number and NNNN is the Xsane supplied image number. The image number will start at 1 for each set (even or odd) of chapter pages.
Once all images are scanned, I will need to rotate either 90 or 270 degrees to prepare for OCR, using the rotate.image function from the adimpro package. I use the following code, depositing the rotated images in a separate directory:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
library("adimpro")
#populate a vector with all image file names
all.files <- list.files(paste(getwd(), "/rawimages", sep=""))
for(i in 1:length(all.files)){
img <- read.image(paste(getwd(),"/rawimages/", all.files[i], sep=""))
if(nchar(all.files[i])==14){enum <- 4}else{enum <- 5}
if( substring(all.files[i],enum,enum)=="e" ){
img <- rotate.image(img, angle = 270, compress=NULL)
}
if( substring(all.files[i],enum,enum)=="o" ){
img <- rotate.image(img, angle = 90, compress=NULL)
}
write.image(img, file = paste(getwd(),"/rotatedimages/", all.files[i], sep=""))
}

OCR

Next perform OCR on each image. I use tesseract from Google which has a Debian package.

1
2
3
4
5
6
7
for(i in 1:length(all.files)){
basefile <- substring(all.files[i], 1, nchar(all.files[i])-5)
system( paste("tesseract", paste(getwd(),"/rotatedimages/", all.files[i], sep=""), paste(getwd(),"/textfiles/", basefile, sep=""), sep=" "))
}

Seems to work well. Here is a comparison of image and text:

center
center

Assemble text

I need to create a table of textfile name, page number, words per page etc. to coordinate assembly of the final text and assist with future text mining. Here are the contents of the all.files variable:

1
2
3
4
5
6
7
8
> all.files <- list.files(paste(getwd(), "/textfiles", sep=""))
> all.files
[1] "ch10e-0001.txt" "ch10e-0002.txt" "ch10e-0003.txt" "ch10e-0004.txt"
[5] "ch10e-0005.txt" "ch10e-0006.txt" "ch10o-0001.txt" "ch10o-0002.txt"
[9] "ch10o-0003.txt" "ch10o-0004.txt" "ch10o-0005.txt" "ch10o-0006.txt"
[13] "ch11e-0001.txt" "ch11e-0002.txt" "ch11e-0003.txt" "ch11e-0004.txt"
[17] "ch11o-0001.txt" "ch11o-0002.txt" "ch11o-0003.txt" "ch11o-0004.txt"
.....

Make a data.frame extracting relevant information from the filenames:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
all.files <- list.files(paste(getwd(), "/textfiles", sep=""))
d <- data.frame(matrix(ncol = 11, nrow = 190))
names(d) <- c("file.name","bname","chpt","eo","page","lines","words","img.num","text","problems","pnumber")
d$file.name <- all.files
for(i in 1:nrow(d)){
numc <- nchar(d[i,"file.name"])
d[i,"bname"] <- substring( d[i,"file.name"], 1, numc - 4)
if(numc==13){
d[i,"chpt"] <- as.numeric(as.character(substring(d[i,"file.name"], 3, 3)))
d[i,"eo"] <- substring( d[i,"file.name"], 4, 4)
d[i,"img.num"] <- substring( d[i,"file.name"], 6, 9)
}else{d[i,"chpt"] <- as.numeric(as.character(substring( d[i,"file.name"], 3, 4)))
d[i,"eo"] <- substring( d[i,"file.name"], 5, 5)
d[i,"img.num"] <- substring( d[i,"file.name"], 7, 10)}
}

Read in all the pages of text using the read_lines function from the readr package:

1
2
3
4
5
6
library(readr)
pages <- vector(mode="list", length=nrow(d))
for(i in 1:nrow(d)){
pages[i] <- list(read_lines( paste( getwd(), "/textfiles/", d[i,"bname"], ".txt" , sep=""), skip =0))
}

If I look at some random pages, I can see that usually the second to the last line has the page number, when it exists on a page:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>pages[[10]]
.......
[63] "needed, the poor young things . . ."
[64] "As soon- as he Could escape from his Aunt’s dinner table"
[65] "1 I 7" #actually page 117
[66] ""
>pages[[21]]
.......
[37] "The gold padlock on the Head’s heavy chain bracelet rattled"
[38] ""
[39] "142"
[40] ""
>

Many of the page numbers are corrupt i.e. there are random characters thrown in by mistake by the OCR. I make note of these characters and use gsub to get rid of them. Some escape my efforts, but enough are accurate that I can compare the extracted page number to the expected page number, determined by the order in which the pages were fed into the scanner.
I will extract the second to the last line (stll) and include it in my table:

1
2
3
4
5
6
7
8
9
pnumber <- list()[1:190]
for(i in 1:nrow(d)){
stll <- pages[[i]][length(pages[[i]])-1] #second to last line
#get rid of: ' . : - x |
pnumber[[i]] <- gsub( "'", "",gsub( ":", "",gsub( "-", "",gsub( "x", "",gsub( "/|", "",gsub( "/.", "", stll))))))
tryCatch({d[i,"pnumber"] <- as.integer(pnumber[i])},
error={d[i,"pnumber"] <- 0})
}

For the expected page number, create a column “chpteo” which is the concatenation of chptr number and e or o for even odd. Sequentially number these by 2.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
d$chpteo <- paste0(d$chpt, d$eo)
odds <- d[d$eo=="o",]
odds <- odds[ order(c(as.numeric(as.character(odds$chpt)), as.numeric(as.character(odds$img.num)))),]
odds <- odds[!is.na(odds$file.name),]
odds$page <- seq(1,189,by=2)
evens <- d[d$eo=="e",]
evens <- evens[ order(c(as.numeric(as.character(evens$img.num))), decreasing=TRUE),]
evens <- evens[ order(c(as.numeric(as.character(evens$chpt)))),]
evens <- evens[!is.na(evens$file.name),]
evens$page <- seq(2,190,by=2)
d2 <- rbind(evens, odds)
d2 <- d2[order(d2$page),]

Here is what my data.frame “d2” looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
> head(d2)
file.name bname chpt eo page lines words img.num text problems
91 ch1o-0001.txt ch1o-0001 1 o 1 NA NA 0001 NA NA
92 ch1o-0002.txt ch1o-0002 1 o 3 NA NA 0002 NA NA
93 ch1o-0003.txt ch1o-0003 1 o 5 NA NA 0003 NA NA
94 ch1o-0004.txt ch1o-0004 1 o 7 NA NA 0004 NA NA
95 ch1o-0005.txt ch1o-0005 1 o 9 NA NA 0005 NA NA
96 ch1o-0006.txt ch1o-0006 1 o 11 NA NA 0006 NA NA
pnumber chpteo
91 NA 1o
92 NA 1o
93 NA 1o
94 7 1o
95 9 1o
96 NA 1o

“page” is the expected page number based on scanning order.
“pnumber” is the OCR extracted page. Compare them:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> d2[,c("page","pnumber")]
page pnumber
91 1 NA
92 3 NA
93 5 NA
94 7 7
95 9 9
96 11 NA
97 13 NA
98 15 NA
99 17 17
100 19 NA
109 21 NA
110 23 23
111 25 95
112 27 97
113 29 29
114 31 31
115 33 33
116 35 35
122 37 37
123 39 39

Looks good. There are some OCR errors but enough come through to verify that the order is correct. Now I can sort on page and use that order to assemble the ebook. Read each page file and append to an output file. Since I want to be able to refer to images to correct problems, I also insert the image information between text files:

1
2
3
4
5
6
7
8
9
10
11
####write it all out
out.file <- paste(getwd(), "/ebook-draft/output.txt", sep="")
for(i in 1:nrow(d2)){
a <- readLines(con = paste(getwd(), "/textfiles/",d2[i,"file.name"] ,sep=""), n = -1L, ok = TRUE, warn = TRUE,encoding = "unknown", skipNul = FALSE)
cat( paste(d2[i,"file.name"], "\n\n", sep=""),file = out.file, fill=80, append=TRUE)
cat(a, file = out.file, fill=80, append=TRUE)
cat("\n\n", file = out.file, fill=80, append=TRUE)
}

Here is what a page junction looks like:

center
 

You can see the page number when present, which will provide a method to confirm the correct order. The file name is included, which will allow me to go back to the original image during the proofreading process to verify words I may be uncertain of.

Proofread

It would be nice to have the image and text juxtaposed during the proofreading process. To see what this looks like, take a look at Project Gutenberg’s Distributed Proofreaders website. I will have to read on a device that allows me to refer to the images when needed. Once the proofreading is complete, I will be ready for sentiment analysis.

The next post in this series discusses text manipulation.

Share

Euler-98

https://projecteuler.net/problem=98

By replacing each of the letters in the word CARE with 1, 2, 9, and 6 respectively, we form a square number: 1296 = 36^2. What is remarkable is that, by using the same digital substitutions, the anagram, RACE, also forms a square number: 9216 = 96^2. We shall call CARE (and RACE) a square anagram word pair and specify further that leading zeroes are not permitted, neither may a different letter have the same digital value as another letter.

Using words.txt, a 16K text file containing nearly two-thousand common English words, find all the square anagram word pairs (a palindromic word is NOT considered to be an anagram of itself).

What is the largest square number formed by any member of such a pair?

NOTE: All anagrams formed must be contained in the given text file.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
rm(list=ls(all=TRUE))
#I want all large integers manipulated without scientific notation
options( scipen = 20 ) ##don't use scientific notation
options(digits=22)
words <- scan( file= paste(getwd(),"p098_words.txt", sep="/"), what="list", sep=",",skip=0, quote="\"")
> words <- scan( file= paste(getwd(),"p098_words.txt", sep="/"), what="list", sep=",",skip=0, quote="\"")
Read 1786 items
>
> counts <- nchar(words)
> max.counts <- max(counts)
> max.counts
[1] 14
>
> min.counts <- min(counts)
> min.counts
[1] 1
>
> words.len <- length(words)
>
> d <- data.frame(counts, words)
> head(d)
counts words
1 1 A
2 7 ABILITY
3 4 ABLE
4 5 ABOUT
5 5 ABOVE
6 7 ABSENCE
> tail(d)
counts words
1781 3 YET
1782 3 YOU
1783 5 YOUNG
1784 4 YOUR
1785 8 YOURSELF
1786 5 YOUTH
>

There are 1786 words, the longest is 14 characters and the smallest is 1 character. Since we have already been given the square anagram word pair CARE / RACE I will assume the answer is greater than 4 characters and ignore all 1-4 character words.

Write a function compare.word that will take 2 words, sort the characters and determine if the two words have the same characters.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
> compare.word <- function( a, b){
+ aa <-sort(strsplit (a,"")[[1]])
+ bb <-sort(strsplit (b,"")[[1]])
+ (length(aa)==sum(aa==bb))
+ }
>
##process through all words looking for words that have the same characters
##get words from d by counts. Only consider words with 5 or more characters
for(i in 5:14){
d2 <- d[d$counts==i,]
d2.len <- nrow(d2)
col.index <- 2
row.index <- 1
for(row in row.index:d2.len){
for(col in col.index:d2.len){
if(compare.word(as.character(d2[col,2]), as.character(d2[row,2]))){
mywords <- rbind(mywords,c(i,row,col,as.character(d2[row,2]),as.character(d2[col,2])))
}
}
if(col.index < d2.len) col.index <- col.index+1
}
}
mywords <- mywords[!is.na(mywords$a),]
> mywords$same <- mywords$a!=mywords$b
> mywords <- mywords[mywords$same==TRUE,]
> mywords
len row col a b same
21 5 18 175 ARISE RAISE TRUE
22 5 31 36 BOARD BROAD TRUE
23 5 68 101 EARTH HEART TRUE
24 5 118 219 LEAST STEAL TRUE
25 5 141 237 NIGHT THING TRUE
26 5 157 194 PHASE SHAPE TRUE
27 5 172 173 QUIET QUITE TRUE
28 5 196 236 SHEET THESE TRUE
29 5 199 208 SHOUT SOUTH TRUE
30 5 240 279 THROW WORTH TRUE
32 6 53 228 CENTRE RECENT TRUE
33 6 69 272 COURSE SOURCE TRUE
34 6 71 89 CREDIT DIRECT TRUE
35 6 74 135 DANGER GARDEN TRUE
36 6 110 111 EXCEPT EXPECT TRUE
37 6 132 231 FORMER REFORM TRUE
38 6 144 234 IGNORE REGION TRUE
41 8 35 125 CREATION REACTION TRUE
43 9 62 87 INTRODUCE REDUCTION TRUE
>
> nrow(mywords)
[1] 19
>
> len.of.int <- unique(mywords$len)
> len.of.int
[1] 5 6 8 9
>

There are 19 anagrams with lengths of 5, 6, 8, or 9 characters.

Now figure out how many squares of length 5,6,8, or 9 exist.

These squares will be between 100^2 to 31623^2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
nums <- vector(mode="numeric", 31523)
counter <- 1
for(i in 100:31623){
nums[counter] <- as.numeric(i^2)
counter <- counter +1
}
num.lengths <- nchar(nums)
d.nums <- data.frame(nums, num.lengths)
> head(d.nums)
nums num.lengths
1 10000 5
2 10201 5
3 10404 5
4 10609 5
5 10816 5
6 11025 5
> tail(d.nums)
nums num.lengths
31519 999697924 9
31520 999761161 9
31521 999824400 9
31522 999887641 9
31523 999950884 9
31524 1000014129 10
>
mynums <- data.frame(matrix(nrow=1,ncol=5))
names(mynums) <- c("len","row","col","a","b")
for( i in len.of.int){
##d.nums is the data.frame that holds the squares
##and their lengths
temp.nums <- as.numeric(d.nums[d.nums$num.lengths==i,"nums"])
temp.len <- length(temp.nums)
col.index <- 2
row.index <- 1
for(row in row.index:temp.len){
for(col in col.index:temp.len){
if(compare.square(as.integer(temp.nums[col]),as.integer(temp.nums[row]))){
mynums <- rbind(mynums, c(i,col,row,temp.nums[col],temp.nums[row]))
}
}
if(col.index < temp.len) col.index <- col.index+1
}
}
> head(mynums)
len row col a b
2 5 11 2 12100 10201
3 5 21 3 14400 10404
4 5 102 3 40401 10404
5 5 111 3 44100 10404
6 5 31 4 16900 10609
7 5 41 4 19600 10609
> tail(mynums)
len row col a b
40991 9 21612 21516 999255321 993195225
40992 9 21526 21517 993825625 993258256
40993 9 21589 21530 997801744 994077841
40994 9 21574 21538 996854329 994582369
40995 9 21594 21573 998117649 996791184
40996 9 21623 21623 999950884 999950884
> nrow(mynums)
[1] 40994
>

There are about 41,000 squares. Assign a square to a word, rearrange according to the square pair, determine if the new word is in the anagram list.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
scramble <- function( w, n1, n2){
#w: the word; n1: the first square; n2 the rearranged square
w.vector <-strsplit(w,"")[[1]]
n1.vector <- as.integer(strsplit(as.character(n1),"")[[1]])
n2.vector <- as.integer(strsplit(as.character(n2),"")[[1]])
d1 <-data.frame(col1=w.vector, col2= n1.vector, stringsAsFactors = FALSE)
paste(as.character(d1$col1[match(n2.vector, d1$col2)]),sep="", collapse="")
}
holder <- data.frame(matrix(nrow=1,ncol=5))
names(holder) <- c("len","wa","wb","na","nb")
for(i in len.of.int){
temp.words <- mywords[mywords3len==i,]
temp.nums <- mynums[mynums$len==i,]
for(j in 1:nrow(temp.words)){
for(k in 1:nrow(temp.nums)){
new.word <-temp.words[ match(scramble(temp.words[j,"a"], temp.nums[k,"a"], temp.nums[k,"b"]),temp.words$a), "a"]
if( !is.na( new.word)){
holder <- rbind(holder,c(i, temp.words[j,"a"], as.character(new.word),temp.nums[k,"a"], temp.nums[k,"b"] ))
}
}
}
}
> holder
len wa wb na nb
1 5 BROAD BOARD 18769 17689
2 6 CENTRE RECENT 436921 214369

CENTRE assigns different numbers to the ‘E’ violating one of the rules so BROAD/BOARD is the anagram pair and the largest square is 18769

Share

Euler-17

https://projecteuler.net/problem=17

If the numbers 1 to 5 are written out in words: one, two, three, four, five, then there are 3 + 3 + 5 + 4 + 4 = 19 letters used in total.

If all the numbers from 1 to 1000 (one thousand) inclusive were written out in words, how many letters would be used?

NOTE: Do not count spaces or hyphens. For example, 342 (three hundred and forty-two) contains 23 letters and 115 (one hundred and fifteen) contains 20 letters. The use of “and” when writing out numbers is in compliance with British usage.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
rm(list=ls(all=TRUE))
a <- c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten")
b <- c("eleven","twelve","thirteen","fourteen","fifteen","sixteen","seventeen","eighteen","nineteen")
#don't use c as a variable
#this is to concatenate with each tens
d <- c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine")
#21, 22, 23, 24, ...29
e <- c(rep("twenty", 10), d)
e
## [1] "twenty" "twenty" "twenty" "twenty" "twenty" "twenty" "twenty"
## [8] "twenty" "twenty" "twenty" "one" "two" "three" "four"
## [15] "five" "six" "seven" "eight" "nine"
#31, 32, 33, ....39
f <- c(rep("thirty", 10), d)
g <- c(rep("forty", 10), d)
h <- c(rep("fifty", 10), d)
i <- c(rep("sixty", 10), d)
j <- c(rep("seventy", 10), d)
k <- c(rep("eighty", 10), d)
l <- c(rep("ninety", 10), d)
#character count for 1 throu 99
through99 <- sum(nchar(c(a,b,e,f,g,h,i,j,k,l)))
through99
## [1] 854
#100, 200,300, ...900
m <- sum(nchar( c(rep("hundred",9),d))) #hundreds
#101-199
n <- sum(nchar(rep("onehundredand",99))) + through99
#201-299 etc.
o <- sum(nchar(rep("twohundredand",99))) + through99
p <- sum(nchar(rep("threehundredand",99))) + through99
q <- sum(nchar(rep("fourhundredand",99))) + through99
r <- sum(nchar(rep("fivehundredand",99))) + through99
s <- sum(nchar(rep("sixhundredand",99))) + through99
t <- sum(nchar(rep("sevenhundredand",99))) + through99
u <- sum(nchar(rep("eighthundredand",99))) + through99
v <- sum(nchar(rep("ninehundredand",99))) + through99
w <- sum(nchar("onethousand"))
results <- sum(through99,m,n,o,p,q,r,s,t,u,v,w)
results
## [1] 21124
Share