Statistical Analysis of Written Texts
Modern European Portuguese vs. Brazilian
Portuguese
This work is part of the project “Rhythmic
Patterns Parameter Setting &Language Change”, [1]. The main goal of the present
work is develop a methodology to detect the rhythmic patterns of spoken
language in written texts.
In carrying
out the investigation, I met with more difficulties that I had foreseen. The
term used above: RHYTHM. What is RHYTHM? What are the variables that express
the rhythm?
From
several and fruitful discussions, I conclude, I hope rightly, that the primary
stress (position in the word and relative position) must be evaluated.
To create
learning sets, I took 20 and 23 written texts of EP and BP, respectively,
published in the last 10 years ago. These texts are short histories or pieces
of books of known authors (Tab. 1). These texts are formal but popular.
We deal
with the following variables:
MA- Unstressed monosyllable
MT- Stressed monotonic syllable
OX- last stressed syllable word
PA- Penultimate stressed syllable
word
PR- Antepenultimate stressed
syllable word
For the
above variables we compute de total number of words, in each group, divided by
the total number of words in the respective text.
We analyses
too the position of the primary stress with respect the sentence. We consider
as unit measure a syllable. For
example, take a sentence
“Pedro conversou demoradamente comigo”.
This
sentence has 14 syllables as following:
Pe/dro con/ver/sou de/mo/ra/da/men/te
co/mi/go.
In bold
face we point out the primary stress:
Pe/dro con/ver/sou
de/mo/ra/da/men/te co/mi/go.
We call
“first distance” (d1) the number of syllables between the beginning of the
sentence and the first primary stress,
“second distance” (d2) the number of syllables between the first primary
stress until the second primary stress and so on.
For the
example, we have d1=0, d2=3, d3=4 and d4=2.
In this
way, for each text we compute
D1: the mean distance of
d1,
D2: the mean distance of
d2, etc.
Initially we perform descriptive
statistics of our data. Below we have the box plot graphics for all variables.
Box plots, also called box-and-whisker plots, are particularly useful for
showing the distributional characteristics of data.
A line is drawn
across the box at the median. By default, the bottom of the box is at the first
quartile (Q1), and the top is at the third quartile (Q3) value. The whiskers
are the lines that extend from the top and bottom of the box to the adjacent
values. The adjacent values are the lowest and highest observations that are
still inside the region defined by the following limits:
Lower
Limit: Q1 - 1.5 (Q3 - Q1)
Upper
Limit: Q3 + 1.5 (Q3 - Q1)
Outliers
are points outside of the lower and upper limits and are plotted with asterisks
(*).
In the
following we have the box plot for all variables and for the two groups: BP and
EP.
UNSTRESSED
MONOSYLLABIC WORD
LAST STRESSED SYLLABLE
PENULTIMATE STRESSED SYLLABLE
ANTEPENULTIMATE STRESSED SYLLABLE
FIRST DISTANCE
PROFILE OF THE FIRST DISTANCE IN BP
PROFILE OF THE FIRST DISTANCE IN EP
Apparently, we have distinct behaviors between BP and EP for all variables.
Inferential Statistical Analysis
We use binary logistic regression to perform logistic regression on a binary response variable. A binary variable only has two possible values, such as BP and EP.
Binary logistic regression is used to classify observations into one of two categories, BP and EP.
The vertical axis indicates the probability that a specific text belongs to EP.
We have only two miss classifications; one for EP and other for BP. It’s a very good result.
The adjusted model is
P(text belongs to EP) =
537.6 –310.1 MA – 443.4 MT –3539 OXI – 248 PA
–991 PA2 + 6412 OX*PA + 15.303 D1 –22.29 S(D1)
Author BP |
Probability |
|
Author EP |
Probability |
Adelia |
0 |
|
Venancio |
0,62656 |
AntonioMaria |
0 |
|
Sa |
0,99997 |
Chico |
0,39659 |
|
Carvalho |
0,99929 |
Dinah |
0,00004 |
|
Carmelo |
0,90238 |
Inacio |
0 |
|
Venda |
0,81739 |
JoaoSaldanha |
0,00276 |
|
Cruz |
0,99939 |
Lara
Resende |
0,19836 |
|
Navarro |
0,93045 |
Lygia |
0,10032 |
|
Almeida |
0,01951 |
Mendes
Campos |
0,34252 |
|
Faria |
0,97402 |
Raquel
Queiroz |
0,41368 |
|
Moreira |
0,99861 |
Rubem
Braga |
0,01267 |
|
Moreira2 |
0,79481 |
Sergio
Porto |
0,04571 |
|
Moreira3 |
1 |
Ubaldo |
0 |
|
Faires |
0,99798 |
Verissimo |
0 |
|
Claudio |
1 |
Ziraldo |
0,09044 |
|
Nery |
0,9995 |
gemeas |
0,03742 |
|
Mmiranda |
0,91339 |
latricerio |
0,00212 |
|
Mmiranda2 |
0,99968 |
ioga |
0,56013 |
|
Cabral |
0,87516 |
anjos |
0,07006 |
|
Leon |
0,99608 |
tropicais |
0,30143 |
|
Aguiar |
0,99934 |
pertencer |
0,05358 |
|
Saramago |
0,52862 |
cracha |
0 |
|
|
|
We can see the classifications of all texts in our learning set in the graphic above.
CLASSIFICATION FOR MODERN PORTUGUESE
We take the adjusted model cited above and apply for the historical written texts of corpus Tycho Brahe.
Again, in the vertical axis we have the probability of a specific text belongs to EP. We observe that the texts in the seventeen-century were classified as BP.
It’s important to note that this is a
result for this specific model, with these variables and this learning
set.
CLASSIFICATION FOR
HISTORICAL CORPUS
It’s possible that the difference between
BP and EP point out by this study, in fact,
does not represent a rhythm but another
characteristic of these languages.
TABLES
Venâncio, Fernando |
2001 |
Maquinações e Bons Sentimentos |
Sá, Daniel de |
1995 |
Crônica do Despovoamento das Ilhas Salamandra |
Carvalho, Rentes de |
1998 |
O Joalheiro |
Carmelo, Luís |
1999 |
As Saudades do Mundo |
Venda, Antônio Manuel |
1996 |
A Bruxa do Bairro Alto |
Cruz, Bento da |
1996 |
O Retábulo das Virgens Loucas |
Navarro, Antônio Rebordão |
1995 |
O Discurso da Desordem |
Almeida, Onésimo Teotônio |
1997 |
O Meu Nacionalismo |
Almeida Faria |
1980 |
Carta |
Moreira, Antônio Mendes |
1966 |
Vida de Médico ; "O Primeiro Doente" |
Moreira, Antônio Mendes |
1966 |
Os Amantes |
Moreira, Antônio Mendes |
1966 |
Uma Visita |
Aires, Fernando |
1995 |
Memórias da Cidade Cercada |
Aires, Fernando |
1995 |
A Quinta das Virtudes |
Nery, Júlia |
1998 |
Valéria, Valéria |
Miranda, Miguel |
1996 |
Tripas com Limão |
Miranda, Miguel |
1997 |
Bailado de Sombras |
Cabral, Antônio |
1931 |
Memória Delta |
Machado, José Leon |
1997 |
A Margem |
Aguiar, João |
1984 |
A insígnia do Touro |
Saramago, José |
* |
|
Adélia Prado |
1980 |
Cacos para um Vitral |
Antonio Maria |
1964 |
Canções de Homens e Mulheres Lamentáveis |
Chico Buarque de Holanda |
1991 |
Estorvo |
Dinah Silveira |
1967 |
A Moralista |
Ignácio de Loyola |
1997 |
Obscenidades para uma dona-de-casa |
João Saldanha |
1966 |
A Silhueta |
Otto Lara Resende |
1975 |
O Elo Partido |
Lygia Fagundes Teles |
1973 |
As meninas |
Paulo Mendes Campos |
1960 |
Receita de Domingo |
Rachel de Queirós |
1959 |
Viagem de Bonde |
Rubem Braga |
1967 |
Meu Ideal Seria Escrever … |
Sérgio Porto (Stanislaw Ponte Preta) |
1963 |
Éramos mais unidos aos domingos |
João Ubaldo Ribeiro |
1999 |
A casa dos budas ditosos |
Luís Fernando Veríssimo |
2001 |
Manolo ( O ESTADO DE SÃO PAULO) |
Ziraldo Alves Pinto |
1984 |
Reminiscência |
Nelson Rodrigues |
1980 |
|
Stanislaw Ponte Preta |
1990 |
|
Veja |
2001 |
Ioga |
Luís Fernando Veríssimo |
2000 |
Clube dos Anjos |
Nelson Motta |
2000 |
Noites Tropicais |
Clarice Lispector |
|
|
Lígia Fagundes Teles |
|
|