Statistical Analysis of Written Texts

Modern European Portuguese vs. Brazilian Portuguese

 

 

 

Introduction

 

 

This work is part of the project “Rhythmic Patterns Parameter Setting &Language Change”, [1]. The main goal of the present work is develop a methodology to detect the rhythmic patterns of spoken language in written texts.

 

In carrying out the investigation, I met with more difficulties that I had foreseen. The term used above: RHYTHM. What is RHYTHM? What are the variables that express the rhythm?

 

From several and fruitful discussions, I conclude, I hope rightly, that the primary stress (position in the word and relative position) must be evaluated.

 

 

 

Descriptive Statistical Analysis

 

        As mentioned in the project, we want assign historical texts into Modern European Portuguese (EP) or Brazilian Portuguese (BP) group.

 

To create learning sets, I took 20 and 23 written texts of EP and BP, respectively, published in the last 10 years ago. These texts are short histories or pieces of books of known authors (Tab. 1). These texts are formal but popular.

       

We deal with the following variables:

 

                MA- Unstressed monosyllable

                MT- Stressed monotonic syllable

                OX- last stressed syllable word

                PA- Penultimate stressed syllable word

                PR- Antepenultimate stressed syllable word

 

 

For the above variables we compute de total number of words, in each group, divided by the total number of words in the respective text.

 

We analyses too the position of the primary stress with respect the sentence. We consider as unit measure a syllable.  For example, take a sentence

 

“Pedro conversou demoradamente comigo”.

 

This sentence has 14 syllables as following:

 

        Pe/dro con/ver/sou de/mo/ra/da/men/te co/mi/go.

 

In bold face we point out the primary stress:

 

        Pe/dro con/ver/sou de/mo/ra/da/men/te co/mi/go.

 

 

We call “first distance” (d1) the number of syllables between the beginning of the sentence and the first primary stress,  “second distance” (d2) the number of syllables between the first primary stress until the second primary stress and so on.

 

For the example, we have d1=0, d2=3, d3=4 and d4=2.

 

In this way, for each text we compute

 

                        D1: the mean distance of d1,

                        D2: the mean distance of d2, etc.

 

                 

                       

 

Initially we perform descriptive statistics of our data. Below we have the box plot graphics for all variables. Box plots, also called box-and-whisker plots, are particularly useful for showing the distributional characteristics of data.

 

A line is drawn across the box at the median. By default, the bottom of the box is at the first quartile (Q1), and the top is at the third quartile (Q3) value. The whiskers are the lines that extend from the top and bottom of the box to the adjacent values. The adjacent values are the lowest and highest observations that are still inside the region defined by the following limits:

 

Lower Limit:       Q1 - 1.5 (Q3 - Q1)

 

Upper Limit:       Q3 + 1.5 (Q3 - Q1)

 

Outliers are points outside of the lower and upper limits and are plotted with asterisks (*).

 

 

In the following we have the box plot for all variables and for the two groups: BP and EP.

 

 

 

 

 

 

 

 

 

BOX PLOT GRAPHICS

 

 

MONOSYLLABIC STRESSED WORD 

          

 


 

 


UNSTRESSED MONOSYLLABIC WORD

 


      

 


                                          

LAST STRESSED SYLLABLE

 

 


 


PENULTIMATE STRESSED SYLLABLE

 

 


 

 


ANTEPENULTIMATE STRESSED SYLLABLE

 

 


                                                  

 


FIRST DISTANCE

 


 

 

 

 

 

 


PROFILE OF THE FIRST DISTANCE IN BP 

                                             

 

 


 

 

 

 

 

 

 

 

 

 


PROFILE OF THE FIRST DISTANCE IN EP 

 

 


                                       

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Apparently, we have distinct behaviors between BP and EP for all variables.

  

 

 

Inferential Statistical Analysis

 

 

We use binary logistic regression to perform logistic regression on a binary response variable. A binary variable only has two possible values, such as BP and EP.

Binary logistic regression is used to classify observations into one of two categories,  BP and EP.

 

The vertical axis indicates the probability that a specific text belongs to EP.

We have only two miss classifications; one for EP and other for BP. It’s a very good result.

 

The adjusted model is

 

 

 P(text belongs to EP) =

          537.6 –310.1 MA – 443.4 MT –3539 OXI – 248 PA –991 PA2 + 6412 OX*PA + 15.303 D1 –22.29 S(D1)

 

 

 

 

Author BP

Probability

 

Author EP

Probability

Adelia

0

 

Venancio

0,62656

AntonioMaria

0

 

Sa

0,99997

Chico

0,39659

 

Carvalho

0,99929

Dinah

0,00004

 

Carmelo

0,90238

Inacio

0

 

Venda

0,81739

JoaoSaldanha

0,00276

 

Cruz

0,99939

Lara Resende

0,19836

 

Navarro

0,93045

Lygia

0,10032

 

Almeida

0,01951

Mendes Campos

0,34252

 

Faria

0,97402

Raquel Queiroz

0,41368

 

Moreira

0,99861

Rubem Braga

0,01267

 

Moreira2

0,79481

Sergio Porto

0,04571

 

Moreira3

1

Ubaldo

0

 

Faires

0,99798

Verissimo

0

 

Claudio

1

Ziraldo

0,09044

 

Nery

0,9995

gemeas

0,03742

 

Mmiranda

0,91339

latricerio

0,00212

 

Mmiranda2

0,99968

ioga

0,56013

 

Cabral

0,87516

anjos

0,07006

 

Leon

0,99608

tropicais

0,30143

 

Aguiar

0,99934

pertencer

0,05358

 

Saramago

0,52862

cracha

0

 

 

 

 

 

 

 

 

 

We can see the classifications of all texts in our learning set in the graphic above.

 

 

 

 

CLASSIFICATION FOR MODERN PORTUGUESE

 

 

 



 

 

 


 

 

 


We take the adjusted model cited above and apply for the historical written texts of corpus Tycho Brahe.

 

Again, in the vertical axis we have the probability of a specific text belongs to EP. We observe that the texts in the seventeen-century were classified as BP.

It’s important to note that this is a result for this specific model, with these variables and this learning set.  

 

 

 

CLASSIFICATION FOR HISTORICAL CORPUS

 

 


 

 


It’s possible that the difference between BP and EP point out by this study, in fact,

does not represent a rhythm but another characteristic of these languages.

 

 

 

 

 

 

TABLES

 

Venâncio, Fernando

2001

Maquinações e Bons Sentimentos

Sá, Daniel de

1995

Crônica do Despovoamento das Ilhas Salamandra

Carvalho, Rentes de

1998

O Joalheiro

Carmelo, Luís

1999

As Saudades do Mundo

Venda, Antônio Manuel

1996

A Bruxa do Bairro Alto

Cruz, Bento da

1996

O Retábulo das Virgens Loucas

Navarro, Antônio Rebordão

1995

O Discurso da Desordem

Almeida, Onésimo Teotônio

1997

O Meu Nacionalismo

Almeida Faria

1980

Carta

Moreira, Antônio Mendes

1966

Vida de Médico ; "O Primeiro Doente"

Moreira, Antônio Mendes

1966

Os Amantes

Moreira, Antônio Mendes

1966

Uma Visita

Aires, Fernando

1995

Memórias da Cidade Cercada

Aires, Fernando

1995

A Quinta das Virtudes

Nery, Júlia

1998

Valéria, Valéria

Miranda, Miguel

1996

Tripas com Limão

Miranda, Miguel

1997

Bailado de Sombras

Cabral, Antônio

1931

Memória Delta

Machado, José Leon

1997

A Margem

Aguiar, João

1984

A insígnia  do Touro

Saramago, José

*

 

Adélia Prado

1980

Cacos para um Vitral

Antonio Maria

1964

Canções de Homens e Mulheres Lamentáveis

Chico Buarque de Holanda

1991

Estorvo

Dinah Silveira

1967

A Moralista

Ignácio de Loyola

1997

Obscenidades para uma dona-de-casa

João Saldanha

1966

A Silhueta

Otto Lara Resende

1975

O Elo Partido

Lygia Fagundes Teles

1973

As meninas

Paulo Mendes Campos

1960

Receita de Domingo

Rachel de Queirós

1959

Viagem de Bonde

Rubem Braga

1967

Meu Ideal Seria Escrever …

Sérgio Porto (Stanislaw Ponte Preta)

1963

Éramos mais unidos aos domingos

João Ubaldo Ribeiro

1999

A casa dos budas ditosos

Luís Fernando Veríssimo

2001

Manolo ( O ESTADO DE SÃO PAULO)

Ziraldo Alves Pinto

1984

Reminiscência

Nelson Rodrigues

1980

 

Stanislaw Ponte Preta

1990

 

Veja

        2001

           Ioga

Luís Fernando Veríssimo

        2000

Clube dos Anjos

Nelson Motta

        2000

Noites Tropicais

Clarice Lispector

 

 

Lígia Fagundes Teles