Data Analysis For Question Answering

Ismailouahbi
4 min readOct 24, 2022

--

text (unsplash.com)

Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.[1] Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science, and social science domains.[2] In today’s business world, data analysis plays a role in making decisions more scientific and helping businesses operate more effectively. (credit Wikipedia )

Step by Step

As a starting point, I’ve taken the “titanic” data set, which is a famous data that you can start your data science journey with, you can download it through this website:

kaggle

Problem Introduction

cruise ship (unsplash.com)

On April 15, 1912, during its maiden voyage, the Titanic sank after striking an iceberg, killing 1,502 of 2224 passengers and crew. The survival rate was 32%. One of the reasons the sinking caused such a loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some luck in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper class. As a result, the Titanic trip generated a lot of data that statisticians collected for analysis and enhancement to predict other situations and avoid the recurrence of such a problem.

Data Set description

The data set is composed of 12 columns and 418 rows where 7 columns are quantitative and 5 qualitative, it has as null values 414 distributed between several columns.

  • Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
  • survival Survival (0 = No; 1 = Yes)
  • name Name
  • sex Sex age Age
  • sibsp Number of Siblings/Spouses Aboard
  • parch Number of Parents/Children Aboard
  • ticket Ticket Number
  • fare Passenger Fare (British pound)
  • cabin Cabin embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton).

We will also give a brief description of each variable.

  • pclass = ticket class, 1 = upper class (rich) 2 = middle class (general class) 3 = lower class (working class)
  • Embarked = The definition of each variable is C = Cherbourg Q = Queenstown S = Southampton NaN = represents a data loss.

3. Assumptions and Questions

I can barely remember when I first watched the movie Titanic, but Titanic is still a topic of discussion in a wide variety of fields. Thus several questions arise at this point:

  • What kind of people survives?
  • What are the factors influencing people’s survival?
  • The age of surviving people?
  • What class dominates the survivor classes?
  • Can we say that women and children have a strong chance of surviving?

EDA part:

Since I already did that part in a previous project, I’ll skip it here.

Analysis Part:

PCA:

Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data. wikipedia

=> The correlation between the variables Age<->Pclass is negative

=> The correlation between the variables Pclass<->Fare is negative

=> For the rest of the correlations between the variables, the correlation varies between low and medium correlation.

PCA Part

PCA results
PCA plot

=> The correlation between variables Age<->Pclass is negative

=> The correlation between variables SibSp<->Parch<->Survived is strong => Age & Parch variables are independent

=> Survived & Age variables are independent

=> Survived & Pclass variables are independent

Stay tuned for more advanced projects, and don’t forget to download my entire project paper for more details & code (using python & R).

Thanks for reading .

Visit my website: Home — IsmailOuahbi.com
Follow me on LinkedIn for more.

--

--

Ismailouahbi

I share my unique experiences and insights, unraveling the complexities of machine learning and data science in an engaging and accessible manner.