Open In Colab

Si no funciona el botó podeu copiar el següent enllaç

3 - Estructura del dataframe

Ara que ja sabem carregar dataframes de fitxers, descobrirem com podem accedir a l’informació que es troba dins de l’estructura de dades.

Començarem seleccionant columnes i obtenint resums estadístics d’elles. Més endavant passarem a fer seleccions de files del dataframe. Finalment, realitzarem seleccions combinades creant els nostres propis dataframes a partir dels subconjunts seleccionats.

[1]:
#Seguirem emprant la llibreria pandas i el dataset WHO

import pandas as pd

df_who = pd.read_csv("http://www.exploredata.net/ftp/WHO.csv") # carregam un dataframe

Columnes

Com hem comentat a l’introducció, començarem fent feina amb les columnes

[2]:
# Columnes o caracteristiques de cada mostra
print(df_who.columns)
Index(['Country', 'CountryID', 'Continent', 'Adolescent fertility rate (%)',
       'Adult literacy rate (%)',
       'Gross national income per capita (PPP international $)',
       'Net primary school enrolment ratio female (%)',
       'Net primary school enrolment ratio male (%)',
       'Population (in thousands) total', 'Population annual growth rate (%)',
       ...
       'Total_CO2_emissions', 'Total_income', 'Total_reserves',
       'Trade_balance_goods_and_services', 'Under_five_mortality_from_CME',
       'Under_five_mortality_from_IHME', 'Under_five_mortality_rate',
       'Urban_population', 'Urban_population_growth',
       'Urban_population_pct_of_total'],
      dtype='object', length=358)
[3]:
for col in df_who.columns:
    print(col)
Country
CountryID
Continent
Adolescent fertility rate (%)
Adult literacy rate (%)
Gross national income per capita (PPP international $)
Net primary school enrolment ratio female (%)
Net primary school enrolment ratio male (%)
Population (in thousands) total
Population annual growth rate (%)
Population in urban areas (%)
Population living below the poverty line (% living on < US$1 per day)
Population median age (years)
Population proportion over 60 (%)
Population proportion under 15 (%)
Registration coverage of births (%)
Total fertility rate (per woman)
Antenatal care coverage - at least four visits (%)
Antiretroviral therapy coverage among HIV-infected pregt women for PMTCT (%)
Antiretroviral therapy coverage among people with advanced HIV infections (%)
Births attended by skilled health personnel (%)
Births by caesarean section (%)
Children aged 6-59 months who received vitamin A supplementation (%)
Children aged <5 years sleeping under insecticide-treated nets (%)
Children aged <5 years who received any antimalarial treatment for fever (%)
Children aged <5 years with ARI symptoms taken to facility (%)
Children aged <5 years with diarrhoea receiving ORT (%)
Contraceptive prevalence (%)
Neonates protected at birth against neonatal tetanus (PAB) (%)
One-year-olds immunized with MCV
One-year-olds immunized with three doses of diphtheria tetanus toxoid and pertussis (DTP3) (%)
One-year-olds immunized with three doses of Hepatitis B (HepB3) (%)
One-year-olds immunized with three doses of Hib (Hib3) vaccine (%)
Tuberculosis detection rate under DOTS (%)
Tuberculosis treatment success under DOTS (%)
Women who have had mammography (%)
Women who have had PAP smear (%)
Community and traditional health workers density (per 10 000 population)
Dentistry personnel density (per 10 000 population)
Environment and public health workers density (per 10 000 population)
External resources for health as percentage of total expenditure on health
General government expenditure on health as percentage of total expenditure on health
General government expenditure on health as percentage of total government expenditure
Hospital beds (per 10 000 population)
Laboratory health workers density (per 10 000 population)
Number of community and traditional health workers
Number of dentistry personnel
Number of environment and public health workers
Number of laboratory health workers
Number of nursing and midwifery personnel
Number of other health service providers
Number of pharmaceutical personnel
Number of physicians
Nursing and midwifery personnel density (per 10 000 population)
Other health service providers density (per 10 000 population)
Out-of-pocket expenditure as percentage of private expenditure on health
Per capita government expenditure on health (PPP int. $)
Per capita government expenditure on health at average exchange rate (US$)
Per capita total expenditure on health (PPP int. $)
Per capita total expenditure on health at average exchange rate (US$)
Pharmaceutical personnel density (per 10 000 population)
Physicians density (per 10 000 population)
Private expenditure on health as percentage of total expenditure on health
Private prepaid plans as percentage of private expenditure on health
Ratio of health management and support workers to health service providers
Ratio of nurses and midwives to physicians
Social security expenditure on health as percentage of general government expenditure on health
Total expenditure on health as percentage of gross domestic product
Births attended by skilled health personnel (%) highest educational level of mother
Births attended by skilled health personnel (%) highest wealth quintile
Births attended by skilled health personnel (%) lowest educational level of mother
Births attended by skilled health personnel (%) lowest wealth quintile
Births attended by skilled health personnel (%) rural
Births attended by skilled health personnel (%) urban
Births attended by skilled health personnel difference highest lowest educational level of mother
Births attended by skilled health personnel difference highest-lowest wealth quintile
Births attended by skilled health personnel difference urban-rural
Births attended by skilled health personnel ratio highest-lowest educational level of mother
Births attended by skilled health personnel ratio highest-lowest wealth quintile
Births attended by skilled health personnel ratio urban-rural
Measles immunization coverage among one-year-olds (%) highest educational level of mother
Measles immunization coverage among one-year-olds (%) highest wealth quintile
Measles immunization coverage among one-year-olds (%) lowest educational level of mother
Measles immunization coverage among one-year-olds (%) lowest wealth quintile
Measles immunization coverage among one-year-olds (%) rural
Measles immunization coverage among one-year-olds (%) urban
Measles immunization coverage among one-year-olds difference highest-lowest educational level of mother
Measles immunization coverage among one-year-olds difference highest-lowest wealth quintile
Measles immunization coverage among one-year-olds difference urban-rural
Measles immunization coverage among one-year-olds ratio highest-lowest educational level of mother
Measles immunization coverage among one-year-olds ratio highest-lowest wealth quintile
Measles immunization coverage among one-year-olds ratio urban-rural
Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) difference lowest-highest educational level of mother
Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) difference lowest-highest wealth quintile
Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) difference rural-urban
Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) highest educational level of mother
Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) highest wealth quintile
Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) lowest educational level of mother
Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) lowest wealth quintile
Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) ratio lowest-highest educational level of mother
Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) ratio lowest-highest wealth quintile
Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) ratio rural-urban
Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) rural
Under-5 mortality rate (Probability of dying aged < 5 years per 1 000 live births) urban
Adult mortality rate (probability of dying between 15 to 60 years per 1000 population) both sexes
Adult mortality rate (probability of dying between 15 to 60 years per 1000 population) female
Adult mortality rate (probability of dying between 15 to 60 years per 1000 population) male
Age-standardized mortality rate for cancer (per 100 000 population)
Age-standardized mortality rate for cardiovascular diseases (per 100 000 population)
Age-standardized mortality rate for injuries (per 100 000 population)
Age-standardized mortality rate for non-communicable diseases (per 100 000 population)
Deaths among children under five years of age due to diarrhoeal diseases (%)
Deaths among children under five years of age due to HIV/AIDS (%)
Deaths among children under five years of age due to injuries (%)
Deaths among children under five years of age due to malaria (%)
Deaths among children under five years of age due to measles (%)
Deaths among children under five years of age due to neonatal causes (%)
Deaths among children under five years of age due to other causes (%)
Deaths among children under five years of age due to pneumonia (%)
Deaths due to HIV/AIDS (per 100 000 population per year)
Deaths due to tuberculosis among HIV-negative people (per 100 000 population)
Deaths due to tuberculosis among HIV-positive people (per 100 000 population)
Healthy life expectancy (HALE) at birth (years) both sexes
Healthy life expectancy (HALE) at birth (years) female
Healthy life expectancy (HALE) at birth (years) male
Incidence of tuberculosis (per 100 000 population per year)
Infant mortality rate (per 1 000 live births) both sexes
Infant mortality rate (per 1 000 live births) female
Infant mortality rate (per 1 000 live births) male
Life expectancy at birth (years) both sexes
Life expectancy at birth (years) female
Life expectancy at birth (years) male
Maternal mortality ratio (per 100 000 live births)
Neonatal mortality rate (per 1 000 live births)
Number of confirmed poliomyelitis cases
Prevalence of HIV among adults aged >=15 years (per 100 000 population)
Prevalence of tuberculosis (per 100 000 population)
Under-5 mortality rate (probability of dying by age 5 per 1000 live births) both sexes
Under-5 mortality rate (probability of dying by age 5 per 1000 live births) female
Under-5 mortality rate (probability of dying by age 5 per 1000 live births) male
Years of life lost to communicable diseases (%)
Years of life lost to injuries (%)
Years of life lost to non-communicable diseases (%)
Children under five years of age overweight for age (%)
Children under five years of age stunted for age (%)
Children under five years of age underweight for age (%)
Newborns with low birth weight (%)
Per capita recorded alcohol consumption (litres of pure alcohol) among adults (>=15 years)
Population using solid fuels (%) rural
Population using solid fuels (%) urban
Population with sustainable access to improved drinking water sources (%) rural
Population with sustainable access to improved drinking water sources (%) total
Population with sustainable access to improved drinking water sources (%) urban
Population with sustainable access to improved sanitation (%) rural
Population with sustainable access to improved sanitation (%) total
Population with sustainable access to improved sanitation (%) urban
Prevalence of adults (>=15 years) who are obese (%) female
Prevalence of adults (>=15 years) who are obese (%) male
Prevalence of condom use by young people (15-24 years) at higher risk sex (%) female
Prevalence of condom use by young people (15-24 years) at higher risk sex (%) male
Prevalence of current tobacco use among adolescents (13-15 years) (%) both sexes
Prevalence of current tobacco use among adolescents (13-15 years) (%) female
Prevalence of current tobacco use among adolescents (13-15 years) (%) male
Prevalence of current tobacco use among adults (>=15 years) (%) both sexes
Prevalence of current tobacco use among adults (>=15 years) (%) female
Prevalence of current tobacco use among adults (>=15 years) (%) male
Adolescent_fertility_rate
Agricultural_land
Agriculture_contribution_to_economy
Aid_given
Aid_received
Aid_received_total
All_forms_of_TB_new_cases_per_100_000_estimated
All_forms_of_TB_new_cases_per_100_000_reported
Annual_freshwater_withdrawals_total
Arms_exports
Arms_imports
Bad_teeth_per_child
Births_attended_by_skilled_health_staff
Breast_cancer_deaths_per_100_000_women
Breast_cancer_new_cases_per_100_000_women
Breast_cancer_number_of_female_deaths
Breast_cancer_number_of_new_female_cases
Broadband_subscribers
Broadband_subscribers_per_100_people
CO2_emissions
CO2_intensity_of_economic_output
Capital_formation
Cell_phones_per_100_people
Cell_phones_total
Central_bank_discount_rate
Cervical_cancer_deaths_per_100_000_women
Cervical_cancer_new_cases_per_100_000_women
Cervical_cancer_number_of_female_deaths
Cervical_cancer_number_of_new_female_cases
Children_and_elderly
Children_out_of_school_primary
Children_out_of_school_primary_female
Children_out_of_school_primary_male
Children_per_woman
Coal_consumption
Coal_consumption_per_person
Coal_production
Coal_production_per_person
Colon_and_Rectum_cancer_deaths_per_100_000_men
Colon_and_Rectum_cancer_deaths_per_100_000_women
Colon_and_Rectum_cancer_new_cases_per_100_000_men
Colon_and_Rectum_cancer_new_cases_per_100_000_women
Colon_and_Rectum_cancer_number_of_female_deaths
Colon_and_Rectum_cancer_number_of_male_deaths
Colon_and_Rectum_cancer_number_of_new_female_cases
Colon_and_Rectum_cancer_number_of_new_male_cases
Consumer_price_index
Contraceptive_use
Deaths_from_TB_per_100_000_estimated
Debt_servicing_costs
Democracy_score
Electric_power_consumption
Electricity_generation
Electricity_generation_per_person
Energy_use
Expenditure_per_student_primary
Expenditure_per_student_secondary
Expenditure_per_student_tertiary
Exports_of_goods_and_services
Exports_unit_value
External_debt_total_DOD_current_USdollars
External_debt_total_pct_of_GNI
Female_labour_force
Fixed_line_and_mobile_phone_subscribers
Foreign_direct_investment_net_inflows
Foreign_direct_investment_net_outflows
Forest_area
Gross_capital_formation
HIV_infected
Health_expenditure_per_person
Health_expenditure_private
Health_expenditure_public_pct_of_GDP
Health_expenditure_public_pct_of_government_expenditure
Health_expenditure_public_pct_of_total_health_expenditure
Health_expenditure_total
High_technology_exports
Hydroelectricity_consumption
Hydroelectricity_consumption_per_person
Imports_of_goods_and_services
Imports_unit_value
Improved_sanitation_facilities_urban
Improved_water_source
Income_growth
Income_per_person
Income_share_held_by_lowest_20pct
Industry_contribution_to_economy
Inequality_index
Infant_mortality_rate
Infectious_TB_new_cases_per_100_000_estimated
Infectious_TB_new_cases_per_100_000_reported
Infectious_TB_treatment_completeness
Inflation_GDP_deflator
Internet_users
Life_expectancy_at_birth
Literacy_rate_adult_female
Literacy_rate_adult_male
Literacy_rate_adult_total
Literacy_rate_youth_female
Literacy_rate_youth_male
Literacy_rate_youth_total
Liver_cancer_deaths_per_100_000_men
Liver_cancer_deaths_per_100_000_women
Liver_cancer_new_cases_per_100_000_men
Liver_cancer_new_cases_per_100_000_women
Liver_cancer_number_of_female_deaths
Liver_cancer_number_of_male_deaths
Liver_cancer_number_of_new_female_cases
Liver_cancer_number_of_new_male_cases
Lung_cancer_deaths_per_100_000_men
Lung_cancer_deaths_per_100_000_women
Lung_cancer_new_cases_per_100_000_men
Lung_cancer_new_cases_per_100_000_women
Lung_cancer_number_of_female_deaths
Lung_cancer_number_of_male_deaths
Lung_cancer_number_of_new_female_cases
Lung_cancer_number_of_new_male_cases
Malaria_prevention_insecticide_treated_bed_nets_usage
Malaria_treatment
Malnutrition_weight_for_age
Market_value_of_listed_companies
Maternal_mortality
Math_achievement_4th_grade
Math_achievement_8th_grade
Measles_immunization
Medical_Doctors
Merchandise_trade
Military_expenditure
Natural_gas_consumption
Natural_gas_consumption_per_person
Natural_gas_production
Natural_gas_production_per_person
Natural_gas_proved_reserves
Natural_gas_proven_reserves_per_person
Net_barter_terms_of_trade
Nuclear_consumption
Nuclear_consumption_per_person
Number_of_deaths_from_TB_estimated
Number_of_existing_TB_cases_estimated
Oil_consumption
Oil_consumption_per_person
Oil_production
Oil_production_per_person
Oil_proved_reserves
Oil_proven_reserves_per_person
Old_version_of_Income_per_person
Patent_applications
Patents_granted
Patents_in_force
People_living_with_HIV
Personal_computers_per_100_people
Personal_computers_total
Population_growth
Population_in_urban_agglomerations_more_than_1_million
Population_total
Poverty_headcount_ratio_at_national_poverty_line
Present_value_of_debt
Primary_completion_rate_total
Primary_energy_consumption
Primary_energy_consumption_per_person
Primary_school_completion_pct_of_boys
Primary_school_completion_pct_of_girls
Prostate_cancer_deaths_per_100_000_men
Prostate_cancer_new_cases_per_100_000_men
Prostate_cancer_number_of_male_deaths
Prostate_cancer_number_of_new_male_cases
Pump_price_for_gasoline
Ratio_of_girls_to_boys_in_primary_and_secondary_education
Ratio_of_young_literate_females_to_males
Roads_paved
SO2_emissions_per_person
Services_contribution_to_economy
Stomach_cancer_deaths_per_100_000_men
Stomach_cancer_deaths_per_100_000_women
Stomach_cancer_new_cases_per_100_000_men
Stomach_cancer_new_cases_per_100_000_women
Stomach_cancer_number_of_female_deaths
Stomach_cancer_number_of_male_deaths
Stomach_cancer_number_of_new_female_cases
Stomach_cancer_number_of_new_male_cases
Sugar_per_person
Surface_area
Tax_revenue
Total_CO2_emissions
Total_income
Total_reserves
Trade_balance_goods_and_services
Under_five_mortality_from_CME
Under_five_mortality_from_IHME
Under_five_mortality_rate
Urban_population
Urban_population_growth
Urban_population_pct_of_total

Ja coneixem els dataframes, la segona estructura de dades que més empram a Pandas és la serie:

[4]:
df_who["Country"]
[4]:
0             Afghanistan
1                 Albania
2                 Algeria
3                 Andorra
4                  Angola
              ...
197               Vietnam
198    West Bank and Gaza
199                 Yemen
200                Zambia
201              Zimbabwe
Name: Country, Length: 202, dtype: object

Un cop seleccionam una columna podem accedir als seus elements com si fossin una llista:

[5]:
print(df_who["Country"][0])
print("-"*30)
print(df_who["Country"][:5])
print("-"*30)
print(df_who["Country"].values)
Afghanistan
------------------------------
0    Afghanistan
1        Albania
2        Algeria
3        Andorra
4         Angola
Name: Country, dtype: object
------------------------------
['Afghanistan' 'Albania' 'Algeria' 'Andorra' 'Angola'
 'Antigua and Barbuda' 'Argentina' 'Armenia' 'Australia' 'Austria'
 'Azerbaijan' 'Bahamas' 'Bahrain' 'Bangladesh' 'Barbados' 'Belarus'
 'Belgium' 'Belize' 'Benin' 'Bermuda' 'Bhutan' 'Bolivia'
 'Bosnia and Herzegovina' 'Botswana' 'Brazil' 'Brunei Darussalam'
 'Bulgaria' 'Burkina Faso' 'Burundi' 'Cambodia' 'Cameroon' 'Canada'
 'Cape Verde' 'Central African Republic' 'Chad' 'Chile' 'China' 'Colombia'
 'Comoros' 'Congo, Dem. Rep.' 'Congo, Rep.' 'Cook Islands' 'Costa Rica'
 "Cote d'Ivoire" 'Croatia' 'Cuba' 'Cyprus' 'Czech Republic' 'Denmark'
 'Djibouti' 'Dominica' 'Dominican Republic' 'Ecuador' 'Egypt'
 'El Salvador' 'Equatorial Guinea' 'Eritrea' 'Estonia' 'Ethiopia' 'Fiji'
 'Finland' 'France' 'French Polynesia' 'Gabon' 'Gambia' 'Georgia'
 'Germany' 'Ghana' 'Greece' 'Grenada' 'Guatemala' 'Guinea' 'Guinea-Bissau'
 'Guyana' 'Haiti' 'Honduras' 'Hong Kong, China' 'Hungary' 'Iceland'
 'India' 'Indonesia' 'Iran (Islamic Republic of)' 'Iraq' 'Ireland'
 'Israel' 'Italy' 'Jamaica' 'Japan' 'Jordan' 'Kazakhstan' 'Kenya'
 'Kiribati' 'Korea, Dem. Rep.' 'Korea, Rep.' 'Kuwait' 'Kyrgyzstan'
 "Lao People's Democratic Republic" 'Latvia' 'Lebanon' 'Lesotho' 'Liberia'
 'Libyan Arab Jamahiriya' 'Lithuania' 'Luxembourg' 'Macao, China'
 'Macedonia' 'Madagascar' 'Malawi' 'Malaysia' 'Maldives' 'Mali' 'Malta'
 'Marshall Islands' 'Mauritania' 'Mauritius' 'Mexico'
 'Micronesia (Federated States of)' 'Moldova' 'Monaco' 'Mongolia'
 'Montenegro' 'Morocco' 'Mozambique' 'Myanmar' 'Namibia' 'Nauru' 'Nepal'
 'Netherlands' 'Netherlands Antilles' 'New Caledonia' 'New Zealand'
 'Nicaragua' 'Niger' 'Nigeria' 'Niue' 'Norway' 'Oman' 'Pakistan' 'Palau'
 'Panama' 'Papua New Guinea' 'Paraguay' 'Peru' 'Philippines' 'Poland'
 'Portugal' 'Puerto Rico' 'Qatar' 'Romania' 'Russia' 'Rwanda'
 'Saint Kitts and Nevis' 'Saint Lucia' 'Saint Vincent and the Grenadines'
 'Samoa' 'San Marino' 'Sao Tome and Principe' 'Saudi Arabia' 'Senegal'
 'Serbia' 'Seychelles' 'Sierra Leone' 'Singapore' 'Slovakia' 'Slovenia'
 'Solomon Islands' 'Somalia' 'South Africa' 'Spain' 'Sri Lanka' 'Sudan'
 'Suriname' 'Swaziland' 'Sweden' 'Switzerland' 'Syria' 'Taiwan'
 'Tajikistan' 'Tanzania' 'Thailand' 'Timor-Leste' 'Togo' 'Tonga'
 'Trinidad and Tobago' 'Tunisia' 'Turkey' 'Turkmenistan' 'Tuvalu' 'Uganda'
 'Ukraine' 'United Arab Emirates' 'United Kingdom'
 'United States of America' 'Uruguay' 'Uzbekistan' 'Vanuatu' 'Venezuela'
 'Vietnam' 'West Bank and Gaza' 'Yemen' 'Zambia' 'Zimbabwe']

Existeix una altra manera més simple de seleccionar una única columna, ja que podem trobar noms de columnes molt llargs: “Children aged <5 years who received any antimalarial treatment for fever (%)”

Nota: Per tant, en la creació de documents és important un adequat nom de columnes!!

[6]:
df_who.Country
[6]:
0             Afghanistan
1                 Albania
2                 Algeria
3                 Andorra
4                  Angola
              ...
197               Vietnam
198    West Bank and Gaza
199                 Yemen
200                Zambia
201              Zimbabwe
Name: Country, Length: 202, dtype: object
[7]:
print(df_who.columns[9])
print("-"*30)
print(df_who[df_who.columns[9]])

Population annual growth rate (%)
------------------------------
0      4.0
1      0.6
2      1.5
3      1.0
4      2.8
      ...
197    1.4
198    NaN
199    3.0
200    1.9
201    0.8
Name: Population annual growth rate (%), Length: 202, dtype: float64

Aquí també podem emprar l’slicing, per seleccionar múltiples columnes.

[8]:

df_who[df_who.columns[0:5]]
[8]:
Country CountryID Continent Adolescent fertility rate (%) Adult literacy rate (%)
0 Afghanistan 1 1 151.0 28.0
1 Albania 2 2 27.0 98.7
2 Algeria 3 3 6.0 69.9
3 Andorra 4 2 NaN NaN
4 Angola 5 3 146.0 67.4
... ... ... ... ... ...
197 Vietnam 198 6 25.0 90.3
198 West Bank and Gaza 199 1 NaN NaN
199 Yemen 200 1 83.0 54.1
200 Zambia 201 3 161.0 68.0
201 Zimbabwe 202 3 101.0 89.5

202 rows × 5 columns

I també podem seleccionar diverses columnes si coneixem el seu nom, veiem que Pandas és molt flexible:

[9]:
df_who[["CountryID","Continent"]]
[9]:
CountryID Continent
0 1 1
1 2 2
2 3 3
3 4 2
4 5 3
... ... ...
197 198 6
198 199 1
199 200 1
200 201 3
201 202 3

202 rows × 2 columns

Estadístics

En seleccionar una columna d’un dataframe obtenim una sèrie. Podem obtenir estadístics d’aquesta sèrie de manera molt senzilla:

[10]:
fertilitat = df_who[df_who.columns[3]]
print("Min ", fertilitat.min())
print("Max ", fertilitat.max())
print("Mean ", fertilitat.mean())
Min  0.0
Max  199.0
Mean  59.45762711864407

A continuació trobareu la taula que mostra les funcions descriptives que tenim disponibles: e52cf318af924c62a5f8030e9857cb3f

Veurem que obtenir aquestes informacions estadístiques ens pot ajudar a extreure informació molt concreta de la taula, per exemple, si volem saber:

¿Quin pais té la major emissió de CO2 ?

[11]:
co2 = df_who["Total_CO2_emissions"]
row = df_who[co2 == co2.max()]  # Selecció condicionada
type(row)
[11]:
pandas.core.frame.DataFrame

La variable row conté la fila amb el valor màxim a la columna “Total_CO2_emissions”

[12]:
print(row["Country"])
192    United States of America
Name: Country, dtype: object
[13]:
row["Country"].values
[13]:
array(['United States of America'], dtype=object)
[14]:
print("El pais mas contaminante es: ", row["Country"].values[0])
El pais mas contaminante es:  United States of America

Files

Cada fila té un índex. L’índex pot ser numèric, alfabètic o de temps.

[15]:
df_who.index
[15]:
RangeIndex(start=0, stop=202, step=1)
[16]:
df_who.index.values
[16]:
array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
       182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,
       195, 196, 197, 198, 199, 200, 201])

El mètode loc

Emprant el mètode loc del dataframe podem accedir a les seves files amb la mateixa lògica que ja coneixem per les llistes:

[17]:
df_who.loc[0] #la mostra zero, primera fila
[17]:
Country                           Afghanistan
CountryID                                   1
Continent                                   1
Adolescent fertility rate (%)           151.0
Adult literacy rate (%)                  28.0
                                     ...
Under_five_mortality_from_IHME          231.9
Under_five_mortality_rate               257.0
Urban_population                    5740436.0
Urban_population_growth                  5.44
Urban_population_pct_of_total            22.9
Name: 0, Length: 358, dtype: object
[18]:
type(df_who.loc[0]) # una fila també és una sèrie
[18]:
pandas.core.series.Series
[19]:
print(df_who.loc[0].Country) # Podem accedir a una sèrie amb el seu índex
print(df_who.loc[0][0]) #o amb la seva posició
print(df_who.loc[0][3])
Afghanistan
Afghanistan
151.0
/tmp/ipykernel_1292/52902763.py:2: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  print(df_who.loc[0][0]) #o amb la seva posició
/tmp/ipykernel_1292/52902763.py:3: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  print(df_who.loc[0][3])

Com passava amb les llistes també podem seleccionar diversos elements (files) en una sola comanda usant la tècnica de slicing.

[20]:
df_who.loc[166:170] #slicing
[20]:
Country CountryID Continent Adolescent fertility rate (%) Adult literacy rate (%) Gross national income per capita (PPP international $) Net primary school enrolment ratio female (%) Net primary school enrolment ratio male (%) Population (in thousands) total Population annual growth rate (%) ... Total_CO2_emissions Total_income Total_reserves Trade_balance_goods_and_services Under_five_mortality_from_CME Under_five_mortality_from_IHME Under_five_mortality_rate Urban_population Urban_population_growth Urban_population_pct_of_total
166 Somalia 167 3 35.0 NaN NaN 6.0 11.0 8445.0 3.0 ... 589.90 NaN NaN NaN 150.8 219.9 150.8 2885131.0 4.09 35.2
167 South Africa 168 3 65.0 82.4 8900.0 88.0 88.0 48282.0 0.7 ... 408792.47 1.610000e+11 66.32 -2.200000e+09 68.5 68.5 68.5 27800000.0 1.98 59.3
168 Spain 169 2 10.0 97.2 28200.0 99.0 100.0 43887.0 1.1 ... 343701.53 6.780000e+11 NaN -5.770000e+10 4.9 4.2 4.9 33300000.0 1.75 76.7
169 Sri Lanka 170 7 29.0 90.7 3730.0 100.0 99.0 19207.0 0.5 ... 11017.65 1.980000e+10 24.28 -2.180000e+09 21.4 12.1 21.4 2969868.0 0.26 15.1
170 Sudan 171 3 51.0 60.9 1780.0 37.0 45.0 37707.0 2.2 ... 10618.27 1.670000e+10 10.13 -2.850000e+09 111.3 94.4 111.3 15100000.0 4.40 40.8

5 rows × 358 columns

[21]:
# Els índexs són útils quan són dates: Sèries temporals
# per exemple:
# df.loc["2020":"2022"]

Selecció condicional

Si volem que la nostra selecció es correspongui amb un criteri lògic, per exemple saber quins són els països amb una taxa d’alfabetització dels adults amb més d’un 70% podem fer el següent:

[22]:
alfabetitzacio = df_who[df_who['Adult literacy rate (%)'] > 70][["Country","Adult literacy rate (%)"]]

alfabetitzacio[alfabetitzacio["Country"] == "Italy"]
[22]:
Country Adult literacy rate (%)
85 Italy 98.4

L’estructura d’aquest tipus de selecció sembla complexa, però si la dividim en parts veurem que és abordable:

[23]:
# En aquest codi obtenim una llista de valors booleans (True o False) segons és compleix la condició per cada una de les files:
seleccio = df_who['Adult literacy rate (%)'] > 70
[24]:
# En aquest codi, de les files on seleccio == True agafam les dues columnes que ens interessen
df_who[seleccio][["Country","Adult literacy rate (%)"]]
[24]:
Country Adult literacy rate (%)
1 Albania 98.7
6 Argentina 97.2
7 Armenia 99.4
10 Azerbaijan 98.8
12 Bahrain 86.5
... ... ...
193 Uruguay 96.8
195 Vanuatu 75.5
196 Venezuela 93.0
197 Vietnam 90.3
201 Zimbabwe 89.5

93 rows × 2 columns

Aquest tipus de selecció ens obre tot un nou ventall de possibilitats de selecció “automàtica” de dades que fins ara es feia molt complicat.

Files i columnes

Obviament, si sabem seleccionar files i columnes, podem combinar-les per fer una selecció més específica.

[25]:

df_who.loc[167:169, ["Country","Total_CO2_emissions"]] # per noms
[25]:
Country Total_CO2_emissions
167 South Africa 408792.47
168 Spain 343701.53
169 Sri Lanka 11017.65

El mètode iloc

No sempre ens serà còmode fer seleccions amb els noms de les files (encara que normalment siguin nombres) i de les columnes. Si volem fer seleccions emprant només els índexs podem emprar el mètode .iloc[files, columnes]:

[26]:
df_who.iloc[[168,192],[0,43,118]] # per posicions
[26]:
Country Hospital beds (per 10 000 population) Deaths among children under five years of age due to pneumonia (%)
168 Spain 34.0 1.3
192 United States of America 32.0 1.3

Activitat

En aquesta activitat practicarem la selecció de dades (columnes i files) en un dataframe

1) Quina és la mitjana de la població urbana (“Urban_population”) de tots els països? La seva desviació típica (std)?

[ ]:

2) Consulta la fila del país: “Spain”

[ ]:

3a) Quin país té una població urbana més gran? 3b) Quins països tenen una població urbana menor a 50.000 ?

[ ]:

4) El continent on està situat Spain és el mateix que el d’UnitedStates?

Utilitza una condició per obtenir un resultat Booleà (True o False)

[ ]:

5) Quins són els cinc països més contaminants (“Total_CO2_emissions”)?

Aquesta és la meva pista per a una solució elegant

6) Observant algunes mostres del fitxer pots establir la relació entre l’identificador del continent i el seu nom?

És a dir, sabem que Spain és al continent Europeu i el codi del continent és el 2.

Hi ha els codis de continents: 1, 2, 3, 4, 5, 6, 7

Nota: Hi ha dos codis associats a Àsia.

Fes les consultes pertinents al dataframe per construir un diccionari amb la següent

[27]:
codigoContinentes = {1:"Asia",2:"Europa"} #Al menos hay 7!
print(codigoContinentes[2])
Europa
[28]:
codigoContinentes[3] = ""
codigoContinentes[4] = ""
codigoContinentes[5] = ""
codigoContinentes[6] = ""
codigoContinentes[7] = ""

License: CC BY 4.0 Isaac Lera and Gabriel Moya Universitat de les Illes Balears isaac.lera@uib.edu, gabriel.moya@uib.edu