Facebook
TwitterOrganized by zipcode: Rates of Alzheimer's disease Percent of landcover types Modelled PM2.5 Socioeconomic variables. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Lucas Neas (CPHEA/PHESD/EB) is the owner of the copy of this dataset that was used. Format: Medicare database. This dataset is associated with the following publication: Wu, J., and L. Jackson. Greenspace inversely associated with the risk of Alzheimer’s disease in the mid-Atlantic United States. Earth. MDPI AG, Basel, SWITZERLAND, 2(1): 140-150, (2021).
Facebook
TwitterBy US Open Data Portal, data.gov [source]
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to use this dataset
- Accessing the Data: To access this dataset you can visit the website Data.cdc.gov where it is publicly available or download it directly from Kaggle at [https://www.kaggle.com/cdc/us-national-cardiovascular-disease].
- Exploring the data: There are 20 columns/variables that make up this dataset which include Year, LocationAbbr,LocationDesc,DataSource PriorityArea1 through PriorityArea4,CategoryTopicIndicatorData_Value_TypeData_Value_UnitData_Value_Alt FootnoteSymbol BreakOutCategory GeoLocation etc.(see above for full list). You can explore the data however you want by looking at one variable or multiple variables simultaneously in order to gain insight about CVDs in America such as their rates across different locations over years or prevalence of certain risk factors among different age groups and gender etc . 3 . The Uses of This Dataset: This dataset can be used by researchers who are interested in improving our understanding of CVDs in America through accessing its vital statistics such as assessing disease burden and monitoring trends over time across different population subgroups etc., health authorities attempting to publicize vital health related knowledge via data dissemination tactics such as outreach programs or policy makers who intend on informing community level interventions based upon insights extracted from this powerful tool For example - Someone may look at a comparison between smoking prevalence between males & females within one state countrywide or they could further investigate that comparison into doing a time series analysis looking at smoking prevalence trends since 2001 onwards across both genders nationally until present day
- Creating a real-time cardiovascular disease surveillance system that can send updates and alert citizens about risks in their locale.
- Generating targeted public health campaigns for different demographic groups by drawing insights from the dataset to reach those most at risk of CVDs.
- Developing an app or software interface to allow users to visualize data trends around CVD prevalence and risk factors between different locations, age groups and ethnicities quickly, easily and accurately
If you use this dataset in your research, please credit the original authors. Data Source
Unknown License - Please check the dataset description for more information.
File: csv-1.csv | Column name | Description | |:-------------------------------|:----------------------------------------------------------------| | Year | Year of the survey. (Integer) | | LocationAbbr | Abbreviation of the location. (String) | | LocationDesc | Description of the location. (String) | | DataSource | Source of the data. (String) | | PriorityArea1 | Priority area 1. (String) | | PriorityArea2 | Priority area 2. (String) | | PriorityArea3 | Priority area 3. (String) | | PriorityArea4 | Priority area 4. (String) | | Category | Category of the data value type. (String) | | Topic | Topic related to the indicator of the data value unit. (String) | | Indicator | Indicator of the data value unit. (String) | | Data_Value_Type | Type of data value. (String) | | Data_Value_Unit | Unit of the data value. (String) | | Data_Value_Alt | Alternative value of the data value. (Float) | | Data_Value_Footnote_Symbol | Footnote symbol of the data value. (String) | | Break_Out_Category | Break out category of the data value. (String) | | GeoLocation | Geographic location associated with the survey d...
Facebook
TwitterThe Integrated Postsecondary Education Data System (IPEDS) is a system of interrelated surveys conducted annually by the U.S. Department of Education's National Center for Education Statistics (NCES). IPEDS annually gathers information from about 6,400 colleges, universities, and technical and vocational institutions that participate in the federal student aid programs.
Access Database: To eliminate the step of downloading IPEDS separately by survey component or select variables, IPEDS has made available the entire survey data for one collection year in the Microsoft Access format beginning with the 2004-05 IPEDS data collection year. Each database contains the relational data tables as well as the metadata tables that describe each data table, the variable titles, descriptions and variables types. Value codes and value labels are also available for all categorical variables. When downloading an IPEDS Access Database, the file is compressed using WinZip.
Facebook
TwitterThe Annual Population Survey (APS) is a major survey series, which aims to provide data that can produce reliable estimates at local authority level. Key topics covered in the survey include education, employment, health and ethnicity. The APS comprises key variables from the Labour Force Survey (LFS) (held at the UK Data Archive under GN 33246), all of its associated LFS boosts and the APS boost. Thus, the APS combines results from five different sources: the LFS (waves 1 and 5); the English Local Labour Force Survey (LLFS), the Welsh Labour Force Survey (WLFS), the Scottish Labour Force Survey (SLFS) and the Annual Population Survey Boost Sample (APS(B) - however, this ceased to exist at the end of December 2005, so APS data from January 2006 onwards will contain all the above data apart from APS(B)). Users should note that the LLFS, WLFS, SLFS and APS(B) are not held separately at the UK Data Archive. For further detailed information about methodology, users should consult the Labour Force Survey User Guide, selected volumes of which have been included with the APS documentation for reference purposes (see 'Documentation' table below).
The APS aims to provide enhanced annual data for England, covering a target sample of at least 510 economically active persons for each Unitary Authority (UA)/Local Authority District (LAD) and at least 450 in each Greater London Borough. In combination with local LFS boost samples such as the WLFS and SLFS, the survey provides estimates for a range of indicators down to Local Education Authority (LEA) level across the United Kingdom.
APS Well-Being data
Since April 2011, the APS has included questions about personal and subjective well-being. The responses to these questions have been made available as annual sub-sets to the APS Person level files. It is important to note that the size of the achieved sample of the well-being questions within the dataset is approximately 165,000 people. This reduction is due to the well-being questions being only asked of persons aged 16 and above, who gave a personal interview and proxy answers are not accepted. As a result some caution should be used when using analysis of responses to well-being questions at detailed geography areas and also in relation to any other variables where respondent numbers are relatively small. It is recommended that for lower level geography analysis that the variable UACNTY09 is used.
As well as annual datasets, three-year pooled datasets are available. When combining multiple APS datasets together, it is important to account for the rotational design of the APS and ensure that no person appears more than once in the multiple year dataset. This is because the well-being datasets are not designed to be longitudinal e.g. they are not designed to track individuals over time/be used for longitudinal analysis. They are instead cross-sectional, and are designed to use a cross-section of the population to make inferences about the whole population. For this reason, the three-year dataset has been designed to include only a selection of the cases from the individual year APS datasets, chosen in such a way that no individuals are included more than once, and the cases included are approximately equally spread across the three years. Further information is available in the 'Documentation' section below.
Secure Access APS Well-Being data
Secure Access datasets for the APS Well-Being include additional variables not included in either the standard End User Licence (EUL) versions (see under GN 33357) or the Special Licence (SL) access versions (see under GN 33376). Extra variables that typically can be found in the Secure Access version but not in the EUL or SL versions relate to:
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains information about individuals' demographic and employment attributes to predict whether their income exceeds $50,000 per year. It originates from the 1994 U.S. Census database and has been widely used in classification problems, making it an excellent resource for machine learning, data analysis, and statistical modeling.
The dataset includes various features related to personal and work-related attributes. The target variable is whether an individual's income exceeds $50,000 annually.
Key features include:
Age: Age of the individual.
Workclass: Employment type (e.g., private, government, self-employed).
Education: Highest level of education achieved.
Education-Num: Number corresponding to the level of education.
Marital Status: Marital status of the individual.
Occupation: Profession or job role.
Relationship: Family role (e.g., husband, wife, not in family).
Race: Race of the individual.
Sex: Gender of the individual.
Capital Gain: Income from investment sources other than salary.
Capital Loss: Losses from investment sources.
Hours Per Week: Average number of hours worked per week.
Native Country: Country of origin of the individual
Age: Continuous variable representing the age of the individual.
Workclass: Categorical variable indicating the type of employment (e.g., Private, Self-Employed, Government).
Education: Categorical variable showing the highest level of education achieved (e.g., Bachelors, Masters).
Education-Num: Numerical representation of the education level.
Marital Status: Categorical variable representing marital status (e.g., Married, Never-Married).
Occupation: Categorical variable indicating the job role or occupation
Relationship: Categorical variable describing the family relationship (e.g., Husband, Wife).
Race: Categorical variable showing the race of the individual.
Sex: Categorical variable indicating the gender of the individual.
Capital Gain: Continuous variable representing income from capital gains.
Capital Loss: Continuous variable representing losses from investments.
Hours Per Week: Continuous variable showing the average working hours per week.
Native Country: Categorical variable indicating the country of origin.
Income: Target variable (binary), indicating whether the individual earns more than $50,000 (>50K) or not (<=50K).
This dataset was derived from the 1994 U.S. Census database and has been made publicly available for research and educational purposes. It is not affiliated with any specific organization. Users are encouraged to comply with ethical data usage guidelines while working with this dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.
The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
Facebook
TwitterThis data release contains the input-data files and R scripts associated with the analysis presented in [citation of manuscript]. The spatial extent of the data is the contiguous U.S. The input-data files include one comma separated value (csv) file of county-level data, and one csv file of city-level data. The county-level csv (“county_data.csv”) contains data for 3,109 counties. This data includes two measures of water use, descriptive information about each county, three grouping variables (climate region, urban class, and economic dependency), and contains 18 explanatory variables: proportion of population growth from 2000-2010, fraction of withdrawals from surface water, average daily water yield, mean annual maximum temperature from 1970-2010, 2005-2010 maximum temperature departure from the 40-year maximum, mean annual precipitation from 1970-2010, 2005-2010 mean precipitation departure from the 40-year mean, Gini income disparity index, percent of county population with at least some college education, Cook Partisan Voting Index, housing density, median household income, average number of people per household, median age of structures, percent of renters, percent of single family homes, percent apartments, and a numeric version of urban class. The city-level csv (city_data.csv) contains data for 83 cities. This data includes descriptive information for each city, water-use measures, one grouping variable (climate region), and 6 explanatory variables: type of water bill (increasing block rate, decreasing block rate, or uniform), average price of water bill, number of requirement-oriented water conservation policies, number of rebate-oriented water conservation policies, aridity index, and regional price parity. The R scripts construct fixed-effects and Bayesian Hierarchical regression models. The primary difference between these models relates to how they handle possible clustering in the observations that define unique water-use settings. Fixed-effects models address possible clustering in one of two ways. In a "fully pooled" fixed-effects model, any clustering by group is ignored, and a single, fixed estimate of the coefficient for each covariate is developed using all of the observations. Conversely, in an unpooled fixed-effects model, separate coefficient estimates are developed only using the observations in each group. A hierarchical model provides a compromise between these two extremes. Hierarchical models extend single-level regression to data with a nested structure, whereby the model parameters vary at different levels in the model, including a lower level that describes the actual data and an upper level that influences the values taken by parameters in the lower level. The county-level models were compared using the Watanabe-Akaike information criterion (WAIC) which is derived from the log pointwise predictive density of the models and can be shown to approximate out-of-sample predictive performance. All script files are intended to be used with R statistical software (R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org) and Stan probabilistic modeling software (Stan Development Team. 2017. RStan: the R interface to Stan. R package version 2.16.2. http://mc-stan.org).
Facebook
TwitterBackground
The Annual Population Survey (APS) is a major survey series, which aims to provide data that can produce reliable estimates at local authority level. Key topics covered in the survey include education, employment, health and ethnicity. The APS comprises key variables from the Labour Force Survey (LFS) (held at the UK Data Archive under GN 33246), all of its associated LFS boosts and the APS boost. Thus, the APS combines results from five different sources: the LFS (waves 1 and 5); the English Local Labour Force Survey (LLFS), the Welsh Labour Force Survey (WLFS), the Scottish Labour Force Survey (SLFS) and the Annual Population Survey Boost Sample (APS(B) - however, this ceased to exist at the end of December 2005, so APS data from January 2006 onwards will contain all the above data apart from APS(B)). Users should note that the LLFS, WLFS, SLFS and APS(B) are not held separately at the UK Data Archive. For further detailed information about methodology, users should consult the Labour Force Survey User Guide, selected volumes of which have been included with the APS documentation for reference purposes (see 'Documentation' table below).
The APS aims to provide enhanced annual data for England, covering a target sample of at least 510 economically active persons for each Unitary Authority (UA)/Local Authority District (LAD) and at least 450 in each Greater London Borough. In combination with local LFS boost samples such as the WLFS and SLFS, the survey provides estimates for a range of indicators down to Local Education Authority (LEA) level across the United Kingdom.
Secure Access APS data
Secure Access datasets for the APS include additional variables not included in the standard End User Licence (EUL) versions (see under GN 33357). Extra variables that typically can be found in the Secure Access version but not in the EUL versions relate to:
Occupation data for 2021 and 2022 data files
The ONS have identified an issue with the collection of some
occupational data in 2021 and 2022 data files in a number of their
surveys. While they estimate any impacts will be small overall, this
will affect the
accuracy of the breakdowns of some detailed (four-digit Standard
Occupational
Classification (SOC)) occupations, and data derived from them. None of
ONS' headline
statistics, other than those directly sourced from occupational data,
are affected and you
can continue to rely on their accuracy. For further information on this
issue, please see:
https://www.ons.gov.uk/news/statementsandletters/occupationaldatainonssurveys.
Latest edition information:
For the thirty-second edition (August 2025), a data file for January to December 2024 has been added to the study.
Facebook
TwitterBy City of Chicago [source]
This public health dataset contains a comprehensive selection of indicators related to natality, mortality, infectious disease, lead poisoning, and economic status from Chicago community areas. It is an invaluable resource for those interested in understanding the current state of public health within each area in order to identify any deficiencies or areas of improvement needed.
The data includes 27 indicators such as birth and death rates, prenatal care beginning in first trimester percentages, preterm birth rates, breast cancer incidences per hundred thousand female population, all-sites cancer rates per hundred thousand population and more. For each indicator provided it details the geographical region so that analyses can be made regarding trends on a local level. Furthermore this dataset allows various stakeholders to measure performance along these indicators or even compare different community areas side-by-side.
This dataset provides a valuable tool for those striving toward better public health outcomes for the citizens of Chicago's communities by allowing greater insight into trends specific to geographic regions that could potentially lead to further research and implementation practices based on empirical evidence gathered from this comprehensive yet digestible selection of indicators
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to use this dataset effectively to assess the public health of a given area or areas in the city: - Understand which data is available: The list of data included in this dataset can be found above. It is important to know all that are included as well as their definitions so that accurate conclusions can be made when utilizing the data for research or analysis. - Identify areas of interest: Once you are familiar with what type of data is present it can help to identify which community areas you would like to study more closely or compare with one another. - Choose your variables: Once you have identified your areas it will be helpful to decide which variables are most relevant for your studies and research specific questions regarding these variables based on what you are trying to learn from this data set.
- Analyze the Data : Once your variables have been selected and clarified take right into analyzing the corresponding values across different community areas using statistical tests such as t-tests or correlations etc.. This will help answer questions like “Are there significant differences between two outputs?” allowing you to compare how different Chicago Community Areas stack up against each other with regards to public health statistics tracked by this dataset!
- Creating interactive maps that show data on public health indicators by Chicago community area to allow users to explore the data more easily.
- Designing a machine learning model to predict future variations in public health indicators by Chicago community area such as birth rate, preterm births, and childhood lead poisoning levels.
- Developing an app that enables users to search for public health information in their own community areas and compare with other areas within the city or across different cities in the US
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: public-health-statistics-selected-public-health-indicators-by-chicago-community-area-1.csv | Column name | Description | |:-----------------------------------------------|:--------------------------------------------------------------------------------------------------| | Community Area | Unique identifier for each community area in Chicago. (Integer) | | Community Area Name | Name of the community area in Chicago. (String) | | Birth Rate | Number of live births per 1,000 population. (Float) | | General Fertility Rate | Number of live births per 1,000 women aged 15-44. (Float) ...
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
New version 2.0.0 with majors change
For free and complete informations concerning CASSMIR datasets, please visit our website (in French).
The CASSMIR database (Contribution to the Spatial and Sociological Analysis of Residential Real Estate Markets) is a spatial and population datasets on housing property market of the Parisian metropolitan area, from 1996 to 2018. The indicators in the CASSMIR database cover four "thematic areas of investigation" : prices, socio-demographic profile of buyers and sellers, purchasing regimes and types of property transfers and types of real estate. These indicators characterize spatial units at three scales (communal level, 1km grid and 200m grid) and population groups of buyers and sellers declined according to social, generational and gender criteria. The delivery of the database follows a series of matching and aggregation of individual data from two original databases : a database on real estate transactions (BIEN database) and a database on first-time buyer investments (PTZ database). CASSMIR delivers aggregated data (with nearly 350 variables) in open access for non-commercial use.
This repository consists of sevenfiles.
"CASSMIR_SpatialDataBase" is a Geopackage file, it lists all the data aggregated to spatial units of reference. It is composed of three layers that correspond to the geographical scale of aggregation: at a communal level, a grid of one kilometer on each side and a grid of two hundred meters on each side.
"CASSMIR_GroupesPopDataBase" is a .csv file, it lists all the data aggregated to population groups of reference. There are three types of population groups : groups referenced by the social position of the buyers/sellers (social group), groups referenced by the age group to which the buyers/sellers belong (generational group), groups referenced by the sex of the buyers/sellers (gender group).
Two metadata files (.csv) lists the metadata of the indicators made available. They are systematically structured as follows :
"BIENSampleForTest" and "PTZSampleForTest" are two .txt files which restore a sample of individual data from each of the original databases. All data were anonymized and the values randomized. These two files are specifically dedicated to reproducing the different stages of processing that lead to the production of the CASSMIR files ("CASSMIR_SpatialDataBase" or "CASSMIR_GroupesPopDataBase") and cannot be used in any other way.
"LEXIQUE" is a glossary of terms used to name the variables (.csv).
The creation of the database was funded by the National Reseach Agency (ANR WIsDHoM https://anr.fr/Projet-ANR-18-CE41-0004).
All CASSMIR documentation (in French) and R codes are accessible via the Gitlab repository at the following address : https://gitlab.huma-num.fr/tlecorre/cassmir.git
METADATA :
This dataset is registered under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. You are free to copy, distribute, transmit, and adapt the data, provided that you give credit to the CASSMIR data base and specify the original source of the data. If you modify or use the data in other derivative works, you may distribute them only under the same license. You may not make commercial use of this database, nor may you use it for any purpose other than scientific research.
- Figures: (CC - CASSMIR database, indicator(s) constructed from XXX data)
- Bibliography : Productions that use the CASSMIR database must reference the dataset and the data paper.
Dataset: Le Corre T., 2020, CASSMIR (Version 2.0.0) [Data set], Zenodo. http://doi.org/10.5281/zenodo.4497219
Data paper: Thibault Le Corre, « Une base de données pour étudier vingt années de dynamiques du marché immobilier résidentiel en Île-de-France », Cybergeo: European Journal of Geography [En ligne], Data papers, article No.992, mis en ligne le 09 août 2021. URL : http://journals.openedition.org/cybergeo/37430 ; DOI : https://doi.org/10.4000/cybergeo.37430
"Une base de données pour étudier vingt années de dynamiques du marché immobilier en Île-de-France"
Thibault Le Corre
Housing market, data base, Île-de-France, spatio-temporal dynamics
DOI : https://doi.org/10.4000/cybergeo.37430
French
The time period covered by the indicators in the database depends on the data sources used, respectively:
For data from BIEN: 1996, 1999, 2003-2012, 2015, 2018
For data from PTZ: 1996-2016
Nature of data submitted
vector: Vector data
grid: Data mesh
code: programming code (see the website or GitLab of the project)
Île-de-France region
Municipalities and grid mesh elements (1km side grid and 200 side grid) concerned by real estate transactions
Reference Coordinate System (RCS): EPSG 2154 RGF93/Lambert 93.
- Xmin : 586421.7
- Xmax : 741205.6
- Ymin : 6780020
- Ymax : 6905324
Data Paper
Facebook
TwitterBy Noah Rippner [source]
This dataset provides an in-depth look at the data elements for the US College CollegeScorecard Graduation and Opportunity Project Use Case. It contains information on the variables used to create a comprehensive report, including Year, dev-category, developer-friendly name, VARIABLE NAME, API data type, label, VALUE, LABEL , SCORECARD? Y/N , SOURCE and NOTES. The data is provided by the U.S Department of Education and allows parents, students and policymakers to take meaningful action to improve outcomes. This dataset contains more than enough information to allow people like Maria - a 25 year old recent US Army veteran who wants a degree in Management Systems and Information Technology -to distinguish between her school options; access services; find affordable housing near high-quality schools which are located in safe neighborhoods that have access to transport links as well as employment opportunities nearby. This highly useful dataset provides detailed analysis of all this criteria so that users can make an informed decision about which school is best for them!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains data related to college students, including their college graduation rates, access to opportunity indicators such as geographic mobility and career readiness, and other important indicators of the overall learning experience in the United States. This guide will show you how to use this dataset to make meaningful conclusions about high education in America.
First, you will need to be familiar with the different fields included in this CollegeScorecard’s US College Graduation and Opportunity Data set. Each record is comprised of several data elements which are defined by concise labels on the left side of each observation row. These include labels such as Name of Data Element, Year, dev-category (i.e., developmental category), Variable Name, API data type (i.e., type information for programmatic interface), Label (i.e., descriptive content labeling for visual reporting), Value , Label (i.e., descriptive value labeling for visual reporting). SCORECARD? Y/N indicates whether or not a field pertains to U.S Department of Education’s College Scorecard program and SOURCE indicates where the source of the variable can be found among other minor details about that variable are found within Notes column attributed beneath each row entry for further analysis or comparison between elements captured across observations
Now that you understand the components associated within each element or label related within Observation Rows identified beside each header label let’s go over some key steps you can take when working with this particular dataset:
- Utilize year specific filters on specified fields if needed — e.g.; Year = 2020 & API Data Type = Character
Look up any ‘NCalPlaceHolder” values if applicable — these are placeholders often stating values have been absolved fromScorecards display versioning due conflicting formatting requirements across standard conditions being met or may state these details have still yet been updated recently so upon assessment wait patiently until returns minor changes via API interface incorporate latest returned results statements inventory configuration options relevant against budgetary cycle limits established positions
Pivot data points into more custom tabular structured outputs tapering down complex unstructured RAW sources into more digestible Medium Level datasets consumed often via PowerBI / Tableau compatible Snapshots expanding upon Delimited text exports baseline formats provided formerly
Explore correlations between education metrics our third parties documents generated frequently such values indicative educational adherence effects ROI growth potential looking beyond Campus Panoramic recognition metrics often supported outside Social Medial Primary
- Creating an interactive dashboard to compare school performance in terms of safety, entrepreneurship and other criteria.
- Using the data to create a heat map visualization that shows which cities are most conducive to a successful educational experience for students like Maria.
- Gathering information about average course costs at different universities and mapping them relative to US unemployment rates indicates which states might offer the best value for money when it comes to higher education expenses
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Integrated Postsecondary Education Data System (IPEDS) is a system of interrelated surveys conducted annually by the U.S. Department of Education's National Center for Education Statistics (NCES). IPEDS annually gathers information from about 6,400 colleges, universities, and technical and vocational institutions that participate in the federal student aid programs. Access Database: To eliminate the step of downloading IPEDS separately by survey component or select variables, IPEDS has made available the entire survey data for one collection year in the Microsoft Access format beginning with the 2004-05 IPEDS data collection year. Each database contains the relational data tables as well as the metadata tables that describe each data table, the variable titles, descriptions and variables types. Value codes and value labels are also available for all categorical variables. When downloading an IPEDS Access Database, the file is compressed using WinZip.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The survey dataset for identifying Shiraz old silo’s new use which includes four components: 1. The survey instrument used to collect the data “SurveyInstrument_table.pdf”. The survey instrument contains 18 main closed-ended questions in a table format. Two of these, concern information on Silo’s decision-makers and proposed new use followed up after a short introduction of the questionnaire, and others 16 (each can identify 3 variables) are related to the level of appropriate opinions for ideal intervention in Façade, Openings, Materials and Floor heights of the building in four values: Feasibility, Reversibility, Compatibility and Social Benefits. 2. The raw survey data “SurveyData.rar”. This file contains an Excel.xlsx and a SPSS.sav file. The survey data file contains 50 variables (12 for each of the four values separated by colour) and data from each of the 632 respondents. Answering each question in the survey was mandatory, therefor there are no blanks or non-responses in the dataset. In the .sav file, all variables were assigned with numeric type and nominal measurement level. More details about each variable can be found in the Variable View tab of this file. Additional variables were created by grouping or consolidating categories within each survey question for simpler analysis. These variables are listed in the last columns of the .xlsx file. 3. The analysed survey data “AnalysedData.rar”. This file contains 6 “SPSS Statistics Output Documents” which demonstrate statistical tests and analysis such as mean, correlation, automatic linear regression, reliability, frequencies, and descriptives. 4. The codebook “Codebook.rar”. The detailed SPSS “Codebook.pdf” alongside the simplified codebook as “VariableInformation_table.pdf” provides a comprehensive guide to all 50 variables in the survey data, including numerical codes for survey questions and response options. They serve as valuable resources for understanding the dataset, presenting dictionary information, and providing descriptive statistics, such as counts and percentages for categorical variables.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The AgrImOnIA dataset is a comprehensive dataset relating air quality and livestock (expressed as the density of bovines and swine bred) along with weather and other variables. The AgrImOnIA Dataset represents the first step of the AgrImOnIA project. The purpose of this dataset is to give the opportunity to assess the impact of agriculture on air quality in Lombardy through statistical techniques capable of highlighting the relationship between the livestock sector and air pollutants concentrations.
The building process of the dataset is detailed in the companion paper:
A. Fassò, J. Rodeschini, A. Fusta Moro, Q. Shaboviq, P. Maranzano, M. Cameletti, F. Finazzi, N. Golini, R. Ignaccolo, and P. Otto (2023). Agrimonia: a dataset on livestock, meteorology and air quality in the Lombardy region, Italy. SCIENTIFIC DATA, 1-19.
available here.
This dataset is a collection of estimated daily values for a range of measurements of different dimensions as: air quality, meteorology, emissions, livestock animals and land use. Data are related to Lombardy and the surrounding area for 2016-2021, inclusive. The surrounding area is obtained by applying a 0.3° buffer on Lombardy borders.
The data uses several aggregation and interpolation methods to estimate the measurement for all days.
The files in the record, renamed according to their version (es. .._v_3_0_0), are:
Agrimonia_Dataset.csv(.mat and .Rdata) which is built by joining the daily time series related to the AQ, WE, EM, LI and LA variables. In order to simplify access to variables in the Agrimonia dataset, the variable name starts with the dimension of the variable, i.e., the name of the variables related to the AQ dimension start with 'AQ_'. This file is archived also in the format for MATLAB and R software.
Metadata_Agrimonia.csv which provides further information about the Agrimonia variables: e.g. sources used, original names of the variables imported, transformations applied.
Metadata_AQ_imputation_uncertainty.csv which contains the daily uncertainty estimate of the imputed observation for the AQ to mitigate missing data in the hourly time series.
Metadata_LA_CORINE_labels.csv which contains the label and the description associated with the CLC class.
Metadata_monitoring_network_registry.csv which contains all details about the AQ monitoring station used to build the dataset. Information about air quality monitoring stations include: station type, municipality code, environment type, altitude, pollutants sampled and other. Each row represents a single sensor.
Metadata_LA_SIARL_labels.csv which contains the label and the description associated with the SIARL class.
AGC_Dataset.csv(.mat and .Rdata) that includes daily data of almost all variables available in the Agrimonia Dataset (excluding AQ variables) on an equidistant grid covering the Lombardy region and its surrounding area.
The Agrimonia dataset can be reproduced using the code available at the GitHub page: https://github.com/AgrImOnIA-project/AgrImOnIA_Data
UPDATE 31/05/2023 - NEW RELEASE - V 3.0.0
A new version of the dataset is released: Agrimonia_Dataset_v_3_0_0.csv (.Rdata and .mat), where variable WE_rh_min, WE_rh_mean and WE_rh_max have been recomputed due to some bugs.
In addition, two new columns are added, they are LI_pigs_v2 and LI_bovine_v2 and represents the density of the pigs and bovine (expressed as animals per kilometer squared) of a square of size ~ 10 x 10 km centered at the station localisation.
A new dataset is released: the Agrimonia Grid Covariates (AGC) that includes daily information for the period from 2016 to 2020 of almost all variables within the Agrimonia Dataset on a equidistant grid containing the Lombardy region and its surrounding area. The AGC does not include AQ variables as they come from the monitoring stations that are irregularly spread over the area considered.
UPDATE 11/03/2023 - NEW RELEASE - V 2.0.2
A new version of the dataset is released: Agrimonia_Dataset_v_2_0_2.csv (.Rdata), where variable WE_tot_precipitation have been recomputed due to some bugs.
A new version of the metadata is available: Metadata_Agrimonia_v_2_0_2.csv where the spatial resolution of the variable WE_precipitation_t is corrected.
UPDATE 24/01/2023 - NEW RELEASE - V 2.0.1
minor bug fixed
UPDATE 16/01/2023 - NEW RELEASE - V 2.0.0
A new version of the dataset is released, Agrimonia_Dataset_v_2_0_0.csv (.Rdata) and Metadata_monitoring_network_registry_v_2_0_0.csv. Some minor points have been addressed:
Added values for LA_land_use variable for Switzerland stations (in Agrimonia Dataset_v_2_0_0.csv)
Deleted incorrect values for LA_soil_use variable for stations outside Lombardy region during 2018 (in Agrimonia Dataset_v_2_0_0.csv)
Fixed duplicate sensors corresponding to the same pollutant within the same station (in Metadata_monitoring_network_registry_v_2_0_0.csv)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset is the repository for the following paper submitted to Data in Brief:
Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).
The Data in Brief article contains the supplement information and is the related data paper to:
Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).
Description/abstract
The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.
Folder structure
The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:
“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.
“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.
“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).
“yield_productivity” contains .csv files of yield information for all countries listed above.
“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).
“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.
“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.
Code structure
1_MODIS_NDVI_hdf_file_extraction.R
This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.
2_MERGE_MODIS_tiles.R
In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").
3_CROP_MODIS_merged_tiles.R
Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.
4_TREND_analysis_NDVI.R
Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.
5_BUILT_UP_change_raster.R
Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.
6_POPULATION_numbers_plot.R
For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.
7_YIELD_plot.R
In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.
8_GLDAS_read_extract_trend
The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
Facebook
TwitterPlease review Zhang et al. (2021) for details on study design and datasets (https://doi.org/10.1016/j.watres.2022.118443). In summary, predictor and response variable data was acquired from the Chesapeake Bay Program and USGS. This data was subjected to a trend analysis to estimate the MK linear slope change for both predictor and response variables. After running a cluster analysis on the scaled TN loading time series (the response variable), the cluster assignment was paired with the slope estimates from the suite of predictor variables tied to the nutrient inventory and static geologic and land use variables. From there, an RF analysis was executed to link trends in anthropogenic driver and other contextual environmental factors to the identified trend cluster types. After calibrating the RF model, likelihood of improving, relatively static, or degrading catchments across the Chesapeake Bay were identified for the 2007 to 2018 period. Tabular data is available on the journal website and PUBMED, and the predictor/response variable data can be downloaded individually in the USGS and Chesapeake Bay Program links listed in the data access section. Portions of this dataset are inaccessible because: This data was generate by other federal entities and are housed in their respective data warehouse domains (e.g., USGS and Chesapeake Bay Program). Furthermore, the data can be accessed on the journal website as well as NCBI PUBMED (https://pubmed.ncbi.nlm.nih.gov/35461100/). They can be accessed through the following means: Combined dataset can be accessed on the journal website (https://www.sciencedirect.com/science/article/pii/S0043135422003979?via%3Dihub#ack0001) and will soon be available on NCBI (https://pubmed.ncbi.nlm.nih.gov/35461100/). The predictor variable data can be accessed from the Chesapeake Bay Program (https://cast.chesapeakebay.net/) and USGS (https://pubs.er.usgs.gov/publication/ds948 and https://www.sciencebase.gov/catalog/item/5669a79ee4b08895842a1d47). Format: Please review Zhang et al. (2021) for details on study design and datasets (https://doi.org/10.1016/j.watres.2022.118443). In summary, predictor and response variable data was acquired from the Chesapeake Bay Program and USGS. This data was subjected to a trend analysis to estimate the MK linear slope change for both predictor and response variables. After running a cluster analysis on the scaled TN loading time series (the response variable), the cluster assignment was paired with the slope estimates from the suite of predictor variables tied to the nutrient inventory and static geologic and land use variables. From there, an RF analysis was executed to link trends in anthropogenic driver and other contextual environmental factors to the identified trend cluster types. After calibrating the RF model, likelihood of improving, relatively static, or degrading catchments across the Chesapeake Bay were identified for the 2007 to 2018 period. Tabular data is available on the journal website and PUBMED, and the predictor/response variable data can be downloaded individually in the USGS and Chesapeake Bay Program links listed in the data access section. This dataset is associated with the following publication: Zhang, Q., J. Bostic, and R. Sabo. Regional patterns and drivers of total nitrogen trends in the Chesapeake Bay watershed: Insights from machine learning approaches and management implications. WATER RESEARCH. Elsevier Science Ltd, New York, NY, USA, 218: 1-15, (2022).
Facebook
TwitterThe Annual Population Survey (APS) household datasets are produced annually and are available from 2004 (Special Licence) and 2006 (End User Licence). They allow production of family and household labour market statistics at local areas and for small sub-groups of the population across the UK. The household data comprise key variables from the Labour Force Survey (LFS) and the APS 'person' datasets. The APS household datasets include all the variables on the LFS and APS person datasets, except for the income variables. They also include key family and household-level derived variables. These variables allow for an analysis of the combined economic activity status of the family or household. In addition, they also include more detailed geographical, industry, occupation, health and age variables.
For further detailed information about methodology, users should consult the Labour Force Survey User Guide, included with the APS documentation. For variable and value labelling and coding frames that are not included either in the data or in the current APS documentation, users are advised to consult the latest versions of the LFS User Guides, which are available from the ONS Labour Force Survey - User Guidance webpages.
Occupation data for 2021 and 2022
The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. None of ONS' headline statistics, other than those directly sourced from occupational data, are affected and you can continue to rely on their accuracy. Further information can be found in the ONS article published on 11 July 2023: Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022
End User Licence and Secure Access APS data
Users should note that there are two versions of each APS dataset. One is available under the standard End User Licence (EUL) agreement, and the other is a Secure Access version. The EUL version includes Government Office Region geography, banded age, 3-digit SOC and industry sector for main, second and last job. The Secure Access version contains more detailed variables relating to:
Facebook
TwitterThe Gender Statistics database is a comprehensive source for the latest sex-disaggregated data and gender statistics covering demography, education, health, access to economic opportunities, public life and decision-making, and agency.
The data is split into several files, with the main one being Data.csv. The Data.csv contains all the variables of interest in this dataset, while the others are lists of references and general nation-by-nation information.
Data.csv contains the following fields:
I couldn't find any metadata for these, and I'm not qualified to guess at what each of the variables mean. I'll list the variables for each file, and if anyone has any suggestions (or, even better, actual knowledge/citations) as to what they mean, please leave a note in the comments and I'll add your info to the data description.
Country-Series.csv
Country.csv
FootNote.csv
Series-Time.csv
Series.csv
This dataset was downloaded from The World Bank's Open Data project. The summary of the Terms of Use of this data is as follows:
You are free to copy, distribute, adapt, display or include the data in other products for commercial and noncommercial purposes at no cost subject to certain limitations summarized below.
You must include attribution for the data you use in the manner indicated in the metadata included with the data.
You must not claim or imply that The World Bank endorses your use of the data by or use The World Bank’s logo(s) or trademark(s) in conjunction with such use.
Other parties may have ownership interests in some of the materials contained on The World Bank Web site. For example, we maintain a list of some specific data within the Datasets that you may not redistribute or reuse without first contacting the original content provider, as well as information regarding how to contact the original content provider. Before incorporating any data in other products, please check the list: Terms of use: Restricted Data.
-- [ed. note: this last is not applicable to the Gender Statistics database]
The World Bank makes no warranties with respect to the data and you agree The World Bank shall not be liable to you in connection with your use of the data.
This is only a summary of the Terms of Use for Datasets Listed in The World Bank Data Catalogue. Please read the actual agreement that controls your use of the Datasets, which is available here: Terms of use for datasets. Also see World Bank Terms and Conditions.
Facebook
TwitterSummary statistics and variable definitions-regular access sample.
Facebook
TwitterThis dataset is part of a series of datasets, where batteries are continuously cycled with randomly generated current profiles. Reference charging and discharging cycles are also performed after a fixed interval of randomized usage to provide reference benchmarks for battery state of health. In this dataset, four 18650 Li-ion batteries (Identified as RW1, RW2, RW7 and RW8) were continuously operated by repeatedly discharging them to 3.2V using a randomized sequence of discharging currents between 0.5A and 4A. This type of discharging profile is referred to here as random walk (RW) discharging. After each discharging cycle the batteries were charged for a randomly selected duration between 0.5 hours and 3 hours. After every fifty RW cycles a series of reference charging and discharging cycles were performed in order to provide reference benchmarks for battery state health.
Facebook
TwitterOrganized by zipcode: Rates of Alzheimer's disease Percent of landcover types Modelled PM2.5 Socioeconomic variables. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Lucas Neas (CPHEA/PHESD/EB) is the owner of the copy of this dataset that was used. Format: Medicare database. This dataset is associated with the following publication: Wu, J., and L. Jackson. Greenspace inversely associated with the risk of Alzheimer’s disease in the mid-Atlantic United States. Earth. MDPI AG, Basel, SWITZERLAND, 2(1): 140-150, (2021).