62 datasets found
  1. f

    Project for Statistics on Living Standards and Development 1993 - South...

    • microdata.fao.org
    • catalog.ihsn.org
    • +2more
    Updated Oct 20, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Southern Africa Labour and Development Research Unit (2020). Project for Statistics on Living Standards and Development 1993 - South Africa [Dataset]. https://microdata.fao.org/index.php/catalog/1527
    Explore at:
    Dataset updated
    Oct 20, 2020
    Dataset authored and provided by
    Southern Africa Labour and Development Research Unit
    Time period covered
    1993
    Area covered
    South Africa
    Description

    Abstract

    The Project for Statistics on Living standards and Development was a countrywide World Bank Living Standards Measurement Survey. It covered approximately 9000 households, drawn from a representative sample of South African households. The fieldwork was undertaken during the nine months leading up to the country's first democratic elections at the end of April 1994. The purpose of the survey was to collect statistical information about the conditions under which South Africans live in order to provide policymakers with the data necessary for planning strategies. This data would aid the implementation of goals such as those outlined in the Government of National Unity's Reconstruction and Development Programme.

    Geographic coverage

    National

    Analysis unit

    Households

    Universe

    All Household members. Individuals in hospitals, old age homes, hotels and hostels of educational institutions were not included in the sample. Migrant labour hostels were included. In addition to those that turned up in the selected ESDs, a sample of three hostels was chosen from a national list provided by the Human Sciences Research Council and within each of these hostels a representative sample was drawn on a similar basis as described above for the households in ESDs.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    (a) SAMPLING DESIGN

    Sample size is 9,000 households. The sample design adopted for the study was a two-stage self-weighting design in which the first stage units were Census Enumerator Subdistricts (ESDs, or their equivalent) and the second stage were households. The advantage of using such a design is that it provides a representative sample that need not be based on accurate census population distribution in the case of South Africa, the sample will automatically include many poor people, without the need to go beyond this and oversample the poor. Proportionate sampling as in such a self-weighting sample design offers the simplest possible data files for further analysis, as weights do not have to be added. However, in the end this advantage could not be retained, and weights had to be added.

    (b) SAMPLE FRAME

    The sampling frame was drawn up on the basis of small, clearly demarcated area units, each with a population estimate. The nature of the self-weighting procedure adopted ensured that this population estimate was not important for determining the final sample, however. For most of the country, census ESDs were used. Where some ESDs comprised relatively large populations as for instance in some black townships such as Soweto, aerial photographs were used to divide the areas into blocks of approximately equal population size. In other instances, particularly in some of the former homelands, the area units were not ESDs but villages or village groups. In the sample design chosen, the area stage units (generally ESDs) were selected with probability proportional to size, based on the census population. Systematic sampling was used throughout that is, sampling at fixed interval in a list of ESDs, starting at a randomly selected starting point. Given that sampling was self-weighting, the impact of stratification was expected to be modest. The main objective was to ensure that the racial and geographic breakdown approximated the national population distribution. This was done by listing the area stage units (ESDs) by statistical region and then within the statistical region by urban or rural. Within these sub-statistical regions, the ESDs were then listed in order of percentage African. The sampling interval for the selection of the ESDs was obtained by dividing the 1991 census population of 38,120,853 by the 300 clusters to be selected. This yielded 105,800. Starting at a randomly selected point, every 105,800th person down the cluster list was selected. This ensured both geographic and racial diversity (ESDs were ordered by statistical sub-region and proportion of the population African). In three or four instances, the ESD chosen was judged inaccessible and replaced with a similar one. In the second sampling stage the unit of analysis was the household. In each selected ESD a listing or enumeration of households was carried out by means of a field operation. From the households listed in an ESD a sample of households was selected by systematic sampling. Even though the ultimate enumeration unit was the household, in most cases "stands" were used as enumeration units. However, when a stand was chosen as the enumeration unit all households on that stand had to be interviewed.

    Mode of data collection

    Face-to-face [f2f]

    Cleaning operations

    All the questionnaires were checked when received. Where information was incomplete or appeared contradictory, the questionnaire was sent back to the relevant survey organization. As soon as the data was available, it was captured using local development platform ADE. This was completed in February 1994. Following this, a series of exploratory programs were written to highlight inconsistencies and outlier. For example, all person level files were linked together to ensure that the same person code reported in different sections of the questionnaire corresponded to the same person. The error reports from these programs were compared to the questionnaires and the necessary alterations made. This was a lengthy process, as several files were checked more than once, and completed at the beginning of August 1994. In some cases, questionnaires would contain missing values, or comments that the respondent did not know, or refused to answer a question.

    These responses are coded in the data files with the following values: VALUE MEANING -1 : The data was not available on the questionnaire or form -2 : The field is not applicable -3 : Respondent refused to answer -4 : Respondent did not know answer to question

    Data appraisal

    The data collected in clusters 217 and 218 should be viewed as highly unreliable and therefore removed from the data set. The data currently available on the web site has been revised to remove the data from these clusters. Researchers who have downloaded the data in the past should revise their data sets. For information on the data in those clusters, contact SALDRU http://www.saldru.uct.ac.za/.

  2. d

    Data from: Best Management Practices Statistical Estimator (BMPSE) Version...

    • catalog.data.gov
    • data.usgs.gov
    Updated Nov 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Best Management Practices Statistical Estimator (BMPSE) Version 1.2.0 [Dataset]. https://catalog.data.gov/dataset/best-management-practices-statistical-estimator-bmpse-version-1-2-0
    Explore at:
    Dataset updated
    Nov 27, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    The Best Management Practices Statistical Estimator (BMPSE) version 1.2.0 was developed by the U.S. Geological Survey (USGS), in cooperation with the Federal Highway Administration (FHWA) Office of Project Delivery and Environmental Review to provide planning-level information about the performance of structural best management practices for decision makers, planners, and highway engineers to assess and mitigate possible adverse effects of highway and urban runoff on the Nation's receiving waters (Granato 2013, 2014; Granato and others, 2021). The BMPSE was assembled by using a Microsoft Access® database application to facilitate calculation of BMP performance statistics. Granato (2014) developed quantitative methods to estimate values of the trapezoidal-distribution statistics, correlation coefficients, and the minimum irreducible concentration (MIC) from available data. Granato (2014) developed the BMPSE to hold and process data from the International Stormwater Best Management Practices Database (BMPDB, www.bmpdatabase.org). Version 1.0 of the BMPSE contained a subset of the data from the 2012 version of the BMPDB; the current version of the BMPSE (1.2.0) contains a subset of the data from the December 2019 version of the BMPDB. Selected data from the BMPDB were screened for import into the BMPSE in consultation with Jane Clary, the data manager for the BMPDB. Modifications included identifying water quality constituents, making measurement units consistent, identifying paired inflow and outflow values, and converting BMPDB water quality values set as half the detection limit back to the detection limit. Total polycyclic aromatic hydrocarbons (PAH) values were added to the BMPSE from BMPDB data; they were calculated from individual PAH measurements at sites with enough data to calculate totals. The BMPSE tool can sort and rank the data, calculate plotting positions, calculate initial estimates, and calculate potential correlations to facilitate the distribution-fitting process (Granato, 2014). For water-quality ratio analysis the BMPSE generates the input files and the list of filenames for each constituent within the Graphical User Interface (GUI). The BMPSE calculates the Spearman’s rho (ρ) and Kendall’s tau (τ) correlation coefficients with their respective 95-percent confidence limits and the probability that each correlation coefficient value is not significantly different from zero by using standard methods (Granato, 2014). If the 95-percent confidence limit values are of the same sign, then the correlation coefficient is statistically different from zero. For hydrograph extension, the BMPSE calculates ρ and τ between the inflow volume and the hydrograph-extension values (Granato, 2014). For volume reduction, the BMPSE calculates ρ and τ between the inflow volume and the ratio of outflow to inflow volumes (Granato, 2014). For water-quality treatment, the BMPSE calculates ρ and τ between the inflow concentrations and the ratio of outflow to inflow concentrations (Granato, 2014; 2020). The BMPSE also calculates ρ between the inflow and the outflow concentrations when a water-quality treatment analysis is done. The current version (1.2.0) of the BMPSE also has the option to calculate urban-runoff quality statistics from inflows to BMPs by using computer code developed for the Highway Runoff Database (Granato and Cazenas, 2009;Granato, 2019). Granato, G.E., 2013, Stochastic empirical loading and dilution model (SELDM) version 1.0.0: U.S. Geological Survey Techniques and Methods, book 4, chap. C3, 112 p., CD-ROM https://pubs.usgs.gov/tm/04/c03 Granato, G.E., 2014, Statistics for stochastic modeling of volume reduction, hydrograph extension, and water-quality treatment by structural stormwater runoff best management practices (BMPs): U.S. Geological Survey Scientific Investigations Report 2014–5037, 37 p., http://dx.doi.org/10.3133/sir20145037. Granato, G.E., 2019, Highway-Runoff Database (HRDB) Version 1.1.0: U.S. Geological Survey data release, https://doi.org/10.5066/P94VL32J. Granato, G.E., and Cazenas, P.A., 2009, Highway-Runoff Database (HRDB Version 1.0)--A data warehouse and preprocessor for the stochastic empirical loading and dilution model: Washington, D.C., U.S. Department of Transportation, Federal Highway Administration, FHWA-HEP-09-004, 57 p. https://pubs.usgs.gov/sir/2009/5269/disc_content_100a_web/FHWA-HEP-09-004.pdf Granato, G.E., Spaetzel, A.B., and Medalie, L., 2021, Statistical methods for simulating structural stormwater runoff best management practices (BMPs) with the stochastic empirical loading and dilution model (SELDM): U.S. Geological Survey Scientific Investigations Report 2020–5136, 41 p., https://doi.org/10.3133/sir20205136

  3. w

    Data Use in Academia Dataset

    • datacatalog.worldbank.org
    csv, utf-8
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semantic Scholar Open Research Corpus (S2ORC) (2023). Data Use in Academia Dataset [Dataset]. https://datacatalog.worldbank.org/search/dataset/0065200/data_use_in_academia_dataset
    Explore at:
    utf-8, csvAvailable download formats
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    Semantic Scholar Open Research Corpus (S2ORC)
    Brian William Stacy
    License

    https://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc

    Description

    This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.


    Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.


    We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.


    Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.


    The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.


    To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.


    The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.


    The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:


    Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.

    There are two classification tasks in this exercise:

    1. identifying whether an academic article is using data from any country

    2. Identifying from which country that data came.

    For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.

    After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]

    For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.

    We expect between 10 and 35 percent of all articles to use data.


    The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.


    A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.


    The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.


    The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of

  4. q

    MATLAB code and output files for integral, mean and covariance of the...

    • researchdatafinder.qut.edu.au
    Updated Jul 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr Matthew Adams (2022). MATLAB code and output files for integral, mean and covariance of the simplex-truncated multivariate normal distribution [Dataset]. https://researchdatafinder.qut.edu.au/display/n20044
    Explore at:
    Dataset updated
    Jul 25, 2022
    Dataset provided by
    Queensland University of Technology (QUT)
    Authors
    Dr Matthew Adams
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Compositional data, which is data consisting of fractions or probabilities, is common in many fields including ecology, economics, physical science and political science. If these data would otherwise be normally distributed, their spread can be conveniently represented by a multivariate normal distribution truncated to the non-negative space under a unit simplex. Here this distribution is called the simplex-truncated multivariate normal distribution. For calculations on truncated distributions, it is often useful to obtain rapid estimates of their integral, mean and covariance; these quantities characterising the truncated distribution will generally possess different values to the corresponding non-truncated distribution.

    In the paper Adams, Matthew (2022) Integral, mean and covariance of the simplex-truncated multivariate normal distribution. PLoS One, 17(7), Article number: e0272014. https://eprints.qut.edu.au/233964/, three different approaches that can estimate the integral, mean and covariance of any simplex-truncated multivariate normal distribution are described and compared. These three approaches are (1) naive rejection sampling, (2) a method described by Gessner et al. that unifies subset simulation and the Holmes-Diaconis-Ross algorithm with an analytical version of elliptical slice sampling, and (3) a semi-analytical method that expresses the integral, mean and covariance in terms of integrals of hyperrectangularly-truncated multivariate normal distributions, the latter of which are readily computed in modern mathematical and statistical packages. Strong agreement is demonstrated between all three approaches, but the most computationally efficient approach depends strongly both on implementation details and the dimension of the simplex-truncated multivariate normal distribution.

    This dataset consists of all code and results for the associated article.

  5. Social Media and Mental Health

    • kaggle.com
    zip
    Updated Jul 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SouvikAhmed071 (2023). Social Media and Mental Health [Dataset]. https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health
    Explore at:
    zip(10944 bytes)Available download formats
    Dataset updated
    Jul 18, 2023
    Authors
    SouvikAhmed071
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.

    The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.

    This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.

    The following is the Google Colab link to the project, done on Jupyter Notebook -

    https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN

    The following is the GitHub Repository of the project -

    https://github.com/daerkns/social-media-and-mental-health

    Libraries used for the Project -

    Pandas
    Numpy
    Matplotlib
    Seaborn
    Sci-kit Learn
    
  6. DivStat: A User-Friendly Tool for Single Nucleotide Polymorphism Analysis of...

    • plos.figshare.com
    docx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Inês Soares; Ana Moleirinho; Gonçalo N. P. Oliveira; António Amorim (2023). DivStat: A User-Friendly Tool for Single Nucleotide Polymorphism Analysis of Genomic Diversity [Dataset]. http://doi.org/10.1371/journal.pone.0119851
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Inês Soares; Ana Moleirinho; Gonçalo N. P. Oliveira; António Amorim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Recent developments have led to an enormous increase of publicly available large genomic data, including complete genomes. The 1000 Genomes Project was a major contributor, releasing the results of sequencing a large number of individual genomes, and allowing for a myriad of large scale studies on human genetic variation. However, the tools currently available are insufficient when the goal concerns some analyses of data sets encompassing more than hundreds of base pairs and when considering haplotype sequences of single nucleotide polymorphisms (SNPs). Here, we present a new and potent tool to deal with large data sets allowing the computation of a variety of summary statistics of population genetic data, increasing the speed of data analysis.

  7. n

    Census Microdata Samples Project

    • neuinfo.org
    • dknet.org
    • +2more
    Updated Jan 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Census Microdata Samples Project [Dataset]. http://identifiers.org/RRID:SCR_008902
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    A data set of cross-nationally comparable microdata samples for 15 Economic Commission for Europe (ECE) countries (Bulgaria, Canada, Czech Republic, Estonia, Finland, Hungary, Italy, Latvia, Lithuania, Romania, Russia, Switzerland, Turkey, UK, USA) based on the 1990 national population and housing censuses in countries of Europe and North America to study the social and economic conditions of older persons. These samples have been designed to allow research on a wide range of issues related to aging, as well as on other social phenomena. A common set of nomenclatures and classifications, derived on the basis of a study of census data comparability in Europe and North America, was adopted as a standard for recoding. This series was formerly called Dynamics of Population Aging in ECE Countries. The recommendations regarding the design and size of the samples drawn from the 1990 round of censuses envisaged: (1) drawing individual-based samples of about one million persons; (2) progressive oversampling with age in order to ensure sufficient representation of various categories of older people; and (3) retaining information on all persons co-residing in the sampled individual''''s dwelling unit. Estonia, Latvia and Lithuania provided the entire population over age 50, while Finland sampled it with progressive over-sampling. Canada, Italy, Russia, Turkey, UK, and the US provided samples that had not been drawn specially for this project, and cover the entire population without over-sampling. Given its wide user base, the US 1990 PUMS was not recoded. Instead, PAU offers mapping modules, which recode the PUMS variables into the project''''s classifications, nomenclatures, and coding schemes. Because of the high sampling density, these data cover various small groups of older people; contain as much geographic detail as possible under each country''''s confidentiality requirements; include more extensive information on housing conditions than many other data sources; and provide information for a number of countries whose data were not accessible until recently. Data Availability: Eight of the fifteen participating countries have signed the standard data release agreement making their data available through NACDA/ICPSR (see links below). Hungary and Switzerland require a clearance to be obtained from their national statistical offices for the use of microdata, however the documents signed between the PAU and these countries include clauses stipulating that, in general, all scholars interested in social research will be granted access. Russia requested that certain provisions for archiving the microdata samples be removed from its data release arrangement. The PAU has an agreement with several British scholars to facilitate access to the 1991 UK data through collaborative arrangements. Statistics Canada and the Italian Institute of statistics (ISTAT) provide access to data from Canada and Italy, respectively. * Dates of Study: 1989-1992 * Study Features: International, Minority Oversamples * Sample Size: Approx. 1 million/country Links: * Bulgaria (1992), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/02200 * Czech Republic (1991), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06857 * Estonia (1989), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06780 * Finland (1990), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06797 * Romania (1992), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06900 * Latvia (1989), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/02572 * Lithuania (1989), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/03952 * Turkey (1990), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/03292 * U.S. (1990), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06219

  8. UFC Complete Dataset (All events 1996-2024)

    • kaggle.com
    zip
    Updated Mar 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MaksBasher (2024). UFC Complete Dataset (All events 1996-2024) [Dataset]. https://www.kaggle.com/datasets/maksbasher/ufc-complete-dataset-all-events-1996-2024
    Explore at:
    zip(2149419 bytes)Available download formats
    Dataset updated
    Mar 28, 2024
    Authors
    MaksBasher
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is my first public project upvotes and suggestions are appreciated 😎🖤

    Project description

    The UFC (Ultimate Fighting Championship) is an American mixed martial arts promotion company which is considered the biggest promotion in the MMA World. Soon they will host an anniversary event UFC 300. It is interesting to see what path the promotion has come from 1996 to this day. There are UFC datasets available in the Kaggle but all of them are outdated. For that matter I've decided to gather the new dataset which will include most of the useful stats you can do for various data analysis tasks and put my theoretical skills into practice. I've created a Python script to parse the ufcstats website and gather available data.

    Currently 4 datasets are available

    Large dataset

    The biggest dataset yet with over 7000 rows and 95 different features to explore. Some of the ideas for projects with this dataset: - ML model for betting predictions; - Data analysis to compare different years, weight classes, fighters, etc; - In depth analysis of a specific fight or all fights of a selected fighter; - Visualisation of average stats (strikes, takedowns, subs) per weightclass, gender, years etc.

    Source code for the scraper that was used to create this dataset can be found in this notebook

    Medium dataset

    Medium dataset for some basic tasks (contains 7582 rows and 19 columns). You can use it for getting a basic understanding of UFC historical data and perform different visualisations.

    Source code for the scraper that was used to create this dataset can be found in this notebook

    Small dataset

    Contains the information with data about completed or upcoming events with only 683 rows and 3 columns

    Source code for the scraper that was used to create this dataset can be found in this notebook

    Fighter stats

    A dataset with the stats for every fighter fought at the UFC event.

  9. Liverpool-East Anglia Coastal Study 2 (LEACOAST2) Data Set (2005 - 2007)

    • bodc.ac.uk
    • edmed.seadatanet.org
    • +1more
    nc
    Updated Mar 30, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HR Wallingford Group Ltd. (2017). Liverpool-East Anglia Coastal Study 2 (LEACOAST2) Data Set (2005 - 2007) [Dataset]. https://www.bodc.ac.uk/resources/inventories/edmed/report/231/
    Explore at:
    ncAvailable download formats
    Dataset updated
    Mar 30, 2017
    Dataset provided by
    Proudman Oceanographic Laboratory
    University of Plymouth School of Earth, Ocean and Environmental Sciences
    Environment Agency Head Office
    HR Wallingford Group Ltd.
    University of East Anglia, School of Environmental Sciences
    License

    https://vocab.nerc.ac.uk/collection/L08/current/LI/https://vocab.nerc.ac.uk/collection/L08/current/LI/

    Time period covered
    2005 - 2007
    Area covered
    Description

    This dataset comprises field measurements of waves, currents and beach shape collected from the (nine segmented shore-parallel) breakwaters at Sea Palling, Norfolk between October 2005 and April 2007 as part of a 3-year programme funded by the Engineering and Physical Sciences Research Council. Field data were collected using Global Positioning System (GPS) surveys, acoustic current meters, video systems, radar, and side-scan sonar imaging of bedforms. These data were used to evaluate the effects of the breakwaters on the low-lying, flood-prone coastal sub-cell between Happisburgh and Waxham over two years. The field measurements were used in combination with computer models to monitor the manner in which the sea defences interact with nearby beaches and to provide enhanced tools for improving the design guidelines for coastal defences. The project partners were the School of Environmental Sciences, University of East Anglia; Department of Civil Engineering, University of Liverpool; Coastal Engineering Group, University of Plymouth; Proudman Oceanographic Laboratory; Halcrow Engineering; HR Wallingford Ltd; and the Environment Agency. The data are stored at the British Oceanographic Data Centre (BODC).

  10. N

    Comprehensive Income by Age Group Dataset: Longitudinal Analysis of Gate, OK...

    • neilsberg.com
    Updated Aug 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2024). Comprehensive Income by Age Group Dataset: Longitudinal Analysis of Gate, OK Household Incomes Across 4 Age Groups and 16 Income Brackets. Annual Editions Collection // 2024 Edition [Dataset]. https://www.neilsberg.com/research/datasets/2ecf2ac3-aeee-11ee-aaca-3860777c1fe6/
    Explore at:
    Dataset updated
    Aug 7, 2024
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Oklahoma, Gate
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the Gate household income by age. The dataset can be utilized to understand the age-based income distribution of Gate income.

    Content

    The dataset will have the following datasets when applicable

    Please note: The 2020 1-Year ACS estimates data was not reported by the Census Bureau due to the impact on survey collection and analysis caused by COVID-19. Consequently, median household income data for 2020 is unavailable for large cities (population 65,000 and above).

    • Gate, OK annual median income by age groups dataset (in 2022 inflation-adjusted dollars)
    • Age-wise distribution of Gate, OK household incomes: Comparative analysis across 16 income brackets

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Interested in deeper insights and visual analysis?

    Explore our comprehensive data analysis and visual representations for a deeper understanding of Gate income distribution by age. You can refer the same here

  11. q

    Transit Network Design dataset

    • researchdatafinder.qut.edu.au
    Updated Aug 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mr Zeke Ahern (2020). Transit Network Design dataset [Dataset]. https://researchdatafinder.qut.edu.au/individual/n1527
    Explore at:
    Dataset updated
    Aug 18, 2020
    Dataset provided by
    Queensland University of Technology (QUT)
    Authors
    Mr Zeke Ahern
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains solutions to the Transit Network Design Problem over various problem instances.

    Solutions for Mandl and Mumford problem instances. A description of the data is contained in the file “Data Information.pdf”.

  12. Local authority school places scorecards - 07 Cost of additional school...

    • explore-education-statistics.service.gov.uk
    Updated Sep 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department for Education (2020). Local authority school places scorecards - 07 Cost of additional school places from LA reported projects AY201516-AY201718 [Dataset]. https://explore-education-statistics.service.gov.uk/data-catalogue/data-set/f64761f4-8664-4027-832e-c1acc8196c7b
    Explore at:
    Dataset updated
    Sep 28, 2020
    Dataset authored and provided by
    Department for Educationhttps://gov.uk/dfe
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Time period covered
    2019
    Description

    Explore Education Statistics data set 07 Cost of additional school places from LA reported projects AY201516-AY201718 from Local authority school places scorecards

  13. Storage and Transit Time Data and Code

    • zenodo.org
    zip
    Updated Nov 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14171251
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrew Felton; Andrew Felton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Author: Andrew J. Felton
    Date: 11/15/2024

    This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

    "Global estimates of the storage and transit time of water through vegetation"

    Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated throughout the peer review process.

    #Data information:

    The data folder contains key data sets used for analysis. In particular:

    "data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

    #Code information

    Python scripts can be found in the "supporting_code" folder.

    Each R script in this project has a role:

    "01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

    "02_functions.R": This script contains custom functions. Load this using the `source()` function in the 01_start.R script.

    "03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
    `source()` function in the 01_start.R script.

    "04_figures_tables.R": This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the "manuscript_figures" folder. Note that all maps were produced using Python code found in the "supporting_code"" folder. Also note that within the "manuscript_figures" folder there is an "extended_data" folder, which contains tables of the summary statistics (e.g., quartiles and sample sizes) behind figures containing box plots or depicting regression coefficients.

    "supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

    "supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.

  14. w

    2010 Index of Stream Condition - Full Set of ISC2010 Data Sets

    • data.wu.ac.at
    • researchdata.edu.au
    • +1more
    shp
    Updated Jul 20, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Environment, Land, Water & Planning (2018). 2010 Index of Stream Condition - Full Set of ISC2010 Data Sets [Dataset]. https://data.wu.ac.at/schema/www_data_vic_gov_au/OGMzNjg0ZWQtYWZhMi00NzRjLWIzYTctMmRlMWNlN2JmZjcz
    Explore at:
    shpAvailable download formats
    Dataset updated
    Jul 20, 2018
    Dataset provided by
    Department of Environment, Land, Water & Planning
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    e614b46b2511ad20d98b5e427b15734d3470cd9b
    Description

    The ISC2010_FULL_METRICS_PACKAGE represents the full set of available ISC2010 data sets. This includes all spatial (vector) features and all a-spatial statistical summary tables. The individual ISC2010 metric data sets can also be accessed separately.

    The Product Description "ISC2010 Remote Sensing Metrics" describes the remote sensing and mapping approach used to create all ISC2010 Metic data sets. Go to http://ics.water.vic.gov.au/ics/

    River condition in Victoria is assessed every 5 years using the Index of Stream Condition (ISC). The Department of Environment and Primary Industries (DEPI) developed a methodology to assess the Physical Form and Riparian Vegetation components of the ISC using remote sensing data, specifically LIDAR and aerial photography.

    A State Wide mapping project was undertaken in 2010-13 to accurately map the Physical Form and Riparian Vegetation metrics of the ISC . Other ISC metrics were not assessed in the project and were derived from other sources.

    The Physical Form and Riparian Vegetation Metric products are a combination of mapped Vector and Raster data as well as Tabular Summary Statistics about the mapped features. In the context of the project, the term Metrics is used to refer to both the mapped features and the summary statistics.

    Remote sensing data used includes 15cm true colour and infra-red aerial photography and four return multi-pulse LiDAR data. This source data was used to derive a variety of Raster data sets including Digital Terrain Models, Slope, Vegetation Height and Vegetation Cover. The Digital Terrain and Slope rasters were used to map Physical Form metrics including Stream Bed, Top of Bank and River Centre Lines while the Vegetation Height and Cover rasters were used to map the Riparian Vegetation metrics. The Project Report "Aerial Remote Sensing for Physical Channel Form and Riparian Vegetation Mapping" describes the remote sensing and mapping approach used to create this data set.

  15. Diamond project pipeline worldwide by yearly average production 2016-2020

    • statista.com
    Updated Sep 15, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2016). Diamond project pipeline worldwide by yearly average production 2016-2020 [Dataset]. https://www.statista.com/statistics/348124/diamond-project-piepline-by-average-annual-production/
    Explore at:
    Dataset updated
    Sep 15, 2016
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2015
    Area covered
    Worldwide
    Description

    This statistic depicts the diamond project pipeline globally set to begin between 2016 and 2020, based on 2015 data, by estimated average annual production. The Renard project in Canada, under Stornoway, is set to begin in 2016 with an annual production expected to be around *** million carats. The diamond industry, unlike other precious metals and natural resources, relies almost exclusively on consumer demand for diamond jewelry. The diamond industry is expected to flourish despite weak global economic growth.

  16. q

    Data from: The Berth Allocation Problem with Channel Restrictions - Datasets...

    • researchdatafinder.qut.edu.au
    • researchdata.edu.au
    Updated Jun 22, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr Paul Corry (2018). The Berth Allocation Problem with Channel Restrictions - Datasets [Dataset]. https://researchdatafinder.qut.edu.au/individual/n4992
    Explore at:
    Dataset updated
    Jun 22, 2018
    Dataset provided by
    Queensland University of Technology (QUT)
    Authors
    Dr Paul Corry
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These datatasets relate to the computational study presented in the paper The Berth Allocation Problem with Channel Restrictions, authored by Paul Corry and Christian Bierwirth. They consist of all the randomly generated problem instances along with the computational results presented in the paper.

    Results across all problem instances assume ship separation parameters of [delta_1, delta_2, delta_3] = [0.25, 0, 0.5].

    Excel Workbook Organisation:

    The data is organised into separate Excel files for each table in the paper, as indicated by the file description. Within each file, each row of data presented (aggregating 10 replications) in the corrsponding table is captured in two worksheets, one with the problem instance data, and the other with generated solution data obtained from several solution methods (described in the paper). For example, row 3 of Tab. 2, will have data for 10 problem instances on worksheet T2R3, and corresponding solution data on T2R3X.

    Problem Instance Data Format:

    On each problem instance worksheet (e.g. T2R3), each row of data corresponds to a different problem instance, and there are 10 replications on each worksheet.

    The first column provides a replication identifier which is referenced on the corresponding solution worksheet (e.g. T2R3X).

    Following this, there are n*(2c+1) columns (n = number of ships, c = number of channel segmenets) with headers p(i)_(j).(k)., where i references the operation (channel transit/berth visit) id, j references the ship id, and k references the index of the operation within the ship. All indexing starts at 0. These columns define the transit or dwell times on each segment. A value of -1 indicates a segment on which a berth allocation must be applied, and hence the dwell time is unkown.

    There are then a further n columns with headers r(j), defining the release times of each ship.

    For ChSP problems, there are a final n colums with headers b(j), defining the berth to be visited by each ship. ChSP problems with fixed berth sequencing enforced have an additional n columns with headers toa(j), indicating the order in which ship j sits within its berth sequence. For BAP-CR problems, these columnns are not present, but replaced by n*m columns (m = number of berths) with headers p(j).(b) defining the berth processing time of ship j if allocated to berth b.

    Solution Data Format:

    Each row of data corresponds to a different solution.

    Column A references the replication identifier (from the corresponding instance worksheet) that the soluion refers to.

    Column B defines the algorithm that was used to generate the solution.

    Column C shows the objective function value (total waiting and excess handling time) obtained.

    Column D shows the CPU time consumed in generating the solution, rounded to the nearest second.

    Column E shows the optimality gap as a proportion. A value of -1 or an empty value indicates that optimality gap is unknown.

    From column F onwards, there are are n*(2c+1) columns with the previously described p(i)_(j).(k). headers. The values in these columns define the entry times at each segment.

    For BAP-CR problems only, following this there are a further 2n columns. For each ship j, there will be columns titled b(j) and p.b(j) defining the berth that was allocated to ship j, and the processing time on that berth respectively.

  17. T

    GDP of Qinghai Province by project and industry (1978-2000)

    • data.tpdc.ac.cn
    • tpdc.ac.cn
    zip
    Updated Mar 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Provincial Qinghai (2021). GDP of Qinghai Province by project and industry (1978-2000) [Dataset]. https://data.tpdc.ac.cn/en/data/5c264cd7-cad8-4a13-979d-84ec9d4ed1cb
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 30, 2021
    Dataset provided by
    TPDC
    Authors
    Provincial Qinghai
    Area covered
    Description

    This data set records the statistical data of GDP of Qinghai Province from 1978 to 2000, which is divided by project and year. The data are collected from the statistical yearbook of Qinghai Province issued by the Bureau of statistics of Qinghai Province. The data set consists of three data tables, which are: GDP of sub projects and sub industries 1978-1998.xls, GDP of sub projects and sub industries 1978-1999.xls, GDP of sub projects and sub industries 1978-2000.xls. The data table structure is the same. For example, there are 13 fields in the data table from 1978 to 1998 Field 1: Project Field 2: 1978 Field 3: 1980 Field 4: 1985 Field 5: 1990 Field 6: 1991 Field 7: 1992 Field 8: 1993 Field 9:1994 Field 10:1995 Field 11:1996 Field 12:1997 Field 13:1998

  18. T

    Number of 500000 yuan construction and production projects in cities and...

    • data.tpdc.ac.cn
    • tpdc.ac.cn
    zip
    Updated Mar 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Provincial Qinghai (2021). Number of 500000 yuan construction and production projects in cities and towns by industry in Qinghai Province (2006-2016) [Dataset]. https://data.tpdc.ac.cn/en/data/04df9305-1505-41ec-89db-80bbddefcd32
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 25, 2021
    Dataset provided by
    TPDC
    Authors
    Provincial Qinghai
    Area covered
    Description

    The data set records the statistical data of the number of 500000 yuan construction and production projects in cities and towns of Qinghai Province by industry, and the data is divided by industry, region, affiliation and registration type. The data are collected from the statistical yearbook of Qinghai Province issued by the Bureau of statistics of Qinghai Province. The data set contains 12 data tables, which are: the number of construction and production projects with RMB 500000 by industry (2015). XLS; the number of construction and production projects with RMB 500000 by industry (2016). XLS; the number of construction and production projects with RMB 500000 by industry (2006) in 2007. XLS; the number of construction and production projects with RMB 500000 by industry (2007) in 2008. XLS; the number of construction and production projects with RMB 500000 by industry (2007) in 2008 The number of construction and production projects with RMB 500000 yuan in cities and towns (2008) in 2008.xls, and the number of construction and production projects with RMB 500000 yuan in cities and towns (2009). XLS, etc. by industry. The data table structure is the same. For example, the data table in 2015 has four fields: Field 1: Industry Field 2: Construction Project Field 3: completed project Field 4: completion rate

  19. Z

    The Human Mobility Index Project - Data

    • data-staging.niaid.nih.gov
    Updated Dec 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Özak, Ömer (2024). The Human Mobility Index Project - Data [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14285745
    Explore at:
    Dataset updated
    Dec 6, 2024
    Dataset provided by
    Southern Methodist University
    Authors
    Özak, Ömer
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The Human Mobility Index

    The distance among villages, cities, regions, countries, and other socio-economic and political entities has long been considered important in explaining their comparative political and economic development. While theories abound about how geographical isolation and distances affect trade, warfare, epidemics, and colonization among many others, there is a lack of systematic historical data on communication and transportation costs, as well as geographical distances. This has led many scholars to use great circle distances as a proxy. This measure, however, is rather crude as it does not capture the large variation in the geographical and technological conditions underlying similar great circle distances in different periods.

    In Özak (2018) I rectify this deficiency by introducing a novel set of measures of historical mobility: "The Human Mobility Index (HMI)" which estimates the potential minimum travel time across the globe (measured in hours) accounting for human biological constraints, as well as geographical and technological factors that determined travel time before the widespread use of steam power. In particular, the HMI indices provide a distinct measure of human mobility potential in different eras:

    1. Human Mobility Index (HMI): This index measures mobility on land without seafaring technology. It shows mobility potential on land before the widespread use of steam power.2. Human Mobility Index with Seafaring: HMI expanded to allow mobility on a select set of seas for which historical data was available. It shows potential mobility on land and seas before the introduction of ocean-faring ships.3. Human Mobility Index with Ocean: HMI expanded to allow mobility on all seas based on CLIWOC (interpolated). It shows potential mobility on land and seas after the introduction of ocean-faring ships but before the widespread use of steamships.

    Based on these cost surfaces, researchers can find the minimum travel times between locations or construct more sophisticated statistics based on these. For example, Ashraf, Galor, and Özak (2010) construct measures of pre-historic geographical isolation to study the effect of isolation on development. Similarly, Özak (2010), Depetris-Chauvin and Özak (2016, 2018), and Michalopoulus and Özak (2019) construct potential trade and information flow networks among countries, ethnic groups, cities, and artificial geographical units, to study the origins of the division of labor, and the effect of technological change on isolation and development. Likewise, Depetris-Chauvin and Özak (2019) use these measures to construct artificial states based on Voronoi partitions.

    This strategy overcomes the potential mismeasurement of distances generated by using geodesic distances (Özak 2010), for a period when travel time was the most important determinant of transportation costs. Additionally, it removes the potential concern that travel time to the frontier reflects a country's stage of development, mitigating further possible endogeneity concerns. The research validates these measures by (i) analyzing their association with actual historical travel time; (ii) examining their explanatory power for the location of historical trade routes in the Old World; and (iii) analyzing their association with genetic and cultural distances.The project's main page is https://human-mobility-index.github.io/. This repository holds the data files, which researchers can download and use with GIS software. It will also provide the data for the Python package to compute distances based on them.

  20. Forest condition survey data

    • data.europa.eu
    unknown
    Updated Oct 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valstybės duomenų agentūra (2025). Forest condition survey data [Dataset]. https://data.europa.eu/data/datasets/https-data-gov-lt-datasets-2066-/embed
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Oct 16, 2025
    Dataset provided by
    State Data Agency of Lithuaniahttps://vda.lrv.lt/
    Authors
    Valstybės duomenų agentūra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The forest condition survey data, prepared according to the data of the project "Experimental Ecosystem Volume and Forest Condition Account", at the level of municipalities. The data is linked to the second stage of the experimental ecosystem statistics project - the basic data set of forest indicators according to the groups and classes of indicators of the state of ecosystems. The indicators are calculated and presented according to the relevant spatial units at the beginning of the period, around 2000. (denoted by the letter 'x') and at the end - 2021.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Southern Africa Labour and Development Research Unit (2020). Project for Statistics on Living Standards and Development 1993 - South Africa [Dataset]. https://microdata.fao.org/index.php/catalog/1527

Project for Statistics on Living Standards and Development 1993 - South Africa

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Oct 20, 2020
Dataset authored and provided by
Southern Africa Labour and Development Research Unit
Time period covered
1993
Area covered
South Africa
Description

Abstract

The Project for Statistics on Living standards and Development was a countrywide World Bank Living Standards Measurement Survey. It covered approximately 9000 households, drawn from a representative sample of South African households. The fieldwork was undertaken during the nine months leading up to the country's first democratic elections at the end of April 1994. The purpose of the survey was to collect statistical information about the conditions under which South Africans live in order to provide policymakers with the data necessary for planning strategies. This data would aid the implementation of goals such as those outlined in the Government of National Unity's Reconstruction and Development Programme.

Geographic coverage

National

Analysis unit

Households

Universe

All Household members. Individuals in hospitals, old age homes, hotels and hostels of educational institutions were not included in the sample. Migrant labour hostels were included. In addition to those that turned up in the selected ESDs, a sample of three hostels was chosen from a national list provided by the Human Sciences Research Council and within each of these hostels a representative sample was drawn on a similar basis as described above for the households in ESDs.

Kind of data

Sample survey data [ssd]

Sampling procedure

(a) SAMPLING DESIGN

Sample size is 9,000 households. The sample design adopted for the study was a two-stage self-weighting design in which the first stage units were Census Enumerator Subdistricts (ESDs, or their equivalent) and the second stage were households. The advantage of using such a design is that it provides a representative sample that need not be based on accurate census population distribution in the case of South Africa, the sample will automatically include many poor people, without the need to go beyond this and oversample the poor. Proportionate sampling as in such a self-weighting sample design offers the simplest possible data files for further analysis, as weights do not have to be added. However, in the end this advantage could not be retained, and weights had to be added.

(b) SAMPLE FRAME

The sampling frame was drawn up on the basis of small, clearly demarcated area units, each with a population estimate. The nature of the self-weighting procedure adopted ensured that this population estimate was not important for determining the final sample, however. For most of the country, census ESDs were used. Where some ESDs comprised relatively large populations as for instance in some black townships such as Soweto, aerial photographs were used to divide the areas into blocks of approximately equal population size. In other instances, particularly in some of the former homelands, the area units were not ESDs but villages or village groups. In the sample design chosen, the area stage units (generally ESDs) were selected with probability proportional to size, based on the census population. Systematic sampling was used throughout that is, sampling at fixed interval in a list of ESDs, starting at a randomly selected starting point. Given that sampling was self-weighting, the impact of stratification was expected to be modest. The main objective was to ensure that the racial and geographic breakdown approximated the national population distribution. This was done by listing the area stage units (ESDs) by statistical region and then within the statistical region by urban or rural. Within these sub-statistical regions, the ESDs were then listed in order of percentage African. The sampling interval for the selection of the ESDs was obtained by dividing the 1991 census population of 38,120,853 by the 300 clusters to be selected. This yielded 105,800. Starting at a randomly selected point, every 105,800th person down the cluster list was selected. This ensured both geographic and racial diversity (ESDs were ordered by statistical sub-region and proportion of the population African). In three or four instances, the ESD chosen was judged inaccessible and replaced with a similar one. In the second sampling stage the unit of analysis was the household. In each selected ESD a listing or enumeration of households was carried out by means of a field operation. From the households listed in an ESD a sample of households was selected by systematic sampling. Even though the ultimate enumeration unit was the household, in most cases "stands" were used as enumeration units. However, when a stand was chosen as the enumeration unit all households on that stand had to be interviewed.

Mode of data collection

Face-to-face [f2f]

Cleaning operations

All the questionnaires were checked when received. Where information was incomplete or appeared contradictory, the questionnaire was sent back to the relevant survey organization. As soon as the data was available, it was captured using local development platform ADE. This was completed in February 1994. Following this, a series of exploratory programs were written to highlight inconsistencies and outlier. For example, all person level files were linked together to ensure that the same person code reported in different sections of the questionnaire corresponded to the same person. The error reports from these programs were compared to the questionnaires and the necessary alterations made. This was a lengthy process, as several files were checked more than once, and completed at the beginning of August 1994. In some cases, questionnaires would contain missing values, or comments that the respondent did not know, or refused to answer a question.

These responses are coded in the data files with the following values: VALUE MEANING -1 : The data was not available on the questionnaire or form -2 : The field is not applicable -3 : Respondent refused to answer -4 : Respondent did not know answer to question

Data appraisal

The data collected in clusters 217 and 218 should be viewed as highly unreliable and therefore removed from the data set. The data currently available on the web site has been revised to remove the data from these clusters. Researchers who have downloaded the data in the past should revise their data sets. For information on the data in those clusters, contact SALDRU http://www.saldru.uct.ac.za/.

Search
Clear search
Close search
Google apps
Main menu