Facebook
TwitterThis dashboard was created from data published by Olist Store (a Brazilian e-commerce public dataset). Raw data contains information about 100 000 orders from 2016 to 2018 placed in many regions of Brazil.
The raw datasets were imported into Excel using “Get data” option (formerly known as “Power Query”) and cleaned. An additional table with the names of Brazilian states was also imported from the Wikipedia page.
A Data Table about payment information was created based on imported statistics with the usage of nested formulas. Then, proper pivot charts were used to build an Olist Store Payment Dashboard which allows you to review the data using a connected timeline and slicers.
Facebook
Twitterhttps://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc
This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.
Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.
We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.
Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.
The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.
To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.
The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.
The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:
Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.
There are two classification tasks in this exercise:
1. identifying whether an academic article is using data from any country
2. Identifying from which country that data came.
For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.
After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]
For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.
We expect between 10 and 35 percent of all articles to use data.
The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.
A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.
The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.
The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of
Facebook
TwitterFootball is more than just a game — it’s data-rich and decision-driven. From match results to player statistics, the English Premier League (EPL) offers a goldmine of insights for analysts, fans, and data scientists.
This dataset is part of a personal data preprocessing project designed to transform messy raw data into a clean, structured format — enabling meaningful analysis, modeling, or visualization. Whether you're predicting match outcomes, exploring season trends, or learning data science, this dataset gives you a strong starting point.
This dataset was originally sourced from football-data.co.uk, a trusted source for historical football data. The raw data was downloaded in CSV format and carefully cleaned using Python. The resulting dataset is ready for analysis and includes statistics such as:
Match dates
Full-time and half-time results
Goals, corners, shots, fouls
Yellow and red cards
It’s ideal for building machine learning models, dashboards, or practicing sports analytics.
This dataset is for educational and non-commercial use only. Raw data sourced from football-data.co.uk. Please credit the source if you use or share this dataset.
Facebook
TwitterThe main objective of the HEIS survey is to obtain detailed data on household expenditure and income, linked to various demographic and socio-economic variables, to enable computation of poverty indices and determine the characteristics of the poor and prepare poverty maps. Therefore, to achieve these goals, the sample had to be representative on the sub-district level. The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality.
Data collected through the survey helped in achieving the following objectives: 1. Provide data weights that reflect the relative importance of consumer expenditure items used in the preparation of the consumer price index 2. Study the consumer expenditure pattern prevailing in the society and the impact of demograohic and socio-economic variables on those patterns 3. Calculate the average annual income of the household and the individual, and assess the relationship between income and different economic and social factors, such as profession and educational level of the head of the household and other indicators 4. Study the distribution of individuals and households by income and expenditure categories and analyze the factors associated with it 5. Provide the necessary data for the national accounts related to overall consumption and income of the household sector 6. Provide the necessary income data to serve in calculating poverty indices and identifying the poor chracteristics as well as drawing poverty maps 7. Provide the data necessary for the formulation, follow-up and evaluation of economic and social development programs, including those addressed to eradicate poverty
National
The survey covered a national sample of households and all individuals permanently residing in surveyed households.
Sample survey data [ssd]
The 2008 Household Expenditure and Income Survey sample was designed using two-stage cluster stratified sampling method. In the first stage, the primary sampling units (PSUs), the blocks, were drawn using probability proportionate to the size, through considering the number of households in each block to be the block size. The second stage included drawing the household sample (8 households from each PSU) using the systematic sampling method. Fourth substitute households from each PSU were drawn, using the systematic sampling method, to be used on the first visit to the block in case that any of the main sample households was not visited for any reason.
To estimate the sample size, the coefficient of variation and design effect in each subdistrict were calculated for the expenditure variable from data of the 2006 Household Expenditure and Income Survey. This results was used to estimate the sample size at sub-district level, provided that the coefficient of variation of the expenditure variable at the sub-district level did not exceed 10%, with a minimum number of clusters that should not be less than 6 at the district level, that is to ensure good clusters representation in the administrative areas to enable drawing poverty pockets.
It is worth mentioning that the expected non-response in addition to areas where poor families are concentrated in the major cities were taken into consideration in designing the sample. Therefore, a larger sample size was taken from these areas compared to other ones, in order to help in reaching the poverty pockets and covering them.
Face-to-face [f2f]
List of survey questionnaires: (1) General Form (2) Expenditure on food commodities Form (3) Expenditure on non-food commodities Form
Raw Data The design and implementation of this survey procedures were: 1. Sample design and selection 2. Design of forms/questionnaires, guidelines to assist in filling out the questionnaires, and preparing instruction manuals 3. Design the tables template to be used for the dissemination of the survey results 4. Preparation of the fieldwork phase including printing forms/questionnaires, instruction manuals, data collection instructions, data checking instructions and codebooks 5. Selection and training of survey staff to collect data and run required data checkings 6. Preparation and implementation of the pretest phase for the survey designed to test and develop forms/questionnaires, instructions and software programs required for data processing and production of survey results 7. Data collection 8. Data checking and coding 9. Data entry 10. Data cleaning using data validation programs 11. Data accuracy and consistency checks 12. Data tabulation and preliminary results 13. Preparation of the final report and dissemination of final results
Harmonized Data - The Statistical Package for Social Science (SPSS) was used to clean and harmonize the datasets - The harmonization process started with cleaning all raw data files received from the Statistical Office - Cleaned data files were then all merged to produce one data file on the individual level containing all variables subject to harmonization - A country-specific program was generated for each dataset to generate/compute/recode/rename/format/label harmonized variables - A post-harmonization cleaning process was run on the data - Harmonized data was saved on the household as well as the individual level, in SPSS and converted to STATA format
Facebook
Twitterhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The tasks (called items in the study) are the first 6 histogram and all 6 case-value plot tasks (hence, the first 12 tasks from the data in dataset 1_Raw_Data_Students). It contains all data needed for reproducing the results described in the qualitative article belonging to this dataset, including for example, codebook, coding of transcripts, RStudio file for calculating accuracy and precision. Also detailed coding results, including second coder results. Note that the raw data of this project as well as the design of the project, materials and so on are in the dataset: 1_Raw_Data_Students. The latter dataset is needed for replicating the whole eye-tracking study.
Facebook
Twitterhttps://data.go.kr/ugs/selectPortalPolicyView.dohttps://data.go.kr/ugs/selectPortalPolicyView.do
This is raw data of ODA project information reported by the Korea International Cooperation Agency to the OECD DAC (Development Assistance Committee), providing project name, project area, region, project implementing agency, and corresponding SDG indicators. * Code column description - [Channel Code]: This is a code column that contains the code of the data value specified in [Project Implementing Agency]. - [Aid Type Code]: This is a code column that contains the code of the data value specified in [Aid Type]. - [Sustainable Development Goals (SDGs)]: Refer to the SDGs focus sheet (codebook) in the OECD CRS CODE list file. - [Technical Cooperation], [Program Aid], [Project Aid Code], [Mixed Credit]: This column indicates whether the column attribute of the related project applies. (If applicable = 1, if not applicable = 0) - [Currency]: This column indicates the currency code, and the data value (302) means USD. * Data values left blank do not correspond to the project and are therefore not aggregated.
Facebook
TwitterTHE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE PALESTINIAN CENTRAL BUREAU OF STATISTICS
The Palestinian Central Bureau of Statistics (PCBS) carried out four rounds of the Labor Force Survey 2017 (LFS). The survey rounds covered a total sample of about 23,120 households (5,780 households per quarter).
The main objective of collecting data on the labour force and its components, including employment, unemployment and underemployment, is to provide basic information on the size and structure of the Palestinian labour force. Data collected at different points in time provide a basis for monitoring current trends and changes in the labour market and in the employment situation. These data, supported with information on other aspects of the economy, provide a basis for the evaluation and analysis of macro-economic policies.
The raw survey data provided by the Statistical Agency were cleaned and harmonized by the Economic Research Forum, in the context of a major project that started in 2009. During which extensive efforts have been exerted to acquire, clean, harmonize, preserve and disseminate micro data of existing labor force surveys in several Arab countries.
Covering a representative sample on the region level (West Bank, Gaza Strip), the locality type (urban, rural, camp) and the governorates.
1- Household/family. 2- Individual/person.
The survey covered all Palestinian households who are a usual residence of the Palestinian Territory.
Sample survey data [ssd]
THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE PALESTINIAN CENTRAL BUREAU OF STATISTICS
The methodology was designed according to the context of the survey, international standards, data processing requirements and comparability of outputs with other related surveys.
---> Target Population: It consists of all individuals aged 10 years and Above and there are staying normally with their households in the state of Palestine during 2017.
---> Sampling Frame: The sampling frame consists of the master sample, which was updated in 2011: each enumeration area consists of buildings and housing units with an average of about 124 households. The master sample consists of 596 enumeration areas; we used 494 enumeration areas as a framework for the labor force survey sample in 2017 and these units were used as primary sampling units (PSUs).
---> Sampling Size: The estimated sample size is 5,780 households in each quarter of 2017.
---> Sample Design The sample is two stage stratified cluster sample with two stages : First stage: we select a systematic random sample of 494 enumeration areas for the whole round ,and we excluded the enumeration areas which its sizes less than 40 households. Second stage: we select a systematic random sample of 16 households from each enumeration area selected in the first stage, se we select a systematic random of 16 households of the enumeration areas which its size is 80 household and over and the enumeration areas which its size is less than 80 households we select systematic random of 8 households.
---> Sample strata: The population was divided by: 1- Governorate (16 governorate) 2- Type of Locality (urban, rural, refugee camps).
---> Sample Rotation: Each round of the Labor Force Survey covers all of the 494 master sample enumeration areas. Basically, the areas remain fixed over time, but households in 50% of the EAs were replaced in each round. The same households remain in the sample for two consecutive rounds, left for the next two rounds, then selected for the sample for another two consecutive rounds before being dropped from the sample. An overlap of 50% is then achieved between both consecutive rounds and between consecutive years (making the sample efficient for monitoring purposes).
Face-to-face [f2f]
The survey questionnaire was designed according to the International Labour Organization (ILO) recommendations. The questionnaire includes four main parts:
---> 1. Identification Data: The main objective for this part is to record the necessary information to identify the household, such as, cluster code, sector, type of locality, cell, housing number and the cell code.
---> 2. Quality Control: This part involves groups of controlling standards to monitor the field and office operation, to keep in order the sequence of questionnaire stages (data collection, field and office coding, data entry, editing after entry and store the data.
---> 3. Household Roster: This part involves demographic characteristics about the household, like number of persons in the household, date of birth, sex, educational level…etc.
---> 4. Employment Part: This part involves the major research indicators, where one questionnaire had been answered by every 15 years and over household member, to be able to explore their labour force status and recognize their major characteristics toward employment status, economic activity, occupation, place of work, and other employment indicators.
---> Raw Data PCBS started collecting data since 1st quarter 2017 using the hand held devices in Palestine excluding Jerusalem in side boarders (J1) and Gaza Strip, the program used in HHD called Sql Server and Microsoft. Net which was developed by General Directorate of Information Systems. Using HHD reduced the data processing stages, the fieldworkers collect data and sending data directly to server then the project manager can withdrawal the data at any time he needs. In order to work in parallel with Gaza Strip and Jerusalem in side boarders (J1), an office program was developed using the same techniques by using the same database for the HHD.
---> Harmonized Data - The SPSS package is used to clean and harmonize the datasets. - The harmonization process starts with a cleaning process for all raw data files received from the Statistical Agency. - All cleaned data files are then merged to produce one data file on the individual level containing all variables subject to harmonization. - A country-specific program is generated for each dataset to generate/ compute/ recode/ rename/ format/ label harmonized variables. - A post-harmonization cleaning process is then conducted on the data. - Harmonized data is saved on the household as well as the individual level, in SPSS and then converted to STATA, to be disseminated.
The survey sample consists of about 30,230 households of which 23,120 households completed the interview; whereas 14,682 households from the West Bank and 8,438 households in Gaza Strip. Weights were modified to account for non-response rate. The response rate in the West Bank reached 82.4% while in the Gaza Strip it reached 92.7%.
---> Sampling Errors Data of this survey may be affected by sampling errors due to use of a sample and not a complete enumeration. Therefore, certain differences can be expected in comparison with the real values obtained through censuses. Variances were calculated for the most important indicators: the variance table is attached with the final report. There is no problem in disseminating results at national or governorate level for the West Bank and Gaza Strip.
---> Non-Sampling Errors Non-statistical errors are probable in all stages of the project, during data collection or processing. This is referred to as non-response errors, response errors, interviewing errors, and data entry errors. To avoid errors and reduce their effects, great efforts were made to train the fieldworkers intensively. They were trained on how to carry out the interview, what to discuss and what to avoid, carrying out a pilot survey, as well as practical and theoretical training during the training course. Also data entry staff were trained on the data entry program that was examined before starting the data entry process. To stay in contact with progress of fieldwork activities and to limit obstacles, there was continuous contact with the fieldwork team through regular visits to the field and regular meetings with them during the different field visits. Problems faced by fieldworkers were discussed to clarify any issues. Non-sampling errors can occur at the various stages of survey implementation whether in data collection or in data processing. They are generally difficult to be evaluated statistically.
They cover a wide range of errors, including errors resulting from non-response, sampling frame coverage, coding and classification, data processing, and survey response (both respondent and interviewer-related). The use of effective training and supervision and the careful design of questions have direct bearing on limiting the magnitude of non-sampling errors, and hence enhancing the quality of the resulting data. The implementation of the survey encountered non-response where the case ( household was not present at home ) during the fieldwork visit and the case ( housing unit is vacant) become the high percentage of the non response cases. The total non-response rate reached14.2% which is very low once compared to the household surveys conducted by PCBS , The refusal rate reached 3.0% which is very low percentage compared to the
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.
The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.
This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.
The following is the Google Colab link to the project, done on Jupyter Notebook -
https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN
The following is the GitHub Repository of the project -
https://github.com/daerkns/social-media-and-mental-health
Libraries used for the Project -
Pandas
Numpy
Matplotlib
Seaborn
Sci-kit Learn
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
All data taken from https://fbref.com/
GitHub to my project: https://github.com/emreguvenilir/fifa23-ml-ratingsystem
There is another statistics dataset here on Kaggle where the data is totally incomplete. So I took the time, mainly because of a final school project, to download the raw data from R. I then cleaned the data to the specifics of my project. The data contains only players from the big 5 leagues (prem, la liga, bundesliga, ligue 1, serie a.)
Column Description
squad: The team of a given player
comp: The league of the team, only includes the “big 5”
player: player name
nation: nationality of the player
pos: position of the player
age: age of the player
born: year born
MP: matches played
Minutes_Played: minutes played in the season
Mn_per_MP: minutes per match played
Mins_Per_90: minutes per 90 minutes (length of a soccer match)
Starts: matches started
PPM_Team.Success: avg # of point earned by the team from matches in which the player appeared with a minimum of 30 minutes
OnG_Team.Success: goals scored by team while on pitch
onGA_Team.Success: Goals allowed by team while on pitch plus_per_minus_Team.Success: goals scored minus allowed while on pitch
Goals: goals scored
Assists: assists that led to goal
GoalsAssists: goals + assists
NonPKG: non penalty kick goals
PK: penalty kicks made
PKatt: penalties attempted
CrdY: yellow cards
CrdR: red cards
xG: expected goals based on all shots taken
xAG: expected assisted goals
npxG+xAG: non penalty expected goals + assisted goals
PrgC: progressive carries in the attacking half of the pitch and went at least 10 yards
PrgP: progressive carries in the attacking half of the pitch and went at least 10 yards
Gls_Per90: goals per 90 minutes
Ast_Per90: assists per 90 minutes
G+A_Per90: goals + assists per 90
G_minus_PK_Per: goals excluding penalties per 90
G+A_minus_PK_Per: goals and assists excluding penalties per 90
xG_Per: xG per 90
xAG_Per: xAG per 90
xG+xAG_Per: xG+xAG per 90
Shots: shots taken
Shots_On_Target: shots on goal frame
SoT_percent: sh/SoT * 100
G_per_Sh: goals per shot taken
G_per_SoT: goal per shot on target
Avg_Shot_Dist: avg shot dist
FK_Standard: shots from free kicks
G_minus_xG_expected: goals minus expected goals
np:G_minus_xG_Expected: non penalty goals minus expected goals
Passes_Completed: passes completed
Passes_attempted: passes attempted
Passes_Cmp_percent: pass completion percentage
PrgDist_Total: progressive pass total distance
Passes_Cmp_Short: short passes completed (5 to 15 yds)
Passes_Att_Short: short passes Attempted (5 to 15 yds)
Passes_Cmp_Percent_Short: short passes completed percentage (5 to 15 yds)
Passes_Cmp_Medium: medium passes completed (15 to 30 yds)
Passes_Att_medium: medium passes Attempted (15 to 30 yds)
Passes_Cmp_Percent_Medium: medium passes completed percentage (15 to 30 yds)
Passes_Cmp_long: long passes completed (30+ yds)
Passes_Att_long : long passes Attempted (30+ yds)
Passes_Cmp_Percent_long : long passes completed percentage (30+ yds)
A_minus_xAG_expected: assists minus expected assists
Key_Passes: passes that lead directly to a shot
Final_third: passes that enter the final third of the field
PPA: passes into the penalty area
CrsPA: crosses into penalty area
TB_pass: through ball passes
Crs_Pass: number of crosses
Offside_passes: passes that resulted in an offside
Blocked_passes: passes blocked by an opponent
Shot_Creating_Actions: shot creating actions
SCA_90: shot creating actions per 90
TakeOnTo_Shot: take ons that led to shot
FoulTo_Shot: fouls draw that led to shot
DefAction_Shot: defensive actions that led to a shot (pressing)
GoalCreatingAction: goal creating actions
GCA90: goal creating actions per 90
TakeOn_Goal: take ons that led to a goal
Fld_goal: fouls drawn that led to a goal
DefAction_Goal: defensive actions that led to a goal (pressing)
Tackles: number of tackles made
Tackles_won: tackles won
Def_3rd_Tackles: tackles in the defensive 1/3 of the pitch
Mid_3rd_Tackles: tackles in the middle 1/3 of the pitch
Att_3rd_Tackles: tackles in the attacking 1/3 of the pitch
Tkl_percent_won: % of dribblers tackled
Lost_challenges: lost challenges, unsuccessful attempts to win the ball
Blocks: # of times blocking the ball by standing in path
Sh_blocked: shots blocked
Passes_blocked: number of passes blocked
Interceptions: interceptions
Clearances; clearances
ErrorsLead_ToShot: errors made leading to a shot
Att_Take: attacking take ons attempted
Succ:Take: attacking take ons successful
Succ_percent_take: percentage of attacking take ons successfully
Tkld_Take: times tackled during a take on
Tkld_percent_Take: percentage of times tackled during a take on
TotDist_Carries: total distance carrying the ball in any direction
PrgDist_carries: progressive carry distance total
Miscontrolls: # of times a player...
Facebook
TwitterA. SUMMARY This dataset provides review time metrics for the San Francisco Planning Department’s application review process. The following metrics are provided: total days to Planning approval, days to finish completeness review, days to first check plan letter, and days to complete resubmission review. Targets for each metric and outcomes relative to these targets are also included. These metrics allow for ongoing tracking for individual planning projects and for the calculation of summary statistics for Planning review timelines. There are both Project level metrics and project event level metrics in this table. You can see a dashboard which shows the City's current permit processing performance on sf.gov. B. HOW THE DATASET IS CREATED Planning application review is tracked within Planning’s Project and Permit Tracking System (PPTS). Planners enter review period start and end dates in PPTS when review milestones are reached. Review timeline data is extracted from PPTS and review timelines and outcomes are calculated and consolidated within this dataset. The dataset is generated by a data model that pulls from multiple raw Accela sources and joins them together. C. UPDATE PROCESS This dataset is updated daily overnight. D. HOW TO USE THIS DATASET Use this dataset to analyze project level timelines for planning projects or to calculate summary metrics related to the planning review and approval processes. The review metric type is defined in the ‘project stage’ column. Note that multiple rounds of completeness check review and resubmission review may occur for a single Planning project. The ‘potential error’ column flags records where data entry errors are likely present. Filter out rows where a value is entered in this column before building summary statistics. E. RELATED DATASETS Planning Department Project Events (coming soon) Planning Department Projects (coming soon) Building Permits Building Permit Application Issuance Metrics Building Permit Completeness Check Review Metrics Building Permit Application Review Metrics Planning Department Project Application Review Metrics
Facebook
TwitterThe basic goal of this survey is to provide the necessary database for formulating national policies at various levels. It represents the contribution of the household sector to the Gross National Product (GNP). Household Surveys help as well in determining the incidence of poverty, and providing weighted data which reflects the relative importance of the consumption items to be employed in determining the benchmark for rates and prices of items and services. Generally, the Household Expenditure and Consumption Survey is a fundamental cornerstone in the process of studying the nutritional status in the Palestinian territory.
The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality. Data is a public good, in the interest of the region, and it is consistent with the Economic Research Forum's mandate to make micro data available, aiding regional research on this important topic.
The survey data covers urban, rural and camp areas in West Bank and Gaza Strip.
1- Household/families. 2- Individuals.
The survey covered all the Palestinian households who are a usual residence in the Palestinian Territory.
Sample survey data [ssd]
The sampling frame consists of all enumeration areas which were enumerated in 1997; the enumeration area consists of buildings and housing units and is composed of an average of 120 households. The enumeration areas were used as Primary Sampling Units (PSUs) in the first stage of the sampling selection. The enumeration areas of the master sample were updated in 2003.
The sample is a stratified cluster systematic random sample with two stages: First stage: selection of a systematic random sample of 299 enumeration areas. Second stage: selection of a systematic random sample of 12-18 households from each enumeration area selected in the first stage. A person (18 years and more) was selected from each household in the second stage.
The population was divided by: 1- Governorate 2- Type of Locality (urban, rural, refugee camps)
The calculated sample size is 3,781 households.
The target cluster size or "sample-take" is the average number of households to be selected per PSU. In this survey, the sample take is around 12 households.
Detailed information/formulas on the sampling design are available in the user manual.
Face-to-face [f2f]
The PECS questionnaire consists of two main sections:
First section: Certain articles / provisions of the form filled at the beginning of the month,and the remainder filled out at the end of the month. The questionnaire includes the following provisions:
Cover sheet: It contains detailed and particulars of the family, date of visit, particular of the field/office work team, number/sex of the family members.
Statement of the family members: Contains social, economic and demographic particulars of the selected family.
Statement of the long-lasting commodities and income generation activities: Includes a number of basic and indispensable items (i.e, Livestock, or agricultural lands).
Housing Characteristics: Includes information and data pertaining to the housing conditions, including type of shelter, number of rooms, ownership, rent, water, electricity supply, connection to the sewer system, source of cooking and heating fuel, and remoteness/proximity of the house to education and health facilities.
Monthly and Annual Income: Data pertaining to the income of the family is collected from different sources at the end of the registration / recording period.
Second section: The second section of the questionnaire includes a list of 54 consumption and expenditure groups itemized and serially numbered according to its importance to the family. Each of these groups contains important commodities. The number of commodities items in each for all groups stood at 667 commodities and services items. Groups 1-21 include food, drink, and cigarettes. Group 22 includes homemade commodities. Groups 23-45 include all items except for food, drink and cigarettes. Groups 50-54 include all of the long-lasting commodities. Data on each of these groups was collected over different intervals of time so as to reflect expenditure over a period of one full year.
Both data entry and tabulation were performed using the ACCESS and SPSS software programs. The data entry process was organized in 6 files, corresponding to the main parts of the questionnaire. A data entry template was designed to reflect an exact image of the questionnaire, and included various electronic checks: logical check, range checks, consistency checks and cross-validation. Complete manual inspection was made of results after data entry was performed, and questionnaires containing field-related errors were sent back to the field for corrections.
The survey sample consists of about 3,781 households interviewed over a twelve-month period between January 2004 and January 2005. There were 3,098 households that completed the interview, of which 2,060 were in the West Bank and 1,038 households were in GazaStrip. The response rate was 82% in the Palestinian Territory.
The calculations of standard errors for the main survey estimations enable the user to identify the accuracy of estimations and the survey reliability. Total errors of the survey can be divided into two kinds: statistical errors, and non-statistical errors. Non-statistical errors are related to the procedures of statistical work at different stages, such as the failure to explain questions in the questionnaire, unwillingness or inability to provide correct responses, bad statistical coverage, etc. These errors depend on the nature of the work, training, supervision, and conducting all various related activities. The work team spared no effort at different stages to minimize non-statistical errors; however, it is difficult to estimate numerically such errors due to absence of technical computation methods based on theoretical principles to tackle them. On the other hand, statistical errors can be measured. Frequently they are measured by the standard error, which is the positive square root of the variance. The variance of this survey has been computed by using the “programming package” CENVAR.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
QUORA_ONE_MANY_QA
This dataset is derived from quora.com questioning data. It is a question with multiple answers. The project provide gas for mnbvc.
STATISTICS
Raw data size
100w 16G 200w 17G 300w 15G 400w 11G 500w 10G 600w 9G 700w 9G 800w 7.5G 900w 7G 1000w 6.5G Updating...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
The dataset contains eddy-covariance data from five i-Box stations in the Austrian Inn Valley, which have been processed to 5-min statistics. The i-Box is a long-term measurement platform, including a small network of eddy-covariance stations in the lower Inn Valley, to study boundary-layer processes in mountainous terrain. More information about the i-Box can be found at https://www.uibk.ac.at/acinn/research/atmospheric-dynamics/projects/innsbruck-box-i-box.html.en and in Rotach et al. (2017).
Data description
Station locations
The present dataset contains processed data from five i-Box stations located in the Austrian Inn Valley. The Inn Valley is an approximately southwest-northeast oriented valley in the western part of Austria, with a depth of about 2000 m and a width of about 2 km at the valley floor. The locations of the sites are shown in the overview figure i-Box_sites.pdf.
VF0 is located at the almost flat valley floor. The site is surrounded by grassland and agricultural fields. (47.305°N, 11.622°E, 545 m MSL)
SF8 is located at the foot of the north sidewall next to a steep embankment between an agricultural field and a concrete parking lot. (47.326°N, 11.652°E, 575 m MSL)
SF1 is located on an almost flat plateau running along the northern valley sidewall. The site is mainly surrounded by grassland and agricultural fields. (47.317°N, 11.616°E, 829 m MSL)
NF10 is located on an approximately 10 deg slope on the south sidewall, covered by grassland. (47.300°N, 11.673°E, 930 m MSL)
NF27 is located on a steep, grass-covered slope on the south sidewall, with a slope angle of about 25 deg. (47.288°N, 11.631°E, 1009 m MSL)
Further information about station locations can be found in Rotach et al. (2017) and Lehner et al. (2021).
Temporal coverage
The dataset contains processed data between 2014 and 2020. Some instruments were replaced and new instruments were added during this period. Data gaps occur as a result of instrument malfunctions and maintenance.
Instrumentation
Each station is equipped with at least one sonic anemometer and a gas analyzer. The instrumentation usually consists of a CSAT3 sonic anemometer (Campbell Scientific, USA) and KH20 Krypton hygrometer (Campbell Scientific) or an EC150 open-path infrared gas analyzer (Campbell Scientific). In 2020, several of the instruments were replaced with an Irgason (Campbell Scientific), which combines an open-path infrared gas analyzer with a sonic anemometer. Pressure, air temperature, and humidity used for calculating flux corrections are measured with Setra 278 sensors (Setra Systems, USA) and Rotronic HC2A-S temperature and humidity probes (Rotronic, Switzerland).
VF0: CSAT3 and EC150 at 4.0 m, CSAT3 at 8.7 m, CSAT3 and KH20 (until July 2020) or Irgason (since July 2020) at 16.9 m
SF8: CSAT3 at 6.1, CSAT3 and KH20 (until September 2020) or Irgason (since September 2020) at 11.2 m
SF1: CSAT3 and KH20 (until June 2020) or Irgason (since June 2020) at 6.8 m
NF10: CSAT3 and KH20 (until June 2020) or Irgason (since June 2020) at 5.7 m
NF27: CSAT3 at 1.5 (since September 2017), CSAT3 and KH20 (until November 2016) or Irgason (since September 2017) 6.8 m
Further information about the instrumentation can be found in Rotach et al. (2017), Lehner et al. (2021), and in the ACINN database:
NF10: https://acinn-data.uibk.ac.at/pages/i-box-weerberg.html
NF27: https://acinn-data.uibk.ac.at/pages/i-box-hochhaeuser.html
Data processing
Raw 20-Hz data were quality controlled and rotated into a streamline coordinate system using double rotation before block averaging the data to 5-min statistics, without previous filtering. Flux corrections were applied to the turbulence statistics, including a frequency response correction (Aubinet et al. 2012) with spectral models following Moore (1986), Højstrup (1981), and Kaimal et al. (1972); a sonic heat-flux correction of the vertical heat flux and the temperature variance (Schotanus et al. 1983); a WPL correction of the vertical moisture flux (Webb et al. 1980); and an Oxygen correction of the vertical moisture flux for data from Krypton hygrometers (van Dijk et al. 2003).
The quality control procedures include the removal of data during periods of instrument malfunction as indicated by the instruments’ quality flags, a despiking, the removal of data points exceeding 30 m s-1 for the horizontal wind components, 10 m s-1 for the vertical wind velocity, and 50 g m3 for water vapor density, and the removal of sonic temperature data outside the range -20 – 40°C. The removed data are replaced with random values drawn from a Gaussian distribution, with its mean and standard deviation calculated over a 30-s data window.
Quality flags are based on the criteria described in Stiperski and Rotach (2016):
-1: More than 10% of the raw data within the averaging period are replaced during the quality control.
0: More than 90% of the raw data fulfill the quality control criteria.
1: In addition to fulfilling the quality control criteria, the skewness is within the range -2–2 and the kurtosis is less than 8.
2: In addition to the above criteria, the stationarity test by Foken and Wichura (1996) is below 30% and the uncertainty is less than 50% based on Stiperski and Rotach (2016) and Wyngaard (1973)
Data files
i-Box_sites.pdf contains a map of the i-Box stations.
list_variables.pdf contains a list of variable names with a short description.
SITENAME_5min.zip contains the processed turbulence statistics, split into yearly files. There is more than one file per year if the instrumentation changed during the year or because of memory restrictions during the processing.
Acknowledgments
Data processing was performed in the framework of the TExSMBL (Turbulent Exchange in the Stable Mountain Boundary Layer) project funded by the Austrian Science Fund (FWF) under grant V 791-N. Data were processed on the LEO HPC infrastructure of the University of Innsbruck.
References
Aubinet M, Vesala T, D P (eds) (2012) Eddy Covariance. A practical guide to measurements and data analysis. Springer, Dordrecht, DOI 10.1007/978-94-007-2351-1
Højstrup J (1981) A simple model for the adjustment of velocity spectra in unstable conditions downstream of an abrupt change in roughness and heat flux. Boundary-Layer Meteorol 21:341–356, DOI 10.1007/bf00119278
Kaimal JC, Wyngaard JC, Izumi Y, Coté OR (1972) Spectral characteristics of surface-layer turbulence. Q J R M Soc 98:563–589, DOI 10.1002/qj.49709841707
Lehner M, Rotach MW, Sfyri E, Obleitner F (2021) Spatial and temporal variations in near-surface energy fluxes in an Alpine valley under synoptically undisturbed and clear-sky conditions. Q J R M Soc 147:2173–2196, DOI 10.1002/qj.4016
Moore CJ (1986) Frequency response corrections for eddy correlation systems. Boundary-Layer Meteorol 37:17–35, DOI 10.1007/BF00122754
Rotach MW, Stiperski I, Fuhrer O, Goger B, Gohm A, Obleitner F, Rau G, Sfyri E, Vergeiner J (2017) Investigating exchange processes over complex topography—the Innsbruck Box (i-Box). Bull Amer Meteorol Soc 98:787–805, DOI 10.1175/BAMS-D-15-00246.1
Schotanus P, Nieuwstadt FTM, de Bruijn HAR (1983) Temperature measurement with a sonic anemometer and its application to heat and moisture fluxes. Boundary-Layer Meteorol 26:81–93, DOI 10.1007/BF00164332
Stiperski, I. and Rotach, M.W. (2016) On the measurement of turbulence over complex mountainous terrain. Boundary-Layer Meteorology, 159, 97–121. DOI 10.1007/s10546-015-0103-z.
Van Dijk A, Kohsiek W, de Bruin HAR (2003) Oxygen sensitivity of Krypton and Lyman-α hygrometers. J Atmos Ocean Technol 20:143–151, DOI 10.1175/1520-0426(2003)020¡0143:OSOKAL¿2.0.CO;2
Webb EK, Pearman GI, R L (1980) Correction of flux measurements for density effects due to heat and water vapour transfer. Q J R M Soc 106:85–100, DOI 10.1002/qj.49710644707
Wyngaard, J.C. (1973). On surface layer turbulence. In D.A. Haugen (Ed.), Workshop on Micrometeorology, American Meteorological Society, pp. 101–150.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Crowdfunding has become one of the main sources of initial capital for small businesses and start-up companies that are looking to launch their first products. Websites like Kickstarter and Indiegogo provide a platform for millions of creators to present their innovative ideas to the public. This is a win-win situation where creators could accumulate initial fund while the public get access to cutting-edge prototypical products that are not available in the market yet.
At any given point, Indiegogo has around 10,000 live campaigns while Kickstarter has 6,000. It has become increasingly difficult for projects to stand out of the crowd. Of course, advertisements via various channels are by far the most important factor to a successful campaign. However, for creators with a smaller budget, this leaves them wonder,
"How do we increase the probability of success of our campaign starting from the very moment we create our project on these websites?"
All of my raw data are scraped from Kickstarter.com.
First 4000 live projects that are currently campaigning on Kickstarter (live.csv)
Top 4000 most backed projects ever on Kickstarter (most_backed.csv)
See more at http://datapolymath.paperplane.io/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
The dataset contains eddy-covariance data from five i-Box stations in the Austrian Inn Valley, which have been processed to 30-s statistics. The i-Box is a long-term measurement platform, including a small network of eddy-covariance stations in the lower Inn Valley, to study boundary-layer processes in mountainous terrain. More information about the i-Box can be found at https://www.uibk.ac.at/acinn/research/atmospheric-dynamics/projects/innsbruck-box-i-box.html.en and in Rotach et al. (2017).
Data description
Station locations
The present dataset contains processed data from five i-Box stations located in the Austrian Inn Valley. The Inn Valley is an approximately southwest-northeast oriented valley in the western part of Austria, with a depth of about 2000 m and a width of about 2 km at the valley floor. The locations of the sites are shown in the overview figure i-Box_sites.pdf.
VF0 is located at the almost flat valley floor. The site is surrounded by grassland and agricultural fields. (47.305°N, 11.622°E, 545 m MSL)
SF8 is located at the foot of the north sidewall next to a steep embankment between an agricultural field and a concrete parking lot. (47.326°N, 11.652°E, 575 m MSL)
SF1 is located on an almost flat plateau running along the northern valley sidewall. The site is mainly surrounded by grassland and agricultural fields. (47.317°N, 11.616°E, 829 m MSL)
NF10 is located on an approximately 10 deg slope on the south sidewall, covered by grassland. (47.300°N, 11.673°E, 930 m MSL)
NF27 is located on a steep, grass-covered slope on the south sidewall, with a slope angle of about 25 deg. (47.288°N, 11.631°E, 1009 m MSL)
Further information about station locations can be found in Rotach et al. (2017) and Lehner et al. (2021).
Temporal coverage
The dataset contains processed data between 2014 and 2020. Some instruments were replaced and new instruments were added during this period. Data gaps occur as a result of instrument malfunctions and maintenance.
Instrumentation
Each station is equipped with at least one sonic anemometer and a gas analyzer. The instrumentation usually consists of a CSAT3 sonic anemometer (Campbell Scientific, USA) and KH20 Krypton hygrometer (Campbell Scientific) or an EC150 open-path infrared gas analyzer (Campbell Scientific). In 2020, several of the instruments were replaced with an Irgason (Campbell Scientific), which combines an open-path infrared gas analyzer with a sonic anemometer. Pressure, air temperature, and humidity used for calculating flux corrections are measured with Setra 278 sensors (Setra Systems, USA) and Rotronic HC2A-S temperature and humidity probes (Rotronic, Switzerland).
VF0: CSAT3 and EC150 at 4.0 m, CSAT3 at 8.7 m, CSAT3 and KH20 (until July 2020) or Irgason (since July 2020) at 16.9 m
SF8: CSAT3 at 6.1, CSAT3 and KH20 (until September 2020) or Irgason (since September 2020) at 11.2 m
SF1: CSAT3 and KH20 (until June 2020) or Irgason (since June 2020) at 6.8 m
NF10: CSAT3 and KH20 (until June 2020) or Irgason (since June 2020) at 5.7 m
NF27: CSAT3 at 1.5 (since September 2017), CSAT3 and KH20 (until November 2016) or Irgason (since September 2017) 6.8 m
Further information about the instrumentation can be found in Rotach et al. (2017), Lehner et al. (2021), and in the ACINN database:
VF0: https://acinn-data.uibk.ac.at/pages/i-box-kolsass.html
SF8: https://acinn-data.uibk.ac.at/pages/i-box-terfens.html
SF1: https://acinn-data.uibk.ac.at/pages/i-box-eggen.html
NF10: https://acinn-data.uibk.ac.at/pages/i-box-weerberg.html
NF27: https://acinn-data.uibk.ac.at/pages/i-box-hochhaeuser.html
Data processing
Raw 20-Hz data were quality controlled and rotated into a streamline coordinate system using double rotation before block averaging the data to 30-s statistics, without previous filtering. Flux corrections were applied to the turbulence statistics, including a frequency response correction (Aubinet et al. 2012) with spectral models following Moore (1986), Højstrup (1981), and Kaimal et al. (1972); a sonic heat-flux correction of the vertical heat flux and the temperature variance (Schotanus et al. 1983); a WPL correction of the vertical moisture flux (Webb et al. 1980); and an Oxygen correction of the vertical moisture flux for data from Krypton hygrometers (van Dijk et al. 2003).
The quality control procedures include the removal of data during periods of instrument malfunction as indicated by the instruments’ quality flags, a despiking, the removal of data points exceeding 30 m s-1 for the horizontal wind components, 10 m s-1 for the vertical wind velocity, and 50 g m3 for water vapor density, and the removal of sonic temperature data outside the range -20 – 40°C. The removed data are replaced with random values drawn from a Gaussian distribution, with its mean and standard deviation calculated over a 30-s data window.
Quality flags are based on the criteria described in Stiperski and Rotach (2016):
-1: More than 10% of the raw data within the averaging period are replaced during the quality control.
0: More than 90% of the raw data fulfill the quality control criteria.
1: In addition to fulfilling the quality control criteria, the skewness is within the range -2–2 and the kurtosis is less than 8.
2: In addition to the above criteria, the stationarity test by Foken and Wichura (1996) is below 30% and the uncertainty is less than 50% based on Stiperski and Rotach (2016) and Wyngaard (1973)
Data files
i-Box_sites.pdf contains a map of the i-Box stations.
list_variables.pdf contains a list of variable names with a short description.
SITENAME_30s.zip contains the processed turbulence statistics, split into yearly files. There is more than one file per year if the instrumentation changed during the year or because of memory restrictions during the processing.
Acknowledgments
Data processing was performed in the framework of the TExSMBL (Turbulent Exchange in the Stable Mountain Boundary Layer) project funded by the Austrian Science Fund (FWF) under grant V 791-N. Data were processed on the LEO HPC infrastructure of the University of Innsbruck.
References
Aubinet M, Vesala T, D P (eds) (2012) Eddy Covariance. A practical guide to measurements and data analysis. Springer, Dordrecht, DOI 10.1007/978-94-007-2351-1
Højstrup J (1981) A simple model for the adjustment of velocity spectra in unstable conditions downstream of an abrupt change in roughness and heat flux. Boundary-Layer Meteorol 21:341–356, DOI 10.1007/bf00119278
Kaimal JC, Wyngaard JC, Izumi Y, Coté OR (1972) Spectral characteristics of surface-layer turbulence. Q J R M Soc 98:563–589, DOI 10.1002/qj.49709841707
Lehner M, Rotach MW, Sfyri E, Obleitner F (2021) Spatial and temporal variations in near-surface energy fluxes in an Alpine valley under synoptically undisturbed and clear-sky conditions. Q J R M Soc 147:2173–2196, DOI 10.1002/qj.4016
Moore CJ (1986) Frequency response corrections for eddy correlation systems. Boundary-Layer Meteorol 37:17–35, DOI 10.1007/BF00122754
Rotach MW, Stiperski I, Fuhrer O, Goger B, Gohm A, Obleitner F, Rau G, Sfyri E, Vergeiner J (2017) Investigating exchange processes over complex topography—the Innsbruck Box (i-Box). Bull Amer Meteorol Soc 98:787–805, DOI 10.1175/BAMS-D-15-00246.1
Schotanus P, Nieuwstadt FTM, de Bruijn HAR (1983) Temperature measurement with a sonic anemometer and its application to heat and moisture fluxes. Boundary-Layer Meteorol 26:81–93, DOI 10.1007/BF00164332
Stiperski, I. and Rotach, M.W. (2016) On the measurement of turbulence over complex mountainous terrain. Boundary-Layer Meteorology, 159, 97–121. DOI 10.1007/s10546-015-0103-z.
Van Dijk A, Kohsiek W, de Bruin HAR (2003) Oxygen sensitivity of Krypton and Lyman-α hygrometers. J Atmos Ocean Technol 20:143–151, DOI 10.1175/1520-0426(2003)020¡0143:OSOKAL¿2.0.CO;2
Webb EK, Pearman GI, R L (1980) Correction of flux measurements for density effects due to heat and water vapour transfer. Q J R M Soc 106:85–100, DOI 10.1002/qj.49710644707
Wyngaard, J.C. (1973). On surface layer turbulence. In D.A. Haugen (Ed.), Workshop on Micrometeorology, American Meteorological Society, pp. 101–150.
Facebook
TwitterMSZSI: Multi-Scale Zonal Statistics [AgriClimate] Inventory
--------------------------------------------------------------------------------------
MSZSI is a data extraction tool for Google Earth Engine that aggregates time-series remote sensing information to multiple administrative levels using the FAO GAUL data layers. The code at the bottom of this page (metadata) can be pasted into the Google Earth Engine JavaScript code editor and ran at https://code.earthengine.google.com/.
Please refer to the associated publication:
Peter, B.G., Messina, J.P., Breeze, V., Fung, C.Y., Kapoor, A. and Fan, P., 2024. Perspectives on modifiable spatiotemporal unit problems in remote sensing of agriculture: evaluating rice production in Vietnam and tools for analysis. Frontiers in Remote Sensing, 5, p.1042624.
https://www.frontiersin.org/journals/remote-sensing/articles/10.3389/frsen.2024.1042624
Input options:
[1] Country of interest
[2] Start and end year
[3] Start and end month
[4] Option to mask data to a specific land-use/land-cover type
[5] Land-use/land-cover type code from CGLS LULC
[6] Image collection for data aggregation
[7] Desired band from the image collection
[8] Statistics type for the zonal aggregations
[9] Statistic to use for annual aggregation
[10] Scaling options
[11] Export folder and label suffix
Output: Two CSVs containing zonal statistics for each of the FAO GAUL administrative level boundaries
Output fields: system:index, 0-ADM0_CODE, 0-ADM0_NAME, 0-ADM1_CODE, 0-ADM1_NAME, 0-ADMN_CODE, 0-ADMN_NAME, 1-AREA_PERCENT_LULC, 1-AREA_SQM_LULC, 1-AREA_SQM_ZONE, 2-X_2001, 2-X_2002, 2-X_2003, ..., 2-X_2020, .geo
PREPROCESSED DATA DOWNLOAD
The datasets available for download contain zonal statistics at 2 administrative levels (FAO GAUL levels 1 and 2). Select countries from Southeast Asia and Sub-Saharan Africa (Cambodia, Indonesia, Lao PDR, Myanmar, Philippines, Thailand, Vietnam, Burundi, Kenya, Malawi, Mozambique, Rwanda, Tanzania, Uganda, Zambia, Zimbabwe) are included in the current version, with plans to extend the dataset to contain global metrics. Each zip file is described below and two example NDVI tables are available for preview.
Key: [source, data, units, temporal range, aggregation, masking, zonal statistic, notes]
Currently available:
MSZSI-V2_V-NDVI-MEAN.tar: [NASA-MODIS, NDVI, index, 2001–2020, annual mean, agriculture, mean, n/a]
MSZSI-V2_T-LST-DAY-MEAN.tar: [NASA-MODIS, LST Day, °C, 2001–2020, annual mean, agriculture, mean, n/a]
MSZSI-V2_T-LST-NIGHT-MEAN.tar: [NASA-MODIS, LST Night, °C, 2001–2020, annual mean, agriculture, mean, n/a]
MSZSI-V2_R-PRECIP-SUM.tar: [UCSB-CHG-CHIRPS, Precipitation, mm, 2001–2020, annual sum, agriculture, mean, n/a]
MSZSI-V2_S-BDENS-MEAN.tar: [OpenLandMap, Bulk density, g/cm3, static, n/a, agriculture, mean, at depths 0-10-30-60-100-200]
MSZSI-V2_S-ORGC-MEAN.tar: [OpenLandMap, Organic carbon, g/kg, static, n/a, agriculture, mean, at depths 0-10-30-60-100-200]
MSZSI-V2_S-PH-MEAN.tar: [OpenLandMap, pH in H2O, pH, static, n/a, agriculture, mean, at depths 0-10-30-60-100-200]
MSZSI-V2_S-WATER-MEAN.tar: [OpenLandMap, Soil water, % at 33kPa, static, n/a, agriculture, mean, at depths 0-10-30-60-100-200]
MSZSI-V2_S-SAND-MEAN.tar: [OpenLandMap, Sand, %, static, n/a, agriculture, mean, at depths 0-10-30-60-100-200]
MSZSI-V2_S-SILT-MEAN.tar: [OpenLandMap, Silt, %, static, n/a, agriculture, mean, at depths 0-10-30-60-100-200]
MSZSI-V2_S-CLAY-MEAN.tar: [OpenLandMap, Clay, %, static, n/a, agriculture, mean, at depths 0-10-30-60-100-200]
MSZSI-V2_E-ELEV-MEAN.tar: [MERIT, [elevation, slope, flowacc, HAND], [m, degrees, km2, m], static, n/a, agriculture, mean, n/a]
Coming soon
MSZSI-V2_C-STAX-MEAN.tar: [OpenLandMap, Soil taxonomy, category, static, n/a, agriculture, area sum, n/a]
MSZSI-V2_C-LULC-MEAN.tar: [CGLS-LC100-V3, LULC, category, 2015–2019, mode, none, area sum, n/a]
Data sources:
/*/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// MSZSI: Multi-Scale Zonal Statistics Inventory Authors: Brad G. Peter, Department of Geography, University of Alabama Joseph Messina, Department of Geography, University of Alabama Austin Raney, Department of Geography, University of Alabama Rodrigo E. Principe, AgriCircle AG Peilei Fan, Department of Geography, Environment, and Spatial Sciences, Michigan State University Citation: Peter, Brad; Messina, Joseph; Raney, Austin; Principe, Rodrigo; Fan, Peilei, 2021, 'MSZSI: Multi-Scale Zonal Statistics Inventory', https://doi.org/10.7910/DVN/YCUBXS, Harvard Dataverse, V# SEAGUL: Southeast Asia Globalization, Urbanization, Land and Environment Changes http://seagul.info/ https://lcluc.umd.edu/projects/divergent-local-responses-globalization-urbanization-land-transition-and-environmental This project was made possible by the the NASA Land-Cover/Land-Use Change Program (Grant #: 80NSSC20K0740)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The School of Education at the University of Cape Town (UCT) investigated children’s learning through digital play. The aim of the study was to explore the intersection between child play, technology, creativity and learning among children aged between 3 and 11 years. The study also identified skills and dispositions children develop through both digital and non-digital play. The data shared emerged from a survey of parents of children in the stated age group, with particular reference to the parents views on children's play practices, including time parents spent playing with their children, concerns parents had on time children spend playing on various technologies, types of play children in South Africa engaged in and the concerns of parents when children played with some electronic devices. The following data files are shared:SA - Survey - Children, Technology and Play (CTAP) - Google Forms.pdfDescriptive Stats 2020.1.9 -Children Technology and Play SURVEY.xlsxParent Survey RAW PUBLIC DATA 2020.2.29 - Children Technology and Play Project.xlsxParent Survey RAW PUBLIC DATA 2020.2.29 - Children Technology and Play Project.csvParent Survey REPORT DATA 2020.2.29 - Children Technology and Play Project.xlsxParent Survey REPORT DATA 2020.2.29 - Children Technology and Play Project.csvParent Survey RAW and REPORT DATA SYNTAX 2020.2.29 - Children Technology and Play Project.spsNOTE: This survey was adapted from Marsh, J. Stjerne Thomsen, B., Parry, B., Scott, F. Bishop, J.C., Bannister, C., Driscoll, A., Margary, T., Woodgate, A., (2019) Children, Technology and Play. UK Survey Questions. LEGO Foundation.
Facebook
TwitterTHE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 25% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE DEPARTMENT OF STATISTICS OF THE HASHEMITE KINGDOM OF JORDAN
In light of the rapid socio-economic development in this era, it is necessary to make data on household expenditure and income available, as well as the relationship between those statistics and various variables with direct or indirect impact. Therefore, most of the countries are nowadays keen to periodically carry-out Household Expenditure and Income surveys. Given the continuous changes in spending patterns, income levels and prices, as well as in population both internal and external migration, it was now mandatory to update data for household income and expenditure over time. The main objective of the survey is to obtain detailed data on HH income and expenditure, linked to various demographic and socio-economic variables, to enable computation of poverty indices and determine the characteristics of the poor and prepare poverty maps. Therefore, to achieve these goals, it was well considered that the sample should be representative on the sub-district level. Hence, the data collected through the survey would also enable to achieve the following objectives: 1. Provide data weights that reflect the relative importance of consumer expenditure items used in the preparation of the consumer price index. 2- Study the consumer expenditure pattern prevailing in the society and the impact of demograohic and socio-economic variables on those patterns. 3. Calculate the average annual income of the household and the individual, and assess the relationship between income and different economic and social factors, such as profession and educational level of the head of the household and other indicators. 4. Study the distribution of individuals and households by income and expenditure categories and analyze the factors associated with it. 5. Provide the necessary data for the national accounts related to overall consumption and income of the household sector. 6. Provide the necessary income data to serve in calculating poverty indices and identifying the poor chracteristics as well as drawing poverty maps.. 7. Provide the data necessary for the formulation, follow-up and evaluation of economic and social development programs, including those addressed to eradicate poverty.
The raw survey data provided by the Statistical Agency were cleaned and harmonized by the Economic Research Forum, in the context of a major project that started in 2009. During which extensive efforts have been exerted to acquire, clean, harmonize, preserve and disseminate micro data of existing household surveys in several Arab countries.
This survey was carried-out for a sample of 12678 households distributed on urban and rural areas in all the Kingdom governorates.
1- Household/family. 2- Individual/person.
The survey covered a national sample of households and all individuals permanently residing in surveyed households.
Sample survey data [ssd]
THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 25% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE DEPARTMENT OF STATISTICS OF THE HASHEMITE KINGDOM OF JORDAN
A two stage stratified cluster sampling technique was used. In the first stage, a cluster sample proportional to the size has been uniformly selected, and in the second stage, a systematic approach guaranteing a representative sample of all sub-districts (Qada) has been applied.
Face-to-face [f2f]
List of survey questionnaires:
(1) General Form (2) Expenditure on food commodities Form (3) Expenditure on non-food commodities Form
The design and implementation of this survey procedures are: 1. Sample design and selection. 2. Design of forms/questionnaires, guidelines to assist in filling out the questionnaires, and preparing instruction manuals. 3. Design the tables template to be used for the dissemination of the survey results. 4. Preparation of the fieldwork phase including printing forms/questionnaires, instruction manuals, data collection instructions, data checking instructions and codebooks. 5. Selection and training of survey staff to collect data and run required data checkings. 6. Preparation and implementation of the pretest phase for the survey designed to test and develop forms/questionnaires, instructions and software programs required for data processing and production of survey results. 7. Data collection. 8. Data checking and coding. 9. Data entry. 10. Data cleaning using data validation programs. 11. Data accuracy and consistency checks. 12. Data tabulation and preliminary results. 13. Preparation of the final report and dissemination of final results.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The stakeholder survey conducted under the ERA_FABRIC project delves into understanding the characteristics of research and innovation ecosystems across Europe. Through engaging 169 stakeholders from diverse regions, this report identifies critical factors influencing collaboration, inclusivity, and alignment between policy and practice. It reveals challenges such as the need for enhanced intra-regional cooperation, better stakeholder engagement, and improved public-private alignment. It also underscores the importance of environmental sustainability and fostering interregional connections. These insights are pivotal for advancing the ERA_Hub concept, promoting evidence-based policy-making, and reinforcing innovation ecosystems' capacity to address dynamic regional needs. The main purpose of these documents are to provide the statistics behind the conclusions reached in the omonymous project deliverable. Both the raw data and weighted data sets can be viewed.
Facebook
TwitterThis dashboard was created from data published by Olist Store (a Brazilian e-commerce public dataset). Raw data contains information about 100 000 orders from 2016 to 2018 placed in many regions of Brazil.
The raw datasets were imported into Excel using “Get data” option (formerly known as “Power Query”) and cleaned. An additional table with the names of Brazilian states was also imported from the Wikipedia page.
A Data Table about payment information was created based on imported statistics with the usage of nested formulas. Then, proper pivot charts were used to build an Olist Store Payment Dashboard which allows you to review the data using a connected timeline and slicers.