Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Exploring E-commerce Trends: A Guide to Leveraging Dummy Dataset
Introduction: In the world of e-commerce, data is a powerful asset that can be leveraged to understand customer behavior, improve sales strategies, and enhance overall business performance. This guide explores how to effectively utilize a dummy dataset generated to simulate various aspects of an e-commerce platform. By analyzing this dataset, businesses can gain valuable insights into product trends, customer preferences, and market dynamics.
Dataset Overview: The dummy dataset contains information on 1000 products across different categories such as electronics, clothing, home & kitchen, books, toys & games, and more. Each product is associated with attributes such as price, rating, number of reviews, stock quantity, discounts, sales, and date added to inventory. This comprehensive dataset provides a rich source of information for analysis and exploration.
Data Analysis: Using tools like Pandas, NumPy, and visualization libraries like Matplotlib or Seaborn, businesses can perform in-depth analysis of the dataset. Key insights such as top-selling products, popular product categories, pricing trends, and seasonal variations can be extracted through exploratory data analysis (EDA). Visualization techniques can be employed to create intuitive graphs and charts for better understanding and communication of findings.
Machine Learning Applications: The dataset can be used to train machine learning models for various e-commerce tasks such as product recommendation, sales prediction, customer segmentation, and sentiment analysis. By applying algorithms like linear regression, decision trees, or neural networks, businesses can develop predictive models to optimize inventory management, personalize customer experiences, and drive sales growth.
Testing and Prototyping: Businesses can utilize the dummy dataset to test new algorithms, prototype new features, or conduct A/B testing experiments without impacting real user data. This enables rapid iteration and experimentation to validate hypotheses and refine strategies before implementation in a live environment.
Educational Resources: The dummy dataset serves as an invaluable educational resource for students, researchers, and professionals interested in learning about e-commerce data analysis and machine learning. Tutorials, workshops, and online courses can be developed using the dataset to teach concepts such as data manipulation, statistical analysis, and model training in the context of e-commerce.
Decision Support and Strategy Development: Insights derived from the dataset can inform strategic decision-making processes and guide business strategy development. By understanding customer preferences, market trends, and competitor behavior, businesses can make informed decisions regarding product assortment, pricing strategies, marketing campaigns, and resource allocation.
Conclusion: In conclusion, the dummy dataset provides a versatile and valuable resource for exploring e-commerce trends, understanding customer behavior, and driving business growth. By leveraging this dataset effectively, businesses can unlock actionable insights, optimize operations, and stay ahead in today's competitive e-commerce landscape
Facebook
TwitterU.S. Geological Survey scientists, funded by the Climate and Land Use Change Research and Development Program, developed a dataset of 2006 and 2011 land use and land cover (LULC) information for selected 100-km2 sample blocks within 29 EPA Level 3 ecoregions across the conterminous United States. The data was collected for validation of new and existing national scale LULC datasets developed from remotely sensed data sources. The data can also be used with the previously published Land Cover Trends Dataset: 1973-2000 (http:// http://pubs.usgs.gov/ds/844/), to assess land-use/land-cover change in selected ecoregions over a 37-year study period. LULC data for 2006 and 2011 was manually delineated using the same sample block classification procedures as the previous Land Cover Trends project. The methodology is based on a statistical sampling approach, manual classification of land use and land cover, and post-classification comparisons of land cover across different dates. Landsat Thematic Mapper, and Enhanced Thematic Mapper Plus imagery was interpreted using a modified Anderson Level I classification scheme. Landsat data was acquired from the National Land Cover Database (NLCD) collection of images. For the 2006 and 2011 update, ecoregion specific alterations in the sampling density were made to expedite the completion of manual block interpretations. The data collection process started with the 2000 date from the previous assessment and any needed corrections were made before interpreting the next two dates of 2006 and 2011 imagery. The 2000 land cover was copied and any changes seen in the 2006 Landsat images were digitized into a new 2006 land cover image. Similarly, the 2011 land cover image was created after completing the 2006 delineation. Results from analysis of these data include ecoregion based statistical estimates of the amount of LULC change per time period, ranking of the most common types of conversions, rates of change, and percent composition. Overall estimated amount of change per ecoregion from 2001 to 2011 ranged from a low of 370 km2 in the Northern Basin and Range Ecoregion to a high of 78,782 km2 in the Southeastern Plains Ecoregion. The Southeastern Plains Ecoregion continues to encompass the most intense forest harvesting and regrowth in the country. Forest harvesting and regrowth rates in the southeastern U.S. and Pacific Northwest continued at late 20th century levels. The land use and land cover data collected by this study is ideally suited for training, validation, and regional assessments of land use and land cover change in the U.S. because it is collected using manual interpretation techniques of Landsat data aided by high resolution photography. The 2001-2011 Land Cover Trends Dataset is provided in an Albers Conical Equal Area projection using the NAD 1983 datum. The sample blocks have a 30-meter resolution and file names follow a specific naming convention that includes the number of the ecoregion containing the block, the block number, and the Landsat image date. The data files are organized by ecoregion, and are available in the ERDAS Imagine (.img) format. U.S. Geological Survey scientists, funded by the Climate and Land Use Change Research and Development Program, developed a dataset of 2006 and 2011 land use and land cover (LULC) information for selected 100-km2 sample blocks within 29 EPA Level 3 ecoregions across the conterminous United States. The data was collected for validation of new and existing national scale LULC datasets developed from remotely sensed data sources. The data can also be used with the previously published Land Cover Trends Dataset: 1973-2000 (http:// http://pubs.usgs.gov/ds/844/), to assess land-use/land-cover change in selected ecoregions over a 37-year study period. LULC data for 2006 and 2011 was manually delineated using the same sample block classification procedures as the previous Land Cover Trends project. The methodology is based on a statistical sampling approach, manual classification of land use and land cover, and post-classification comparisons of land cover across different dates. Landsat Thematic Mapper, and Enhanced Thematic Mapper Plus imagery was interpreted using a modified Anderson Level I classification scheme. Landsat data was acquired from the National Land Cover Database (NLCD) collection of images. For the 2006 and 2011 update, ecoregion specific alterations in the sampling density were made to expedite the completion of manual block interpretations. The data collection process started with the 2000 date from the previous assessment and any needed corrections were made before interpreting the next two dates of 2006 and 2011 imagery. The 2000 land cover was copied and any changes seen in the 2006 Landsat images were digitized into a new 2006 land cover image. Similarly, the 2011 land cover image was created after completing the 2006 delineation. Results from analysis of these data include ecoregion based statistical estimates of the amount of LULC change per time period, ranking of the most common types of conversions, rates of change, and percent composition. Overall estimated amount of change per ecoregion from 2001 to 2011 ranged from a low of 370 square km in the Northern Basin and Range Ecoregion to a high of 78,782 square km in the Southeastern Plains Ecoregion. The Southeastern Plains Ecoregion continues to encompass the most intense forest harvesting and regrowth in the country. Forest harvesting and regrowth rates in the southeastern U.S. and Pacific Northwest continued at late 20th century levels. The land use and land cover data collected by this study is ideally suited for training, validation, and regional assessments of land use and land cover change in the U.S. because it’s collected using manual interpretation techniques of Landsat data aided by high resolution photography. The 2001-2011 Land Cover Trends Dataset is provided in an Albers Conical Equal Area projection using the NAD 1983 datum. The sample blocks have a 30-meter resolution and file names follow a specific naming convention that includes the number of the ecoregion containing the block, the block number, and the Landsat image date. The data files are organized by ecoregion, and are available in the ERDAS Imagine (.img) format.
Facebook
TwitterThe harmonized data set on health, created and published by the ERF, is a subset of Iraq Household Socio Economic Survey (IHSES) 2012. It was derived from the household, individual and health modules, collected in the context of the above mentioned survey. The sample was then used to create a harmonized health survey, comparable with the Iraq Household Socio Economic Survey (IHSES) 2007 micro data set.
----> Overview of the Iraq Household Socio Economic Survey (IHSES) 2012:
Iraq is considered a leader in household expenditure and income surveys where the first was conducted in 1946 followed by surveys in 1954 and 1961. After the establishment of Central Statistical Organization, household expenditure and income surveys were carried out every 3-5 years in (1971/ 1972, 1976, 1979, 1984/ 1985, 1988, 1993, 2002 / 2007). Implementing the cooperation between CSO and WB, Central Statistical Organization (CSO) and Kurdistan Region Statistics Office (KRSO) launched fieldwork on IHSES on 1/1/2012. The survey was carried out over a full year covering all governorates including those in Kurdistan Region.
The survey has six main objectives. These objectives are:
The raw survey data provided by the Statistical Office were then harmonized by the Economic Research Forum, to create a comparable version with the 2006/2007 Household Socio Economic Survey in Iraq. Harmonization at this stage only included unifying variables' names, labels and some definitions. See: Iraq 2007 & 2012- Variables Mapping & Availability Matrix.pdf provided in the external resources for further information on the mapping of the original variables on the harmonized ones, in addition to more indications on the variables' availability in both survey years and relevant comments.
National coverage: Covering a sample of urban, rural and metropolitan areas in all the governorates including those in Kurdistan Region.
1- Household/family. 2- Individual/person.
The survey was carried out over a full year covering all governorates including those in Kurdistan Region.
Sample survey data [ssd]
----> Design:
Sample size was (25488) household for the whole Iraq, 216 households for each district of 118 districts, 2832 clusters each of which includes 9 households distributed on districts and governorates for rural and urban.
----> Sample frame:
Listing and numbering results of 2009-2010 Population and Housing Survey were adopted in all the governorates including Kurdistan Region as a frame to select households, the sample was selected in two stages: Stage 1: Primary sampling unit (blocks) within each stratum (district) for urban and rural were systematically selected with probability proportional to size to reach 2832 units (cluster). Stage two: 9 households from each primary sampling unit were selected to create a cluster, thus the sample size of total survey clusters was 25488 households distributed on the governorates, 216 households in each district.
----> Sampling Stages:
In each district, the sample was selected in two stages: Stage 1: based on 2010 listing and numbering frame 24 sample points were selected within each stratum through systematic sampling with probability proportional to size, in addition to the implicit breakdown urban and rural and geographic breakdown (sub-district, quarter, street, county, village and block). Stage 2: Using households as secondary sampling units, 9 households were selected from each sample point using systematic equal probability sampling. Sampling frames of each stages can be developed based on 2010 building listing and numbering without updating household lists. In some small districts, random selection processes of primary sampling may lead to select less than 24 units therefore a sampling unit is selected more than once , the selection may reach two cluster or more from the same enumeration unit when it is necessary.
Face-to-face [f2f]
----> Preparation:
The questionnaire of 2006 survey was adopted in designing the questionnaire of 2012 survey on which many revisions were made. Two rounds of pre-test were carried out. Revision were made based on the feedback of field work team, World Bank consultants and others, other revisions were made before final version was implemented in a pilot survey in September 2011. After the pilot survey implemented, other revisions were made in based on the challenges and feedbacks emerged during the implementation to implement the final version in the actual survey.
----> Questionnaire Parts:
The questionnaire consists of four parts each with several sections: Part 1: Socio – Economic Data: - Section 1: Household Roster - Section 2: Emigration - Section 3: Food Rations - Section 4: housing - Section 5: education - Section 6: health - Section 7: Physical measurements - Section 8: job seeking and previous job
Part 2: Monthly, Quarterly and Annual Expenditures: - Section 9: Expenditures on Non – Food Commodities and Services (past 30 days). - Section 10 : Expenditures on Non – Food Commodities and Services (past 90 days). - Section 11: Expenditures on Non – Food Commodities and Services (past 12 months). - Section 12: Expenditures on Non-food Frequent Food Stuff and Commodities (7 days). - Section 12, Table 1: Meals Had Within the Residential Unit. - Section 12, table 2: Number of Persons Participate in the Meals within Household Expenditure Other Than its Members.
Part 3: Income and Other Data: - Section 13: Job - Section 14: paid jobs - Section 15: Agriculture, forestry and fishing - Section 16: Household non – agricultural projects - Section 17: Income from ownership and transfers - Section 18: Durable goods - Section 19: Loans, advances and subsidies - Section 20: Shocks and strategy of dealing in the households - Section 21: Time use - Section 22: Justice - Section 23: Satisfaction in life - Section 24: Food consumption during past 7 days
Part 4: Diary of Daily Expenditures: Diary of expenditure is an essential component of this survey. It is left at the household to record all the daily purchases such as expenditures on food and frequent non-food items such as gasoline, newspapers…etc. during 7 days. Two pages were allocated for recording the expenditures of each day, thus the roster will be consists of 14 pages.
----> Raw Data:
Data Editing and Processing: To ensure accuracy and consistency, the data were edited at the following stages: 1. Interviewer: Checks all answers on the household questionnaire, confirming that they are clear and correct. 2. Local Supervisor: Checks to make sure that questions has been correctly completed. 3. Statistical analysis: After exporting data files from excel to SPSS, the Statistical Analysis Unit uses program commands to identify irregular or non-logical values in addition to auditing some variables. 4. World Bank consultants in coordination with the CSO data management team: the World Bank technical consultants use additional programs in SPSS and STAT to examine and correct remaining inconsistencies within the data files. The software detects errors by analyzing questionnaire items according to the expected parameter for each variable.
----> Harmonized Data:
Iraq Household Socio Economic Survey (IHSES) reached a total of 25488 households. Number of households refused to response was 305, response rate was 98.6%. The highest interview rates were in Ninevah and Muthanna (100%) while the lowest rates were in Sulaimaniya (92%).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Development for Codes of Conduct in Online Classrooms of Vietnamese High School Students (CCOCVHSS) dataset includes 06 files with different formats (.doc, .cvs, .sav) to suit each step in the process of developing items of CCOCVHSS, specifically as follows: 1. Initial_Items_Pool.docx: presents 34 items developed by the research team based on the overview and analysis of research documents related to student behavior in the online learning environment in relation to teachers and other students with two main aspects: attitude and behavior, along with codes of conduct for students at general schools for online learning. 2. Experts_Judge_Results.xlsx: includes 07 columns and 35 rows, in which the columns correspond to data fields. Meanwhile, the rows show information about each item code, the content of that item, each expert's rating for that item, the total score of that item, and the analysis results of the proportions of the three rating levels. 3. Questionare_Of_CCOCVHSS.docx: is a questionnaire designed to serve the data collection with three parts: (1) Introduction and declaration of consent; (2) Demographic information; and (3) Questions. 4. CCOCVHSS _rawdata.csv: is the data used for analysis that has been cleaned from the raw data collected from the online survey.
Facebook
TwitterBy Health [source]
The Behavioral Risk Factor Surveillance System (BRFSS) offers an expansive collection of data on the health-related quality of life (HRQOL) from 1993 to 2010. Over this time period, the Health-Related Quality of Life dataset consists of a comprehensive survey reflecting the health and well-being of non-institutionalized US adults aged 18 years or older. The data collected can help track and identify unmet population health needs, recognize trends, identify disparities in healthcare, determine determinants of public health, inform decision making and policy development, as well as evaluate programs within public healthcare services.
The HRQOL surveillance system has developed a compact set of HRQOL measures such as a summary measure indicating unhealthy days which have been validated for population health surveillance purposes and have been widely implemented in practice since 1993. Within this study's dataset you will be able to access information such as year recorded, location abbreviations & descriptions, category & topic overviews, questions asked in surveys and much more detailed information including types & units regarding data values retrieved from respondents along with their sample sizes & geographical locations involved!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset tracks the Health-Related Quality of Life (HRQOL) from 1993 to 2010 using data from the Behavioral Risk Factor Surveillance System (BRFSS). This dataset includes information on the year, location abbreviation, location description, type and unit of data value, sample size, category and topic of survey questions.
Using this dataset on BRFSS: HRQOL data between 1993-2010 will allow for a variety of analyses related to population health needs. The compact set of HRQOL measures can be used to identify trends in population health needs as well as determine disparities among various locations. Additionally, responses to survey questions can be used to inform decision making and program and policy development in public health initiatives.
- Analyzing trends in HRQOL over the years by location to identify disparities in health outcomes between different populations and develop targeted policy interventions.
- Developing new models for predicting HRQOL indicators at a regional level, and using this information to inform medical practice and public health implementation efforts.
- Using the data to understand differences between states in terms of their HRQOL scores and establish best practices for healthcare provision based on that understanding, including areas such as access to care, preventative care services availability, etc
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: rows.csv | Column name | Description | |:-------------------------------|:----------------------------------------------------------| | Year | Year of survey. (Integer) | | LocationAbbr | Abbreviation of location. (String) | | LocationDesc | Description of location. (String) | | Category | Category of survey. (String) | | Topic | Topic of survey. (String) | | Question | Question asked in survey. (String) | | DataSource | Source of data. (String) | | Data_Value_Unit | Unit of data value. (String) | | Data_Value_Type | Type of data value. (String) | | Data_Value_Footnote_Symbol | Footnote symbol for data value. (String) | | Data_Value_Std_Err | Standard error of the data value. (Float) | | Sample_Size | Sample size used in sample. (Integer) | | Break_Out | Break out categories used. (String) | | Break_Out_Category | Type break out assessed. (String) | | **GeoLocation*...
Facebook
TwitterThe resilience of health care development in countries along the Belt and Road reflects the level of resilience of health care development in the countries along the Belt and Road, and the higher the value of the data, the stronger the resilience of health care development in the countries along the Belt and Road. The World Bank statistical database was used for the preparation of the health resilience data. Based on the year-on-year data of these four indicators, and taking into account the year-on-year changes of each indicator, the product of resilience in the development of healthcare conditions was prepared through comprehensive diagnosis based on sensitivity and adaptability analysis. "The Resilience in Health Care Development dataset for countries along the Belt and Road is an important reference for analysing and comparing the current resilience in health care development in each country.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Global Roads Open Access Data Set, Version 1 (gROADSv1) was developed under the auspices of the CODATA Global Roads Data Development Task Group. The data set combines the best available roads data by country into a global roads coverage, using the UN Spatial Data Infrastructure Transport (UNSDI-T) version 2 as a common data model. All country road networks have been joined topologically at the borders, and many countries have been edited for internal topology. Source data for each country are provided in the documentation, and users are encouraged to refer to the readme file for use constraints that apply to a small number of countries. Because the data are compiled from multiple sources, the date range for road network representations ranges from the 1980s to 2010 depending on the country (most countries have no confirmed date), and spatial accuracy varies. The baseline global data set was compiled by the Information Technology Outreach Services (ITOS) of the University of Georgia. Updated data for 27 countries and 6 smaller geographic entities were assembled by Columbia University's Center for International Earth Science Information Network (CIESIN), with a focus largely on developing countries with the poorest data coverage.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset combines multimodal biosignals and eye tracking information gathered under a human-computer interaction framework. The dataset was developed in the vein of the MAMEM project that aims to endow people with motor disabilities with the ability to edit and author multimedia content through mental commands and gaze activity. The dataset includes EEG, eye-tracking, and physiological (GSR and Heart rate) signals along with demographic, clinical and behavioral data collected from 36 individuals (18 able-bodied and 18 motor-impaired). Data were collected during the interaction with specifically designed interface for web browsing and multimedia content manipulation and during imaginary movement tasks. Alongside these data we also include evaluation reports both from the subjects and the experimenters as far as the experimental procedure and collected dataset are concerned. We believe that the presented dataset will contribute towards the development and evaluation of modern human-computer interaction systems that would foster the integration of people with severe motor impairments back into society.Please use the following citation: Nikolopoulos, Spiros, Georgiadis, Kostas, Kalaganis, Fotis, Liaros, Georgios, Lazarou, Ioulietta, Adam, Katerina, Papazoglou – Chalikias, Anastasios, Chatzilari, Elisavet , Oikonomou, Vangelis P., Petrantonakis, Panagiotis C., Kompatsiaris, Ioannis, Kumar, Chandan, Menges, Raphael, Staab, Steffen, Müller, Daniel, Sengupta, Korok, Bostantjopoulou, Sevasti, Zoe, Katsarou , Zeilig, Gabi, Plotnik, Meir, Gottlieb, Amihai, Fountoukidou, Sofia, Ham, Jaap, Athanasiou, Dimitrios, Mariakaki, Agnes, Comanducci, Dario, Sabatini, Edoardo, Nistico, Walter & Plank, Markus. (2017). The MAMEM Project - A dataset for multimodal human-computer interaction using biosignals and eye tracking information. Zenodo. http://doi.org/10.5281/zenodo.834154Read/analyze using the following software:https://github.com/MAMEM/eeg-processing-toolbox
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please use the MESINESP2 corpus (the second edition of the shared-task) since it has a higher level of curation, quality and is organized by document type (scientific articles, patents and clinical trials).
Introduction
The Mesinesp (Spanish BioASQ track, see https://temu.bsc.es/mesinesp) development set has a total of 750 records indexed manually by seven experienced medical literature indexers. Indexing is done using DeCS codes, a sort of Spanish equivalent to MeSH terms. Records were distributed in a way that each article was annotated, at least, by two different human indexers.
The data annotation process consisted in two steps:
Manual indexing step. DeCS codes were manually assigned to each record following the DeCS manual indexing guidelines.
Manual validation and consensus. The joined set of manually indexed DeCS codes generated by both indexers were manually revised and corrections were done.
These annotations were analyzed, resulting in an agreement using the Jaccard index.
Records consisted basically in medical literature abstracts and titles from the IBECS and LILACS databases.
Zip structure The zip file contains two different development sets:
Official development set, which has the union of the annotations, with an agreement of macro = 0.6568 and micro = 0.6819. This set is composed by all the different (unique) DeCS codes that have been added by any annotator for each document; and
Core-descriptors development set, which has the intersection of the annotations, with an agreement of macro = 1.0 and micro = 1.0. This set is composed of the common DeCS codes that have been added by two or more annotators for each document.
Corpus format
Each dataset is a JSON object with one single key named "articles", which contains a list of documents. So, the raw format of the file is one line per document plus two additional lines (the first and the last) to enclose that list of documents and the expected type of data is as follows:
{"articles":[ {"abstractText":str,"db":str,"decsCodes":list,"id":str,"journal":str,"title":str,"year":int}, ... ]}
To clarify, the order of appearance of the fields in each document is as follows (note that this example it is pretty printed for readability purposes):
{ "articles": [ { "abstractText": "Content of the abstract", "db": "Name of the source database", "decsCodes": [ "code1", "code2", "code3" ], "id": "Id of the document", "journal": "Name of the journal", "title": "Title of the document", "year": 2019 } ] }
Note: The fields "db", "journal" and "year" might be null.
Copyright (c) 2020 Secretaría de Estado de Digitalización e Inteligencia Artificial
Facebook
TwitterThe United States Census Bureau’s International Dataset provides estimates of country populations since 1950 and projections through 2050. Specifically, the data set includes midyear population figures broken down by age and gender assignment at birth. Additionally, they provide time-series data for attributes including fertility rates, birth rates, death rates, and migration rates.
The full documentation is available here. For basic field details, please see the data dictionary.
Note: The U.S. Census Bureau provides estimates and projections for countries and areas that are recognized by the U.S. Department of State that have a population of at least 5,000.
This dataset was created by the United States Census Bureau.
Which countries have made the largest improvements in life expectancy? Based on current trends, how long will it take each country to catch up to today’s best performers?
You can use Kernels to analyze, share, and discuss this data on Kaggle, but if you’re looking for real-time updates and bigger data, check out the data on BigQuery, too: https://cloud.google.com/bigquery/public-data/international-census.
Facebook
TwitterThe National Child Development Study (NCDS) is a continuing longitudinal study that seeks to follow the lives of all those living in Great Britain who were born in one particular week in 1958. The aim of the study is to improve understanding of the factors affecting human development over the whole lifespan.
The NCDS has its origins in the Perinatal Mortality Survey (PMS) (the original PMS study is held at the UK Data Archive under SN 2137). This study was sponsored by the National Birthday Trust Fund and designed to examine the social and obstetric factors associated with stillbirth and death in early infancy among the 17,000 children born in England, Scotland and Wales in that one week. Selected data from the PMS form NCDS sweep 0, held alongside NCDS sweeps 1-3, under SN 5565.
Survey and Biomeasures Data (GN 33004):
To date there have been ten attempts to trace all members of the birth cohort in order to monitor their physical, educational and social development. The first three sweeps were carried out by the National Children's Bureau, in 1965, when respondents were aged 7, in 1969, aged 11, and in 1974, aged 16 (these sweeps form NCDS1-3, held together with NCDS0 under SN 5565). The fourth sweep, also carried out by the National Children's Bureau, was conducted in 1981, when respondents were aged 23 (held under SN 5566). In 1985 the NCDS moved to the Social Statistics Research Unit (SSRU) - now known as the Centre for Longitudinal Studies (CLS). The fifth sweep was carried out in 1991, when respondents were aged 33 (held under SN 5567). For the sixth sweep, conducted in 1999-2000, when respondents were aged 42 (NCDS6, held under SN 5578), fieldwork was combined with the 1999-2000 wave of the 1970 Birth Cohort Study (BCS70), which was also conducted by CLS (and held under GN 33229). The seventh sweep was conducted in 2004-2005 when the respondents were aged 46 (held under SN 5579), the eighth sweep was conducted in 2008-2009 when respondents were aged 50 (held under SN 6137), the ninth sweep was conducted in 2013 when respondents were aged 55 (held under SN 7669), and the tenth sweep was conducted in 2020-24 when the respondents were aged 60-64 (held under SN 9412).
A Secure Access version of the NCDS is available under SN 9413, containing detailed sensitive variables not available under Safeguarded access (currently only sweep 10 data). Variables include uncommon health conditions (including age at diagnosis), full employment codes and income/finance details, and specific life circumstances (e.g. pregnancy details, year/age of emigration from GB).
Four separate datasets covering responses to NCDS over all sweeps are available. National Child Development Deaths Dataset: Special Licence Access (SN 7717) covers deaths; National Child Development Study Response and Outcomes Dataset (SN 5560) covers all other responses and outcomes; National Child Development Study: Partnership Histories (SN 6940) includes data on live-in relationships; and National Child Development Study: Activity Histories (SN 6942) covers work and non-work activities. Users are advised to order these studies alongside the other waves of NCDS.
From 2002-2004, a Biomedical Survey was completed and is available under Safeguarded Licence (SN 8731) and Special Licence (SL) (SN 5594). Proteomics analyses of blood samples are available under SL SN 9254.
Linked Geographical Data (GN 33497):
A number of geographical variables are available, under more restrictive access conditions, which can be linked to the NCDS EUL and SL access studies.
Linked Administrative Data (GN 33396):
A number of linked administrative datasets are available, under more restrictive access conditions, which can be linked to the NCDS EUL and SL access studies. These include a Deaths dataset (SN 7717) available under SL and the Linked Health Administrative Datasets (SN 8697) available under Secure Access.
Multi-omics Data and Risk Scores Data (GN 33592)
Proteomics analyses were run on the blood samples collected from NCDS participants in 2002-2004 and are available under SL SN 9254. Metabolomics analyses were conducted on respondents of sweep 10 and are available under SL SN 9411. Polygenic indices are available under SL SN 9439. Derived summary scores have been created that combine the estimated effects of many different genes on a specific trait or characteristic, such as a person's risk of Alzheimer's disease, asthma, substance abuse, or mental health disorders, for example. These scores can be combined with existing survey data to offer a more nuanced understanding of how cohort members' outcomes may be shaped.
Additional Sub-Studies (GN 33562):
In addition to the main NCDS sweeps, further studies have also been conducted on a range of subjects such as parent migration, unemployment, behavioural studies and respondent essays. The full list of NCDS studies available from the UK Data Service can be found on the NCDS series access data webpage.
How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
For information on how to access biomedical data from NCDS that are not held at the UKDS, see the CLS Genetic data and biological samples webpage.
Further information about the full NCDS series can be found on the Centre for Longitudinal Studies website.
NCDS6:
The sixth NCDS sweep took place in 1999-2000, when cohort members were aged 41-42 years. Fieldwork was combined with the 29-year follow-up for the
1970 British Cohort Study (BCS70), also conducted by CLS.
SN 5578 supersedes the former combined NCDS6/BCS70 1999-2000 dataset, which was held under SN 4396
National Child Development Study and 1970 British Cohort Study (BCS70) Follow-ups, 1999-2000. The Centre for Longitudinal Studies updated the first six waves of NCDS in late 2006, and as part of this work separated the composite NCDS6/BCS70 dataset. Improvements made include further data cleaning and the addition of new documentation. Users who have previously obtained SN 4396 should no longer use it, and should completely replace it with this one. The BCS70 component of SN 4396 is now held separately under SN 5558 1970 British Cohort Study: Twenty-Nine-Year Follow-up, 1999-2000.
Latest edition information
For the third edition (November 2024), 14 new variables have been added. These variables correspond to truncated ICD-10 codes, limited to the first letter, derived from free-text responses regarding general health issues, kidney and bladder conditions, and long-standing illnesses. In addition, a small number of variables have been removed as a result of a disclosure review.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created in 2025 by the CATReloaded team in the Data Science Circle at Mansoura University, Faculty of Engineering, Egypt.
The dataset was originally prepared as the supporting material for a pandas practice notebook. That notebook was designed as a practical task after Corey Schafer’s YouTube pandas course
The goal was to create a comprehensive pandas challenge that includes almost every skill you might need when working with pandas. The idea is that you can save the code and revisit it later whenever you need a reference.
Anyone just starting with pandas
Learners who want a structured challenge to test and refresh their skills
People looking for a practice task they can build on, enhance, or adapt
👉 "https://www.kaggle.com/code/seifhafez/pandas-exercise/edit">Link to Notebook
The task may contain non-beginner-friendly questions, so don’t worry if they take some time.
I plan to provide solutions/answers when I have free time to write them down.
If anyone from the community shares model answers, I’ll be very grateful. I will gladly give credit and mention those contributions so others can benefit from them too.
You are welcome to design new tasks or variations using this dataset or notebook, as long as credit is kept to the CATReloaded team.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F19471804%2F9dcd0bfb323cfa328e83bd8a2b7944a7%2F458741397_513503334603832_744753795589333817_n.jpg?generation=1758812067506227&alt=media" alt="">
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Major Development Sites in York. For further information about major development sites please visit the City of York Council website. *Please note that the data published within this dataset is a live API link to CYC's GIS server. Any changes made to the master copy of the data will be immediately reflected in the resources of this dataset.The date shown in the "Last Updated" field of each GIS resource reflects when the data was first published.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book series. It has 1 row and is filtered where the books is Property development. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 3 rows and is filtered where the books is Understanding development & learning : implications for teaching. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Facebook
Twitterhttps://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Open access or shared research data must comply with (HIPAA) patient privacy regulations. These regulations require the de-identification of datasets before they can be placed in the public domain. The process of image de-identification is time consuming, requires significant human resources, and is prone to human error. Automated image de-identification algorithms have been developed but the research community requires some method of evaluation before such tools can be widely accepted. This evaluation requires a robust dataset that can be used as part of an evaluation process for de-identification algorithms.
We developed a DICOM dataset that can be used to evaluate the performance of de-identification algorithms. DICOM image information objects were selected from datasets published in TCIA. Synthetic Protected Health Information (PHI) was generated and inserted into selected DICOM data elements to mimic typical clinical imaging exams. The evaluation dataset was de-identified by a TCIA curation team using standard TCIA tools and procedures. We are publishing the evaluation dataset (containing synthetic PHI) and de-identified evaluation dataset (result of TCIA curation) in advance of a potential competition, sponsored by the National Cancer Institute (NCI), for de-identification algorithm evaluation, and de-identification of medical image datasets. The evaluation dataset published here is a subset of a larger evaluation dataset that was created under contract for the National Cancer Institute. This subset is being published to allow researchers to test their de-identification algorithms and promote standardized procedures for validating automated de-identification.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This new dataset was established according to the MIMIC III dataset, an openly available database developed by The Laboratory of Computational Physiology at Massachusetts Institute of Technology (MIT), which consists of data from more than 25,000 patients who were admitted to the Beth Israel Deaconess Medical Center (BIDMC) since 2003 and who have been de-identified for information safety. Here, we identified patients who were diagnosed as pelvic, acetabular, or combined pelvic and acetabular fractures according to ICD-9 code and who survived at least 72 hours after the ICU admission. All the data within the first 72 hours following ICU admission were collected and extracted from the MIMIC-III clinical database version 1.4.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The images in this dataset were taken from different angles and heights with the help of a DJI Phantom 3 model drone from a 30-acre field in Tekirdağ / Köseilyas, which was planted on April 30, 2020 and harvested on August 31, 2020. The images collected at different times of the day, on 43 separate days, at average intervals of 2-3 days, were filed separately. A total of 6465 images with a resolution of 2250 x 4000 were obtained.
All images collected on different days and filed separately were combined into a single folder in the order of the day they were taken. When the images are examined, the entire development process from the sprouting of the plants can be easily observed. The dimensions of the original images are quite high (2250 x 4000 pixels) and since they are reduced by about one tenth to fit the input of the network during training, a lot of data is lost. In addition, the success of deep convolutional neural networks depends on having a large amount of data. If there is not enough data, the network may overfit and memorize the data instead of learning it. Therefore, it is thought that using the original images divided into 6 equal parts will reduce data loss and increase the size of the data set. In addition, the divided images increase the diversity in the data set due to perspective differences.
Although the land where the study was conducted has a smooth (homogeneous) structure, there are development differences in some parts. These differences are seen more especially in the early stages and at the beginning of flowering. For example, while plants have started to emerge from the soil in one part of a divided image, no plants may have sprouted in another part. Or, similarly, while no flowers are seen in one part of a divided image, flowering may have started in another part. Therefore, not all parts of an original image are included in the same class, and each class in the data set is created from individually selected parts.
When all images are divided into 6 parts, a total of 38790 images with a size of 1333x1125 pixels are obtained. From these images, those that can be clearly distinguished by eye are selected and 8 separate classes are created. In this case, a dataset consisting of 1600 images in each class and 12800 images in total was created. All images were reduced to 224x224 pixels to be suitable for the input of CNN networks.
The description of each class is as follows.
1th class: Images were taken from the first emergence from the soil (cotyledon) to the 4-5 leaf stage.
2th class: Images were taken from the 5-6 leaf stage to the 10-11 leaf stage. The distances between the plants have decreased, overlapping has begun. The plant rows have begun to become clear.
3th class: Images were taken from the 11-12 leaf stage to the formation of the flower head. The soil ground is almost completely covered with plants. Vivid green tones dominate.
4th class: Flower beds have begun to open. The flowering process in the middle of the bed has begun. When viewed from above, yellow ray flowers have begun to be seen. Flower heads are upright.
5th class: The flowering process on the bed is complete or close to complete. Flower beds have begun to bend. Yellow ray flowers continue to be seen.
6th class: Flowering is complete, flower beds are completely bent. Yellow ray flowers have almost completely fallen off. Green leaves have started to fade.
7th class: The back of the beds have turned light yellow. Plants can be seen separately. Green leaves have completely faded, soil ground has started to be seen.
8th class: Physiological maturity is complete. Flower beds and bracts have turned brown. Suitable for harvest.
You can access the details and the article about the study from the link below. https://dergipark.org.tr/tr/pub/gazimmfd/issue/82783/1200615
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
M-Vet Livestock Dataset is an open-access dataset created by the M-Vet project (www.m-vet.net) based at Makerere University Artificial Intelligence & Data Science Research Lab (www.air.ug) and supported by Lacuna Fund. It is aimed at supporting machine learning models for image classification tasks in Livestock. This dataset contains about 18,000 images of different animal types, including cows, goats, and pigs, collected from various farms and regions and annotated for animal type with corresponding classes. The dataset is designed to facilitate research and development in livestock management, particularly in animal classification tasks using computer vision. It is a valuable resource for researchers, developers, and agricultural stakeholders looking to innovate in animal health monitoring and diagnostics. Available on GitHub for public use.
The dataset consists of nine subfolders (0001 to 0009), each containing three directories: labels, images, and data. Each image has a corresponding .txt file containing its annotations.
For example, given the image:
M-Vet_Livestock-Dataset-main/0001/images/e0b206bf-ee2f-4d6a-bdbe-ea29d70402aa7725714119516389484_jpg.rf.31cf1c70bbd1a46bfb83404ffe9414dc.jpg
The corresponding label file: M-Vet_Livestock-Dataset-main/0001/labels/e0b206bf-ee2f-4d6a-bdbe-ea29d70402aa7725714119516389484_jpg.rf.31cf1c70bbd1a46bfb83404ffe9414dc.txt
Contains the following annotation: 0 1 0.07107843124999999 0.311274509375 0.07107843124999999 0.311274509375 0.51992034375 1 0.51992034375 1 0.07107843124999999
Acknowledgement: Dataset is created by M-Vet project(www.m-vet) led by Daniel Mutembesa(www.linkedin.com/in/mutembesa-daniel-447452165), in collaboration with 162 Expert and rural based Veterinarians and a network of over 1,500 Livestock farmers in Uganda, the National Livestock Resources and Research Institute(https://naro.go.ug/naris/nalirri/), Veterinarians Without Boarders (https://vetswithoutbordersus.org), and Research Consortium on African Swine Fever at Makerere University.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The Bengali Character Recognition Dataset (BCRD) is a comprehensive collection of images designed for training and evaluating machine learning models focused on the recognition of Bengali characters. The dataset contains 1000 images per character and covers all the basic characters used in the Bengali script, including vowels, consonants, and special symbols. The images in this dataset have been generated using the Noto Sans Bengali font (NotoSansBengali-Regular.ttf), which is a widely used and highly legible font for Bengali text. This ensures that the dataset represents standard, clean text representations in Bengali, making it ideal for character recognition tasks. Dataset Details: • Total Characters: The dataset includes both vowels and consonants, as well as special symbols from the Bengali alphabet. • Number of Images: For each character, there are 1000 images, ensuring a diverse set of images for training deep learning models. • Image Format: All images are stored in PNG format for high quality and clarity. • Resolution: Each image has a resolution of 200x200 pixels. • Font Used: The images are generated using the NotoSansBengali-Regular.ttf font, a standard font for Bengali text representation, which ensures consistency and legibility. • Language: Bengali, which is the primary language of the Bengali-speaking community in Bangladesh and India. Characters Included: The dataset contains the following categories of Bengali characters: 1. Vowels (Sorbonno): অ, আ, ই, ঈ, উ, ঊ, ঋ, এ, ঐ, ও, ঔ 2. Consonants (Bengali Consonants): ক, খ, গ, ঘ, ঙ, চ, ছ, জ, ঝ, ঞ, ট, ঠ, ড, ঢ, ণ, ত, থ, দ, ধ, ন, প, ফ, ব, ভ, ম, য, র, ল, শ, ষ, স, হ, ড়, ঢ়, য় 3. Special Symbols: ৎ, ং, ঃ, ঁ Purpose: • The BCRD is designed to enable and enhance the development of models that can automatically recognize Bengali characters in a variety of real-world applications, including but not limited to: o Optical Character Recognition (OCR) for Bengali text. o Handwriting recognition. o Text classification tasks involving Bengali script. o Linguistic research on the Bengali writing system. Applications: • OCR Systems: This dataset is perfect for training OCR models for Bengali text extraction from scanned documents or images. • AI & Machine Learning: It can be used for various AI tasks, including supervised learning, classification, and model evaluation. • Language Processing: Researchers working on Bengali language processing can use this dataset for training recognition and generation models. Dataset Usage: • Academic Research: This dataset is an excellent resource for students and researchers working on character recognition and natural language processing in Bengali. • Deep Learning Models: The dataset is suitable for training Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), or other state-of-the-art models for Bengali character recognition. • Open Source Projects: It can be utilized by developers and open-source projects aiming to build Bengali OCR systems or handwriting recognition tools. …………………………………..Note for Researchers Using the dataset………………………………………………………………………
This dataset was created by Shuvo Kumar Basak. If you use this dataset for your research or academic purposes, please ensure to cite this dataset appropriately. If you have published your research using this dataset, please share a link to your paper. Good Luck.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Exploring E-commerce Trends: A Guide to Leveraging Dummy Dataset
Introduction: In the world of e-commerce, data is a powerful asset that can be leveraged to understand customer behavior, improve sales strategies, and enhance overall business performance. This guide explores how to effectively utilize a dummy dataset generated to simulate various aspects of an e-commerce platform. By analyzing this dataset, businesses can gain valuable insights into product trends, customer preferences, and market dynamics.
Dataset Overview: The dummy dataset contains information on 1000 products across different categories such as electronics, clothing, home & kitchen, books, toys & games, and more. Each product is associated with attributes such as price, rating, number of reviews, stock quantity, discounts, sales, and date added to inventory. This comprehensive dataset provides a rich source of information for analysis and exploration.
Data Analysis: Using tools like Pandas, NumPy, and visualization libraries like Matplotlib or Seaborn, businesses can perform in-depth analysis of the dataset. Key insights such as top-selling products, popular product categories, pricing trends, and seasonal variations can be extracted through exploratory data analysis (EDA). Visualization techniques can be employed to create intuitive graphs and charts for better understanding and communication of findings.
Machine Learning Applications: The dataset can be used to train machine learning models for various e-commerce tasks such as product recommendation, sales prediction, customer segmentation, and sentiment analysis. By applying algorithms like linear regression, decision trees, or neural networks, businesses can develop predictive models to optimize inventory management, personalize customer experiences, and drive sales growth.
Testing and Prototyping: Businesses can utilize the dummy dataset to test new algorithms, prototype new features, or conduct A/B testing experiments without impacting real user data. This enables rapid iteration and experimentation to validate hypotheses and refine strategies before implementation in a live environment.
Educational Resources: The dummy dataset serves as an invaluable educational resource for students, researchers, and professionals interested in learning about e-commerce data analysis and machine learning. Tutorials, workshops, and online courses can be developed using the dataset to teach concepts such as data manipulation, statistical analysis, and model training in the context of e-commerce.
Decision Support and Strategy Development: Insights derived from the dataset can inform strategic decision-making processes and guide business strategy development. By understanding customer preferences, market trends, and competitor behavior, businesses can make informed decisions regarding product assortment, pricing strategies, marketing campaigns, and resource allocation.
Conclusion: In conclusion, the dummy dataset provides a versatile and valuable resource for exploring e-commerce trends, understanding customer behavior, and driving business growth. By leveraging this dataset effectively, businesses can unlock actionable insights, optimize operations, and stay ahead in today's competitive e-commerce landscape