https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
This dataset has been collected and annotated by Terms of Service; Didn't Read (ToS;DR), an independent project aimed at analyzing and summarizing the terms of service and privacy policies of various online services. ToS;DR helps users understand the legal agreements they accept when using online platforms by categorizing and evaluating specific cases related to these policies.
The dataset includes structured information on individual cases, broader topics, specific services, detailed documents, and key points extracted from legal texts.
Cases refer to individual legal cases or specific issues related to the terms of service or privacy policies of a particular online service. Each case typically focuses on a specific aspect of a service's terms, such as data collection, user rights, content ownership, or security practices.
Topics are general categories or themes that encompass various cases. They help organize and group similar cases together based on the type of issues they address. For example, "Data Collection" could be a topic that includes cases related to how a service collects and uses user data.
Services represent specific online platforms, websites, or applications that have their own terms of service and privacy policies.
Points are individual statements or aspects within a case that highlight important information about a service's terms of service or privacy policy. These points can be positive (e.g., strong privacy protections) or negative (e.g., data sharing with third parties).
Documents refer to the original terms of service and privacy policies of the services that are being analyzed on TOSDR. These documents are the source of information for the cases, points, and ratings provided on the platform. TOSDR links to the actual documents, so users can review the full details if they choose to.
☎️+1(888) 642-5075 is the number to call when planning group travel with United Airlines. Whether you're booking for a school, business, or family reunion, ☎️+1(888) 642-5075 provides dedicated assistance for larger parties. ☎️+1(888) 642-5075 United offers tailored services for groups of 10 or more, making coordination easier and more efficient.
When you dial ☎️+1(888) 642-5075, a United group travel expert will guide you through the entire booking process. ☎️+1(888) 642-5075 This includes selecting travel dates, organizing passengers, and confirming fare options. ☎️+1(888) 642-5075 Group bookings come with unique advantages like flexible payment and name change policies.
Calling ☎️+1(888) 642-5075 ensures you receive the best available rates for group travel. United often offers special discounts not available online. ☎️+1(888) 642-5075 These exclusive prices are only accessible when booking over the phone. ☎️+1(888) 642-5075 Save time and money with one quick call.
☎️+1(888) 642-5075 also helps you secure seat blocks so your group can sit together. This is ideal for students, teams, or corporate travelers. ☎️+1(888) 642-5075 An agent will coordinate the layout based on availability. ☎️+1(888) 642-5075 Group seating adds comfort and convenience to your trip.
In addition to flights, calling ☎️+1(888) 642-5075 allows you to bundle hotel stays, ground transport, or car rentals. ☎️+1(888) 642-5075 United Vacations packages can be customized for group needs. ☎️+1(888) 642-5075 Everything can be arranged over the phone with one agent.
For groups traveling internationally, ☎️+1(888) 642-5075 is the best way to ensure passports, visas, and documentation are compliant. ☎️+1(888) 642-5075 The support team will help avoid legal or entry issues. ☎️+1(888) 642-5075 Planning in advance saves your group from stress.
Many organizations use ☎️+1(888) 642-5075 to schedule mission trips, religious pilgrimages, or charity outreach travel. ☎️+1(888) 642-5075 United’s phone agents assist with bulk bookings and special needs. ☎️+1(888) 642-5075 Your group receives focused attention with every reservation.
By calling ☎️+1(888) 642-5075, you can set up a flexible payment schedule for your group. Deposits may be accepted with final payments later. ☎️+1(888) 642-5075 This is perfect for large events still gathering attendees. ☎️+1(888) 642-5075 The flexibility gives you room to organize properly.
☎️+1(888) 642-5075 is especially useful for corporate events and incentive travel programs. United offers support for branded materials, group check-in, ☎️+1(888) 642-5075 and onboard announcements if needed. ☎️+1(888) 642-5075 Impress your employees or clients with customized travel solutions.
If you’re booking for a wedding party or family reunion, ☎️+1(888) 642-5075 makes it easy to manage guests. You can reserve blocks, ☎️+1(888) 642-5075 track who has confirmed, and modify as needed. ☎️+1(888) 642-5075 A personal agent keeps everything organized for you.
The group reservations line, ☎️+1(888) 642-5075, also supports name changes up to a few days before departure. ☎️+1(888) 642-5075 This is a great advantage over regular bookings. ☎️+1(888) 642-5075 It gives you flexibility as group members confirm.
Educators planning school trips can rely on ☎️+1(888) 642-5075 for flight arrangements and student coordination. ☎️+1(888) 642-5075 Safety, chaperone logistics, and seating can all be managed easily. ☎️+1(888) 642-5075 United has experience with academic group travel.
☎️+1(888) 642-5075 also handles sports team travel. Whether collegiate or professional, agents will help with large equipment, meal requests, ☎️+1(888) 642-5075 and seating requirements. Group coordinators appreciate the attention to detail. ☎️+1(888) 642-5075 Get your team where they need to go smoothly.
Calling ☎️+1(888) 642-5075 gives access to support before, during, and after the flight. Flight changes, cancellations, or updates are all handled quickly. ☎️+1(888) 642-5075 You won’t be left guessing. ☎️+1(888) 642-5075 Your group stays informed every step of the way.
Want to track multiple itineraries in one place? Call ☎️+1(888) 642-5075 for consolidated group confirmation numbers. ☎️+1(888) 642-5075 This helps with logistics and monitoring everyone’s travel status. ☎️+1(888) 642-5075 It's ideal for school administrators, tour operators, and event planners.
Group meals, accessibility needs, and medical services can be arranged via ☎️+1(888) 642-5075. United’s agents will ensure all special requests are logged. ☎️+1(888) 642-5075 A smooth journey begins with solid planning. ☎️+1(888) 642-5075 That’s what phone support makes possible.
If you’ve never made a group reservation before, ☎️+1(888) 642-5075 offers expert guidance. No guesswork or complicated forms—just simple questions. ☎️+1(888) 642-5075 Your agent will walk you through each step. ☎️+1(888) 642-5075 Peace of mind is only a phone call away.
Changes in your travel group? ☎️+1(888) 642-5075 allows for additions or subtractions without penalty in many cases. ☎️+1(888) 642-5075 United’s policies are far more flexible for groups. ☎️+1(888) 642-5075 You won’t get that level of support online.
If your group qualifies for nonprofit or government rates, ☎️+1(888) 642-5075 can help you apply those discounts. ☎️+1(888) 642-5075 Be sure to ask the agent about eligibility and documentation. ☎️+1(888) 642-5075 It's one more reason to book by phone.
Large groups also benefit from priority boarding and baggage arrangements by calling ☎️+1(888) 642-5075. This helps you avoid delays and confusion at the airport. ☎️+1(888) 642-5075 Enjoy smoother transitions from curb to gate. ☎️+1(888) 642-5075 Your agent will make it all happen.
So whether you're booking for 10 or 100, ☎️+1(888) 642-5075 is your direct line to successful group travel with United Airlines. ☎️+1(888) 642-5075 Don’t risk errors or delays—speak with a pro. ☎️+1(888) 642-5075 The sky is easier to navigate with the right team.
Yes, Lufthansa Airlines offers group reservations, and you can easily get started by calling ✈️📞+1(877) 471-1812 for live assistance. When you call ✈️📞+1(877) 471-1812, Lufthansa agents will walk you through the process of organizing a group booking. ✈️📞+1(877) 471-1812 ensures that your entire group travels smoothly, with seats and schedules aligned.
Group reservations are ideal for schools, sports teams, families, wedding parties, or corporate events—just call ✈️📞+1(877) 471-1812 to get started. A group booking generally includes 10 or more passengers traveling together on the same itinerary, and ✈️📞+1(877) 471-1812 makes this easy. ✈️📞+1(877) 471-1812 provides real-time flight availability and pricing exclusively for group travel.
Lufthansa's group booking service through ✈️📞+1(877) 471-1812 includes flexible payment options and fare holds. If your group needs time to finalize travelers’ names, ✈️📞+1(877) 471-1812 can reserve seats with placeholders. ✈️📞+1(877) 471-1812 is perfect when you're still confirming who’s attending the group trip.
Traveling with a group has added benefits, and ✈️📞+1(877) 471-1812 will explain them all. These include discounted rates, simplified baggage handling, and seating arrangements. By contacting ✈️📞+1(877) 471-1812, your group will also enjoy coordinated check-in and often dedicated service desks. ✈️📞+1(877) 471-1812 ensures your entire group experience is seamless.
If your group has special needs—such as elderly travelers, medical requirements, or specific seating—✈️📞+1(877) 471-1812 is the best solution. Lufthansa’s specialists via ✈️📞+1(877) 471-1812 can arrange wheelchair access, dietary accommodations, and boarding assistance. ✈️📞+1(877) 471-1812 is trained to customize each journey for every group member.
Group travel often means changing names or adjusting details at the last minute. With ✈️📞+1(877) 471-1812, Lufthansa allows a more flexible name-change policy compared to individual bookings. ✈️📞+1(877) 471-1812 provides clear deadlines and fee structures. You can rely on ✈️📞+1(877) 471-1812 for updated terms and conditions.
Whether you’re booking a domestic group trip or a global adventure, ✈️📞+1(877) 471-1812 is the first step. Lufthansa connects to over 200 destinations worldwide, and ✈️📞+1(877) 471-1812 helps organize multi-leg group itineraries with layovers, hotels, and transport. ✈️📞+1(877) 471-1812 takes the complexity out of big travel plans.
Your group can also benefit from group check-in and preassigned seating by using ✈️📞+1(877) 471-1812. Lufthansa offers services for streamlined boarding and bulk baggage handling, all arranged through ✈️📞+1(877) 471-1812. ✈️📞+1(877) 471-1812 ensures efficiency from the airport to your destination.
Need help understanding fare rules for your group? ✈️📞+1(877) 471-1812 will guide you through payment deadlines, deposits, and cancellation policies. Lufthansa representatives at ✈️📞+1(877) 471-1812 are trained in handling travel budgets for nonprofits, schools, or government groups. ✈️📞+1(877) 471-1812 gives you clarity every step of the way.
In short, yes—you can make a group reservation easily with Lufthansa, and ✈️📞+1(877) 471-1812 is the best way to do it. ✈️📞+1(877) 471-1812 provides specialized care for every group travel situation and ensures reliable support. Trust ✈️📞+1(877) 471-1812 for your next group journey on Lufthansa Airlines.
A merge of multiple existing existing Kaggle datasets for convenience sake, with a few quality of life tweaks/minor cleanup:
* Class names are cleaned (No spaces, only ISO basic Latin alphabet and -
characters) combined across the different component datasets
* Images are grouped by class into directories
* [v2 and Beyond] Only JPG(/JPEG) and PNG filetypes
* [v2 and Beyond] Images are only uniform, square aspect ratios (via the original dataset creators!)
* [v2 and Beyond] Images are cropped, centered, and verified by hand (via the original dataset creators!)
pokemonXX.jpg
and are all formatted as JPG files. xxxxxxxx.jpg
which is not unique across the dataset, only the given class it belongs to, or some sort of hash-like string (such as 0a3a642e700b4153b115e0f645d273f1.jpg
). The files are formatted as JPG (named both .jpg
and .jpeg
) and PNG. (SVG files are present in the original dataset, but removed from this one.) NNN-XXXPokemon[_Descriptor].png
or NNN_HOMEXXX.png
where NNN
is some unique per class 3 digit number, XXX
is the pokemon's number, Pokemon
is the pokemon's name, and [_Descriptor]
is an optional tag describing more about the image/its context, square brackets []
are not in the actual filenames. Files are formatted as PNG. pokemonXX.jpg
and are all formatted as JPG files. xxxxxxxx.jpg
) which is not unique across the dataset, only the given class it belongs to, or some sort of hash-like string (such as 0a3a642e700b4153b115e0f645d273f1.jpg
). The files are formatted as JPG (named both .jpg
and .jpeg
), PNG, GIF, and SVG. The license for this dataset as a whole is a bit complicated, as it is: * Comprised of images created by many different artists that was gathered across the internet and therefor not created by any of the dataset creators. * A collection of 3-4 other datasets, of which the original authors have all specified various types of licenses. As of January 2024 (and the versions of these datasets gathered at this time): * "11945 Pokemon from First Gen" by Unexpected Scepticism is labeled as CC BY-SA 4.0 * "7,000 Labeled Pokemon" by Lance Zhang is labeled as "Data files © Original Authors" * It's parent dataset, which was previously used in v1 of this dataset, ["Pokemon Generation One"](https://www.kaggle.com/datasets/thedagger/pokemon-generation-o...
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is designed for object detection in grape cultivation environments. The primary task is to identify and annotate grapes and packing materials used during grape harvesting.
Grapes are clusters of small, round fruits growing on vines. They are typically seen in tight bunches, each bunch containing several individual fruits. The clusters are often attached to a stem and are distinguishable by their spherical shape and the way they group together.
Packing refers to the materials used around grape clusters for protection or organization. These may include bags or paper-like sheets and often appear alongside or in proximity to grape clusters.
If you’re booking Air Canada flights for a wedding party☎️+1 (844) 583-4411, the best way to start is by calling ☎️+1 (844) 583-4411 to get personalized service. A second call to ☎️+1 (844) 583-4411 will help finalize passenger details☎️+1 (844) 583-4411, group booking codes☎️+1 (844) 583-4411, and special amenities. Wedding parties often travel in groups of 10 or more☎️+1 (844) 583-4411, and Air Canada can offer group rates and flexible payment options if you go through ☎️+1 (844) 583-4411. You can also request to block seats together and coordinate arrival times if guests are flying from different cities. The agent at ☎️+1 (844) 583-4411 can even set up name change options or allow last-minute guest list adjustments☎️+1 (844) 583-4411, depending on your fare type. Most importantly☎️+1 (844) 583-4411, ☎️+1 (844) 583-4411 ensures that wedding items like dresses or special baggage are handled with care and according to airline policy. You may also want to arrange early check-ins or shuttle services through the agent. Trying to organize this online can be chaotic—just call ☎️+1 (844) 583-4411 and let them handle everything☎️+1 (844) 583-4411, from group coordination to airport logistics. A wedding is already stressful enough—use ☎️+1 (844) 583-4411 and make your flight booking one less thing to worry about.
Travelers with ambitious itineraries often choose to visit more than one destination on a single journey. ✈️📞+1(877) 471-1812 is the number to call if you're planning a multi-city trip with Lufthansa Airlines. Whether you’re visiting Europe, Asia, or multiple U.S. cities, Lufthansa makes booking multi-city routes efficient and flexible. ✈️📞+1(877) 471-1812 ensures you can organize complex trips with ease and support at every stage of planning.
Lufthansa offers a specialized booking tool that allows you to create customized, multi-destination routes. ✈️📞+1(877) 471-1812 is the ideal contact to guide you through this process if you prefer speaking to a live agent. While the website allows self-booking of multi-city tickets, many travelers find it faster and easier to get expert help. ✈️📞+1(877) 471-1812 provides step-by-step support, fare comparisons, and itinerary suggestions that match your goals.
Multi-city trips differ from round-trip and one-way flights. ✈️📞+1(877) 471-1812 helps you string together multiple segments, such as flying from New York to Paris, then Rome to Berlin, and finally back to New York. This format offers more flexibility than traditional bookings. ✈️📞+1(877) 471-1812 ensures your connections, layovers, and timelines are optimized for comfort and budget.
To book a multi-city itinerary online, go to Lufthansa’s official website. ✈️📞+1(877) 471-1812 is always available in case you encounter any confusion or system glitches. Click on the “Multiple Destinations” or “Multi-City” option during flight search. ✈️📞+1(877) 471-1812 allows you to adjust the order of cities, change dates, or select flight classes for each leg.
When booking over the phone, provide your desired dates, cities, and flight preferences. ✈️📞+1(877) 471-1812 can tailor the itinerary based on your available travel dates and destination goals. Agents at ✈️📞+1(877) 471-1812 are trained to spot better schedules, cheaper fares, and alternative airports you might not have considered. This personalized planning ensures a smoother trip.
Lufthansa’s global network and Star Alliance membership make multi-city travel more accessible. ✈️📞+1(877) 471-1812 can incorporate flights with partner airlines such as Swiss, United, or Austrian Airlines into your journey. This means you can fly Lufthansa to one city and return with a partner from another. ✈️📞+1(877) 471-1812 makes these combinations possible within a single ticket.
Booking a multi-city ticket is ideal for travelers wanting to explore multiple countries or regions in one trip. ✈️📞+1(877) 471-1812 helps honeymooners, students, corporate travelers, and backpackers build unforgettable travel routes. Whether it’s a European rail tour or a business trip with multiple conferences, ✈️📞+1(877) 471-1812 creates tailored itineraries that make the most of your time.
Many travelers worry that multi-city tickets are more expensive. ✈️📞+1(877) 471-1812 can help compare prices between booking a multi-city ticket versus buying separate one-way tickets. Often, a multi-city ticket is cheaper when bundled under one airline group. ✈️📞+1(877) 471-1812 ensures your trip is not only flexible but also cost-effective.
You can choose different classes for different legs of the trip. ✈️📞+1(877) 471-1812 helps with booking Economy on your outbound leg and Premium Economy or Business Class for your return. This hybrid booking is perfect for travelers who want to balance budget and comfort. ✈️📞+1(877) 471-1812 explains pricing and upgrade eligibility for each segment.
Lufthansa also allows stopovers in hub cities, which can be added during multi-city planning. ✈️📞+1(877) 471-1812 can include a stopover in Frankfurt or Munich so you can explore the city for a few days. This is ideal if you want to break up long-haul travel. ✈️📞+1(877) 471-1812 ensures these stopovers are included without extra ticketing hassle.
Travelers using frequent flyer miles can book multi-city flights too. ✈️📞+1(877) 471-1812 connects you to Lufthansa’s Miles & More desk to handle award travel across multiple cities. Redeeming miles for complex itineraries is easier when guided by phone. ✈️📞+1(877) 471-1812 shows you how to combine mileage segments within Star Alliance partners.
If your travel involves events, conferences, or weddings, Lufthansa can accommodate group bookings across cities. ✈️📞+1(877) 471-1812 helps families or business teams create itineraries that ensure everyone arrives on time. You can even link multiple travelers to the same itinerary. ✈️📞+1(877) 471-1812 helps manage group seating, special services, and flexible schedules.
When booking multi-city flights, fare rules and baggage policies can vary between segments. ✈️📞+1(877) 471-1812 explains which parts of the trip include free checked luggage or flexible changes. Don’t assume every leg offers the same benefits—check each one. ✈️📞+1(877) 471-1812 walks you through these policies to prevent confusion.
Children, seniors, and travelers with mobility needs can book multi-city tickets with assistance. ✈️📞+1(877) 471-1812 will add wheelchair services, meal preferences, or family seating as needed. Lufthansa’s customer care ensures comfort across every city pair you fly. ✈️📞+1(877) 471-1812 provides real-time updates and support throughout the trip.
In case of delays or cancellations on one segment, Lufthansa will help rebook the rest of your itinerary. ✈️📞+1(877) 471-1812 becomes your go-to resource if travel disruptions impact your plan. If one leg is affected, the airline will try to maintain the rest of your route. ✈️📞+1(877) 471-1812 offers 24/7 emergency support.
Multi-city travel also lets you customize arrival and departure cities for maximum convenience. ✈️📞+1(877) 471-1812 lets you, for instance, land in Milan and depart from Amsterdam, which might be better for your schedule. This format removes the need to double back. ✈️📞+1(877) 471-1812 makes such itineraries easy to construct and book.
Online booking works well if you’re tech-savvy, but phone booking offers more personalized guidance. ✈️📞+1(877) 471-1812 is staffed with agents who specialize in complex route planning. They can suggest better timing, lower fares, or quicker connections than you might find online. ✈️📞+1(877) 471-1812 remains the best choice for premium planning.
Before finalizing, Lufthansa provides a detailed itinerary, showing connection times, layovers, and cabin class for each segment. ✈️📞+1(877) 471-1812 can review this itinerary with you over the phone to ensure accuracy. This extra level of confirmation helps avoid surprises later. ✈️📞+1(877) 471-1812 also emails or texts your final booking confirmation.
Changes and cancellations to multi-city trips are subject to fare conditions. ✈️📞+1(877) 471-1812 will review what’s refundable, exchangeable, or subject to penalties before you book. Flex fares often allow itinerary adjustments with minimal cost. ✈️📞+1(877) 471-1812 provides post-booking support if your plans shift unexpectedly.
Multi-city trips create richer travel experiences, and Lufthansa makes them accessible. ✈️📞+1(877) 471-1812 helps you turn a regular vacation into a multi-destination adventure. Whether it’s business, leisure, or a mix of both, this feature unlocks more freedom. ✈️📞+1(877) 471-1812 is your ticket to smarter, more rewarding travel.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.
However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.
2 Data-set Introduction
2.1 Data Collection
We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:
The headline must have one or more words directly or indirectly related to COVID-19.
The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.
The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.
Avoid taking duplicate reports.
Maintain a time frame for the above mentioned newspapers.
To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.
2.2 Data Pre-processing and Statistics
Some pre-processing steps performed on the newspaper report dataset are as follows:
Remove hyperlinks.
Remove non-English alphanumeric characters.
Remove stop words.
Lemmatize text.
While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.
The primary data statistics of the two dataset are shown in Table 1 and 2.
Table 1: Covid-News-USA-NNK data statistics
No of words per headline
7 to 20
No of words per body content
150 to 2100
Table 2: Covid-News-BD-NNK data statistics No of words per headline
10 to 20
No of words per body content
100 to 1500
2.3 Dataset Repository
We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.
3 Literature Review
Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.
Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].
Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.
Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.
4 Our experiments and Result analysis
We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:
In February, both the news paper have talked about China and source of the outbreak.
StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.
Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.
Washington Post discussed global issues more than StarTribune.
StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.
While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.
We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases
where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository holds the data and models created by training and testing a Hybrid Deep Learning model whose results are published in the Conference Paper "To Whom are You Talking? A DL model to Endow Social Robots with Addressee Estimation Skills" presented at the International Joint Conference on Neural Networks (IJCNN) 2023. https://ieeexplore.ieee.org/document/10191452
OA version: http://doi.org/10.48550/arXiv.2308.10757/
Addressee Estimation is the ability to understand to whom a person is directing an utterance. This ability is crucial for social robot engaging in multi-party interaction to understand the basic dynamics of social communication.
In this project, we trained a DL model composed of convolutional layers and LSTM cells and taking as input visual information of the speaker to estimate the placement of the addressee. We used a supervised learning approach. The data to train our model were taken from the Vernissage Corpus, a dataset collected in multi-party Human-Robot Interaction from the robot's sensors. For the original HRI corpus, see http://vernissage.humavips.eu/
This repository contains the /labels used for the training, the /models resulting from the training, and the /resuls, i.e., the files containing the results of tests.
The codes to obtain this data can be found at http://doi.org/10.5281/zenodo.10709858
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Python script used to examine how the marketing of properties explains neighborhood racial and income change using historical public remarks in real estate listings from Multiple Listing Services (MLS) collected and curated by CoreLogic.The primary dataset used for this research consists of 158,253 geocoded real estate listings for single-family homes in Mecklenburg County, North Carolina between 2001 and 2020. The historical MLS data which include public remarks is proprietary and can be obtained through purchase agreement with CoreLogic. The MLS is not publicly available and only available for members of the National Association of Realtors. Public remarks for homes currently listed for sale can be collected from online real estate websites such as Zillow, Trulia, Realtor.com, Redfin, and others.Since we cannot share this data, users need to, before running the script provided here, run the script provided by Nilsson and Delmelle (2023) which can be accessed here: https://doi.org/10.6084/m9.figshare.20493012.v1. This in order to get a fabricated/mock dataset of classified listings called classes_mock.csv. The article associated with Nilsson and Delmelle's (2023) script can be accessed here: https://www.tandfonline.com/doi/abs/10.1080/13658816.2023.2209803The user can then run the code together with the data provided here to estimate the threshold models together with data derived from the publicly available HMDA data. To compile a historical data set of loan/application records (LAR) for the user's own study are, the user will need to download data from the following websites:https://ffiec.cfpb.gov/data-publication/snapshot-national-loan-level-dataset/2022 (2017-forward)https://www.ffiec.gov/hmda/hmdaproducts.htm (2007-2016)https://catalog.archives.gov/search-within/2456161?limit=20&levelOfDescription=fileUnit&sort=naId:asc (for data prior to 2007)
Huddle is a collaboration platform that provides USAID with a secure environment where the Agency can manage, organize and work together on all of their content. Huddle workspaces are central repositories for all types of files to be saved and accessed by USAID offices around the globe. In workspaces, USAID users will manage project tasks, files and workflows. Huddle's file management features enable users to upload multiple files from their desktop, create a folder structure and share files with their team members through the platform. Users can share and comment on files, and direct the comments to specific team members. When edits to a file are required, users will open the file in its native application directly in the platform, make changes, and a new version will be automatically saved to the workspace. The editing feature provides users with all of the familiar features and functionality of the native application without leaving Huddle. Files are locked when they are opened for editing so there is no confusion about which user has made changes to a version. All content stored on Huddle has access permission settings so USAID can ensure that the right documents are visible and being shared with the appropriate users.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This poster is for SciDataCon 2023 poster exhibition
Nowadays research teams everywhere face challenges in the better management of research data for cross-domain collaboration and long-term use. The research teams are often diverse in their composition as in terms of application domains, computational resources, research methods, and lab practices, just to name a few. To overcome these differences, we believe that it is essential to foster a culture of sharing experiences and ideas about research data management planning among and within the teams. By doing so, we can navigate around common barriers as well as grow data expertise together.
In this poster, we report on a joint effort between a research data repository (the depositar; https://data.depositar.io/) and a biodiversity information facility (TaiBIF; https://portal.taibif.tw/) in engaging with local research communities in fostering good data management practices. The depositar is a data repository open to researchers worldwide for the deposit, discovery, and reuse of datasets. TaiBIF (Taiwan Biodiversity Information Facility) builds essential information infrastructures and promotes the openness and integration of biodiversity data. Both teams are based in Academia Sinica, Taiwan. TaiBIF has been organizing workshops in Taiwan for the management, mobilization, application, and integration of biodiversity information. In the past years, the depositar team has been taking part in TaiBIF workshops to organize hand-on courses on writing Data Management Plans (DMPs). These workshops offer training and guidance to help researchers acquire practical skills in research data management. The course activities are designed to encourage workshop participants not only to draft DMPs but also to engage in the peer review of their draft DMPs. As a result, we empower the workshop participants to take ownership of their data management practices and contribute to the overall improvement of their data management skills.
Our templates for drafting and reviewing DMPs are derived from Science Europe's Practical Guide to the International Alignment of Research Data Management (extended edition). We have created online instructional materials where participants can simulate the process of writing DMPs based on their own research projects. Furthermore, we facilitate peer review activities in small groups by means of the DMP evaluation criteria listed in the Science Europe's guide. The entire process is conducted through open sharing, allowing participants to learn from each other and to share data management practices within their knowledge domains. Subsequently, we select outstanding DMPs from these exercises which serve as examples and discussion points for future workshops. This approach allows us to increase the availability of data management solutions that are closely aligned with specific domains. It also fosters a friendly environment that encourages researchers to organize, share, and improve upon their data management planning skills.
Reference
Science Europe. (2021). Practical Guide to the International Alignment of Research Data Management - Extended Edition. (W. W. Tu & C. H. Wang & C. J. Lee & T. R. Chuang & M. S. Ho, Trans.). https://pid.depositar.io/ark:37281/k516v4d6w
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Rising sea levels (SLR) will cause coastal groundwater to rise in many coastal urban environments. Inundation of contaminated soils by groundwater rise (GWR) will alter the physical, biological, and geochemical conditions that influence the fate and transport of existing contaminants. These transformed products can be more toxic and/or more mobile under future conditions driven by SLR and GWR. We reviewed the vulnerability of contaminated sites to GWR in a US national database and in a case comparison with the San Francisco Bay region to estimate the risk of rising groundwater to human and ecosystem health. The results show that 326 sites in the US Superfund program may be vulnerable to changes in groundwater depth or flow direction as a result of SLR, representing 18.1 million hectares of contaminated land. In the San Francisco Bay Area, we found that GWR is predicted to impact twice as much coastal land area as inundation from SLR alone, and 5,297 state-managed sites of contamination may be vulnerable to inundation from GWR in a 1-meter SLR scenario. Increases of only a few centimeters of elevation can mobilize soil contaminants, alter flow directions in a heterogeneous urban environment with underground pipes and utility trenches, and result in new exposure pathways. Pumping for flood protection will elevate the salt water interface, changing groundwater salinity and mobilizing metals in soil. Socially vulnerable communities are more exposed to this risk at both the national scale and in a regional comparison with the San Francisco Bay Area. Methods Data Dryad This data set includes data from the California State Water Resources Control Board (WRCB), the California Department of Toxic Substances Control (DTSC), the USGS, the US EPA, and the US Census. National Assessment Data Processing: For this portion of the project, ArcGIS Pro and RStudio software applications were used. Data processing for superfund site contaminants in the text and supplementary materials was done in RStudio using R programming language. RStudio and R were also used to clean population data from the American Community Survey. Packages used include: Dplyr, data.table, and tidyverse to clean and organize data from the EPA and ACS. ArcGIS Pro was used to compute spatial data regarding sites in the risk zone and vulnerable populations. DEM data processed for each state removed any elevation data above 10m, keeping anything 10m and below. The Intersection tool was used to identify superfund sites within the 10m sea level rise risk zone. The Calculate Geometry tool was used to calculate the area within each coastal state that was occupied by the 10m SLR zone and used again to calculate the area of each superfund site. Summary Statistics were used to generate the total proportion of superfund site surface area / 10m SLR area for each state. To generate population estimates of socially vulnerable households in proximity to superfund sites, we followed methods similar to that of Carter and Kalman (2020). First, we generated buffers at the 1km, 3km, and 5km distance of superfund sites. Then, using Tabulate Intersection, the estimated population of each census block group within each buffer zone was calculated. Summary Statistics were used to generate total numbers for each state. Bay Area Data Processing: In this regional study, we compared the groundwater elevation projections by Befus et al (2020) to a combined dataset of contaminated sites that we built from two separate databases (Envirostor and GeoTracker) that are maintained by two independent agencies of the State of California (DTSC and WRCB). We used ArcGIS to manage both the groundwater surfaces, as raster files, from Befus et al (2020) and the State’s point datasets of street addresses for contaminated sites. We used SF BCDC (2020) as the source of social vulnerability rankings for census blocks, using block shapefiles from the US Census (ACS) dataset. In addition, we generated isolines that represent the magnitude of change in groundwater elevation in specific sea level rise scenarios. We compared these isolines of change in elevation to the USGS geological map of the San Francisco Bay region and noted that groundwater is predicted to rise farther inland where Holocene paleochannels meet artificial fill near the shoreline. We also used maps of historic baylands (altered by dikes and fill) from the San Francisco Estuary Institute (SFEI) to identify the number of contaminated sites over rising groundwater that are located on former mudflats and tidal marshes. The contaminated sites' data from the California State Water Resources Control Board (WRCB) and the Department of Toxic Substances (DTSC) was clipped to our study area of nine-bay area counties. The study area does not include the ocean shorelines or the north bay delta area because the water system dynamics differ in deltas. The data was cleaned of any duplicates within each dataset using the Find Identical and Delete Identical tools. Then duplicates between the two datasets were removed by running the intersect tool for the DTSC and WRCB point data. We chose this method over searching for duplicates by name because some sites change names when management is transferred from DTSC to WRCB. Lastly, the datasets were sorted into open and closed sites based on the DTSC and WRCB classifications which are shown in a table in the paper's supplemental material. To calculate areas of rising groundwater, we used data from the USGS paper “Projected groundwater head for coastal California using present-day and future sea-level rise scenarios” by Befus, K. M., Barnard, P., Hoover, D. J., & Erikson, L. (2020). We used the hydraulic conductivity of 1 condition (Kh1) to calculate areas of rising groundwater. We used the Raster Calculator to subtract the existing groundwater head from the groundwater head under a 1-meter of sea level rise scenario to find the areas where groundwater is rising. Using the Reclass Raster tool, we reclassified the data to give every cell with a value of 0.1016 meters (4”) or greater a value of 1. We chose 0.1016 because groundwater rise of that little can leach into pipes and infrastructure. We then used the Raster to Poly tool to generate polygons of areas of groundwater rise.
https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/cc-by/cc-by_f24dc630aa52ab8c52a0ac85c03bc35e0abc850b4d7453bdc083535b41d5a5c3.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/cc-by/cc-by_f24dc630aa52ab8c52a0ac85c03bc35e0abc850b4d7453bdc083535b41d5a5c3.pdf
ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days (monthly means are available around the 6th of each month). In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. So far this has only been the case for the month September 2021, while it will also be the case for October, November and December 2021. For months prior to September 2021 the final release has always been equal to ERA5T, and the goal is to align the two again after December 2021. ERA5 is updated daily with a latency of about 5 days (monthly means are available around the 6th of each month). In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines. Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities). The present entry is "ERA5 monthly mean data on pressure levels from 1940 to present".
IPUMS-International is an effort to inventory, preserve, harmonize, and disseminate census microdata from around the world. The project has collected the world's largest archive of publicly available census samples. The data are coded and documented consistently across countries and over time to facillitate comparative research. IPUMS-International makes these data available to qualified researchers free of charge through a web dissemination system.
The IPUMS project is a collaboration of the Minnesota Population Center, National Statistical Offices, and international data archives. Major funding is provided by the U.S. National Science Foundation and the Demographic and Behavioral Sciences Branch of the National Institute of Child Health and Human Development. Additional support is provided by the University of Minnesota Office of the Vice President for Research, the Minnesota Population Center, and Sun Microsystems.
National coverage
Household
UNITS IDENTIFIED: - Dwellings: No - Vacant units: No - Households: Yes - Individuals: Yes - Group quarters: Yes (institutional)
UNIT DESCRIPTIONS: - Dwellings: Not available - Households: An individual or group of people who inhabit part or all of the physical or census building, usually live together, who eat from one kitchen or organize daily needs together as one unit. - Group quarters: A special household includes people living in dormitories, barracks, or institutions in which daily needs are under the responsibility of a foundation or other organization. Also includes groups of people in lodging houses or buildings, where the total number of lodgers is ten or more.
All population residing in the geographic area of Indonesia regardless of residence status. Diplomats and their families residing in Indonesia were excluded.
Census/enumeration data [cen]
MICRODATA SOURCE: Statistics Indonesia
SAMPLE DESIGN: Geographically stratified systematic sample (drawn by MPC).
SAMPLE UNIT: Household
SAMPLE FRACTION: 10%
SAMPLE SIZE (person records): 20,112,539
Face-to-face [f2f]
L1 questionnaire for buildings and households; L2 questionnaire for permanent residents; and L3 questionnaire for non-permanent residents (boat people, homeless persons, etc).
Round 1 of the Afrobarometer survey was conducted from July 1999 through June 2001 in 12 African countries, to solicit public opinion on democracy, governance, markets, and national identity. The full 12 country dataset released was pieced together out of different projects, Round 1 of the Afrobarometer survey,the old Southern African Democracy Barometer, and similar surveys done in West and East Africa.
The 7 country dataset is a subset of the Round 1 survey dataset, and consists of a combined dataset for the 7 Southern African countries surveyed with other African countries in Round 1, 1999-2000 (Botswana, Lesotho, Malawi, Namibia, South Africa, Zambia and Zimbabwe). It is a useful dataset because, in contrast to the full 12 country Round 1 dataset, all countries in this dataset were surveyed with the identical questionnaire
Botswana Lesotho Malawi Namibia South Africa Zambia Zimbabwe
Basic units of analysis that the study investigates include: individuals and groups
Sample survey data [ssd]
A new sample has to be drawn for each round of Afrobarometer surveys. Whereas the standard sample size for Round 3 surveys will be 1200 cases, a larger sample size will be required in societies that are extremely heterogeneous (such as South Africa and Nigeria), where the sample size will be increased to 2400. Other adaptations may be necessary within some countries to account for the varying quality of the census data or the availability of census maps.
The sample is designed as a representative cross-section of all citizens of voting age in a given country. The goal is to give every adult citizen an equal and known chance of selection for interview. We strive to reach this objective by (a) strictly applying random selection methods at every stage of sampling and by (b) applying sampling with probability proportionate to population size wherever possible. A randomly selected sample of 1200 cases allows inferences to national adult populations with a margin of sampling error of no more than plus or minus 2.5 percent with a confidence level of 95 percent. If the sample size is increased to 2400, the confidence interval shrinks to plus or minus 2 percent.
Sample Universe
The sample universe for Afrobarometer surveys includes all citizens of voting age within the country. In other words, we exclude anyone who is not a citizen and anyone who has not attained this age (usually 18 years) on the day of the survey. Also excluded are areas determined to be either inaccessible or not relevant to the study, such as those experiencing armed conflict or natural disasters, as well as national parks and game reserves. As a matter of practice, we have also excluded people living in institutionalized settings, such as students in dormitories and persons in prisons or nursing homes.
What to do about areas experiencing political unrest? On the one hand we want to include them because they are politically important. On the other hand, we want to avoid stretching out the fieldwork over many months while we wait for the situation to settle down. It was agreed at the 2002 Cape Town Planning Workshop that it is difficult to come up with a general rule that will fit all imaginable circumstances. We will therefore make judgments on a case-by-case basis on whether or not to proceed with fieldwork or to exclude or substitute areas of conflict. National Partners are requested to consult Core Partners on any major delays, exclusions or substitutions of this sort.
Sample Design
The sample design is a clustered, stratified, multi-stage, area probability sample.
To repeat the main sampling principle, the objective of the design is to give every sample element (i.e. adult citizen) an equal and known chance of being chosen for inclusion in the sample. We strive to reach this objective by (a) strictly applying random selection methods at every stage of sampling and by (b) applying sampling with probability proportionate to population size wherever possible.
In a series of stages, geographically defined sampling units of decreasing size are selected. To ensure that the sample is representative, the probability of selection at various stages is adjusted as follows:
The sample is stratified by key social characteristics in the population such as sub-national area (e.g. region/province) and residential locality (urban or rural). The area stratification reduces the likelihood that distinctive ethnic or language groups are left out of the sample. And the urban/rural stratification is a means to make sure that these localities are represented in their correct proportions. Wherever possible, and always in the first stage of sampling, random sampling is conducted with probability proportionate to population size (PPPS). The purpose is to guarantee that larger (i.e., more populated) geographical units have a proportionally greater probability of being chosen into the sample. The sampling design has four stages
A first-stage to stratify and randomly select primary sampling units;
A second-stage to randomly select sampling start-points;
A third stage to randomly choose households;
A final-stage involving the random selection of individual respondents
We shall deal with each of these stages in turn.
STAGE ONE: Selection of Primary Sampling Units (PSUs)
The primary sampling units (PSU's) are the smallest, well-defined geographic units for which reliable population data are available. In most countries, these will be Census Enumeration Areas (or EAs). Most national census data and maps are broken down to the EA level. In the text that follows we will use the acronyms PSU and EA interchangeably because, when census data are employed, they refer to the same unit.
We strongly recommend that NIs use official national census data as the sampling frame for Afrobarometer surveys. Where recent or reliable census data are not available, NIs are asked to inform the relevant Core Partner before they substitute any other demographic data. Where the census is out of date, NIs should consult a demographer to obtain the best possible estimates of population growth rates. These should be applied to the outdated census data in order to make projections of population figures for the year of the survey. It is important to bear in mind that population growth rates vary by area (region) and (especially) between rural and urban localities. Therefore, any projected census data should include adjustments to take such variations into account.
Indeed, we urge NIs to establish collegial working relationships within professionals in the national census bureau, not only to obtain the most recent census data, projections, and maps, but to gain access to sampling expertise. NIs may even commission a census statistician to draw the sample to Afrobarometer specifications, provided that provision for this service has been made in the survey budget.
Regardless of who draws the sample, the NIs should thoroughly acquaint themselves with the strengths and weaknesses of the available census data and the availability and quality of EA maps. The country and methodology reports should cite the exact census data used, its known shortcomings, if any, and any projections made from the data. At minimum, the NI must know the size of the population and the urban/rural population divide in each region in order to specify how to distribute population and PSU's in the first stage of sampling. National investigators should obtain this written data before they attempt to stratify the sample.
Once this data is obtained, the sample population (either 1200 or 2400) should be stratified, first by area (region/province) and then by residential locality (urban or rural). In each case, the proportion of the sample in each locality in each region should be the same as its proportion in the national population as indicated by the updated census figures.
Having stratified the sample, it is then possible to determine how many PSU's should be selected for the country as a whole, for each region, and for each urban or rural locality.
The total number of PSU's to be selected for the whole country is determined by calculating the maximum degree of clustering of interviews one can accept in any PSU. Because PSUs (which are usually geographically small EAs) tend to be socially homogenous we do not want to select too many people in any one place. Thus, the Afrobarometer has established a standard of no more than 8 interviews per PSU. For a sample size of 1200, the sample must therefore contain 150 PSUs/EAs (1200 divided by 8). For a sample size of 2400, there must be 300 PSUs/EAs.
These PSUs should then be allocated proportionally to the urban and rural localities within each regional stratum of the sample. Let's take a couple of examples from a country with a sample size of 1200. If the urban locality of Region X in this country constitutes 10 percent of the current national population, then the sample for this stratum should be 15 PSUs (calculated as 10 percent of 150 PSUs). If the rural population of Region Y constitutes 4 percent of the current national population, then the sample for this stratum should be 6 PSU's.
The next step is to select particular PSUs/EAs using random methods. Using the above example of the rural localities in Region Y, let us say that you need to pick 6 sample EAs out of a census list that contains a total of 240 rural EAs in Region Y. But which 6? If the EAs created by the national census bureau are of equal or roughly equal population size, then selection is relatively straightforward. Just number all EAs consecutively, then make six selections using a table of random numbers. This procedure, known as simple random sampling (SRS), will
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The experiment the dataset was collected on was conducted on Tuesday September 22nd 2015 from approximately 4:00-4:45 PM at the Osgoode Woodlot area at York University. The temperature of the area was around 22°C with a clear sunny sky. But due to the experiment being in the woodlot area the tree canopies covered most of the light leaving it very dimly lit with a cooler overall temperature. The experiment was done together with Maleeha, Laiba and Rachel. There were 4 datasets to complete in the experiment and each group member was assigned one dataset to further analyze. I was assigned to complete dataset #3 with the help of my partner, Maleeha. The experiment started off at the edge of the Osgoode Woodlot. We walked in a straight line towards the center of the woodlot. Every adult tree (twice the observer’s height around 340cm) encountered, we measured the distance between it to the closest adult tree nearby with the transect tape in metres at the breast height of the observer (around 140cm), therefore forming a pair (replicate). We also measured the diameter (dph) of the paired trees using the transect tape in centimetres at the breast height (around 140cm) of the observer. As well as the condition of the trees from the ratings 0-2 (0=dead, 1=living, 2=huge green canopy). This procedure was repeated 9 more times to get a total of 10 pairings (replicates). It was recorded by hand during the experiment and then transferred onto Microsoft Excel to further organize the dataset. In this experiment it allows the students to learn how to identify plant species and how to collect the datasets with different sampling tools like the quadrat and transect.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Information:This online replication package corresponds to our qualitative study understanding modeling issues and tool-related issue (of popular modeling tools) that modelers face. The icpc2022_db.sql file contains the dataset of discussions from Stack Overflow, Eclipse forums, and Matlab forums associated with MDSE. Each table in the database is the extracted information from each of the data sources. The other files contain the links to the posts that were analyzed through our random sampling as part of the qualitative analysis. We also have provided the visualization of the taxonomy for the research questions.Important Replication Details of Qualitative Study: To analyze the data, we performed an open-coding process in which we had a multi-coding process to address the two research questions (i.e., the modeling-related issues and the tool-related issues). Since we performed an open-coding process, these codes were not predefined prior to the analysis. Using an open-coding is an inductive process, where in our case these codes represent the underlying issue or difficulty of which the modeler experienced in the post. Since it was multi-coding, if a post was not related to one of the research questions, the judge (i.e., the authors of this work) was instructed to use “N/A” for the code. To reduce bias of a single judge determining the code of a post, each post was examined by at least two judges. We utilized a custom online tool to randomly assign the posts in our dataset (the SQL dump in this replication) in such a manner that there were two judges. The tool displayed the original title of the post, the body of the post, and a link to entire discussion thread as well as provided free-form text fields for the authors to provide a code for each of the aspects investigated (i.e., the research question) and validated the input to ensure codes were not left unfilled. We performed this analysis iteratively and met to discuss the codes to ensure consistency between coding sessions. Additionally, the existing codes where merged when applicable to avoid redundant codes (i.e., semantically equivalent codes). For each iteration, a sample of 200 posts were coded by each judge, resulting in 4 iterations. In the last iteration, we observed no new codes were introduced. A final round of coding was performed to address conflicts in the coding between two judges. For each post with a conflict, a third judge was assigned to the post to resolve the disagreement. We computed Fleiss' kappa to assess inter-rater agreement (for our first research question, we observed kappa = 0.83; for our second research question, we observed kappa = 0.46). To generate a taxonomy from this qualitative open-coding process, we performed card-sorting to organize the codes into higher-level categories. The codes were iteratively clustered into groups, which represent a higher-level abstraction of the codes. We performed this card-sorting until all the codes were assigned and converged on a common set of clusters. The number of clusters were not predefined, but were determined systematically through the iterative card-sorting process. Subsequently, we organized these groups hierarchically when there was a relationship between two or more groups. It is important to note that this process was done independently for each research question. In this replication package, we have the two taxonomies as well as the combined taxonomy displayed in our paper. To replicate the study, the data should be analyzed independently of our codes and taxonomy to determine whether the same taxonomy is generated for the research questions. Our paper provides more details on the taxonomy and implications of our work.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Fireshed Registry is a geospatial dashboard and decision tool built to organize information about wildfire transmission to buildings and monitor progress towards risk reduction for communities from management investments. The concept behind the Fireshed Registry is to identify and map the source of risk rather than what is at risk across all lands in the United States. While the Fireshed Registry was organized around mapping the source of fire risk to communities, the framework does not preclude the assessment of other resource management priorities and trends such as water, fish and aquatic or wildlife habitat, or recreation. The Fireshed Registry is also a multi-scale decision tool for quantifying, prioritizing, and geospatially displaying wildfire transmission to buildings in adjacent or nearby communities. Fireshed areas in the Fireshed Registry are approximately 250,000 acre accounting units that are delineated based on a smoothed building exposure map of the United States. These boundaries were created by dividing up the landscape into regular-sized units that represent similar source levels of community exposure to wildfire risk. Subfiresheds are approximately 25,000 acre accounting units nested within firesheds. Firesheds for the Conterminous U.S., Alaska, and Hawaii were generated in separate research efforts and are published in incremental versions in the Research Data Archive. They are combined here for ease of use.This record was taken from the USDA Enterprise Data Inventory that feeds into the https://data.gov catalog. Data for this record includes the following resources: ISO-19139 metadata ArcGIS Hub Dataset ArcGIS GeoService OGC WMS CSV Shapefile GeoJSON KML For complete information, please visit https://data.gov.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
100 microarray and RNA-seq gene expression datasets from five model species (human, mouse, fruit fly, arabidopsis plants, and baker's yeast). These datasets represent the benchmark set that was used to test our clust clustering method and to compare it with seven widely used clustering methods (Cross-Clustering, k-means, self-organising maps, MCL, hierarchical clustering, CLICK, and WGCNA). This data resource includes raw data files, pre-processed data files, clustering results, clustering results evaluation, and scripts.
The files are split into eight zipped parts, 100Datasets_0.zip to 100Datasets_7.zip. The contents of the three zipped files should be extracted to a single folder (e.g. 100Datasets).
Below is a thorough description of the files and folders in this data resource.
Scripts
The scripts used to apply each one of the clustering methods to each one of the 100 datasets and to evaluate their results are all included in the folder (scripts/).
Datasets and clustering results (folders starting with D)
The datasets are labelled as D001 to D100. Each dataset has two folders: D###/ and D###_Res/, where ### is the number of the dataset. The first folder only includes the raw dataset while the second folder includes the results of applying the clustering methods to that dataset. The files ending with _B.tsv include clustering results in the form of a partition matrix. The files ending with _E include metrics evaluating the clustering results. The files ending with _go and _go_E respectively include the enriched GO terms in the clustering results and evaluation metrics of these GO terms. The files ending with _REACTOME and _REACTOME_E are similar to the GO term files but for the REACTOME pathway enrichment analysis. Each of these D###_Res/ folders includes a sub-folder "ParamSweepClust" which includes the results of applying clust multiple times to the same dataset while sweeping some parameters.
Large datasets analysis results
The folder LargeDatasets/ includes data and results for what we refer to as "large" datasets. These are 19 datasets that have more than 50 samples including replicates and have not therefore been included in the set of 100 datasets. However, they fit all of the other dataset selection criteria. We have compared clust with the other clustering methods over these datasets to demonstrate that clust still outperforms other datasets over larger datasets. This folder includes folders LD001/ to LD019/ and LD001_Res/ to LD019_Res/. These have similar format and contents as the D###/ and D###_Res/ folders described above.
Simultaneous analysis of multiple datasets (folders starting with MD)
As our clust method is design to be able to extract clusters from multiple datasets simultaneously, we also tested it over multiple datasets. All folders starting with MD_ are related to "multiple datasets (MD)" results. Each MD experiment simultaneously analyses d randomly selected datasets either out of a set of 10 arabidopsis datasets or out of a set of 10 yeast datasets. For each one of the two species, all d values from 2 to 10 were tested, and at each one of these d values, 10 different runs were conducted, where at each run a different subset of d datasets is selected randomly.
The folders MD_10A and MD_10Y include the full sets of 10 arabidposis or 10 yeast datasets, respectively. Each folder with the format MD_10#_d#_Res## includes the results of applying the eight clustering methods at one of the 10 random runs of one of the selected d values. For example, the "MD_10A_d4_Res03/" folder includes the clustering results of the 3rd random selection of 4 arabidopsis datasets (the letter A in the folder's name refers to arabidopsis).
Our clust method is applied directly over multiple datasets where each dataset is in a separate data file. Each "MD_10#_d#_Res##" folder includes these individual files in a sub-folder named "Processed_Data/". However, the other clustering methods only accept a single input data file. Therefore, the datasets are merged first before being submitted to these methods. Each "MD_10#_d#_Res##" folder includes a file "X_merged.tsv" for the merged data.
Evaluation metrics (folders starting with Metrics)
Each clustering results folder (D##_Res or MD_10#_d#_Res##) includes some clustering evaluation files ending with _E. This information is combined into tables for all datasets, and these tables appear in the folders starting with "Metrics_".
Other files and folders
The GO folder includes the reference GO term annotations for arabidopsis and yeast. Similarly, the REACTOME folder includes the reference REACTOME pathway annotations for arabidopsis and yeast. The Datasets file includes a TAB delimited table describing the 100 datasets. The SearchCriterion file includes the objective methodology of searching the NCBI database to select these 100 datasets. The Specials file includes some special considerations for couple of datasets that differ a bit from what is described in the SearchCriterion file. The Norm### files and the files in the Reps/ folder describe normalisation codes and replicate structures for the datasets and were fed to the clust method as inputs. The Plots/ folder includes plots of the gene expression profiles of the individual genes in the clusters generated by each one of the eight methods over each one of the 100 datasets. Only up to 14 clusters per method are plotted.
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
This dataset has been collected and annotated by Terms of Service; Didn't Read (ToS;DR), an independent project aimed at analyzing and summarizing the terms of service and privacy policies of various online services. ToS;DR helps users understand the legal agreements they accept when using online platforms by categorizing and evaluating specific cases related to these policies.
The dataset includes structured information on individual cases, broader topics, specific services, detailed documents, and key points extracted from legal texts.
Cases refer to individual legal cases or specific issues related to the terms of service or privacy policies of a particular online service. Each case typically focuses on a specific aspect of a service's terms, such as data collection, user rights, content ownership, or security practices.
Topics are general categories or themes that encompass various cases. They help organize and group similar cases together based on the type of issues they address. For example, "Data Collection" could be a topic that includes cases related to how a service collects and uses user data.
Services represent specific online platforms, websites, or applications that have their own terms of service and privacy policies.
Points are individual statements or aspects within a case that highlight important information about a service's terms of service or privacy policy. These points can be positive (e.g., strong privacy protections) or negative (e.g., data sharing with third parties).
Documents refer to the original terms of service and privacy policies of the services that are being analyzed on TOSDR. These documents are the source of information for the cases, points, and ratings provided on the platform. TOSDR links to the actual documents, so users can review the full details if they choose to.