“You are provided a selection of norms for English and German across a range of variables. The
norms rely on human judgements and/or semi-automatic extensions regarding degrees of concreteness, valence, arousal, imageability and further perception modalities. In addition, you are provided corpus-based frequency lists as well as distributional co-occurrence scores.
The goal of your project is to first analyse a subset of the norm data and then to explore whether
judgements are related across modalities and to corpus-based frequency and semantic diversity.

Task: Write a report about your findings (Your report should be 5 – 8 pages long (excluding the bibliography))”

(Unfortunately, the corpus frequency and distributional information files are too large to be uploaded here. I hope there’s another way to provide the files in case this assignment is being done by someone. It is a beginners R course, so only basic plots and statistics should be in the report, you can still freely choose the variables though. Basically have some fun looking into data!)

What you need to do:
 
Part I Data collection
Write a Python script that harvests the tweets of the three Twitter accounts the study focuses on. Get the contents of their tweets and when they were tweeted. Write the information into a data file (an xlsx file, not a CSV file see the documentation below). You will need to upload both the Python script and the data file that you generated as part of the assignment.

Part II Data analysis
Analyse the file that is provided, named twitterdata.xlsx. The analysis will a descriptive analysis of the tweets, in which you will compare how the three accounts in focus have tweeted (and how that possibly changed over time). This will require to draw a random sample of 50 COVID tweets per account (See the additional documentation on how to do that).
Be creative in how you handle the analysis. You will have to upload the Python script in which you perform the analysis (data cleaning if necessary and analysis/visualisation).

Part III Report
Write a research report (1,500 to 2,000 words, all included), with the following sections:
1. Introduction section in which you contextualize the research and outline the research questions.
2. Methodology section in which you concisely explain the procedure: (1) how did you get the data (Although you will work with the data that I provide, the procedure should be just the same as the one that you used, only the timeframe of data collection is wider started December 1st and lasts until April 18th), (2) what do the sample data look like (i.e., how many tweets, harvested in what period this will be a description of the file that is available on Blackboard, not the data file you harvested yourself).
3. Results section in which you discuss the analysis: i.e., what did you do with what data, and what does that tell us.
4. Discussion section in which you explain how the results answer the initial research questions (i.e., what do the results mean). This is concluded by a reflection on the strengths and weaknesses of the research methodology (draw inspiration from the introduction lecture, as well as from the module on APIs).
Make sure that the report mentions your name and student number. There are no strict guidelines on how to format the document, except for the word count. However, make it look clean and professional in every possible way. A professionally type-set research article by a publisher such as Sage, Wiley-Blackwell, Elsevier might inspire you.

In total, there are five files you need to upload, combined in a single compressed .zip file:

1. A python file that harvests tweets and writes them into a data file (.py file)
2. The data file with the harvested tweets (.xlsx file)
3. A python file with the data processing/analysis/visualisation (.py file)
4. The final version of the data file that you processed
5. A text document with the 1,500 to 2,000-word research report (.pdf)
Your project makes up 50% of your final grade.

    What are you graded on?
 
1. Were you able to outline the relevance of the research question? (introduction section report)
2. Is the code that you wrote to harvest tweets valid and effective? (harvest file)
3. Were you able to clearly describe the procedure on how tweets were harvested?
(methodology section report)
4. Were you able to transparently explain what you did with the data, what you
analysed/visualised? (results section report)
5. Were you able to clean and format the given research data? (analysis file)
6. Is the analysis/visualisation that you performed sound/valid? (analysis file)
7. Does your discussion of the results make sense in answering the research
questions? Are you able to pinpoint the strengths and weaknesses of the method (Including whether analysing tweets is the right way to go…)? (discussion section report)
8. Is your writing tidy and clear? (entire report)
9. Is your document professionally formatted? (entire report)

1. Consider the training examples shown in Table 3.5 page 185 of the second Edition of the text book. Compute the Gini index for the overall collection of training examples. Compute the Gini index for the customer ID attribute. Compute the Gini index for the Geneder attribute. Compute the Gini index for the Car type attribute. Compute the Gini index for the Shirt Size attribute. Which attribute is better Gender, Car Type, or Shirt Size? Explain why Customer ID should not be used as the attribute test condition even though it has the lowest Gini index.
2. Repeat exercise (1) using entropy instead of the Gini index.
3. Use the outline of code we discussed in class to create a decision tree for the IrisDataSet which predicts the Type column using the other attributes. Create three versions of this tree: one using entropy, one using the Gini coecient, and one using the Classication error as splitting criteria. Use the rst half of the data set as the training data and the second half as the test data. Provide the error rate for each tree.

Use the outline of code we discussed in class to create a decision  tree for the IrisDataSet which predicts the Type column using the other attributes. Create three versions of this tree: one using entropy, one using the Gini coefficient, and one using the Classification error as splitting criteria. Use the first half of the data set as the training data and the second half as the test data. Provide the error rate for each tree.

An office supply store tests a telemarketing campaign to its existing business customers. The company targeted approximately 16,000 customers for the campaign. Assume you are a consultant brought on board to help the company leverage and use the findings from the tests to its advantage. Refer to the accompanying spreadsheet, which contain the results of the tests.
The detailed requirements and expected deliverables are mentioned in Capstone Assignment.docx.
Three sample presentations are attached for reference. Data to be used are in excel file.

Research Report – Twitter Analysis and Presentation

The aim of this assignment is to collect Twitter data, summarise the data using a spreadsheet or other tool, and then write a report about that data. The purpose of the report is to investigate and discuss the use of twitter analysis by researchers, brands or journalists (depending on your major). The report is not meant to be written as a public facing report or feature, but rather an internal research report that might be used in a professional context or to inform your own practices.

You can choose to follow a group of people or a hashtag/hashtags over a period of time that will yield a reasonable sized data set ( a few thousand tweets at least up to a max of 250 thousand is about the right size for this task, much bigger as Excel will struggle to open the file). Suitable targets could be hashtags for a TV show or media event, a new or defective product, a group of journalists attending a conference or the conference hashtag,  a brand campaign, or news event as it happens.

You may have to try a few different scenarios before you get some data you can use. For example broad hashtags like christmas or happy is a bad choice, corbyn (during PMQ) or MUFC (during a game) are probably better ones. Spend a little time exploring how hastags are used together (co-hashags) partly to make sure that you have all the relevant tags covered (i.e ‘Manu’ as well as ‘MUFC’), this can be done with the Twitter advanced search page. You should write about this hashtag research as part of your reflection.

Once you have collected the tweets and profile data use this data set to discuss the following questions in your report. You can do more analysis than this, but these are expected as part of the report.

Required analysis

a) Who were the top tweeters and retweeters?

b) How many of your top tweeters are bots? (remove as many as possible from your data set before performing the rest of the analysis)

c) What was the top retweet? and what was the ratio of tweets to retweets in your data set?

d) What % of tweet/retweets in your data set came from the top 10  tweeters?

e) Use a word cloud or word tree of the most used words in your data set to show the type of language being used. Was the hashtag being used in conjunction with other hashtags?

f) Where to the tweets come from? What % are geocoded, what % of profiles have a location?

g) Do the tweeters fall into any demographic groupings that you can see (look at some follower, friend counts, total number of tweets etc)

In addition to answering these questions you can perform other types of acquisition and/or analysis and you may be awarded extra marks for doing so.

Visualisations you should also include your report.

Time series for your tweets on a suitable timeframe unit

Word Cloud or Word Tree of language in popular retweets (or co-hashtag use)

Chart showing the % of tweets to retweets

Chart showing the % tweets geocoded and the % of profiles with locations

Histogram of tweeters volumes (i.e. 1 person tweeted more than 100 times, 5 people tweeted 50-100 times, 50 people tweeted 10-50 times, 1,000 people tweeted 5-10 times etc)

The report should be ~1500 words done as a basic but well styled HTML page that includes some visualisations to help illustrate your data. MSc students should attempt visualisations using a JavaScript library rather than iframe embeds. You should try to use a template system to start your page. As well as answering the questions above in your report you should do some research on social media analytics and Twitter use in journalism and consider the how the types of analysis you have performed can be used in a professional context. Include references to research material you used in your report. You can also talk about the 5th estate and Twitter more generally and it’s effect on journalism and society making reference to your own data where possible.

You should also supply a written ~500 word reflection. The reflection should consider the following points. Why did you settle on the hashtag(s) and timeframe that you did? What issues did you encounter in gathering the tweets and analysing them, how did you overcome these problems. How would you extend or improve your study given more time and/or resources? Include attributions for any code libraries or images used in your report.

Submission

The submission should consist of a single word or rtf document that contains your reflection and a link to the online report. If you have used code in the acquisition or analysis of your tweets, you should also provide a link to a github GIST for each one. You should add plenty of comments to this code to demonstrate your understanding of how it works.

Marking scheme

Will be allocated based on the following scheme

5/25 Acquisition – Research and discussion of method used to acquire tweets and data obtained

5/25 Presentation – HTML/CSS, layout, quality of writing, overall quality

5/25 Visualisation – Quality, scope/difficulty, integration with report

5/25 Analysis – Quality, depth, difficulty

5/25 Reflection – Discussion of techniques, self critique, journalistic context

Note that extra marks can be used for using acquisition, analysis and presentation techniques beyond those taught in class.

Code is not required for the acquisition stage to pass the assignment unless you are on an MSc award. Note that use of code in this coursework can contribute towards the award of the MSc for journalism students.

Produce an illustrated report that uses analysis and techniques examined during lectures and practicals to examine the distribution, variation and relationships between at least two variables from the following London data:
UK Census
Air Quality
Roads and Parks
Airbnb
Another dataset for London as agreed with your lecturers
The following specific requirements apply (over and above the official Coursework Submission Requirements):
Students are expected to present and interpret a mix of descriptive statistics, maps, tables (and other visualisations) to provide an evidential base to describe spatial patterns and relationships. Literature should be used to support analysis of the patterns and relationships observed, including a discussion of the possible underlying drivers or causes. Analysis could be at neighbourhood, borough, or city scales.
You are free to develop a topic that speaks to your research and study interests, but some possible topics include: the impact of Airbnb on housing; the relationship between air pollution and deprivation; and the impact of green space and roads on air pollution. The code used to create the supplied data set is available for those who wish to extend it with new data. Feel free to discuss your ideas with the module co-ordinator, especially if you wish to use data not supplied to you.
Your submission should include a balanced assessment of the strengths and limitations of the data (e.g. what is recorded, what is not recorded, what is potentially misleading, etc.), as well as a justification of the methods used in your analysis. The focus of this assessment is a demonstration of judgement and understanding, not mindlessly applying every technique acquired during the term.
The report should be structured using the following sub-headers:
Introduction: to set the context for your analysis, including brief overview of relevant literature;
Data and Methods: briefly describe the origin of the data and the rationale for any
transformation/manipulation of the data;
Results: present an analysis (not simply a summary) of your data using charts, maps and tables (ensure
these are embedded in text);
Discussion: reflect on the possible drivers or causes of the Results, including commenting on the weight of
evidence provided (e.g. the strengths and weaknesses of analyses and data used);
Summary: briefly wrap-up your report with the key conclusions you want the reader to take-away.
Figures and summary tables should be used and be well-presented. Use of wider literature to support discussion and analysis is important. Any code used for analysis should be presented in an Appendix (not in the main body of the report).

https://github.com/kingsgeocomp/geocomputation/blob/master/data/LSOA%20Data.csv.gz – data to be used