Home Page > > Details

CMT309 Coursework Assessment Pro-forma

 Cardiff School of Computer Science and Informatics

Coursework Assessment Pro-forma
Module Code: CMT309
Module Title: Computational Data Science
Lecturer: Dr. Matthias Treder, Dr. Luis Espinosa-Anke
Assessment Title: CMT309 Programming Exercises
Assessment Number: 3
Date set: 06-03-2020
Submission date and time: 08-05-2020 at 9:30 am
Return date:
This assignment is worth 40% of the total marks available for this module. If coursework is 
submitted late (and where there are no extenuating circumstances):
1 - If the assessment is submitted no later than 24 hours after the deadline, the 
mark for the assessment will be capped at the minimum pass mark;
2 - If the assessment is submitted more than 24 hours after the deadline, a mark 
of 0 will be given for the assessment.
Your submission must include the official Coursework Submission Cover sheet, which can 
be found here:
https://docs.cs.cf.ac.uk/downloads/coursework/Coversheet.pdf
Submission Instructions
Your coursework should be submitted via Learning Central by the above deadline. You have 
to upload the following files:
Description Type Name
Cover sheet Compulsory One PDF (.pdf) file Student_number.pdf
Your solution to question 1 Compulsory One Python (.py) file Q1.py
Your solution to question 2 Compulsory One Python (.py) file Q2.py
Your solution to question 3 Compulsory One Word (.docx) file Q3.docx
For the filename of the Cover Sheet, replace ‘Student_number’ by your student number, e.g. 
“C1234567890.pdf”. Make sure to include your student number as a comment in all of the 
Python files! Any deviation from the submission instructions (including the number and types 
of files submitted) may result in a reduction of marks for the assessment or question part.
You can submit multiple times on Learning Central. ONLY files contained in the last attempt 
will be marked, so make sure that you upload all files in the last attempt.
Assignment
Start by downloading the following files from Learning Central:
• Q1.py
• acronym_example1.txt, acronym_example2.txt, acronym_example3.txt, 
acronym_example4.txt, acronym_tuples.txt
• Q2.py
• Q3.py
• Q3.docx
Then answer the following questions. You can use any Python expression or package that was 
used in the lectures. Additional packages are not allowed unless instructed in the question.
Question 1 - What is the long form of the acronym? (Total 35 marks)
In this question, your task is to implement several functions that parse text strings for 
acronyms and their long forms. Acronyms are abbreviations typically formed from the initial 
letters of multiple words and pronounced as a word. For instance, the acronym "GPU" stands 
for the long form "graphics processing unit". In this question, an acronym is defined as a 
character sequence of at least two successive capital letters. Your task is to implement several 
functions that together parse a text for acronyms and find their long forms. 
As an example text, let us define the string
s = "A GPU, which stands for graphics processing unit, is different 
from CPUs, says the IT expert. For some operations, a GPU is faster 
than a CPU. GPUs are not always faster though."
Q1 a) Parse acronyms (10 marks)
Write a function read_file(filename) that receives as input a filename. The filename 
includes the filepath. The function returns the entire content of the file as a single string.
Write a function find_acronyms(s) that receives as input a string s representing the text. 
The function returns a list of acronyms. For our example above, find_acronyms(s) returns 
the list ['GPU', 'CPU', 'IT']. Note: It is not important in which order the acronyms 
appear in the returned list.
Q1 b) Find the long forms (15 marks)
In this question the hard work is done: given the acronyms, your task is to find their long 
form in the text. To this end, write a function find_long_forms(s, acronyms). It receives 
as input a string s representing the text and a Python list of acronyms. The function returns 
a dictionary d with key-value pairs, where the key is the acronym and the value is its long 
form. For instance, in our example above the output is the dictionary d = {'GPU' : 
'graphics processing unit', 'CPU' : None, 'IT' : None}.
You can make the following assumptions:
• The long form is found in the same sentence as the acronym itself.
• If the acronym occurs multiple times in a text, its long form is found in the first 
sentence that contains the acronym. 
• Every '.' (dot) marks the end of a sentence. Sentences like "I talked to the Dr. and 
raised my concerns." where dots are contained within the sentence will not occur.
• The first letter of the acronym is the same letter as the first letter of the first word of 
the long form. All of the letters in the acronym need to appear in the long form.
• If no long form can be found for an acronym, it is set to None (Python's None type)
as in the dictionary above.
Four examples for texts with acronyms are given in the example files
acronym_example1.txt, acronym_example2.txt etc. The corresponding tuples 
of (acronym, long form) are specified in the file acronym_tuples.txt. 
Q1 c) Replace acronyms by long forms (10 marks)
Assume we want to make the document more self-explanatory and replace its acronyms with 
their corresponding long forms. To this end, write a function replace_acronyms(s, d). It 
receives as input a string s representing the text, and a dictionary d which contains
: long_form> key-value pairs as defined in Q1b). The function returns another string as 
output. In this output, all acronyms in s have been replaced with their long forms. The 
following rules apply:
- If an acronym has a long form, the sentence wherein the long form was defined
remains unchanged. For any other sentence, the acronym is replaced by the long form.
- If an acronym has no long form, it is not replaced anywhere.
- If you add the long form at the beginning of a sentence, make sure that its first word is 
capitalised.
For instance, in our example above the output of the function is the string:
"A GPU, which stands for graphics processing unit, is different from 
CPUs, says the IT expert. For some operations, a graphics processing 
unit is faster than a CPU. Graphics processing units are not always 
faster though."
As a starting point, use Q1.py from Learning Central. Do not rename the file or the function.
Q2 Statistics (Total 35 marks)
In this question, your task is to implement several statistical functions that perform t-tests, 
linear regression, and variable selection.
Q2 a) Mass t-tests (10 marks)
In this question, your task is to implement two functions that perform dependent and 
independent t-tests on input data. You can use the corresponding t-test functions in 
scipy.stats. 
Write a function mass_paired_ttest(X) that performs a series of paired-samples t-tests. It 
receives as input a numpy array X with dimensions �×�, where � is the number of rows and 
� is the number of columns. Each of the � columns represents one sample. Your function 
should find the pair of columns that yields the lowest p-value i.e. it is the 'most significant'. 
Then the function returns a tuple with three elements (index of the first column from the pair, 
index of the second column from the pair, corresponding p-value).
Example: imagine your dataset is of (100, 3) shape i.e. has three columns. Assume the p￾values for the three pairs of colums are p = 0.4 (col 0 vs col 1), p = 0.12 (col 0 vs col 2), p = 
0.08 (col 1 vs col 2). The lowest p-value is obtained for col 1 vs col 2 and its value is 0.08, so 
the tuple that is returned by the function is t = (1, 2, 0.08). 
Write a similar function mass_independent_ttest(*X) that performs a series of 
independent t-tests. It takes multiple inputs: Each input is a vector (1-D array) representing 
a single sample, so X is a list of Numpy arrays. The arrays can have different lengths. You 
can access each array using its index, e.g. X[0] is the first array, X[1] is the second array etc. 
Like for the paired-samples t-test, find the most significant pair of columns and return the 
tuple of three elements. 
Q2 b) Ridge regression (10 marks)
In this question your task is to implement ridge regression from scratch using Numpy. Do not 
use statsmodels or scipy for this question. Ridge regression is a slightly modified version 
of linear regression which is more stable for collinear data. 
Let us first develop the theory behind linear regression: Assume you have a vector of 
responses �∈ℝ�
, where � is the number of samples. Let �1, �2, �3,..., �p∈ℝ�
be our 
predictors, where � is the number of predictors. Then our linear regression model is 
�̂= �0 + �1�1+�2�2+ �3�3 + ... + �p�p
with �0 being the intercept and �1,...,�� being the slopes for the predictors. For convenience, 
we store our predictors in a matrix �∈ℝ�×(�+1)
=[ �1, �2, �3,..., �p,�]. In other words, each 
column of � represents one predictor. The last column consists entirely of ones, it represents 
the intercept. We also store all �'s in a vector �=[ �1, �2, ... , �p, �0]∈ℝ�+1
. To calculate 
� we can use the equation
� = (�⊤
�)
−1
�⊤
where �⊤ is the matrix transpose of � and the superscript ()−1
refers to the matrix inverse. 
Unfortunately the inverse can be unstable or even undefined if �⊤
� is not well-conditioned. 
As a fix, we will use a different formula called ridge regression which adds a so-called 
regularization term ��. 
� = (�⊤
�+��)
−1 �⊤
where �∈ℝ(�+1)×(�+1)
is an identity matrix and � is a positive number that represents the 
regularization strength. The inverse then always exists as long as �>0. The parameter � has 
to be provided by the user.
Your task: Write a function fit_ridge(y, X, a) that implements ridge regression as 
defined above. It receives the following inputs: 
- The response vector y is a numpy array with shape (n,1). 
- The matrix X is a numpy array of predictors with shape (n, p). Note that X does not 
contain the column of 1's, so you need to add it yourself. 
- The input a represents the strength of regularization. a can be either a single number 
(e.g. a = 1) or a list with multiple numbers (e.g. a = [1, 5, 10]). 
If a is a single number, the function returns �, the ridge regression coefficients using a for 
the regularization. If a is a list with multiple numbers, separate ridge regression solutions
should be calculated for each value of a. In this case, the function returns a Python list of 
vectors of regression coefficients [�0,�1,�2,...], where �0 is the regression coefficients 
using the first value of a, �1 is the regression coefficients using the second value a, and so 
on. 
Tip: remember than the * operator operates element-wise on Numpy arrays. If you want 
proper matrix or vector multiplication like in linear algebra, you can use the @ operator.
Q2 c) Variable selection in linear regression (15 marks)
In this question, your task is to use statsmodels to implement two variable selection 
functions for standard linear regression (a.k.a. OLS regression). The motivation is that 
regression models can have dozens or even hundreds of predictors. This can make it difficult 
to interpret the relationship between the predictors and the response variable y. Ideally, one 
wants to identify a subset of the predictors that carries most of the information about y. A 
possible approach is variable selection. In variable selection (‘variable’ means the same as 
‘predictor’), variables get iteratively added or removed from the regression model. Once 
finished, the model typically contains only a subset of the original variables.
In the following, we will call a predictor "significant" if the p-value of its coefficient is 
smaller or equal to a given threshold. Your approach operates in two stages: In stage 1, you 
iteratively remove predictors that are not significant. This leaves you with a subset of the 
original predictors. In stage 2, you iteratively add interaction terms and keep them in the 
model if they are significant. Remember what an interaction term is: if �1 and �2 are two 
predictors, then the variable �= �1⋅ �2 is their corresponding interaction term. We will split 
the two stages into two functions:
Stage 1 (remove variables)
Write a function remove_variables(y, X, threshold = 0.05, variable_names = 
None). The function receives the following inputs:
• y and X are numpy arrays like in Q2b). 
• threshold is the cut-off value that determines whether a p-value is significant. If a p￾value <= threshold, it counts as significant. 
• variable_names is a Python list of variable names that a user can provide. This is the 
names for the columns of X (e.g. ['TV', 'radio', 'newspaper'] for the advertisement 
dataset discussed in the lecture). If no variable names are provided, your function 
should create the variable names ['x1', 'x2’, ‘x3’, ...] where 'x1' is the name for the 
first column of X, 'x2' is the name for the second column of X, and so on.
The function returns a tuple (new_X, new_variable_names) containing two variables:
• new_X is the matrix of predictors after non-significant variables have been removed. It 
should not include the column of 1’s corresponding to the intercept.
• new_variable_names is a list of strings containing the variable names for the 
columns of new_X. 
Use the statsmodels function add_constant to make sure that X contains a column of 1's for 
the intercept, and use the intercept in all fits. Next, these are the details on how to implement 
the two stages of variable selection:
• To start, fit an OLS model using all of the predictors in X. 
• Identify the predictor whose coefficient has the largest p-value. If it is not significant, 
remove it and fit the model again.
• Repeat this process until either all predictors have been removed or all predictors left are 
significant.
• Never remove the intercept irrespective of whether or not it is significant.
• If no predictors are left after stage 1, return the tuple (None, None).
Tip: It might be useful to use Boolean arrays to select subsets of columns of X.
Stage 2 (add interaction terms)
Write a function add_interaction_terms(y, X, threshold = 0.05, variable_names 
= None). The inputs have the same meaning as in remove_variables. The function 
returns a tuple (new_X, new_variable_names) containing two variables:
• new_X is the matrix of predictors after the interaction terms have been added. Hence, 
it contains the predictors in X plus the interaction terms that have been added as new 
columns to the right. It should not contain the column of 1’s corresponding to the 
intercept term.
• new_variable_names is a list of strings containing the variable names for the 
columns of new_X. For interaction terms, use names that combine the two variable 
names with a ‘*’ sign. For instance, if you add the interaction term for ‘tv’ and 
‘radio’, then call their interaction variable ‘tv*radio’.
The function implements the following algorithm:
• To start, fit an OLS model using all of the predictors in X. 
• Test whether it is useful to add interaction terms: For each pair of predictors, add their 
interaction term into the model. If the interaction term is significant, keep it in the model. 
If it is not significant, remove it again.
• Continue this until you checked every pair of predictors. 
• Never add an interaction term involving the intercept.
• It can happen that when you add new interaction terms, predictors that you previously 
added become non-significant. You can ignore this issue.
• Add the interaction terms in order, starting from the leftmost predictor in X. For instance, 
if you have predictors with column indices 1, 2, 3, and 4, you first add the [1, 2] 
interaction term, then [1, 3], [1, 4], [2, 3], [2, 4], and finally [3, 4].
• After you checked the interaction terms for all pairs of predictors, you are finished. 
Return the new set of predictors and variable names as defined above.
Finally, note that it should be possible to run both functions one after the other. For instance, 
given y and X, the following two lines of code
(new_X,new_variable_names)=remove_variables(y, X)
(new_X,new_variable_names)=add_interaction_terms(y, new_X, variable_names=new_variable_names)
should first perform removal of variables and then add interaction terms.
As a starting point, use Q2.py from Learning Central. Do not rename the file or the function.
Question 3 – Ethics (Total 30 Marks)
In this question you will investigate bias in text corpora (document collections). You are 
provided with two datasets from a recent data science competition on Hyperpartisan News 
Detection [1]. These datasets are
- bias_corpus.txt: a corpus of news articles from media that have been 
classified as exhibiting right or left political bias.
- nobias_corpus.txt: a corpus of news articles that have been classified 
as neutral.
These newspaper articles are mostly written in the context of US politics. They could be used 
for building targeted political ads (reader of newspaper X will prefer to see ads of party Y), 
user or community profiling, etc. However, some articles may depict certain protected 
communities (women, immigrants or LGBT) in a negative way. This may bias any data 
science model built on top of this data.
In this question you implement a 'pattern matching' procedure for investigating how protected 
communities are depicted in both corpora (biased vs non-biased). As an inspiration, you can
start experimenting with Hearst patterns [2], which are often used to identify word pairs in 
which a type-of relationship holds. An example for a Hearst pattern is ‘X is a type of Y’. 
The slots X and Y will be filled with matches in corpora, e.g., ‘cat is a type of animal’
or ‘sofa is a type of furniture’. Such patterns can also be used to reveal how certain 
communities are depicted. For example, ‘immigrants and other x’ would reveal how 
immigrants are depicted in these media. A neutral example could be ‘immigrants and 
other communities’, whereas a (negatively) biased example could be ‘immigrants and 
other criminals’. An initial list of patterns is provided below (with actual examples from 
the data). However, you are free and encouraged to experiment with text patterns of 
your own design. You can experiment using only X, only Y, or both X and Y as empty slots 
(regex groups).
Pattern Example occurrence
X is a Y Obama is a citizen
X is Y Trump is threatening
X and other Y Refugees and other criminals
marginalized Y, especially X marginalized groups, especially gays
X works as a Y He works as a manager
or
She works as a hairdresser
Your tasks:
• Download and uncompress the text corpora from this url: 
https://drive.google.com/drive/folders/1ATp_zALwRRG5-
rd9o0WEcP9IXKOGADSd?usp=sharing
• Decide on a person or community of interest. Define a pattern which you hypothesize 
is likely to reveal how this person/community is depicted. This pattern could include 
regular expressions and group matching for a slot x to fill. For example, the pattern 
‘Trump is x’ is likely to match more verbs in non-biased media because they talk 
more about what he does (‘Trump is speaking’ or ‘Trump is attending’). In 
biased media, however, we could find more adjectives (‘Trump is arrogant’ or 
‘Trump is great’).
• Retrieve and count the hits you get for each value of x in the biased and the non 
biased corpora separately and store those in the dbias and dnobias dictionaries. For 
example, if ‘Trump is speaking’ occurs twice in the non-biased corpus, then the 
dictionary entry dnobias['speaking'] has the value 2.
• Do this pattern extraction process for three different persons/communities to obtain a 
total of three case studies. You can use the same or different patterns.
Then, report your results in the template document Q3.docx as follows: For each of the three 
case studies, write a short justification (up to 300 words for each) with
1. Your initial hypothesis, why you chose the pattern and no other, how many patterns 
you tried for the case you wanted to test, etc.
2. Provide the comparison match frequency table for the two corpora (see example, 
provided as a comment, at the end of Q3.py).
3. Discuss differences, if any, between the results obtained from the two corpora, and 
highlight the stereotypical or negative depictions you found.
As a starting point, use Q3.py and Q3.docx from Learning Central. Your submission should 
only include the Word document, not the Python script.
References:
[1] Kiesel, J., Mestre, M., Shukla, R., Vincent, E., Adineh, P., Corney, D., ... & Potthast, M. (2019, 
June). Semeval-2019 Task 4: Hyperpartisan news detection. In Proceedings of the 13th 
International Workshop on Semantic Evaluation (pp. 829-839). 
(Available at https://www.aclweb.org/anthology/S19-2145/)
[2] Hearst, M. A. (1992, August). Automatic acquisition of hyponyms from large text corpora. 
In Proceedings of the 14th conference on Computational Linguistics-Volume 2 (pp. 539-545). 
Association for Computational Linguistics.
(Available at https://www.aclweb.org/anthology/C92-2082/)
Learning Outcomes Assessed
• Carry out data analysis and statistical testing using code
• Critically analyse and discuss methods of data collection, management and storage
• Reflect upon the legal, ethical and social issues relating to data science and its 
applications
Criteria for assessment
Credit will be awarded against the following criteria. The score in each implemented function 
is judged by its functionality. For Q1 and Q2, the functions you have implemented will be 
tested against different data sets to judge their functionality. Additionally, quality and 
efficiency (Q1) will be assessed. For Q3, marks are based on the written report. The below 
table explains the criteria.
Criteria Distinction
Feedback on your coursework will address the above criteria. Feedback and marks will be 
returned within 4 weeks of your submission date via Learning Central. In case you require 
further details, you are welcome to schedule a one-to-one meeting. Feedback from this 
assignment will be useful for next year’s version of this module as well as the Python for Data 
Analysis module.
Contact Us - Email:99515681@qq.com    WeChat:codinghelp
Programming Assignment Help!