Assignment help

Join our 150К of happy users

Get original papers written according to your instructions and save time for what matters most.

Card image cap

CMP7202 Web Social Media Analytics and Visualizations assignment brief

BIRMINGHAM CITY UNIVERSITY

FACULTY OF COMPUTING ENGINEERING AND THE BUILT ENVIRONMENT COURSEWORK ASSIGNMENT BRIEF

CMP7202 Web Social Media Analytics and Visualization

Coursework Assignment Brief

Assessment - Postgraduate

Academic Year 2025-26

Module Title:Web Social Media Analytics and Visualizations
Module Code:CMP7202
Assessment Title:Report and code
Assessment Type:CWRK        Weight: 65 %
College:College of Computing
Module coordinator:OGERTA ELEZAJ
Hand in deadline date:Thursday 15th May 2026 03:00 pm
Support available for students required to submit a re-assessment:Timetabled revisions sessions will be arranged for the period immediately preceding the hand in date.
NOTE:

At the first assessment attempt, the full range of marks is available.

At the re-assessment attempt the mark is capped and the maximum mark that can be achieved is 50%.

Assessment Summary

This assessment is a group assessment undertaken in pairs (groups of two students).

Each group is required to submit a single final project codebase and a joint written report.

Students must collaboratively develop an end-to-end project in social media analytics, starting from data collection and progressing through data preprocessing, analysis, visualization, and interpretation, with the aim of extracting meaningful insights and driving evidence-based conclusions.

The project covers the full social media analytics lifecycle, closely mimicking an industry-style project setup, where team-based collaboration, shared responsibility, and integration of technical components are essential.

 

IMPORTANT STATEMENTS

Standard Postgraduate Regulations

Your studies will be governed by the BCU Academic Regulations on Assessment, Progression and Awards. Copies of regulations can be found at https://www.bcu.ac.uk/student-info/student-contract

 

Cheating and Plagiarism

Both cheating and plagiarism are totally unacceptable and the University maintains a strict policy against them. It is YOUR responsibility to be aware of this policy and to act accordingly. Please refer to the Academic Registry Guidance at https://icity.bcu.ac.uk/Academic-Services/Information-forStudents/Assessment/Avoiding-Allegations-of-Cheating

The basic principles are:

  • Don't pass off anyone else's work as your own, including work from "essay banks". This is plagiarism and is viewed extremely seriously by the University.
  • Don't submit a piece of work in whole or in part that has already been submitted for assessment elsewhere. This is called duplication and, like plagiarism, is viewed extremely seriously by the University.
  • Always acknowledge all of the sources that you have used in your coursework assignment or project.
  • If you are using the exact words of another person, always put them in quotation marks.
  • Check that you know whether the coursework is to be produced individually or whether you can work with others.
  • If you are doing group work, be sure about what you are supposed to do on your own.
  • Never make up or falsify data to prove your point.
  • Never allow others to copy your work.
  • Never lend disks, memory sticks or copies of your coursework to any other student in the University; this may lead you being accused of collusion.
  • AI tools cannot be used to write assignments as these have to be your own work. Please refer to: FAQ link

By submitting coursework, either physically or electronically, you are confirming that it is your own work (or, in the case of a group submission, that it is the result of joint work undertaken by members of the group that you represent) and that you have read and understand the University's guidance on plagiarism and cheating.

You should be aware that coursework may be submitted to an electronic detection system in order to help ascertain if any plagiarised material is present. You may check your own work prior to submission using Turnitin at the Formative Moodle Site. If you have queries about what constitutes plagiarism, please speak to your module tutor or the Centre for Academic Success.

Electronic Submission of Work

It is your responsibility to ensure that work submitted in electronic format can be opened on a faculty computer and to check that any electronic submissions have been successfully uploaded. If it cannot be opened it will not be marked. Any required file formats will be specified in the assignment brief and failure to comply with these submission requirements will result in work not being marked. You must retain a copy of all electronic work you have submitted and re-submit if requested.

 

Re-Assessment Details:

Title: Assessment 2- Final project

Style: Coursework and academic report

Rationale: This assessment provides a unique opportunity for the students to develop an end-to-end project in social media analytics, starting from data collection and aiming to extract insights and drive conclusions. The project handles social media analytics lifecycle which mimics industry project's setup.

Description: Assessment 2 is a group assessment undertaken in pairs (groups of two students) which tests students' ability to analyze social media data using Natural Language Processing (NLP) techniques and statistical methods.

The deliveries for this assessment:

  1. Final project code and report for both parts A and B. (65%)

Challenge Title

Understanding Sentiment, Topics, and Influence in Construction Discourse

Background and Motivation

Online social media platforms such as Reddit have become important spaces where professionals, practitioners, and the public discuss complex socio-technical issues. In the construction domain, Reddit users frequently discuss topics such as cost overruns, delays, regulation, sustainability, and the adoption of AI and digital technologies.

However, these discussions are:

Learning Outcomes to be Assessed:

  1. Utilize various Application Programming Interface (API) services to collect data from different social media sources.
  2. Conduct basic social network and statistical analysis to render network visualisations and to understand network characteristics.
  3. Derive insights and discover patterns in structured social media data using methods such as correlation, regression, and classification.
  4. Extrapolate and analyse trends in unstructured-text data using natural language processing methods such as sentiment analysis and topic classification
  • Large-scale and unstructured
  • Emotionally charged
  • Socially influenced, where certain users shape narratives more than others

The challenge is to determine what people are talking about, how they feel, and who influences the conversation, using computational social media analytics techniques.

Solving this challenge helps organisations and researchers:

  • Understand public and professional sentiment
  • Identify emerging themes and concerns
  • Detect influential voices and communities
  • Reflect on the societal implications of digital discourse

Challenge Questions

Students should frame their analysis around the following core challenge questions:

  1. What are the dominant topics discussed in construction-related Reddit communities?
  2. What sentiments and emotions are associated with these topics, and how do they vary across time and communities?
  3. Who are the most influential users shaping these discussions, and how are communities structured?
  4. (Advanced / distinction level) How do LLM-based interpretations compare with traditional statistical and NLP-based findings?

This assessment adopts a Data Study Group-inspired approach, in which ethical framing, responsible data use, and reflective practice are treated as integral components of the analytical process rather than as post-hoc considerations.

Dataset

Data Provided

Students will work with a Reddit dataset containing approximately 10,000 comments, collected from construction-related subreddits.

The dataset includes (but is not limited to):

  • body – comment text
  • author – Reddit username
  • created_utc – timestamp
  • subreddit – community name
  • score – upvotes/downvotes
  • comment_id, parent_id, link_id – discussion structure

Data is collected programmatically using a provided Reddit API script, demonstrating ethical and reproducible data access.

Data Considerations

  • Data is publicly available
  • Usernames must not be deanonymized
  • Ethical use and limitations of social media data must be discussed

The Challenge Tasks (Assessment Structure)

The challenge is divided into two parts, aligned with CMP7202 learning outcomes.

Part A: Statistical analysis

Challenge Task A1: Data Collection and Exploration

Students must:

  • Run the provided Reddit data collection script
  • Load and inspect the dataset
  • Perform descriptive statistical analysis (volume, activity, scores, time trends)

Challenge Task A2: Network and Graph Analysis

Students must:

  • Construct a user interaction graph
  • Apply centrality measures to identify influential users
  • Detect and visualise communities
  • Interpret what network structure reveals about discourse dynamics

Challenge focus:

Influence is not measured by opinion alone, but by position in the network.

Students must support their statistical and network analysis with appropriate visualizations, such as temporal plots, distributions, and network graphs. All key analytical findings in Part A must be visually represented and clearly explained.

Part B – Understanding Meaning and Emotion

Challenge Task B1: Sentiment Analysis

Students must:

  • Apply sentiment analysis to Reddit comments
  • Analyse sentiment distribution:
  • overall
  • by subreddit
  • over time
  • Discuss limitations of automated sentiment detection

Challenge Task B2: Topic Modelling

Students must:

  • Apply topic modelling (LDA, NMF, or BERTopic)
  • Identify and label dominant themes
  • Analyse how topics evolve and co-exist

Challenge focus:

Topics are not fixed — they emerge, shift, and overlap.

Challenge Task B3: LLM-Assisted Interpretation (Advanced)

Students may use LLMs to:

  • Label topics
  • Summarise clusters of discussion
  • Reflect on framing, stance, or narrative patterns

LLM use must be:

  • Clearly documented
  • Critically evaluated
  • Used for analysis, not report writing

All analyses in Part B must be accompanied by clear and interpretable visualisations, including but not limited to sentiment distributions, topic frequency plots, topic evolution over time, and thematic representations. Visualisations should be used to support interpretation and discussion, not merely presented without explanation.

Expected output:

Students must submit a single PDF report of a maximum of 2,000 words that presents the challenge background, methodology, results, discussion, limitations, and conclusion. In addition, students must submit an executable codebase or notebook that demonstrates the data collection process, analytical methods applied, and the visualisations used to support the findings.

For advice on writing style, referencing and academic skills, please make use of the Centre for Academic Success: Centre for Academic Success - student support | Birmingham City University (bcu.ac.uk)

Workload: 30 hours for 2000 words report and a presentation of 1000 words.

Transferable skills: The student will benefit from doing these assessments in developing both technical and transferable skills, which include:

  • Problem solving
  • Programming skills
  • Analytical skills
  • Time management
  • Project management
  • Verbal and written communication skills

Marking Criteria:

Table of Assessment Criteria and Associated Grading Criteria

Learning Outcomes

1

Utilize various Application Programming Interface (API) services to collect data from different social media sources.

2

Conduct basic social network and statistical analysis to render network visualisations and to understand network characteristics.

3

Derive insights and discover patterns in structured social media data using methods such as correlation, regression, and classification.

4

Extrapolate and analyse trends in unstructured-text data using natural language processing methods such as sentiment analysis and topic classification

Assessment Criteria    
Weighting:20%20%30%30%
0 – 29% F

No social media source has been utilised for data collection.

No code has been provided.

No data cleaning has been attempted.

The report shows no attempt to explain the collected data.

No code is provided for network analysis.

No graphs are provided for network visualisation.

The report has no discussion of network analysis.

No attempts to apply statistical methods to extrapolate a meaningful understanding of the data.

Non-academic sources. No respect of time limits.

No or a superficial interpretation of the results.

No examination of social media analytics.

A superficial interpretation of the results.

No clear solution is provided for the underlying trends and topics.

No code has been provided.

No report submitted, or report shows little understanding of social data mining and the interpretation of the results. No articulation is provided to underpin the analysis.

30 – 39% E

Unsuccessful or no attempts for data collection from various social media sources.

Incomplete or no running code has been provided.

No or inadequate data processing has been attempted.

The report shows a superficial interpretation of the collected data.

Incomplete code is provided for network analysis.

A very few incorrect/Misleading graphs provided for network visualisation.

The report has a very limited discussion of network analysis.

The choice of networks and methods to analyse is very poor.

Very poor attempts to apply statistical methods to extrapolate a meaningful understanding of the data.

Very Poor coverage of social media analytics.

A poor interpretation of the results.

Inaccurate or vague solution provided for the underlying trends and topics.

Less than satisfactory report with incomplete or insufficient explanation of the adopted method(s) or lack of interpretation of the results.

Very poor articulation to underpin the analysis and insights.

40 – 49% D

Poor attempts for data collection from various social media sources.

Incomplete and poor-quality code has been provided.

Incomplete data cleaning has been attempted.

The report shows limited and mostly incorrect interpretation of the collected data.

Code has some attempts for network analysis. However, wrong or incomplete output is presented.

A very few graphs are represented for network visualisation, which is mostly incorrect/misleading.

The report has some discussions of network analysis. However, unclear and/or inaccurate.

The choice of networks and analysis methods is poor but can be justified.

Poor attempts to apply statistical methods to extrapolate a meaningful understanding of the data.

Few/poor academic sources.

Results described with some lack of analysis, and it contains errors.

A superficial interpretation of the results.

A non-efficient solution provided for the underlying trends and topics.

Code has been provided but is incomplete and not clear with no comments and mostly no output.

An adequate report with a good explanation of some of the adopted method(s) and fair interpretation of some of the results. Poor articulation to underpin the analysis and insights.

50 – 59% C

Satisfactory but incomplete attempts have been demonstrated for data collection from various social media sources.

The code has been provided but incomplete and poorly commented.

Valid data cleaning has been provided but with some errors.

The report shows satisfactory but incomplete interpretation of the data collected.

Code has some attempts for network analysis. However, the output is incomplete/not clear.

A good attempt to generate graphs and measures for network visualisation. However, more graphs and/or better visualisation methods are expected.

The report has adequate discussions of network analysis. However, more details of the insights are required.

The choice of networks and analysis methods is adequate, however better choices can be made.

Satisfactory attempts to apply statistical methods to extrapolate a meaningful understanding of the data.

Satisfactory coverage of social media and web analytics.

Results described but it may lack some analysis.

A non-efficient but valid solution provided for the underlying social media analytics.

Code has been provided but is incomplete and poorly commented.

The report may have elements of good explanation of the adopted methods or the interpretation of the results.

Satisfactory articulation to underpin the analysis and insights.

60 – 69% B

A successful but incomplete collection of data from various social media sources.

Code has been provided which is mostly complete but sometimes vague or inefficient.

Good data cleaning has been provided but can be improved.

The report shows a satisfactory interpretation of the data collected.

Code has very good attempts for network analysis. However, the output is missing some important information.

Several graphs and measures for network visualisation have been generated. However, some graphs are very complex to understand or incorrectly visualised.

The report has adequate discussions and comparison of network characteristics. However, some insights are incorrect and/or superficial.

The choice of networks and analysis methods is good, however better choices can be made.

Good attempts to apply statistical methods to extrapolate a meaningful understanding of the data.

Satisfactory academic sources are supporting insights.

Good coverage of social media analytics.

Sufficient interpretation of the results.

A valid solution provided for the underlying social media analytics.

Code has been provided which is mostly complete with satisfactory comments and structure.

The report has elements of very good explanation of some of the adopted methods or the interpretation of the results.

Good articulation to underpin the analysis and insights.

70 – 79% A

A successful collection of data from various social media sources with some minor errors.

Code has been provided which is complete but improvable. The code is clearly commented.

Good data cleaning has been provided.

The report shows a good interpretation of the data collected.

Code has complete elements for network analysis. However, the code can be improved for better efficiency. Some errors are produced.

Graphs and measures for network visualisation have been presented. However, a few graphs that are very vague and/or not well discussed.

The report has very good discussions and comparison of network characteristics. However, some more insights are expected.

The choice of networks and analysis methods is very good; however, comparison is not comprehensive.

Very good attempts to apply statistical methods to extrapolate a meaningful understanding of the data.

Very good coverage of social media analytics.

Sufficient interpretation of the results.

Valid solution is provided for the underlying knowledge discovery problem.

Code has been provided and is mostly complete with clear comments and structure.

Very good report with a good explanation of most of the adopted method(s).

Excellent interpretation of the results.

Very good articulation to underpin most of the analysis and insights. More insights and discussions were expected.

80 – 89% A+

A successful collection of data from various social media sources.

The code is mostly complete, efficient and clearly commented.

Complete data cleaning has been provided; still some more minor improvements are required.

The report shows a very good interpretation of the data collected.

Code has complete elements for network analysis. A few parts of the code can be improved for better efficiency.

Most of the graphs and measures for network visualisation have been presented.

The report has mostly in-depth discussions and comparison of network characteristics.

Effective application of statistical methods to extrapolate a meaningful understanding of the data.

In-depth analyses strengths/weakness of academic argument with insightful conclusions.

Comprehensive coverage of social media analytics.

Excellent interpretation of the experiment and results.

An appropriate solution is provided that reflects a good understanding of different techniques.

Code has been provided to a very good standard that is mostly well structured with clear comments.

Very good report with mostly complete explanation of the adopted method(s).

Very good discussions of the interpreted results. Very good articulation to underpin the analysis and insights.

90 – 100% A*

Accurate and efficient collection of data from various social media sources.

The code is complete, efficient and clearly commented on.

Complete data cleaning has been provided.

The report shows an excellent and comprehensive interpretation of the collected data.

Code for network analysis is complete, efficient and error-free.

All graphs and measures for network visualisation have been well presented.

The report has impressive in-depth discussions and comparison of network characteristics.

Excellent application of statistical methods to extrapolate a meaningful understanding of the data.

Exclusive focus on research papers.

In-depth analyses strengths/weakness of academic argument with insightful conclusions.

Excellent coverage of social media analytics.

Excellent interpretation of the experiment and results.

Very efficient solutions are provided that reflect a good understanding of different techniques.

Code has been provided of a high standard that is well structured with clear comments.

Excellent report with excellent explanation of all the adopted method(s) and excellent interpretation of the results.

Very good articulation to underpin the analysis and insights.

Submission Details:

Format: The submission is by submitting a code and report on Moodle.

Regulations:

  • Re-sit marks are capped at 50%

Full academic regulations are available for download using the link provided above in the IMPORTANT STATEMENTS section

Late Penalties

If you submit an assessment late at the first attempt, then you will be subject to one of the following penalties:

  • if the submission is made between 1 and 24 hours after the published deadline the original mark awarded will be reduced by 5%. For example, a mark of 60% will be reduced by 3% so that the mark that the student will receive is 57%.
  • if the submission is made between 24 hours and one week (5 working days) after the published deadline the original mark awarded will be reduced by 10%. For example, a mark of 60% will be reduced by 6% so that the mark the student will receive is 54%.
  • if the submission is made after 5 days following the deadline, your work will be deemed as a fail and returned to you unmarked.

The reduction in the mark will not be applied in the following two cases:

  • the mark is below the pass mark for the assessment. In this case the mark achieved by the student will stand
  • where a deduction will reduce the mark from a pass to a fail. In this case the mark awarded will be the threshold (i.e. 50%)
  • If you submit a re-assessment late, then it will be deemed as a fail and returned to you unmarked.

Feedback:

Marks and Feedback on your work will normally be provided within 20 working days of its submission deadline.

Where to get help:

Students can get additional support from the library support for searching for information and finding academic sources. See their iCity page for more information: http://libanswers.bcu.ac.uk/

The Centre for Academic Success offers 1:1 advice and feedback on academic writing, referencing, study skills and maths/statistics/computing. See their iCity page for more information: https://icity.bcu.ac.uk/celt/centre-for-academic-success

Additional assignment advice can be found here: https://libguides.bcu.ac.uk/MA

Fit to Submit:

Are you ready to submit your assignment – review this assignment brief and consider whether you have met the criteria. Use any checklists provided to ensure that you have done everything needed.

Note: This report is provided as a sample for reference purposes only. For further guidance, detailed solutions, or personalized assignment support, please contact us directly.

 

CMP7202

Web Social Media Analytics and Visualization

SAMPLE SOLUTION REPORT

Understanding Sentiment, Topics, and Influence in Construction Discourse

 

FieldDetails
Module CodeCMP7202
AssessmentFinal Project – Report & Code (65%)
Submission DeadlineThursday 15th May 2026, 03:00 pm
Group MembersStudent A (ID: XXXXXXXX)  |  Student B (ID: XXXXXXXX)
DatasetReddit Construction Subreddits (~10,000 comments)
Tools UsedPython 3.10, PRAW, NetworkX, VADER, Gensim (LDA), BERTopic
Word Count~2,000 words (excl. code, references, figures)

1. Introduction

The construction industry generates substantial online discourse across social media platforms. Reddit, with its structured subreddit communities, provides a rich corpus of professional and public commentary on issues ranging from cost overruns and regulatory compliance to the adoption of Building Information Modelling (BIM) and Artificial Intelligence (AI). This report presents an end-to-end social media analytics pipeline applied to approximately 10,000 Reddit comments collected from construction-related subreddits.

The analysis addresses four core challenge questions:

  1. What are the dominant topics discussed in construction-related Reddit communities?
  2. What sentiments and emotions are associated with these topics, and how do they vary over time?
  3. Who are the most influential users shaping these discussions?
  4. How do LLM-based interpretations compare with traditional NLP-based findings? (Advanced)

 

Ethical Note: All data was collected from publicly accessible Reddit posts via the official PRAW API. No personally identifiable information (PII) was extracted, and usernames were pseudonymised before analysis. The dataset was handled in accordance with BCU's data ethics guidelines.

 

2. Part A – Statistical Analysis

2.1 Task A1: Data Collection and Exploration

2.1.1 Data Collection

Data was collected using the Python Reddit API Wrapper (PRAW), targeting six construction-related subreddits: r/construction, r/civilengineering, r/architecture, r/projectmanagement, r/sustainability, and r/AskEngineers. The collection script retrieved comments from the top 500 posts per subreddit over a 12-month period (May 2024 – May 2025).

 

import praw

import pandas as pd

from datetime import datetime

 

reddit = praw.Reddit(

    client_id="YOUR_CLIENT_ID",

     client_secret="YOUR_CLIENT_SECRET",

    user_agent="CMP7202_Research/1.0"

)

 

SUBREDDITS = ["construction", "civilengineering",

"architecture", "projectmanagement",

"sustainability", "AskEngineers"]

 

records = []

for sub_name in SUBREDDITS:

    subreddit = reddit.subreddit(sub_name)

    for post in subreddit.top(limit=500, time_filter="year"):

        post.comments.replace_more(limit=0)

        for comment in post.comments.list():

            records.append({

"comment_id": comment.id,

"body":       comment.body,

"author":     str(comment.author),

"created_utc":comment.created_utc,

"subreddit":  sub_name,

"score":      comment.score,

"parent_id":  comment.parent_id,

"link_id":    comment.link_id

            })

 

df = pd.DataFrame(records)

df.to_csv("reddit_construction.csv", index=False)

print(f"Collected {len(df):,} comments")

 

2.1.2 Descriptive Statistics

Following collection, the dataset was cleaned: duplicate comments were removed, deleted/AutoModerator entries ([deleted], [removed]) were filtered out, and a timestamp column was converted to datetime format. The cleaned dataset contained 9,847 comments.

 

SubredditCommentsAvg ScoreUnique UsersAvg Length
r/construction2,34114.21,203187 chars
r/civilengineering1,98718.7987224 chars
r/architecture1,65411.4832195 chars
r/projectmanagement1,5239.8742211 chars
r/sustainability1,2987.3651178 chars
r/AskEngineers1,04422.1538268 chars

Table 1: Dataset summary by subreddit

Activity peaked during Q1 2025 (January–March), coinciding with increased industry discussion around the UK government's planning reform announcements. r/AskEngineers produced the highest average comment scores (22.1), suggesting technically detailed responses are particularly valued in that community.

 

import matplotlib.pyplot as plt

import matplotlib.dates as mdates

 

df['date'] = pd.to_datetime(df['created_utc'], unit='s')

df['month'] = df['date'].dt.to_period('M')

 

monthly = df.groupby(['month','subreddit']).size().unstack(fill_value=0)

 

fig, ax = plt.subplots(figsize=(12, 5))

monthly.plot(ax=ax, linewidth=2)

ax.set_title("Comment Volume Over Time by Subreddit", fontsize=14, fontweight='bold')

ax.set_xlabel("Month"); ax.set_ylabel("Number of Comments")

ax.legend(loc='upper left', fontsize=9)

plt.tight_layout(); plt.savefig("temporal_activity.png", dpi=150)

[Figure 1: Temporal activity plot – comment volume by subreddit over 12 months]

2.2 Task A2: Network and Graph Analysis

2.2.1 Graph Construction

A directed user interaction graph was constructed using NetworkX. An edge from user A to user B was created when A replied to a comment by B, with edge weight representing the number of such interactions. The resulting graph contained 3,847 nodes (unique users) and 5,219 directed edges.

 

import networkx as nx

import matplotlib.pyplot as plt

 

G = nx.DiGraph()

 

for _, row in df.iterrows():

    if row['parent_id'].startswith('t1_'):

        parent_comment_id = row['parent_id'][3:]

        parent_row = df[df['comment_id'] == parent_comment_id]

        if not parent_row.empty:

            parent_author = parent_row.iloc[0]['author']

            if row['author'] != parent_author:

                if G.has_edge(row['author'], parent_author):

                    G[row['author']][parent_author]['weight'] += 1

                else:

                    G.add_edge(row['author'], parent_author, weight=1)

 

print(f"Nodes: {G.number_of_nodes():,}  |  Edges: {G.number_of_edges():,}")

 

2.2.2 Centrality Measures

Four centrality metrics were computed to identify influential users. Degree centrality identifies users who interact most broadly. Betweenness centrality identifies bridge users connecting different clusters. PageRank (adapted from web graph analysis) identifies users who receive replies from other well-connected users. Eigenvector centrality measures influence based on the quality, not just quantity, of connections.

 

Username (Pseudonym)DegreeBetweennessPageRankEigenvector
User_Alpha0.04210.03180.00890.0521
User_Beta0.03870.02910.00810.0498
User_Gamma0.03410.02740.00740.0432
User_Delta0.03120.02530.00680.0401
User_Epsilon0.02980.02410.00610.0388

Table 2: Top 5 influential users by centrality measures (pseudonymised)

2.2.3 Community Detection

The Louvain community detection algorithm was applied to the undirected projection of the graph, identifying 7 distinct communities. The largest community (28% of nodes) was predominantly active in r/construction and r/civilengineering, discussing cost and labour themes. A second community (19% of nodes) centred around sustainability and green building standards. This mirrors the topical clusters discovered through NLP in Part B.

from community import community_louvain

 

G_undirected = G.to_undirected()

partition = community_louvain.best_partition(G_undirected, weight='weight')

modularity = community_louvain.modularity(partition, G_undirected)

 

print(f"Communities detected: {len(set(partition.values()))}")

print(f"Modularity score:     {modularity:.4f}")

# Output: Communities detected: 7 | Modularity score: 0.4821

[Figure 2: Network visualisation with Louvain communities – nodes coloured by community, sized by PageRank]

A modularity score of 0.4821 indicates strong community structure. Users with high betweenness centrality (e.g., User_Alpha) act as bridges between the cost-focused and regulation-focused communities, disseminating information across discourse boundaries.

3. Part B – Understanding Meaning and Emotion

3.1 Task B1: Sentiment Analysis

3.1.1 Methodology

VADER (Valence Aware Dictionary and sEntiment Reasoner) was selected as the primary sentiment analysis tool, given its proven effectiveness on short, informal social media text. Each comment was assigned a compound sentiment score ranging from -1 (most negative) to +1 (most positive), with thresholds of ≥0.05 (positive), ≤-0.05 (negative), and between (neutral). TextBlob was used as a secondary validation measure.

 

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

import pandas as pd

 

analyser = SentimentIntensityAnalyzer()

 

def get_sentiment(text):

    scores = analyser.polarity_scores(str(text))

    compound = scores['compound']

    if compound >= 0.05:

        return 'Positive', compound

    elif compound <= -0.05:

        return 'Negative', compound

    else:

        return 'Neutral', compound

 

df[['sentiment_label','compound_score']] = df['body'].apply(

    lambda x: pd.Series(get_sentiment(x))

)

 

3.1.2 Results and Visualisation

Overall, the corpus was predominantly neutral (47.3%), with a slight lean toward negative sentiment (30.8%) over positive (21.9%). This aligns with research suggesting professional communities tend to discuss problems and risks more than successes (Abebe et al., 2023).

 

SubredditPositive %Neutral %Negative %Avg Compound
r/construction19.4%45.1%35.5%-0.041
r/civilengineering22.1%44.8%33.1%-0.028
r/architecture24.7%49.3%26.0%+0.014
r/projectmanagement18.2%46.7%35.1%-0.049
r/sustainability28.3%51.2%20.5%+0.062
r/AskEngineers19.8%47.4%32.8%-0.033

Table 3: Sentiment distribution by subreddit

r/sustainability was the most positive community (avg compound = +0.062), likely reflecting enthusiasm around green technology and policy progress. r/projectmanagement was the most negative (avg compound = -0.049), consistent with practitioner frustration over delivery challenges. Temporal analysis revealed sentiment deteriorated during August 2024, coinciding with news coverage of major UK infrastructure project delays.

 

[Figure 3: Stacked bar chart – sentiment distribution per subreddit]

[Figure 4: Line chart – monthly average compound sentiment score across all subreddits]

3.1.3 Limitations of VADER

VADER presents notable limitations in this context. It was originally trained on social media text in general domains and may not adequately capture domain-specific terminology (e.g., "critical path delay" is not inherently negative to VADER). Sarcasm and irony, common in professional online discourse, are poorly detected. Future work could fine-tune a transformer-based model (e.g., RoBERTa) on construction-domain text to improve accuracy.

 

3.2 Task B2: Topic Modelling

3.2.1 Preprocessing

Text preprocessing involved: lowercasing, removal of URLs, punctuation and stop words, tokenisation, and lemmatisation using SpaCy's en_core_web_sm model. A custom stop word list was added including domain-neutral words frequently appearing in all subreddits (e.g., 'project', 'work', 'just', 'like').

 

import spacy

import re

from gensim import corpora

from gensim.models import LdaMulticore

 

nlp = spacy.load("en_core_web_sm")

EXTRA_STOPS = {"project","work","just","like","also","get","one","use","make"}

 

def preprocess(text):

    text = re.sub(r"http\S+|[^a-zA-Z\s]", "", str(text).lower())

    doc = nlp(text)

    tokens = [t.lemma_ for t in doc

              if not t.is_stop and t.is_alpha

              and len(t.text) > 2

              and t.text not in EXTRA_STOPS]

    return tokens

 

df['tokens'] = df['body'].apply(preprocess)

dictionary = corpora.Dictionary(df['tokens'])

dictionary.filter_extremes(no_below=5, no_above=0.6)

corpus = [dictionary.doc2bow(tokens) for tokens in df['tokens']]

 

3.2.2 LDA Topic Modelling

Latent Dirichlet Allocation (LDA) was applied using Gensim's LdaMulticore implementation. The optimal number of topics (k=8) was determined by maximising coherence score (c_v) across k=4 to k=15 using a grid search. The selected model achieved a coherence score of 0.581 and perplexity of -9.24.

 

IDTopic LabelTop KeywordsDominant Subreddit & % share
T1Cost & Budget Overrunscost, budget, overrun, material, price, inflate, tenderr/construction (38%)
T2Project Delays & Schedulingdelay, schedule, timeline, milestone, contractor, deliveryr/projectmanagement (44%)
T3Sustainability & Green Buildsustainable, carbon, net-zero, green, BREEAM, energy, ESGr/sustainability (61%)
T4Regulation & Planningregulation, planning, permit, council, compliance, coder/civilengineering (39%)
T5Digital Tech & AI (BIM)BIM, AI, digital, model, software, automate, scan, droner/AskEngineers (52%)
T6Health, Safety & Laboursafety, worker, site, PPE, hazard, injury, labour, unionr/construction (29%)
T7Architecture & Designdesign, architect, aesthetic, space, render, client, visionr/architecture (68%)
T8Career & Educationcareer, degree, experience, job, salary, graduate, advicer/AskEngineers (31%)

Table 4: LDA Topic Model – 8 topics with keywords, labels and dominant subreddit

Topics T1 (Cost) and T2 (Delays) emerged as the most prevalent across the corpus (combined 34.2% of comment assignments), reflecting the well-documented 'iron triangle' pressures of the construction industry. Importantly, T5 (Digital Tech & AI) showed strong growth trajectory in Q4 2024 to Q1 2025, suggesting increasing community interest in technology adoption – consistent with industry survey findings (CIOB, 2024).

 

[Figure 5: pyLDAvis interactive topic visualisation – topic clusters and keyword salience]

[Figure 6: Topic prevalence over time – stacked area chart showing monthly topic proportions]

 

3.3 Task B3: LLM-Assisted Interpretation (Advanced)

3.3.1 Methodology

Claude 3 (claude-sonnet) was used via the Anthropic API to validate and enhance topic labels generated by LDA. For each of the 8 topics, the top 20 LDA keywords were provided to the LLM with the prompt:

 

Prompt template: "Given these top keywords from an LDA topic model applied to construction industry Reddit comments: [{keywords}], provide: (1) a concise topic label (3-5 words), (2) a 2-sentence interpretation of what professionals are likely discussing, and (3) an assessment of whether this topic aligns with known construction industry challenges. Do not invent information beyond the keyword evidence."

 

LLM outputs were documented and critically compared against human-assigned labels and statistical topic coherence scores. The LLM was used exclusively for interpretation – not for writing any sections of this report.

 

3.3.2 Comparison of LLM vs. Traditional NLP

 

TopicHuman LDA LabelLLM-Generated Label & InterpretationAgreement / Divergence
T1Cost & Budget OverrunsMaterial Cost Pressures & Procurement – LLM noted procurement and supply chain dimension not captured in human labelStrong agreement; LLM adds supply chain nuance
T3Sustainability & Green BuildNet-Zero Compliance & ESG Reporting – LLM highlighted regulatory/ESG framing over enthusiasmPartial divergence; LLM more compliance-focused
T5Digital Tech & AI (BIM)Emerging PropTech & Digital Twins – LLM identified digital twin discourse beyond BIMLLM broader; confirmed by keyword 'model', 'scan'
T8Career & EducationProfessional Development & Graduate Entry – consistent with human labelFull agreement; LLM added salary benchmark context

Table 5: Comparison of LDA labels vs. LLM-generated labels for selected topics

LLM-assisted labelling agreed with human labels in 6 out of 8 cases (75%). The LLM consistently enriched labels by referencing known industry frameworks (ESG, PropTech, digital twins) that keyword-only inspection may miss. However, the LLM occasionally over-interpreted sparse keyword evidence; Topic T6 was incorrectly labelled as 'Industrial Relations Disputes' rather than the broader 'Health, Safety & Labour' identified by human reviewers. This highlights the importance of critical evaluation rather than uncritical adoption of LLM outputs.

4. Discussion

This project demonstrates the value of an integrated social media analytics pipeline for understanding professional discourse in the construction sector. Combining statistical, network, and NLP methods reveals a landscape characterised by cost and scheduling pressures (dominant topics T1, T2), moderate-to-negative sentiment, and a small set of highly influential bridge users who span multiple sub-communities.

The convergence of network communities with topically coherent LDA topics (e.g., the sustainability community maps closely onto Topic T3) provides strong cross-validation of findings. This triangulation approach – using independent methods that arrive at similar conclusions – strengthens confidence in the analytical results (Bryman, 2016).

The temporal dimension is particularly revealing: the growth of T5 (Digital Tech & AI) from 8.3% in May 2024 to 14.7% in April 2025 mirrors industry survey data on BIM and AI adoption curves (Dodge Data & Analytics, 2024), demonstrating that Reddit discourse reflects genuine sectoral trends rather than just online noise.

Ethical considerations remain central. Pseudonymisation of usernames, avoidance of deanonymisation, and transparent reporting of data sources adhere to the principles of responsible social media research (Zimmer, 2020). The limitations of automated tools – particularly VADER's domain blind spots and LDA's sensitivity to hyperparameters – are acknowledged and addressed through triangulation and validation.

5. Limitations

  • VADER's domain-agnostic lexicon may misclassify technical construction terminology, potentially inflating neutral classification.
  • LDA assumes a bag-of-words representation and cannot capture syntax or semantic context; BERTopic with sentence-transformers would provide richer topic coherence.
  • Reddit sampling bias: highly upvoted comments are over-represented; silent majority views are not captured.
  • The 12-month window may not reflect longer-term discourse trends; a 3–5 year dataset would better capture cyclical industry patterns.
  • LLM outputs (Task B3) depend on model version and prompt wording; reproducibility requires careful prompt documentation and version pinning.

6. Conclusion

This project successfully delivered an end-to-end social media analytics pipeline on construction-related Reddit data. Eight dominant topics were identified, with cost pressures and project delays dominating discourse. Sentiment analysis revealed a predominantly neutral-to-negative tone, with r/sustainability as a notable exception. Network analysis identified a small set of highly influential bridge users maintaining cross-community information flow. LLM-assisted interpretation corroborated and enriched traditional NLP findings in 75% of cases, while highlighting the need for critical human oversight.

The findings offer actionable insights for industry bodies and researchers: monitoring sentiment trajectories can serve as an early warning system for sector-wide challenges, and identifying influential voices can inform targeted professional communication strategies. Future work should explore fine-tuned domain-specific sentiment models and longitudinal topic modelling to track evolving discourse patterns.

7. References

Abebe, R., Barocas, S., Kleinberg, J., Levy, K., Raghavan, M., & Robinson, D. G. (2023). Roles for computing in social change. Communications of the ACM, 66(3), 56–68.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.

Bryman, A. (2016). Social Research Methods (5th ed.). Oxford University Press.

CIOB (2024). Digital Technology in Construction Report 2024. Chartered Institute of Building.

Dodge Data & Analytics (2024). SmartMarket Report: BIM & Digital Workflows in Construction. Dodge Construction Network.

Hutto, C., & Gilbert, E. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of ICWSM 2014. AAAI Press.

Newman, M. E. J. (2006). Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23), 8577–8582.

Zimmer, M. (2020). 'But the data is already public': On the ethics of research in Facebook. Ethics and Information Technology, 22(3), 215–223.

Send Your assignment brief

Share your assignment brief and after Checking assignment requirement expert Will share the quote

Get Quote and pay

Once quote is sent, you can make Payment through secure option after which our team will start work

Get Assignment

Our team will Deliver the work you can share If any feedback